Rather than developing open versus closed source models, winning the AI race hinges on access to quality data
On June 9, China’s RedNote open-sourced its first large language model (LLM), joining a growing number of tech firms in China and around the world to do so. LLMs and other generative artificial intelligence (AI) models depend heavily on large datasets and advanced algorithms. A growing debate now pits advocates of open-source models — whose code and training methods are publicly available — against proponents of proprietary models, which keep these details confidential.
What’s next
Subsidiary Impacts
- The evolution of AI capabilities will be influenced by the decisions made at the strategic level within both companies and governments.
- Senior-level expertise is crucial to determining the trajectory of AI, underscoring the importance of skilled decision makers in the field.
- China is making significant progress in the development of highly capable AI chips.
Analysis
‘Open source’ was a guiding principle for large communities of software developers long before the current AI boom started in late 2022, with the launch of OpenAI’s ChatGPT.
Its proponents value transparency, community collaboration and the ability for researchers, developers and organisations to inspect, modify and build upon previous work.
Open-source generative models are those whose underlying code, training methodologies and the model weights — the parameters that determine how the model processes data — are publicly accessible.
They are often likened to an open recipe: anyone can see the ingredients and the cooking process, modify it to their taste and even share their own version with others.
Examples include models such as those in the Llama family (from Meta), China’s DeepSeek and France’s Mistral, as well as those of US-headquartered Hugging Face, which both develops its own models and serves as a platform for hosting and sharing third-party models (see INT: DeepSeek may prompt tighter US restrictions – February 5, 2025).
Proprietary models, by contrast, are developed by companies that keep their code, training methods and data sources confidential. Users can interact with the final product — such as by asking questions — but regardless of technical expertise, they have no insight into how the model was built, what data it was trained on or the ability to modify its underlying functionality. Examples include models in the ChatGPT, Claude (Anthropic) and Grok (xAI) families.
In the culinary metaphor, these models are compared to restaurant dishes — users can ‘enjoy the meal’, but are not allowed to see the recipe or inspect the kitchen.
Transparency and auditability
These differences matter for several reasons. Issues such as bias and ‘hallucinations’ — plausible but incorrect outputs — are major challenges for AI developers. These problems are hard to fix because models often function as ‘black boxes,’ with even their creators unsure how specific outputs are generated.
Transparency is essential for improving auditability, and open-source models make this more feasible by allowing independent researchers to examine, test and verify their behaviour.
Innovation
Economically, open models lower the barriers to entry and fuel innovation by enabling startups, smaller firms and researchers to create specialised applications without costly proprietary licences (see CHINA: Open-source AI push fuels global ambitions – March 24, 2025).
Open access to models and code encourages experimentation and rapid iteration, accelerating improvements in techniques and efficiency. By building on existing work instead of starting from scratch, the community speeds breakthroughs and fosters quicker identification and resolution of model vulnerabilities and limitations.
For instance, the EleutherAI community rapidly advanced the development of open-source LLMs such as GPT-Neo and GPT-J by building on existing work and collectively iterating on architecture, training data selection and evaluation methods. Their collaborative efforts have resulted in competitive models that challenge proprietary offerings and drive the field forward.
On the other hand, proponents of proprietary models argue that the promise of exclusive economic returns incentivises companies to invest heavily in innovation.
However, open-source models also present challenges, particularly around misuse. Their wide availability enables adaptation for harmful purposes, such as disinformation campaigns or automated hacking tools. In contrast, proprietary models give companies greater control over usage, which is crucial for complying with regulatory and safety requirements.
Business costs and security
Open-source generative models have the advantage of significantly reducing long-term costs for organisations, notably those with large-scale operations.
When deploying AI models, businesses can:
- run them on ‘public’ cloud platforms, such as AWS and Microsoft Azure, while retaining full ownership of their data, the models themselves and the strategic outcomes they deliver; or
- use their own data centres to run these models in-house.
Many companies that can afford to do so prefer to use customised versions of open-source models on their own infrastructure. This approach is particularly common for applications that handle commercially sensitive or proprietary data, as it offers greater control and security.
Companies tend to prefer using the private cloud to process sensitive data
Once the initial investment in infrastructure (eg, servers, networking, security and maintenance tools) is made, the marginal costs of running additional models can be quite low, despite significant ongoing operational expertise demands.
By contrast, owners of proprietary models charge businesses ongoing per-token Application Programming Interfaces (API) fees to use them. Tokens are small units — such as parts of words — that models process as input or generate as output. Because models produce responses token by token, pricing is based on token usage. APIs enable users to access these models remotely and on demand. For businesses with heavy or high-volume needs, running open-source models in-house can lead to significant cost savings.
Moreover, self-hosting further increases the control organisations have over the data they feed into these models. This is critical for data-driven competitiveness. Instead of sending sensitive, proprietary or strategic data to external vendors, companies boost security by maintaining their data within their own systems.
By fine-tuning or adapting these models internally, organisations can develop unique, context-specific capabilities that are harder for competitors to replicate. For example, telecommunications companies and banks often choose to self-host open-source LLMs on their own infrastructure to process vast amounts of customer data securely, leveraging their domain-specific knowledge while ensuring that no third party gains access to their proprietary insights.
Performance
While open-source models have many advantages, proprietary generative models frequently outperform them on accuracy, coherence and reasoning depth. Expectations that proprietary models will eventually generate profits facilitate investment attraction, which allow their vendors to invest heavily in massive pre-training runs on unique, proprietary datasets, leverage specialised hardware and large engineering teams — advantages unaffordable to most individual organisations.
These investments often result in models that are more capable in complex or nuanced tasks, offering better generalisation and reliability.
OpenAI’s GPT-4 is a perfect example. Its massive scale, sophisticated training processes and integration with reinforcement learning from human feedback (RLHF) have boosted its performance in areas such as coding, reasoning and multilingual understanding — capabilities that smaller, open-source models often struggle to match directly.
However, the rise of ‘foundation’ (generalist) open models of quality similar to their proprietary peers — notably DeepSeek — has given rise to growing questions regarding the extent to which closed models still enjoy a significant advantage in this area, and whether any such difference is bound to disappear in the near future.
Ready-made solution
Proprietary generative models offered via cloud APIs provide a turnkey solution for deploying advanced AI at scale. Managed by vendors with robust infrastructure and specialised engineering teams, they ensure high uptime, automatic scaling to handle demand spikes and continuous maintenance for security and performance.
For most organisations, replicating this reliability and scalability internally would require substantial investments in machine learning operations (ML-Ops), including complex systems for model deployment, monitoring and hardware management.
Cloud APIs eliminate these challenges, enabling businesses quickly to integrate powerful AI without building extensive infrastructure or staffing specialised teams.
For example, a startup developing an AI-powered customer support chatbot can leverage the OpenAI API to access GPT-4 instantly, allowing it to focus on product features and user experience instead of managing GPU clusters or outages.
Blurred line
In any case, the arguments between companies that develop generative models regarding the merits and downsides of open and proprietary ones hides the fact that the line between these is increasingly blurred.
The line between proprietary and open models is increasingly blurred
Many companies are experimenting with hybrid approaches, releasing some components of their models while keeping others closed. For example, Meta has published the weights applied to model elements in some versions of Llama, but not the full dataset or training code, seeking to balance between openness and competitive advantage.
Data role
Furthermore, while models are fundamental tools for running successful AI systems, data is the raw material that provides these systems with their knowledge, capabilities and even cultural context.
Access to data that reflect the complex realities of areas such as healthcare, agriculture, finance, defence and legal systems not only requires years or even decades of curation; it often also involves regulatory access that external actors cannot easily obtain.
Hence, although the arguments between proponents of open and closed models will persist, the companies and countries with superior data assets will have a more decisive advantage in the competition to lead the AI race.
Countries with rich, diverse and high-quality datasets can train models attuned to their language nuances, cultural references, economic structures and societal needs — capabilities that foreign models struggle to replicate.
Sovereignty considerations
Governments are increasingly recognising that depending entirely on foreign-trained models means outsourcing critical decisions about AI behaviour, values alignment and even basic factual understanding of local contexts.
This also means that countries with small populations — and which therefore generate less data and have reduced digital footprints — face more difficult battles to train models on local languages and specific use cases. This challenge is particularly acute in the case of tasks requiring deep cultural understanding or handling of local dialects.
In contrast, countries that develop and deploy large volumes of real-world equipment data — collected from ‘Internet of Things’ (IoT) devices in manufacturing plants, industrial robots, wind turbines and agricultural machinery — will build a far stronger competitive advantage than those relying solely on digital AI strategies.
Disputes over data control between manufacturers and users of industrial equipment are set to intensify
This will likely accentuate disputes over data control between equipment manufacturers — which often also process the data generated by their products — and their customers.
Geopolitical and commercial considerations will likely drive further ‘data nationalism’ growth, with more jurisdictions passing or tightening legislation determining that data involving their citizens and/or companies must remain within their borders.
Physical world
As AI becomes more integrated with the physical world, leading companies in sectors such as manufacturing, healthcare and agriculture will gain leverage in their relationships with major technology firms.
While “big tech” dominates consumer data and builds many of the AI models these companies use, it cannot match the unique data access and insights that, for example, a medical device company gains through its direct relationships and access to patient information.
Sectors that do not rely solely on human interaction for data gathering, such as those using sensors, machines or other automated systems, might have a further advantage in leveraging these AI opportunities.
Synthetic data
Some in the industry have long hoped that the quality of ‘synthetic data’ — data itself generated by software, specifically for training purposes — would eventually be so high that such datasets could replace those collected from the internet, original documents and physical-world equipment in the training of AI models.
This approach would not only benefit countries and companies with limited access to critical real-world data but also help address privacy concerns and provide a way to keep improving models when high-quality data becomes scarce — a point some researchers believe could arrive within a few years.
However, synthetic data is unlikely to reach the same quality in the near future, as challenges persist in bridging the gap between training on synthetic data and real-world deployment.
Different studies have found that generative AI models trained solely, or primarily, on such data fail to make good generalisations of real-world situations, a key capability of useful models.