OpenAI Furious: DeepSeek Might Have Stolen All the Data OpenAI Stole From Us

OpenAI accuses DeepSeek of unfairly using its data to train AI models, sparking irony as OpenAI itself has faced criticism for sweeping up vast amounts of public data without authorization.

Share
OpenAI Furious: DeepSeek Might Have Stolen All the Data OpenAI Stole From Us

OpenAI and Microsoft are investigating whether Chinese AI startup DeepSeek improperly used OpenAI’s data to train its R1 model, which has outpaced OpenAI’s models despite using older chips and lower costs. The irony is palpable—OpenAI, accused of data hoarding, now claims to be the victim of similar tactics. Venture capitalist David Sacks argues that DeepSeek likely used an AI training technique called distillation, mimicking OpenAI’s models to build a competitor. The debate raises ethical concerns about AI training and the broader implications of OpenAI’s own data-gathering practices.

The AI industry is in chaos as OpenAI and Microsoft launched investigations into whether DeepSeek, a rising Chinese artificial intelligence startup, used OpenAI-generated data in unauthorized ways to train its R1 model. Bloomberg and the Financial Times have reported that the AI giant and its tech partner are probing whether DeepSeek's sudden success was fueled by outputs from OpenAI models.

According to sources, OpenAI’s terms of service might have been violated, or DeepSeek may have bypassed restrictions meant to limit data extraction. The irony of OpenAI—a company built on sweeping up publicly available data—accusing another entity of improper data usage has not been lost on critics.

The Rise of DeepSeek

DeepSeek quickly attracted attention after claiming it has developed a version of the R1 model, which it claims is reportedly more effective than those of OpenAI while operating on older-generation hardware and investing far less. How can that be? OpenAI and Microsoft have claimed that DeepSeek used a questionable AI training technique called distillation that allows one model to learn from another by basically questioning it millions of times. This helps a "student" model approximate the reasoning and knowledge contained in a more sophisticated "parent" model.

Venture capitalist and recently appointed White House AI advisor David Sacks explained this phenomenon on Fox News, emphasizing the possibility that DeepSeek systematically extracted knowledge from OpenAI’s models. He described the process as AI-driven knowledge transfer, akin to how a human student learns from a teacher by repeatedly asking questions and refining their understanding.

The Irony of OpenAI’s Outrage

But with OpenAI and its supporters crying foul, there are many in the tech community quick to point out the hypocrisy. OpenAI has been at the center of legal and ethical debates surrounding data collection, with lawsuits alleging that the company indiscriminately scrapes vast amounts of data from the internet. Its defense has largely hinged on the argument that such data acquisition is permissible.

For years, OpenAI and other AI firms have assumed that anything publicly available—or even semi-restricted—can be fair game for training large language models. Now, the company finds itself on the receiving end of the very practices it helped normalize.

Ethical and Legal Implications

This dispute raises serious questions about the ethics of AI model training and data usage:

What constitutes fair use of AI-generated content?

If DeepSeek trained its model on OpenAI’s outputs, is this fundamentally different from OpenAI training its models on web-scraped data?

Should AI-generated content be protected as intellectual property?

OpenAI’s argument hinges on the idea that its model’s responses constitute proprietary material. But does that mean all AI-generated content—regardless of the source—should be protected?

How should AI companies regulate data-sharing policies?

The industry is at a crossroads where defining and enforcing rules around data acquisition is becoming increasingly critical.

The Bigger Picture: AI’s Data Dilemma

This competition between OpenAI and DeepSeek represents the challenges the AI industry is facing more broadly. While companies like OpenAI, Microsoft, Google, and Meta rely on large amounts of data for their models' fine-tuning, competition between them will grow increasingly heated on issues of ownership of data and ethical training.

Ironically, OpenAI’s anger over DeepSeek’s alleged practices serves as a case study in AI’s escalating conflicts over data control. If DeepSeek truly did “distill” OpenAI’s models, it is simply following a precedent set by the very company that is now condemning it.

FAQ

What is AI distillation, and how does it work?

AI distillation is a technique of machine learning in which a smaller model learns from a larger advanced one by analyzing its outputs. The former can mimic the reasoning processes of the larger one after such a process.

Why does OpenAI accuse DeepSeek of improper data use

OpenAI and Microsoft think that DeepSeek has been training its R1 model on data created by OpenAI in a manner not sanctioned by the company. It might breach the terms of service or even show that DeepSeek had violated access controls for data.

Is OpenAI hypocritical for complaining about data usage?

Many critics have argued that OpenAI has been sweeping up huge amounts of public data to train its models for a long time. Its complaints about DeepSeek using a similar approach point out the inconsistency in its stance on data ethics.

What are the potential legal consequences for DeepSeek?

If OpenAI determines that DeepSeek violated its terms of service or engaged in illegal data extraction, it might set off lawsuits and industry regulations across the world which would influence AI model training worldwide.

What are the implications of this controversy on AI development?

The controversy underscores a need for rules around the use of data, intellectual property rights, and how AI training happens. The heat of AI competition will demand clarity on the moral boundaries for an industry's growth.

Read more