Skip to main content

Publishers are increasingly facing tough choices between collaborating with AI firms or pursuing legal action

In 2024, artificial intelligence (AI) firm OpenAI struck several deals with the publishing industry to acquire content to train its generative AI (GenAI) model, ChatGPT. The growing adoption of GenAI has left for-profit news publishers grappling with a critical choice: whether to license their content to AI developers in the hope of salvaging revenue, or to defend their copyrights through legal action — as The New York Times and others have done.

What next

Having accumulated most publicly available content, AI developers will increasingly depend on content deals with publishers to continue training their models and avoid copyright lawsuits. As social media accelerates changes in the media landscape, questions about the future of traditional journalism will intensify — especially if Donald Trump follows through on his promise to take legal action against the US news media.

Subsidiary Impacts

Analysis

Generative AIs, such as OpenAI’s ChatGPT and its growing cohort of rivals, need vast sets of data to train on. To that end, their developers have increasingly utilised publicly available text — from social media posts to blogs and Wikipedia entries. However, the most contentious aspect has been their use of high-quality language datasets — such as books, news articles and scientific papers — many of which are protected by copyright laws.

Publishing companies, whose content is protected by these laws, have faced a difficult choice between collaborating with AI firms and taking legal action (see INT: AI’s promise and pitfalls will affect key domains – April 24, 2024).

Many publishers, including Axel Springer, the FT Group, the Associated Press and News Corp, have opted to license their content to genAI companies. According to Columbia University’s Tow Center for Digital Journalism, which monitors publicly disclosed licensing and partnership agreements, OpenAI has spent an estimated USD300mn on such deals. A significant portion of this comes from a USD250mn, five-year global agreement with News Corp. OpenAI typically pays publishers a flat fee, whereas competitors such as Perplexity often negotiate revenue-sharing arrangements.

Not all publishers have followed suit, however. The New York Times became the first to sue OpenAI for copyright infringement in December 2023 (see INT: Key deal will divide publishing industry – August 21, 2024). One year later, five of Canada’s leading news media companies joined the fray, taking similar legal action.

Copyright suits

In their lawsuits, publishers contend that OpenAI scraped their websites and other digital properties, copying protected content without permission and profiting from it without compensating the original creators.

OpenAI’s defence hinges on two key arguments: that using such material to train its AIs falls under the “fair use” provisions of copyright law and that scraping does not constitute copying (see UNITED KINGDOM: AI data scraping – October 17, 2024).

OpenAI’s defence hinges on definitions of “fair use” and ‘scraping’

While copyright gives an owner the exclusive legal rights to a creative work, fair use allows others limited use of portions of the work for purposes such as commentary, criticism, news reporting, scholarship and research.

How much or how little of a work can be used does not have a clear legal quantification under US or Canadian copyright law, even for print. Online, where content-sharing by linking or cutting-and-pasting has been a norm since the outset of web publishing, the definition of fair use has become exponentially more vague.

Publishing companies have largely tolerated these extensive copyright violations, accepting them in exchange for links — often from search engines — that drive users back to their websites where they can monetise them. However, GenAIs threaten even this flow of web traffic as they reduce the need for readers to leave the AI’s environment.

Novel defence

Some prominent legal scholars have proposed supportive interpretations of copyright law that bolster OpenAI’s arguments.

These posit that scraping is not copying as GenAI training models learn from copyrighted content at an abstract (and thus uncopyrightable) level by extracting metadata. Further, the outcome of the process — new content that is substantially different from the original work — is no different in concept from the unlicensed use of copyrighted material in criticism, comment, news reporting, scholarship and research that is allowed as “fair use”.

The courts have yet to rule on these novel legal arguments. In November, two training-data cases filed by small publishers against OpenAI were dismissed as the judge determined that the publishers failed to demonstrate tangible harm. The publishers had also accused OpenAI of trying to bypass copyright protections rather than infringing on their rights. As a result, OpenAI’s fair use defence was not put to the test.

Scaling limits

AI developers rely on scaling — using more computing power and data during an AI model’s pre-training phase — to make their models more robust. Yet they face the risk of hitting a plateau as the rapid innovation of recent years reaches the limits of current capabilities.

Work is being done to improve computing performance beyond just adding more raw computing power. However, as GenAI rapidly expands beyond text chatbots to summary content, search and image processing, the risk becomes more acute that they will run out of publicly accessible training data.

Media fragmentation

Even without GenAI, mainstream media business models have already been under severe stress from the transition to digital publishing. The news industry has undergone significant contraction over the past two decades, especially in local news outlets, with closures continuing apace.

GenAI puts more pressure on mainstream media business models
According to Northwestern University’s Medill School of Journalism, the United States has one-third fewer news publishers than it did in 2005. Meanwhile, 1,563 US counties — more than half of all in the country — have just one or no local news outlets. To fill the vacuum in these ‘news deserts’, residents have turned to social media such as TikTok or X.

As the public increasingly relies on social media for news, the risk of misinformation grows significantly. According to the Pew Research Center, one in five US adults now regularly get their news from social media influencers — a figure that rises to nearly two in five among adults under 30. Yet, a new survey by UNESCO finds that nearly two-thirds of social media influencers do not vet the accuracy of content before sharing it with their followers.

Outside specialised markets, the news industry is increasingly — and intentionally — migrating to a non-profit model supported by reader subscriptions and philanthropic foundations. Earlier this year, the News Leaders Association, an editors’ trade group that succeeded the once-influential but now dissolved American Society of News Editors (ASNE), stated it would distribute its remaining assets to non-profit journalism organisations.

Yet, the biggest problem news publishers face may be trust in their products. One in four US adults opposes compensating publishers for AI firms using their content for training. This surprisingly high figure may stem from growing public distrust of mainstream media that has been fuelled by conservative populists in recent years. This antipathy will become more intense if and when President-elect Donald Trump follows through on campaign threats of legal action against the news media.

ChatGPT illustration image (Cheng Xin/Getty Images)

Authored by:

Sarah Fowler

Senior Analyst,
International Economy

Tatia Bolkvadze

Tatia Bolkvadze

Technology Analyst

Looking for more like this?

Start your free Oxford Analytica Daily Brief® trial today.