A handful of cherries does not make a sundae

Why content licensing cannot solve AI’s training-data problem

I have just published an article as part of the ProMarket symposium for the University of Chicago, the Booth School of Business, “The False Hope of Content Licensing at Internet Scale”

Although the article is not very long, I thought I would summarize my point even more briefly here.

AI developers have been on a shopping spree. Since mid-2023, OpenAI, Google, Anthropic and Meta have collectively spent hundreds of millions of dollars striking deals with publishers. OpenAI alone has inked agreements with everyone from the Associated Press to Condé Nast, gaining access to archives from The New Yorker, Vogue, The Wall Street Journal and dozens of other publications.

To many watching from the sidelines, these deals offer tantalizing proof that AI companies can—and should—pay for the content they consume.

However, the agreements grabbing headlines represent a tiny fraction of the data needed to train cutting-edge language models. Modern AI systems require trillions of diverse tokens scraped from across the internet—a scale and diversity that traditional licensing simply cannot reach.

To see why, read the full article: The False Hope of Content Licensing at Internet Scale