The Fallacy of Compression

This post is a very lightly edited extract from my forthcoming article in the Duke Law Journal, Copyright’s Jagged Frontier (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6319379)

What does AI memorization prove?

Some argue that any evidence of memorization necessarily negates the claim that AI models are transformative. They advance this claim by injecting the term “compression” into the conversation in a way that suggests that AI models like GPT, Claude, and Gemini, are compressed representations of their training data in the same way that an MP3 music file is a compressed version of music from a compact disc.

“[model training is] similar to what’s called lossy compression, which one way to describe it is if you have a giant file and you compress it into a ZIP file, you lose some of the contents of the work, but effectively you’re just actually compressing the file. … it’s actually taking the expressive content of the training data and compressing it down into a model. And that confirms that there’s no actual transformative use going on here … what the model is doing is actually just repeating over and over the training data over and over again.”

— Bartz v. Anthropic, Transcript of Motion for Summary Judgment Oral Argument, May 22, 2025., p44-45 (explaining Plaintiff’s expert’s view)

Alex Reisner (AI’s Memorization Crisis, The Atlantic), for example, draws on the Cooper and Ahmed studies, and argues that the evidence of memorization undermines the learning metaphor and reveals generative AI training for what it really is: “compression.” The upshot is, “Large language models don’t ‘learn’—they copy[.]” See also Ted Chaing‘s famous essay: ChatGPT Is a Blurry JPEG of the Web.

Technically accurate but thoroughly misleading

Associating AI training with compression is technically accurate if you understand the term the way computer scientists do; but it is also thoroughly misleading if you associate compression with MP3s, JPEGs, and Zip files, as most of us do.

AI models learn compact internal representations of their training data which capture whatever patterns that enable more accurate predictions. It is equally valid to label this process as “abstraction”, “learning”, “dimension reduction”, or “compression”; but the compression label invites analogy to familiar media formats such as MP3s and JPEGs.

These formats store approximations of original works that can later be reconstructed in forms that closely resemble their sources and are usually regarded as functionally indistinguishable. Other than hipsters with a taste for vinyl records, consumers interact with ZIP files, JPEGs, and MP3s as functionally equivalent to their uncompressed originals; whatever information is discarded is socially normalized as imperceptible. Side note, I highly recommend Jonathan Sterne, MP3: The Meaning of a Format (2012).

Calling it compression tells you nothing

Training an AI model is nothing like a ripping music into an MP3 format. Calling that process “compression,” tells you nothing about the level of detail of what is learned or the significance of the information discarded. The compression metaphor is further misleading because it implies uniformity and predictability. In conventional audio or image compression, the same categories of information are discarded from every file according to stable and transparent criteria that reflect advance judgments about what matters and what does not. By contrast, memorization in large language models is uneven, incidental, and difficult to anticipate. We know that memorization is more likely when a model is exposed to multiple copies of the same work, and that the timing of exposure during training can matter. Beyond such generalities, however, it is not possible to predict in advance which works will be retained verbatim or to what degree.

The rhetoric of compression is really just an effort to sidestep a difficult empirical question, rather than to answer it. The fact that one thing is memorized to a degree that seems relevant under copyright law doesn’t prove that everything is memorized to a similar degree.

To evaluate whether memorization actually has significance under copyright law requires some kind of qualitative and quantitative assessment of the nature and extent of memorization. But even that statement is overbroad, as I explain in Copyright’s Jagged Frontier, what actually matters in terms of a fair use analysis is not memorization in the abstract, but memorization that finds its way into production.

Copyright’s Jagged Frontier

Why the Line Between Legal and Infringing AI Won’t Be a Line at All

By Matthew James Sag

Everyone wants to know whether training AI on copyrighted works is legal. The real answer is: it depends—and the boundary between what’s permissible and what isn’t will be far messier than anyone expects.

In my forthcoming article in the Duke Law Journal, I argue that the copyright boundary for generative AI will be jagged rather than smooth. Not a clean bright line, but an irregular, context-dependent frontier shaped by the interaction of varying memorization rates across different AI models, divergent legal standards of similarity across different creative media, and the interplay of three distinct bodies of copyright doctrine (substantial similarity, fair use and secondary liability).

Understanding that jaggedness turns out to be essential—not just for predicting litigation outcomes, but for seeing the opportunities that lie on the other side.

The phrase “jagged frontier” will be familiar to many. It comes from the influential 2023 study by Fabrizio Dell’Acqua, Ethan Mollick, and colleagues, who used it to describe the uneven capability landscape of AI itself. It’s a useful concept because it captures the way that AI can be astonishingly good at some tasks while failing at others that seem equally difficult.

I borrow the metaphor deliberately, because copyright law presents generative AI with an analogous problem. The legal boundary between permissible and infringing AI conduct is similarly jagged: not because AI’s capabilities are uneven (though they are), but because the legal standards that determine infringement are themselves uneven across different creative domains. It seems likely that an AI system can cross the line into copyright infringement far more easily when generating music or images of recognizable characters than when generating prose—even when the underlying technology is essentially the same.

Explaining how and why the intersection of copyright and AI leads to a jagged frontier accounts for the first third of the article.

That jagged frontier is only the beginning of the story. Drawing on Ronald Coase’s insight that legal rules are starting points for adaptation and negotiation rather than final allocations, the Article argues that the extensive literature on AI and copyright has focused almost exclusively on fair use while ignoring what comes next. I might have something to say about that in a future post.

Matthew James Sag is the Jonas Robitscher Professor of Law in Artificial Intelligence, Machine Learning, and Data Science at Emory University School of Law. His article “Copyright’s Jagged Frontier” is forthcoming in the Duke Law Journal.

The Mouse and the Model: the Disney-OpenAI Deal

The other shoe has finally dropped.

Today, December 11, 2025, OpenAI and Disney announced a partnership that essentially signals a marriage between generative AI and legacy media. Although some kind of deal was inevitable, the range and scope of this one are striking. Disney is sinking $1 billion into OpenAI for an equity stake and warrants, while simultaneously inking a three-year licensing deal.

The immediate result? OpenAI’s Sora and ChatGPT will legally ingest over 200 marquee characters from the Disney, Marvel, Pixar, and Star Wars vaults. We’ll see AI-generated Disney content on Disney+, and Disney employees will get enterprise-grade access to OpenAI’s tools. Notably, actor likenesses are off the table—a nod to the sensitivities of the recent labor strikes—but the direction of travel is clear. For more reporting, see the Verge.

Why it matters

Addressing the “Snoopy Problem

AI companies and copyright industries are beginning to understand, and become reconciled to, the fact that neither side is going to score an absolute victory when it comes to the fair use issue for AI training. AI training that results in a model that learns from, but does not reproduce, the training data looks very likely to be upheld as fair use. Two recent cases held as much on summary judgement and this aligns with a line of precedent “nonexpressive use” cases that predate generative AI.

However, it’s becoming increasingly clear that it’s hard to train generative AI models to be really useful without some degree of memorization of the training data along the way. This is particularly problematic when it comes to copyrightable characters, because copyright protects characters more abstractly than most things. This is the well-known Snoopy problem (a term I coined in 2023).

Faced with this increasingly clear reality, it makes sense for consumer facing AI companies and entertainment Giants like Disney to think about licensing arrangements.

This deal signals a retreat from the fair use absolutism of early AI development. OpenAI and Disney have effectively priced the risk of memorization. Instead of spending the next decade in discovery arguing over pixel similarities, they are moving to a licensing regime. Disney gets paid and retains control; OpenAI gets legal certainty and the ability to serve the entertainment industry without looking over its shoulder.

Capital Crunch?

With competitors like Anthropic eyeing public listings, OpenAI’s decision to take strategic capital from a corporate giant like Disney may be telling. It suggests we are hitting a saturation point for traditional venture capital at the scale these foundation models require. It also hints that OpenAI sees more value in “smart money,” than in the volatility of the public markets. Disney isn’t just a piggy bank; it’s a hedge. By entangling itself with the world’s premier IP holder, OpenAI makes itself indispensable to the very industry that threatened to sue it out of existence. Or, I’m sure that’s the theory, whether it pans out that way remains to be seen.

The End of the Scaling Era?

Finally, this move also adds to the “Data Scarcity” thesis. The era of simply scraping the open web to make models smarter (2017–2025) might be over. The low-hanging fruit of the public internet has been picked, processed, recycled into synthetic data, and processed again, every which way you can imagine. To get better, and to stay ahead of open source rivals, companies like OpenAI are going to need access to data that no one else has. Google has YouTube; OpenAI now has the Magic Kingdom.

The Bottom Line

This is the template for the future. We are moving away from total war between AI and Content, toward a negotiated partition of the world. The tech companies provide the engine; the media giants provide the fuel. And for now, at least, both sides seem to think that’s a better outcome than leaving it up to a judge.

I wrote this blog post the morning the deal was announced, because it fits surprisingly well with a Law Review article I am writing, “The Snoopy Solution: How Fair Use and Licensing for Generative AI Can Coexist” based on a talk I gave at Yale last month.

Copyright Winter is Coming (to Wikipedia?)

Judge Stein’s Order Denying OpenAI’s Motion to Dismiss in Authors Guild v. OpenAI, Inc., No. 25-md-3143 (SHS) (OTW) (S.D.N.Y. Oct. 27, 2025)

A new ruling in Authors Guild v. OpenAI has major implications for copyright law, well beyond artificial intelligence. On October 27, 2025, Judge Sidney Stein of the Southern District of New York denied OpenAI’s motion to dismiss claims that ChatGPT outputs infringed the rights of authors such as George R.R. Martin and David Baldacci. The opinion suggests that short summaries of popular works of fiction are very likely infringing (unless fair use comes to the rescue).

This is a fundamental assault on the idea, expression, distinction as applied to works of fiction. It places thousands of Wikipedia entries in the copyright crosshairs and suggests that any kind of summary or analysis of a work of fiction is presumptively infringing.

A white walker in a desolate field reading Wikipedia (an AI Image by Gemini)

Copyright and derivative works

In Penguin Random House LLC v. Colting, the Southern District of New York found that defendant’s “The Kinderguide” series, which condensed classic works of literature into children’s books, infringed the copyrights in the original works despite being marketed as educational tools for parents to introduce literature to young children.

Every year, I ask students in my copyright class why the children’s versions of classic novels in Colting were found to be infringing but a Wikipedia summary of the plots of those same books probably wouldn’t be. A recent ruling in the consolidated copyright cases against OpenAI means I might have to reconsider.

The ruling

On October 27, 2025, Judge Stein of the Southern District of New York denied OpenAI’s motion to dismiss the output-based copyright infringement claims brought by a class of authors including David Baldacci, George R.R. Martin, and others.

OpenAI had argued, reasonably enough, that the authors’ complaint failed to plausibly allege substantial similarity between any of their works and any of ChatGPT’s outputs. It is standard practice in copyright litigation to attach a copy of the plaintiff’s work and the allegedly infringing work, but the court held that “the outputs plaintiffs submitted along with their opposition to OpenAI’s motion were incorporated into the Consolidated Class Action Complaint by reference” and that it was enough that their Complaint repeatedly made “clear, definite and substantial references” to the outputs. Losing that civil procedure skirmish was probably a bad sign for OpenAI—a bit like the menacing prologue in A Game of Thrones, you sense that Copyright Winter is Coming .  

Judge Stein then went on to evaluate one of the more detailed chat-GPT generated summaries relating to A Game of Thrones, the 694 page novel by George R. R. Martin which eventually became the famous HBO series of the same name. Even though this was only a motion to dismiss, where the cards are stacked against the defendant, I was surprised by how easily the judge could conclude that:

“A more discerning observer could easily conclude that this detailed summary is substantially similar to Martin’s original work, including because the summary conveys the overall tone and feel of the original work by parroting the plot, characters, and themes of the original.”

The judge described the ChatGPT summaries as:

“most certainly attempts at abridgment or condensation of some of the central copyrightable elements of the original works such as setting, plot, and characters”

He saw them as:

“conceptually similar to—although admittedly less detailed than—the plot summaries in Twin Peaks and in Penguin Random House LLC v. Colting, where the district court found that works that summarized in detail the plot, characters, and themes of original works were substantially similar to the original works.” (emphasis added).

To say that the less than 580-word GPT summary of A Game of Thrones is “less detailed” than the 128-page Welcome to Twin Peaks Guide in the Twin Peaks case, or the various children’s books based on famous works of literature in the Colting case, is a bit of an understatement.

The Wikipedia comparison

To see why the latest OpenAI ruling is so surprising, it helps to compare the ChatGPT summary of A Game of Thrones to the equivalent Wikipedia plot summary. I read them both so you don’t have to.

The ChatGPT summary of a Game of Thrones is about 580 words long and captures the essential narrative arc of the novel. It covers all three major storylines: the political intrigue in King’s Landing culminating in Ned Stark’s execution (spoiler alert), Jon Snow’s journey with the Night’s Watch at the Wall, and Daenerys Targaryen’s transformation from fearful bride (more on this shortly) to dragon mother across the Narrow Sea. In this regard, it is very much like the 800 word Wikipedia plot summary. Each summary presents the central conflict between the Starks and Lannisters, the revelation of Cersei and Jaime’s incestuous relationship, and the key plot points that set the larger series in motion.

I could say more about their similarities, but I’m concerned that if I explored the summaries in any greater detail, the Authors Guild might think that I am also infringing George R. R. Martin’s copyright, so I’ll move on to the minor differences.

The key difference between the Wikipedia summary and the GPT summary is structural. The Wikipedia summary takes a geographic approach, dividing the narrative into three distinct sections based on location: “In the Seven Kingdoms,” “On the Wall,” and “Across the Narrow Sea.” This structure mirrors the way the novel follows different characters in different locations, to the point where you begin to wonder whether these characters will ever meet. In contrast, the GPT summary follows a more analytical structure, beginning with contextual information about the setting and the series as a whole, then proceeding through sections that follow a roughly chronological progression through the major plot points.

There are some minor differences. The Wikipedia summary provides more granular plot details and clearer causal chains between events. It explains, for instance, how Catelyn’s arrest of Tyrion leads to Tywin’s retaliatory raids on the Riverlands, which in turn necessitates Robb’s strategic alliance with House Frey to secure a crucial bridge crossing. The Wikipedia summary also includes more secondary characters and subplots, such as Tyrion’s recruitment of Bronn as his champion in trial by combat, and Jon’s protection of Samwell Tarly.

The Wikipedia summary probably assumes a greater familiarity with the fantasy genre, whereas the GPT summary might be more helpful to the uninitiated. The GPT summary explains the significance of the long summer and impending winter and explicitly sets out the novel’s major themes.

In broad strokes, however, there is very little daylight between these two summaries. They are remarkably similar in what they include and in what they leave out. Most notably, both summaries sanitize Daenerys’s storyline by omitting the sexual violence that is fundamental to her character arc. This is particularly striking because sexual violence is central to Martin’s narrative in so many places and to the narrative arc of several of the main characters.

If GPT is substantially similar, so is Wikipedia

I don’t see how the ChatGPT summary could infringe the copyright in George R. R. Martin’s novel, if the Wikipedia summary doesn’t. A chilling prospect indeed, but I don’t think that either one is infringing.

It’s absolutely true that you can infringe the copyright in a novel by merely borrowing some of the key characters, plot points and settings, and spinning out a sequel or a prequel. In copyright, we call this a derivative work. But just because sequels and children’s versions of novels are often infringing, doesn’t mean that a dry and concise analytical summary of a novel is infringing.

Why not? It’s actually the act of taking those key structural elements, the skeleton of the novel if you like, and adding new flesh to them to create a new fully realized work that makes an unauthorized sequel infringing.

What’s at stake

Judge Stein’s order doesn’t resolve the authors’ claims, not by a long shot. And he was careful to point out that he was only considering the plausibility of the infringement allegation and not any potential fair use defenses. Nonetheless, I think this is a troubling decision that sets the bar on substantial similarity far too low.

The fact that “[w]hen prompted, ChatGPT can generate accurate summaries of books authored by plaintiffs and generate outlines for potential sequels to plaintiffs’ books” falls well short of demonstrating that such outputs by themselves would be regarded by the ordinary observer as substantially similar to a fully realized novel.

Competition from AI music, Country Girls Make Do

As of October 2025, Suno and Udio are two text-to-music AI platforms that let users create full songs—including lyrics, vocals, and artwork—simply by entering text prompts. Some of this music is unappealing, even to its creators (protagonists?), but music scene insiders have assured me that some of the music emanating from these platforms is good enough to provoke a wistful, “I wish I had written that.”

AI music is also becoming more popular. A recent article in The Economist (of all places) recounts the viral success of “Country Girls Make Do,” a raunchy parody country song generated by artificial intelligence under the pseudonym Beats By AI. The song apparently features on TikTok where users prank the unsuspecting by playing it under false pretenses.

This is more than a one off. Acts such as Aventhis and The Velvet Sundown, also AI-based, have attracted hundreds of thousands of monthly listeners on Spotify. These tools allow for rapid and prolific production: Beats By AI reportedly releases a new song every day. This is not simply a case of streaming fraud where AI slop steals music plays from real artists by adopting confusing names—Spotify recently removed 75 million such tracks, citing “bad actors” flooding the platform with low-quality content. Some people at least, like some AI music. The Economist reports a Luminate survey finding that, one-third of Americans accept AI-written instrumentals, nearly 30% are fine with AI lyrics, and over a quarter do not mind AI vocals.

No music stands alone, but AI music arguably even less so

The appeal of these tracks lies partly in their echos of established genres and tropes, with a dash of irony and experimentation thrown in. It’s to be seen whether this portends a consumer-driven revolution in content creation where listeners generate their own entertainment rather than relying on record labels.

What does this mean for copyright law?

Although the Copyright Office would not regard works of The Velvet Sundown or Beats By AI as copyrightable, Spotify seems happy to royalties for AI music, provided the works themselves (as opposed to the copying that fed the AI process that created the works) don’t infringe on other artists songs.

AI music may destabilize entrenched business models at the fringes, but it might also foster broader participation and new forms of cultural expression. Does AI pose the same threat to the economic and cultural standing of musicians as it does to stock photography and digital art? Or will AI-generated music remain a hybrid layer within popular culture that feeds off and refers back to mainstream music without replacing the central role of human creation? If so, perhaps at least some country girls will make do.

Piracy, Proxies, and Performance: Rethinking Books3’s Reported Gains

A new NBER working paper by Stella Jia and Abhishek Nagaraj makes some stunning claims about the effects of pirated book corpora on large-language-model (LLM) performance. In Cloze Encounters: The Impact of Pirated Data Access on LLM Performance (May 19, 2025) (working paper)( https://www.nber.org/papers/w33598), the authors contend that access to Books3—a pirated collection of  full-text books—raises measured performance by roughly 21–23 percent in some LLMs.  

This astonishing finding is an artifact of the paper’s methodology and the very narrow definition of “performance” that it adopts, as such it should not be taken at face value. 

Cloze Encounters’ methodology and claims

Jia and Nagaraj assemble a 12,916-book evaluation set and apply a “name cloze” task: mask a named entity in a short passage and ask the model to supply it.

For instance, given a sentence like “Because you’re wrong. I don’t care what he thinks. [MASK] pulled his feet up onto the branch” from The Lightning Thief, the model should identify “Grover” as the missing name.

The main results of Cloze Encounters are estimates of “performance” showing large, statistically significant gains for GPT-class models (about 21–23 percent relative to baseline), smaller gains for Claude/Gemini/Llama-70B (about 7–9 percent), and no detectable effect for Llama-8B. The effects are stronger for less-popular book titles, consistent with fewer substitutes (Internet reviews or summaries) in other training data.

This is all well and good, but the way authors explicitly link these findings to current controversies relating to copyright policy, licensing markets, and training-data attribution is troubling.

Cloze Encounters is not measuring “performance” any way that people should care about

The first thing that raised my suspicion about this paper is that I had already seen this exact methodology used as a clever way to illustrate memorization and to show how some books are memorized more than others. See, Kent Chang et al., “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” (https://arxiv.org/abs/2305.00118). Cloze Encounters scales and repurposes that approach for a causal analysis of how access to pirated books in the Books3 lead to improved LLM “performance.” But it doesn’t make sense to me that what counted as a memorization probe in one paper could just be relabeled as a general “performance” metric in another.

Why is memorization so different to performance?

This is a question of construct validity. The method in Cloze Encounters tests recall of a masked name from a short passage, scored as a binary hit. This kind of lexical recall is a narrow slice of linguistic ability that is highly sensitive to direct exposure to the source text. It’s a proxy for memorization rather than the broad competencies that make LLMs interesting and useful.

The capabilities that matter in practice—long-context understanding, abstraction and synthesis, factual grounding outside narrative domains, reliable instruction following—are largely orthogonal to masked-name recall. Calling the cloze score “LLM performance” is a massive over-generalization from a task that measures a thin, exposure-sensitive facet of behavior. As an evaluation device, name-cloze is sharp for detecting whether models learned from—or memorized—a specific source; it is blunt for assessing overall performance. There is no reason to think that evidence of snippets of memorization from particular works in the books3 dataset has any necessary relationship with being a better translator, drafter, summarizer, brainstorming partner, etc.

This paper is begging to be misread and misapplied in policy and legal debates

I wouldn’t go so far as to say that success on the cloze score tells us “literally nothing” about LLM performance: “almost nothing” is a fairer estimate. To see why, think about the process of pre-training. Pre-training optimizes next-token prediction over trillions of tokens; the cloze outcome is, by construction, and basically the same as that objective. So it is not surprising that it is unusually sensitive to direct exposure to given pieces of training data. There probably is a broad correlation between next-token  accuracy  and perceived usefulness (we certainly saw this in the transition from GPT-3.5 to GPT-4), but the relationship is not lockstep, and it’s easy to imagine a model that excels  at memorization alone but generalizes poorly.

The authors nod to these limitations at various points in the manuscript but they still frame it as a measure a  of “LLM performance” in a way that is just begging to be misread and misapplied in policy and legal debates. Abstract-level claims travel further than caveats; many readers will see the former and miss the latter.

Nor does the identification strategy employed in the paper do anything to rescue the limits of the construct. The instrumental variable—publication-year share in Books3—may isolate an exogenous shock to exposure. Even granting the exclusion restriction, the estimate remains the effect of Books3 on a name-cloze score. It tells us little about summarization, reasoning, instruction following, safety behavior, or cross-domain generalization.

Bottom line

Cloze Encounters usefully documents that access to Books3 leaves a measurable imprint on exposure-sensitive recall. But its central metric does not justify broad the claims it makes about “LLM performance.” The study measures whether models can fill in masked strings drawn from particular books; it does not show that such access improves the flexible, user-tailored generation that makes these systems valuable.

My testimony to the US Senate Judiciary Subcommittee on IP re: Copyright and AI

I had the great honor of testifying to the US Senate Judiciary Subcommittee on Intellectual Property in relation to Artificial Intelligence Copyright on Wednesday, July 12th, 2023.

Video and my written submission are available here: https://www.judiciary.senate.gov/artificial-intelligence-and-intellectual-property_part-ii-copyright and I have also linked to written statement here in case that other link is unavailable.

In my testimony I explained that although we are still a long way from the science fiction version of artificial general intelligence that thinks, feels, and refuses to “open the pod bay doors”, recent advances in machine learning AI raise significant issues for copyright law.

I explained why copyright law does not, and should not, recognize computer systems as authors and why training generative AI on copyrighted works is usually fair use because it falls into the category of non-expressive.

For more on copyright and generative AI, read Matthew Sag, Copyright Safety for Generative AI (Houston Law Review, Forthcoming) (https://ssrn.com/abstract=4438593)

NAFTA must include fair use commitments

I joined with over seventy international copyright law experts today in calling for NAFTA and other trade negotiators to support a set of balanced copyright principles.

Policies like fair use, online safe harbors, and other exceptions and limitations to copyright permit and encourage access to knowledge, flourishing creativity, and innovation.

The following copyright principles are essential to ensure consumers’ digital rights. Copyright law should:

  • Protect and promote copyright balance, including fair use
  • Provide technology-enabling exceptions, such as for search engines and text- and data-mining
  • Include safe harbor provisions to protect online platforms from users’ infringement
  • Ensure legitimate exceptions for anti-circumvention, such as documentary filmmaking, cybersecurity research, and allowing assistive reading technologies for the blind
  • Adhere to existing multilateral commitments on copyright term
  • Guarantee proportionality and due process in copyright enforcement

Measuring the value of copyright and the value of copyright exceptions is methodologically challenging, but if we use the same criteria that WIPO adopts to estimate the value of copyright, then in the U.S., fair use industries represent 16% of annual GDP and employ 18 million American workers.

The Washington Principles on Copyright Balance in Trade Agreements and the new research on Measuring the Impact of Copyright Balance are located at http://infojustice.org/flexible-use

Text Mining, Non-Expressive Use and the Technological Advantage of Fair Use

On March 29, 2017, I attended a fantastic conference on “Globalizing Fair Use: Exploring the Diffusion of General, Open and Flexible Exceptions in Copyright Law” hosted by American University Washington College of Law’s Program and Information Justice and Intellectual Property. As part of that event we held a webcast Q&A session moderated by Sasha Moss of the R Street Institute. The following is rough transcript of my comments in response to Sasha’s questions about the legality of the non-expressive use copyrighted works.

Copyright Questions For the Digital Age

There is no country in the world where simply reading a book and giving someone information about the book, such its subject or themes, whether it uses particular words or particular combinations of words, the number of words, the number of pages, the ratio of female to male pronouns, etc., would amount to copyright infringement.

Why? Because information about the book is not the book. It is metadata. The question for the digital age is, “Can we use computers to produce that kind of data?” This question is important because although I can read a few books and produce some useful metadata, I can’t read a million books. But a computer can.

We have the technology

We have the technology to digitize large collections of books in order to produce data that enables computer scientists, linguists, historians, English professors, and the like, to answer important research questions. The data and the questions it can be used to answer do nothing to communicate the original expression of all those millions of books. However, technically speaking, this kind of digitalization is still copying.

But is this the kind of copying that copyright law should be concerned about? If a tree falls in an empty forest, does it truly make a sound? If something is copied but only read by a computer and the computer only communicates metadata about the work, is that the kind of copying this should amount to copyright infringement?

Text mining is vital for machine learning, automatic translation, and developing the language models

It seems to me, that once you phrase the question that way the answer is clear. We all use this amazing technology on a daily basis when we rely on Internet search engines, but text mining use is about much more than this. By data mining vast quantities of scientific papers, researchers have been able to identify new treatments for diseases. Text mining has also allowed humanities scholars to identify patterns in vast libraries of literature. Text mining is vital for machine learning, automatic translation, and developing the language models the power dictation software.

Fair use and technological advantage

The United States is a world leader in various applications of text mining, starting with Internet search, but going far beyond that. In the United States, once people realized what was possible they more or less start doing it. If Larry Page and Sergy Brin had had the idea for the Google Internet search engine in Canada, Australia, England, or Germany in the 1990s it would have been crystal-clear that because their search engine relied on making copies of other people’s HTML webpages and there was no realistic way to obtain permission from all those people, building search engine would be illegal. In countries with a closed list of copyright exceptions and limitations, or with fair dealing provisions that are tied to specific narrowly defined purposes, a lawyer would have looked at the list and said, “I don’t see Internet search or data mining on that list, so you can’t do it.”

The fair use doctrine reinforces copyright rather than negating it

In the United States, we have the fair use doctrine, which means that the list is not closed. In the United States, the fair use doctrine means you at least get a chance to explain why your particular use of a copyrighted work is for a purpose that promotes the goals of copyright, is reasonable in light of that purpose, and is unlikely to harm the interests of copyright owners. The fair use doctrine reinforces copyright rather than negating it; fair use doesn’t mean that you get to do whatever you want. Fair use is a system for determining how copyright should apply in new situations. That is especially important whether the law was written decades ago and society and technology are changing fast.

Without something like fair use, other countries can only follow the United States

Without something like fair use, other countries can only follow the United States. Non-expressive uses of copyrighted works such as text mining, building an Internet search engine, or running plagiarism detection software have all been held to be fair use in the United States and are slowly becoming more accepted around the world. Of course, now that it is readily apparent that these activities are immensely beneficial and entirely non-prejudicial to the interests of copyright owners we could probably write some specific amendments to the copyright act to make them legal. The problem is, do we didn’t know this two decades ago when we actually needed those rules. I don’t know what the next thing that we don’t know is, but I do know that experience has shown that the flexibility of the fair use doctrine—which has been part of copyright law virtually since the English Statute of Anne in 1710, by the way—has worked better than a system of closed lists.

The fair use doctrine is a real source of competitive advantage for technologists and academic researchers in the United States. Right now, there are technologies being developed and research being done in the United States that either can’t be done in other countries, or can only be done by particular people subject to various arbitrary restrictions. Whether it’s Internet search, digital humanities research, machine learning or cloud computing, other countries have followed the United States in adopting technologies that make non-expressive use of copyrighted works, because some of the copyright risks begin to look less daunting once the practice has become accepted. The Europeans, for example, are pretty sure building a search engine must be legal, but they can’t quite agree why. But the thing to understand is that you can follow this way but you can never lead. It’s much harder to do the new thing if by the letter of the law it is illegal and you have no forum to argue that it should be allowed.

The future doesn’t have a lobby group

Of course, that’s not quite true, you have one forum … you can spend a vast amount of money are lobbyists and go to the government, go to Congress and try to get some favorable rules written. But even if that is successful from time to time, those rules have a particular character. A company that spends millions of dollars on a lobbying campaign to change the law is always going to try and make sure that those new rules only benefit its business. Special interests will get some laws changed, but usually in ways that disadvantage their competitors or exclude alternative technologies that might one day compete with them. The fundamental problem with relying on static lists of copyright exceptions and lobbying to get those lists revised as needed is that the future doesn’t have a lobby group.

If you would like to read more about these topics:

Loyola is hosting the Society for Economic Research on Copyright Issues Annual Meeting Today

The SERCI Annual Congress 2016 is being held at Loyola University Chicago School of Law, Chicago, 7-8th July and is co-hosted by University of Illinois College of Law.

The Society for Economic Research on Copyright Issues or SERCI was established in 2001 to provide a solid academic platform for the application of economic theory to copyright policy.

The complete program is posted online at http://www.serci.org/congress.htm.

My slides for my presentation on empirical studies of copyright litigation are available here.