Copyright Winter is Coming (to Wikipedia?)

Judge Stein’s Order Denying OpenAI’s Motion to Dismiss in Authors Guild v. OpenAI, Inc., No. 25-md-3143 (SHS) (OTW) (S.D.N.Y. Oct. 27, 2025)

A new ruling in Authors Guild v. OpenAI has major implications for copyright law, well beyond artificial intelligence. On October 27, 2025, Judge Sidney Stein of the Southern District of New York denied OpenAI’s motion to dismiss claims that ChatGPT outputs infringed the rights of authors such as George R.R. Martin and David Baldacci. The opinion suggests that short summaries of popular works of fiction are very likely infringing (unless fair use comes to the rescue).

This is a fundamental assault on the idea, expression, distinction as applied to works of fiction. It places thousands of Wikipedia entries in the copyright crosshairs and suggests that any kind of summary or analysis of a work of fiction is presumptively infringing.

A white walker in a desolate field reading Wikipedia (an AI Image by Gemini)

Copyright and derivative works

In Penguin Random House LLC v. Colting, the Southern District of New York found that defendant’s “The Kinderguide” series, which condensed classic works of literature into children’s books, infringed the copyrights in the original works despite being marketed as educational tools for parents to introduce literature to young children.

Every year, I ask students in my copyright class why the children’s versions of classic novels in Colting were found to be infringing but a Wikipedia summary of the plots of those same books probably wouldn’t be. A recent ruling in the consolidated copyright cases against OpenAI means I might have to reconsider.

The ruling

On October 27, 2025, Judge Stein of the Southern District of New York denied OpenAI’s motion to dismiss the output-based copyright infringement claims brought by a class of authors including David Baldacci, George R.R. Martin, and others.

OpenAI had argued, reasonably enough, that the authors’ complaint failed to plausibly allege substantial similarity between any of their works and any of ChatGPT’s outputs. It is standard practice in copyright litigation to attach a copy of the plaintiff’s work and the allegedly infringing work, but the court held that “the outputs plaintiffs submitted along with their opposition to OpenAI’s motion were incorporated into the Consolidated Class Action Complaint by reference” and that it was enough that their Complaint repeatedly made “clear, definite and substantial references” to the outputs. Losing that civil procedure skirmish was probably a bad sign for OpenAI—a bit like the menacing prologue in A Game of Thrones, you sense that Copyright Winter is Coming .  

Judge Stein then went on to evaluate one of the more detailed chat-GPT generated summaries relating to A Game of Thrones, the 694 page novel by George R. R. Martin which eventually became the famous HBO series of the same name. Even though this was only a motion to dismiss, where the cards are stacked against the defendant, I was surprised by how easily the judge could conclude that:

“A more discerning observer could easily conclude that this detailed summary is substantially similar to Martin’s original work, including because the summary conveys the overall tone and feel of the original work by parroting the plot, characters, and themes of the original.”

The judge described the ChatGPT summaries as:

“most certainly attempts at abridgment or condensation of some of the central copyrightable elements of the original works such as setting, plot, and characters”

He saw them as:

“conceptually similar to—although admittedly less detailed than—the plot summaries in Twin Peaks and in Penguin Random House LLC v. Colting, where the district court found that works that summarized in detail the plot, characters, and themes of original works were substantially similar to the original works.” (emphasis added).

To say that the less than 580-word GPT summary of A Game of Thrones is “less detailed” than the 128-page Welcome to Twin Peaks Guide in the Twin Peaks case, or the various children’s books based on famous works of literature in the Colting case, is a bit of an understatement.

The Wikipedia comparison

To see why the latest OpenAI ruling is so surprising, it helps to compare the ChatGPT summary of A Game of Thrones to the equivalent Wikipedia plot summary. I read them both so you don’t have to.

The ChatGPT summary of a Game of Thrones is about 580 words long and captures the essential narrative arc of the novel. It covers all three major storylines: the political intrigue in King’s Landing culminating in Ned Stark’s execution (spoiler alert), Jon Snow’s journey with the Night’s Watch at the Wall, and Daenerys Targaryen’s transformation from fearful bride (more on this shortly) to dragon mother across the Narrow Sea. In this regard, it is very much like the 800 word Wikipedia plot summary. Each summary presents the central conflict between the Starks and Lannisters, the revelation of Cersei and Jaime’s incestuous relationship, and the key plot points that set the larger series in motion.

I could say more about their similarities, but I’m concerned that if I explored the summaries in any greater detail, the Authors Guild might think that I am also infringing George R. R. Martin’s copyright, so I’ll move on to the minor differences.

The key difference between the Wikipedia summary and the GPT summary is structural. The Wikipedia summary takes a geographic approach, dividing the narrative into three distinct sections based on location: “In the Seven Kingdoms,” “On the Wall,” and “Across the Narrow Sea.” This structure mirrors the way the novel follows different characters in different locations, to the point where you begin to wonder whether these characters will ever meet. In contrast, the GPT summary follows a more analytical structure, beginning with contextual information about the setting and the series as a whole, then proceeding through sections that follow a roughly chronological progression through the major plot points.

There are some minor differences. The Wikipedia summary provides more granular plot details and clearer causal chains between events. It explains, for instance, how Catelyn’s arrest of Tyrion leads to Tywin’s retaliatory raids on the Riverlands, which in turn necessitates Robb’s strategic alliance with House Frey to secure a crucial bridge crossing. The Wikipedia summary also includes more secondary characters and subplots, such as Tyrion’s recruitment of Bronn as his champion in trial by combat, and Jon’s protection of Samwell Tarly.

The Wikipedia summary probably assumes a greater familiarity with the fantasy genre, whereas the GPT summary might be more helpful to the uninitiated. The GPT summary explains the significance of the long summer and impending winter and explicitly sets out the novel’s major themes.

In broad strokes, however, there is very little daylight between these two summaries. They are remarkably similar in what they include and in what they leave out. Most notably, both summaries sanitize Daenerys’s storyline by omitting the sexual violence that is fundamental to her character arc. This is particularly striking because sexual violence is central to Martin’s narrative in so many places and to the narrative arc of several of the main characters.

If GPT is substantially similar, so is Wikipedia

I don’t see how the ChatGPT summary could infringe the copyright in George R. R. Martin’s novel, if the Wikipedia summary doesn’t. A chilling prospect indeed, but I don’t think that either one is infringing.

It’s absolutely true that you can infringe the copyright in a novel by merely borrowing some of the key characters, plot points and settings, and spinning out a sequel or a prequel. In copyright, we call this a derivative work. But just because sequels and children’s versions of novels are often infringing, doesn’t mean that a dry and concise analytical summary of a novel is infringing.

Why not? It’s actually the act of taking those key structural elements, the skeleton of the novel if you like, and adding new flesh to them to create a new fully realized work that makes an unauthorized sequel infringing.

What’s at stake

Judge Stein’s order doesn’t resolve the authors’ claims, not by a long shot. And he was careful to point out that he was only considering the plausibility of the infringement allegation and not any potential fair use defenses. Nonetheless, I think this is a troubling decision that sets the bar on substantial similarity far too low.

The fact that “[w]hen prompted, ChatGPT can generate accurate summaries of books authored by plaintiffs and generate outlines for potential sequels to plaintiffs’ books” falls well short of demonstrating that such outputs by themselves would be regarded by the ordinary observer as substantially similar to a fully realized novel.

Do law schools need Harvey.AI?

Harvey.AI is following the playbook of Westlaw and Lexis by trying to establish itself as the go-to AI tool of choice for lawyers before they even become lawyers. I asked my university library to organize a Harvey demo so that we could think about joining the ranks of Stanford, UCLA, NYU, Notre Dame, WashU, Penn, UChicago, Boston University, Fordham, BYU, UGA, Villanova, Baylor, SMU, and Vanderbilt. (As reported by Above The Law)  (https://abovethelaw.com/2025/10/harvey-snags-even-more-seats-in-the-t14).

This post is primarily based on a one-hour product demonstration given to us by a Harvey representative. To have a really well informed view on the product, I would want more hands-eye experience but there is surprisingly little information about what Harvey is actually offering online beyond the company’s own press releases. So, I thought my colleagues at other universities might find this assessment interesting.

TLDR

Meh, it’s OK, but law schools probably don’t need it and are probably only jumping on the bandwagon so that they can be part of the press release.

What is Harvey?

Harvey.AI is a legal-tech and professional services AI company whose flagship product is a generative AI assistant designed specifically for legal workflows used by law firms, in-house legal teams, and other professional services organizations. On its website, Harvey characterizes itself as “Professional Class AI” for leading professional service firms, emphasizing that its technology is domain-specific. In other words, it’s an AI system fine-tuned and optimized for legal and related professional work.

Use Cases and Contraindications

The first thing to understand about Harvey is that it is categorically not a legal research tool. Harvey essentially offers its clients a way of integrating generative AI into some routine drafting and analytical tasks that are quite common in legal practice.

Here are some common use scenarios:

If you have already identified the relevant case law and have a memo template to hand, Harvey AI can help you draft a legal research memo in double-quick time.

Alternatively, Harvey can help you review the key terms of a lengthy contract or almost any other synthesis or summarization task you could imagine.

Another good use case for the Harvey AI platform would be drafting an agreement or marking up the other side’s agreement in light of your own preferred templates. Harvey’s process for drafting from scratch seems directly analogous to vibe coding in software, but with a nice Microsoft Word integration.

You can also use Harvey for analysis and ideation (i.e., brainstorming). I can imagine coming to the end of a 3-month trial, throwing all the relevant documents into Harvey, and then launching into a discussion about closing argument strategy. Or, uploading a motion for summary judgment and the other side’s response, and then trying to anticipate the kinds of questions you might get from the bench.

The Harvey’s Value Proposition

You can already do almost all of this with ChatGPT, Gemini, Claude, and the like, subject to volume limitations on how many documents you upload. So, the natural question is, what value add does Harvey AI offer?

Fine tuning and model switching

One of the advantages claimed by Harvey is that rather than using foundation models like GPT directly, you would be engaging with custom versions of those model, fine-tuned on training data relevant to law and legal analysis. I could imagine that in some fields this would be a significant advantage, but I wonder how much of an advantage it is in the legal field given that most of that fine-tuning data is going to be public domain legal texts that are already well represented in the foundation models.

Another thing Harvey sees as a benefit is that they are not tied to any one model. They currently use three different fine-tuned foundation models, GPT, Gemini, and Claude, and they allocate tasks according to comparative advantage.

Security and confidentiality

By default, prompts and documents transmitted to a company like OpenAI may be used in training, will definitely be stored on OpenAI’s servers (at least for a while), and thus might be subject to discovery through appropriate legal processes. OpenAI has a setting where users can opt out of training that specifies that their data will only be retained for 30 days. This is probably good enough for many casual uses and even some mildly sensitive uses, but it’s obviously not enough for material that is subject to attorney-client privilege.

Accordingly, one of the key differentiators offered by Harvey AI is that the documents you upload and the prompts you write will not be accessible to Harvey or any third party, and that all of the information processing takes place in a secure Microsoft Azure environment with end-to-end encryption. This is probably the absolute minimum necessary to use LLMs for legal work. A large law firm could go one step further and actually host its own model in-house rather than relying on Microsoft. That extra layer of security might be required by some especially restrictive protective orders in litigation or by some especially sensitive clients. That sounds great, but I’m pretty sure I already get all that from Microsoft Copilot (although I would have to do a deep dive into the terms and conditions, Microsoft offers my university, to be sure).

Another nice feature of Harvey is that the client administrator can set permissions for individual users and for particular teams of users. This is critical in a corporate law environment where access to sensitive documents needs to be compartmentalized. It’s also critical if Harvey is being made available to students in a law school environment because students taking courses such as foundational Legal Writing and Research classes should probably not have access to Harvey AI.

Document Review (Retrieval-Augmented Generation)

Harvey AI has a good user interface for analyzing large volumes of documents. That is essentially an implementation of retrieval-augmented generation (RAG).

What’s RAG?

In very simple terms, RAG is an alternative to just answering a question through next-token prediction, relying on bulk context and whatever knowledge and understanding is latent in a foundation model. In a RAG process, the user query is translated into a document query. The document query identifies sections of documents that seem relevant to the query. Those sections are then collated and fed back into a general model which attempts to answer the question based on the specifically retrieved chunks of text. Platforms like ChatGPT are using a process like this any time you see them searching the web and providing links back to particular documents.

Harvey does RAG pretty well

RAG sounds like a great idea in theory. But whether it works in practice depends on how good the matching method is, which can vary a lot from context to context. In any RAG process, you will never know what relevant chunks of text were overlooked, and you won’t know whether the interpretive part of the model has drawn the appropriate inferences from the chunks it has retrieved unless you go back and check the original sources. One of the things I liked about the Harvey UX is that it made it easy to inspect the original document fragments and it had a clear process for checking off that these had actually been interrogated.

Example use cases would be looking for a change of control provisions in licensing agreements, as part of merger due diligence, or in document review for litigation. The Harvey representative we spoke to candidly admitted that the system performed really well in establishing a chronology, except in relation to emails. This makes sense, because an email thread contains lots of different dates all jumbled in together, but it is clearly a major limitation.

Prompting and training

Another value-add our representative stressed was prompting. Our representatives seem to be saying not only that Harvey would be running some thoughtfully-crafted prompts in the background, essentially running interference between user instructions and the models, but also that individual clients could do this for themselves. I can see why this might be an appealing feature to some people, but I’m not entirely convinced that making the steps in an analytical process obscure from the user is a good idea.

My Assessment

Generative AI as legal technology

Before we get into the specific pros and cons of Harvey, we need to consider the appropriate uses of generative AI as a legal technology more generally.

Many key deliverables in the legal field are in the form of text. But it’s relatively rare that the value of that text is entirely contained within the document itself. When a lawyer explains something to a client, they aren’t just helping their client understand something. They are also making a set of representations about the thought, diligence, and analysis that has gone into formulating that advice. Clients don’t just want text for its own sake, they want text you stand behind.

Accordingly, the most significant uses of generative AI in the legal field will be ones that accelerate a drafting-review or document-analysis process, as opposed to merely substituting for the underlying analysis.

Responsible use of generative AI in the legal field must be accompanied by either:

  • strong validation mechanisms (such as a process for clicking through the footnotes to confirm that the document in question really says what the model represented),
  • a knowing and well-informed acceptance of certain risks, or
  • the kind of external validation that a lawyer who is already familiar with the underlying materials intrinsically provides.

The validity questions that need to be answered before deploying generative AI as a legal technology are not limited to the problem of hallucinations in the narrow sense of invented cases, citations, and quotations.

Harvey claims to do very well in dealing with hallucinations, but it’s important to situate this in the context that Harvey is not a legal research tool. The kinds of tasks that Harvey says that its product should be used for are exactly the kind of tasks where one would expect a much lower instance of hallucinations. Why? Because they are mostly summary or translation tasks where the model has specific documents or templates to draw from. Even so, I’m a bit skeptical that the rate of hallucinations is really as low as Harvey claims.

The value proposition for law firms

Depending on the cost, I can see that Harvey would be a very attractive proposition for law firms of all sizes. Most of what Harvey offers can be replicated through an enterprise agreement with one of the main AI providers. Harvey offers a turnkey solution and a good user interface. You can think of it as ChatGPT in a black turtleneck, but that’s no bad thing.

Is it worth it? That depends on the cost, and the cost of the alternatives.

The value proposition for law schools

There is no doubt that most of our students are already using generative AI. It seems appropriate that we begin training them to do so properly and responsibly at the earliest opportunity. That said, the availability of generative AI to students taking specific skills courses could easily undermine the development of those skills. Rather than simply making Harvey available to all students, it makes sense to exclude first-year students and perhaps some upper-level skills courses. But obviously, we would want students in our Advanced Legal Writing course (where we are teaching AI skills) to have access to this tool.

If we decide that we don’t want students in our clinics using generative AI, then one of the major selling points of Harvey disappears. Our students don’t need the robust confidentiality protection that Harvey offers.

If Harvey is offering commercially reasonable terms, I still think it is an attractive proposition. But its value in legal education seems to me to be really quite limited. Our students are not conducting massive document review exercises or working with in-house templates. Most of the things students would find compelling about using Harvey, they can already do with Microsoft Co-Pilot, ChatGPT, Gemini, and Claude.

Legal Scholars Roundtable on Artificial Intelligence 2026 (save the date)

Emory Law is proud to host the 5th annual Legal Scholars Roundtable on Artificial Intelligence on April 9-10, 2026, at Emory University in Atlanta, Georgia. The Legal Scholars Roundtable on Artificial Intelligence is a forum for the discussion of current legal scholarship on AI, covering a range of methodologies, topics, perspectives, and legal intersections.

We will make a formal call for papers in January with submission deadline some time in February.

The AI Roundtable is conveyed by Prof. Matthew Sag (Emory Law) and Prof. Charlotte Tschider (Loyola Law Chicago)

Competition from AI music, Country Girls Make Do

As of October 2025, Suno and Udio are two text-to-music AI platforms that let users create full songs—including lyrics, vocals, and artwork—simply by entering text prompts. Some of this music is unappealing, even to its creators (protagonists?), but music scene insiders have assured me that some of the music emanating from these platforms is good enough to provoke a wistful, “I wish I had written that.”

AI music is also becoming more popular. A recent article in The Economist (of all places) recounts the viral success of “Country Girls Make Do,” a raunchy parody country song generated by artificial intelligence under the pseudonym Beats By AI. The song apparently features on TikTok where users prank the unsuspecting by playing it under false pretenses.

This is more than a one off. Acts such as Aventhis and The Velvet Sundown, also AI-based, have attracted hundreds of thousands of monthly listeners on Spotify. These tools allow for rapid and prolific production: Beats By AI reportedly releases a new song every day. This is not simply a case of streaming fraud where AI slop steals music plays from real artists by adopting confusing names—Spotify recently removed 75 million such tracks, citing “bad actors” flooding the platform with low-quality content. Some people at least, like some AI music. The Economist reports a Luminate survey finding that, one-third of Americans accept AI-written instrumentals, nearly 30% are fine with AI lyrics, and over a quarter do not mind AI vocals.

No music stands alone, but AI music arguably even less so

The appeal of these tracks lies partly in their echos of established genres and tropes, with a dash of irony and experimentation thrown in. It’s to be seen whether this portends a consumer-driven revolution in content creation where listeners generate their own entertainment rather than relying on record labels.

What does this mean for copyright law?

Although the Copyright Office would not regard works of The Velvet Sundown or Beats By AI as copyrightable, Spotify seems happy to royalties for AI music, provided the works themselves (as opposed to the copying that fed the AI process that created the works) don’t infringe on other artists songs.

AI music may destabilize entrenched business models at the fringes, but it might also foster broader participation and new forms of cultural expression. Does AI pose the same threat to the economic and cultural standing of musicians as it does to stock photography and digital art? Or will AI-generated music remain a hybrid layer within popular culture that feeds off and refers back to mainstream music without replacing the central role of human creation? If so, perhaps at least some country girls will make do.

Skater Beagle and the Puzzle of AI Creativity

Generative AI poses a puzzle for copyright lawyers, and many others besides. How can a soulless mechanical process lead to the creation of new expression, seemingly out of nothing, or if not nothing, very little?

This essay will help you understanding where the apparent creativity in generative AI outputs comes from, why a lot of AI works are not copyrightable, and why the outputs of generative AI are mostly very different to the works those AIs were trained on.

Who is the author of Skater Beagle?

The image below was created by one LLM (Google Gemini) using a long prompt written by another LLM (Anthropic’s Claude) following the instruction “draft a prompt for an arresting image of a beagle on skateboard.”

AI generated “arresting image of a beagle on skateboard.” From a low angle, a joyful beagle with ears flying expertly rides a skateboard down a steep urban hill during a cinematic, “golden hour” sunset. A city skyline is backlit by the setting sun.

If I took this photo in real life, I would be recognized as the author. Likewise, if I painted it as a picture. But because the image was created by a process that involved very little direct human contribution, it is uncopyrightable. For many people, this seems odd. How can an image that looks creative not be recognized as copyrightable, just because it was created with AI rather than an iPhone camera, or a set of water-based paints? After all, artists use tools to make art all the time?

No copyright for the AI

The first question to address is whether Google’s image generation model is the author of Skater Beagle. The answer is no, for many reasons, but let’s focus on the copyright issues, because they are the most interesting.

The AI can’t get copyright protection because the AI itself is not creative in any of the ways we generally understand that term (at least if you are a copyright lawyer) because it lacks any desire or intention to express. In Burrow-Giles Lithographic Co. v. Sarony (1884) the U.S. Supreme Court recognized that a photograph could be copyrighted, but only because the photographer’s creative choices made the image an “original intellectual conception[] of the author” rather than a mere mechanical capture. LLMs are impressive, but they don’t have any intentions separate from the math that makes them predict one thing and not another. LLMs don’t have original intellectual conception they are trying to express.

No copyright for the simple prompt engineer

If not the AI, then maybe the person who writes the prompts should be credited with the resulting expression? After all, isn’t choosing the right words in the prompt a creative act?

That doesn’t work either. Sure, choosing the right words in the prompt might be creative in some senses, but copyright law doesn’t protect creativity in the sense of “hey, that’s a good idea,”— it protects creativity that manifests in original expression. This idea-expression distinction is one of the foundations of copyright law. Copyright attaches to the final expression, not the upstream idea or instruction that triggered it. Even if you think my idea to get one LLM to write a prompt for another LLM “for an arresting image of a beagle on skateboard” is creative, its really just a simple idea and nothing copyrightable.

Surely, it must be one or the other?

But still, many would say, if Skater Beagle exhibits all the tell-tale signs of subjective creative authorship, that creativity must come from somewhere. So it’s either the AI or the person who wrote the prompt?

This line of thinking is half right, the generative AI is doing something important, it’s creating something from nothing, but its not “creativity” in the relevant sense. If you want to think of all of the details of the skater-beagle picture as expression, that expression does not magically appear from the ether, it comes from the latent space implied by the training data as processed by the model during training. In some ways it’s fair to say it comes from the collective efforts of all of the authors of all of the works in the training data. But not in the sense of a simple remix or cut-and-paste job.

Not from nothing, but not a remix

Generative AI systems come in different kinds, GANs, diffusion models, multimodal large language models, and more. The common feature of all these systems is that they trained on a large volume of prior works, and through a mathematical process, they are able to produce new works, often with very limited additional human input. But that doesn’t mean Skater Beagle belongs to the millions (10s of millions, 100s of millions?) of authors of the works in the training data. This beagle is not simple remix or collage. Although generative AI models are data dependent, they don’t just remix the training data, they produce genuinely new outputs.

AI Creativity comes from latent space

Generative AI models learn an abstract model of the training data, a model that is in many ways more than the sum of its parts. When you prompt a generative AI model, you are not querying a database, you are navigating a latent space implied by the training data.

What do I mean by “navigating a latent space implied by the training data”? Let’s start with a simple analogy. When you fit a linear regression to a handful of data points you generate a line of best fit implied by the data as seen in the figure below. Think of the dots as the training data and the line as the model implied by the training data.

Illustration of fitting a line to scattered data. Two side-by-side scatter plots on a beige background. Left: Five orange data points scattered in an upward trend without a line. Right: The same points with a straight diagonal line drawn from bottom left to top right, representing a best-fit line. Both axes are labeled X and Y, ranging from 0 to 10.

The line illustrated above is simple, it is in fact an equation that you can use to answer the question, “if y is 6, what is x?” The point 6,6 is not in the data, but it is implied by the data and the model we used to fit the data. When you plug y=6 into the model you are navigating to a point implied by the data that tells you x=6, as seen in the figure below. That is what I mean by navigating the latent space.

Illustration of navigating to point implied by linear regression. A scatter plot with five orange data points, a green dashed diagonal line representing a trend, and red dashed lines intersecting at the point (6,6). Axes are labeled X and Y, ranging from 0 to 10, on a beige background.

But of course, if we used a different model, the data would imply a slightly different latent space, as illustrated in the figure below. Here the model is not linear its quadratic and just changing that starting assumption gives us a different line of best fit.

Illustration of fitting a different model to the data. A scatter plot with five orange data points on a textured blue-and-beige background. A green dashed curve rises steeply before leveling off, intersecting red dashed lines at the point (4,6). Axes are labeled X and Y, ranging from 0 to 10.

The difference between the straight line and the curved line here is analogous to the difference between different LLMs. Obviously, generative AI models are much more complicated than a two-dimensional regression model. Generative AI models have thousands of dimensions, and so they constructs a much richer latent space, but the analogy holds. Any number of dimensions above 3 is hard to conceptualize, don’t bother trying to imagine thousands of dimensions, your brain might melt.

Does Latent Space Solve the Creativity Puzzle?

Understanding latent space helps resolve the creativity puzzle. The image of Skater Beagle looks original because the model has generated a point in a vast space of possible images implied by its training data — not because a human author made free and creative choices about the details. The model navigates to a statistically plausible combination of features, but no person decides where the beagle’s ears should fly, how steep the hill should be, or what the sunset should look like. Understanding latent space helps explain why the output of a model can feel creative but still lacks the human authorship copyright law requires.

But wait, …

But in practice, it seems like almost any photo you send to the Copyright Office will be deemed creative enough to meet the requirements for registration. If I can get copyright for just pointing my iPhone at a beagle on skateboard and pressing a button, why can’t I get copyright in an image of a beagle on skateboard that I created using generative AI?

This seems inconsistent at first blush, but only because the question overlooks the difference between the “thin” copyright that attaches to photos based in reality and the thick copyright that typically attaches to illustrations drawn from imagination.

Small jumps versus big jumps

When you take a photo, you are making a copyrightable selection and arrangement from reality. You get no rights in the underlying reality, just a specific photographic representation therein. In most copyrightable photos there is only a small jump between idea and expression and so the resulting copyright is limited to that jump. Taking a photo does not give you exclusive rights on the underlying ideas, subjects, locations, etc.

There are two critical differences between the typical iPhone snap and an image generated with AI.

The first difference is that there is a much more significant jump between idea and expression in the transition from text prompt to final image, compared to the jump from a real life scene to photo capturing the scene. The second difference is that in photography, a human still makes some minimal creative decisions (framing, timing, composition) that manifest in the look of the resulting image. The human makes the jump, even if it’s only a small jump. In AI generation, the algorithm fills in the details that transform the prompt into a specific visual expression. The AI makes the jump between your idea for a photo and the details of the photo itself.

There is no copyright in the Skater Beagle image Gemini made for me. The work of bridging the gap from abstract concept to concrete image was done entirely by algorithms trained on trillions of words and millions photos. The details that we might think of as expression in the image didn’t come from nothing, they didn’t come directly from any particular photo featuring low angle action shots, beagles, dogs with ears flying, skateboard riders, steep hills, urban settings, “golden hour” sunsets, city skylines, etc. The details that we might think of as expression don’t reflect the free and creative choices of any human mind. They are details implied by a model trained on millions of photos, but those details don’t come from those photos either. The come from the universe of possibilities those photos imply, they come from latent space.

Skater Beagle is an extreme example

Generative AI lets us navigate a latent space implied by works too numerous to count so that we can create genuinely new digital artifacts. I began this essay with the promise that understanding this would shed light on how copyright applies to AI-generated works, but Skater Beagle is an extreme example drawn from one end of the continuum. Understanding why Skater Beagle is not a copy of beagles in the training data, but is also not my creative expression tells us that the Copyright Office is right to deny copyright to some generative AI creations. But it does not tell us at what point a user would cross the line from commissioning editor to guiding hand or creative mastermind. It’s hard to imagine crossing that line with a single text prompt, but it’s easy to see how you would leap over it in an iterative process as in A Single Piece of American Cheese. Iterative interactive use of generative AI will often be an act of authorship, so long as it is more than just choosing a winner in a beauty pageant of AI creations.

[This essay was adapted from Matthew Sag, Copyright Law in the Age of AI (2025)]

Piracy, Proxies, and Performance: Rethinking Books3’s Reported Gains

A new NBER working paper by Stella Jia and Abhishek Nagaraj makes some stunning claims about the effects of pirated book corpora on large-language-model (LLM) performance. In Cloze Encounters: The Impact of Pirated Data Access on LLM Performance (May 19, 2025) (working paper)( https://www.nber.org/papers/w33598), the authors contend that access to Books3—a pirated collection of  full-text books—raises measured performance by roughly 21–23 percent in some LLMs.  

This astonishing finding is an artifact of the paper’s methodology and the very narrow definition of “performance” that it adopts, as such it should not be taken at face value. 

Cloze Encounters’ methodology and claims

Jia and Nagaraj assemble a 12,916-book evaluation set and apply a “name cloze” task: mask a named entity in a short passage and ask the model to supply it.

For instance, given a sentence like “Because you’re wrong. I don’t care what he thinks. [MASK] pulled his feet up onto the branch” from The Lightning Thief, the model should identify “Grover” as the missing name.

The main results of Cloze Encounters are estimates of “performance” showing large, statistically significant gains for GPT-class models (about 21–23 percent relative to baseline), smaller gains for Claude/Gemini/Llama-70B (about 7–9 percent), and no detectable effect for Llama-8B. The effects are stronger for less-popular book titles, consistent with fewer substitutes (Internet reviews or summaries) in other training data.

This is all well and good, but the way authors explicitly link these findings to current controversies relating to copyright policy, licensing markets, and training-data attribution is troubling.

Cloze Encounters is not measuring “performance” any way that people should care about

The first thing that raised my suspicion about this paper is that I had already seen this exact methodology used as a clever way to illustrate memorization and to show how some books are memorized more than others. See, Kent Chang et al., “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” (https://arxiv.org/abs/2305.00118). Cloze Encounters scales and repurposes that approach for a causal analysis of how access to pirated books in the Books3 lead to improved LLM “performance.” But it doesn’t make sense to me that what counted as a memorization probe in one paper could just be relabeled as a general “performance” metric in another.

Why is memorization so different to performance?

This is a question of construct validity. The method in Cloze Encounters tests recall of a masked name from a short passage, scored as a binary hit. This kind of lexical recall is a narrow slice of linguistic ability that is highly sensitive to direct exposure to the source text. It’s a proxy for memorization rather than the broad competencies that make LLMs interesting and useful.

The capabilities that matter in practice—long-context understanding, abstraction and synthesis, factual grounding outside narrative domains, reliable instruction following—are largely orthogonal to masked-name recall. Calling the cloze score “LLM performance” is a massive over-generalization from a task that measures a thin, exposure-sensitive facet of behavior. As an evaluation device, name-cloze is sharp for detecting whether models learned from—or memorized—a specific source; it is blunt for assessing overall performance. There is no reason to think that evidence of snippets of memorization from particular works in the books3 dataset has any necessary relationship with being a better translator, drafter, summarizer, brainstorming partner, etc.

This paper is begging to be misread and misapplied in policy and legal debates

I wouldn’t go so far as to say that success on the cloze score tells us “literally nothing” about LLM performance: “almost nothing” is a fairer estimate. To see why, think about the process of pre-training. Pre-training optimizes next-token prediction over trillions of tokens; the cloze outcome is, by construction, and basically the same as that objective. So it is not surprising that it is unusually sensitive to direct exposure to given pieces of training data. There probably is a broad correlation between next-token  accuracy  and perceived usefulness (we certainly saw this in the transition from GPT-3.5 to GPT-4), but the relationship is not lockstep, and it’s easy to imagine a model that excels  at memorization alone but generalizes poorly.

The authors nod to these limitations at various points in the manuscript but they still frame it as a measure a  of “LLM performance” in a way that is just begging to be misread and misapplied in policy and legal debates. Abstract-level claims travel further than caveats; many readers will see the former and miss the latter.

Nor does the identification strategy employed in the paper do anything to rescue the limits of the construct. The instrumental variable—publication-year share in Books3—may isolate an exogenous shock to exposure. Even granting the exclusion restriction, the estimate remains the effect of Books3 on a name-cloze score. It tells us little about summarization, reasoning, instruction following, safety behavior, or cross-domain generalization.

Bottom line

Cloze Encounters usefully documents that access to Books3 leaves a measurable imprint on exposure-sensitive recall. But its central metric does not justify broad the claims it makes about “LLM performance.” The study measures whether models can fill in masked strings drawn from particular books; it does not show that such access improves the flexible, user-tailored generation that makes these systems valuable.

Thomson Reuters v. ROSS Intelligence (Summary Judgement)

In a closely watched decision revising a previous summary judgment, Judge Stephanos Bibas, a Third Circuit judge sitting by designation, sided largely with Thomson Reuters in its copyright dispute against ROSS Intelligence. The ruling granted partial summary judgment on direct copyright infringement claims while dismissing ROSS’s argument that its use of Thomson Reuters’ content qualified as fair use.

With Ross Intelligence now bankrupt and the technology at issue a decidedly niche application, attention is shifting to the broader implications for AI training and the use of copyrighted materials—particularly in the realm of generative AI. Earlier, Judge Bibas had refused to grant summary judgment on fair use, insisting the matter be put before a jury. However, upon further reflection, he reversed course, ultimately rejecting the defendant’s fair use defense outright.

Background

Thomson Reuters, the owner of Westlaw, accused the AI-driven legal research firm ROSS of copyright infringement, alleging that it had improperly used legal summaries—so-called Bulk Memos—derived from Westlaw’s editorial materials, particularly its headnotes, to train its technology. Thomson Reuters had refused to license its content to ROSS, a rival developing an AI-powered legal research tool requiring a database of legal questions and answers for training. To obtain the necessary data, ROSS partnered with LegalEase, which compiled and sold approximately 25,000 Bulk Memos—summaries created by lawyers referencing Westlaw headnotes. Whether the Bulk Memos involved verbatim copying or otherwise infringing copying was an issue in the case that ultimately went against ROSS. Upon discovering that ROSS had used content derived from these headnotes, Thomson Reuters filed a copyright infringement lawsuit. The summary judgment pertains only to a subset of the contested headnotes, leaving broader legal questions unresolved.

The court ruled against ROSS, determining that it had copied 2,243 headnotes and dismissing its various legal defenses, including claims of innocent infringement, copyright misuse, and the merger doctrine.

Ross’s use was not transformative

Judge Bibas ruled that ROSS’s use of Thomson Reuters’ material was commercial and non-transformative, a conclusion that weighed heavily in the publisher’s favor. According to the court, the use did not qualify as transformative because it lacked a distinct purpose or character from Thomson Reuters’ original work.

The court’s conclusion that Ross’s use was not transformative is puzzling, especially given its acknowledgment—while discussing the third fair use factor—that the output of Ross’s system did not replicate Westlaw’s copyrighted headnotes but rather produced uncopyrighted judicial opinions.

The court did distinguish two significant cases, Sega Enterprises Ltd. v. Accolade, Inc. and Sony Computer Entertainment, Inc. v. Connectix Corp. but failed to consider cases like iParadigms, HathiTrust and Google Books. Even the way the court dealt with the reverse engineering cases is a bit suspect. The court sets them aside for two reasons, first because those cases involved copying software code, and second, that such copying was “necessary for competitors to innovate.” To be sure, Oracle v. Google suggests that cases involving software may merit special treatment, but it is not clear why the software context should make a difference here. Judge Bibas’s invocation of necessity is undercooked as well. Whether an act of copying is “necessary” is inextricably tied to the level of generality at which you ask the question. In Oracle v. Google, Google’s replication of Java APIs was essential for compatibility with existing Java programmers, but whether that compatibility was a necessity or luxury depends on the level of generality at which you pose the question. After all, other smartphones ran without making life easy for Java programmers.

Not generative AI, but why?

The judge took care to distinguish this case from generative AI, yet the distinction remains murky. The court stated: “Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw. It is undisputed that Ross’s AI is not generative AI (AI that writes new content itself).” And later that “Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today.”

But what, exactly, sets this apart from generative AI? More broadly, how does this differ from other cases where nonexpressive uses have been deemed fair use? The opinion offers little guidance. It fails to engage with seemingly comparable precedents, such as plagiarism detection tools, library digitization for text analysis and digital humanities research, or the creation of a book search engine—cases where courts have found fair use.

The closest we get to an explanation of why Ross’s use of the Westlaw headnotes is different to the intermediate copying iParadigms, HathiTrust and Google Books is that Ross merely retrieves and presents judicial opinions in response to user queries. This process, the court observed, closely parallels Westlaw’s own practice of using headnotes and key numbers to identify relevant cases.  Consequently, the court concluded that Ross’s use was not transformative, as it primarily served to facilitate the development of a competing legal research tool rather than to add new expression or meaning to the copied material.

Market effect

The court determined that ROSS’s actions impaired Thomson Reuters’ market for legal AI training data, and in its reasoning, the fourth fair use factor carried substantial weight. Without qualification, the opinion echoes Harper & Row’s assertion that the fourth factor “is undoubtedly the single most important element of fair use.” This is problematic. Asserting the absolute primacy of the Fourth factor is obviously in error in light of Campbell, as well as the Court’s more recent decisions in Google v. Oracle and Andy Warhol Foundation. The Court’s contemporary approach to fair use eschews rigid hierarchies among the statutory factors.

That said, the judge’s finding in relation to the fourth factor may not be entirely unreasonable in this case: Ross explicitly intended to compete with Westlaw by creating a viable market alternative. For the court the key fact was that Ross “meant to compete with Westlaw by developing a market substitute.” “And it does not matter whether Thomson Reuters has used the data to train its own legal search tools; the effect on a potential market for AI training data is enough.”

Implications

One district court opinion that barely engages with the relevant caselaw will not change U.S. fair use law overnight, but it will certainly be welcome news for the plaintiffs in the more than 30 ongoing AI copyright cases currently being litigated.

I think what is really going on in this decision is that the judge has confused the first factor with the fourth factor. There is no obvious way to distinguish training on the question and answer memos to develop a model that directly links user questions to the relevant case law from cases involving search engines and plagiarism detection software. The real distinction, if there is one, is that ROSS used Westlaw’s product to create a directly competing product.

Looking at the case this way, the decision might actually be good for the generative AI defendants, in cases like NYT v OpenAI, because there isn’t the same direct competition. 

* This is my first quick take on the decision just hours after it was handed down.

* Citation: Thomson Reuters Enter. Ctr. GmbH v. ROSS Intelligence Inc., No. 1:20-cv-613-SB (D. Del. Feb. 11, 2025)

Book Review: Nick Seaver, Computing Taste: Algorithms and the Makers of Music Recommendation

(University of Chicago Press, 2022)

In Computing Taste, Nick Seaver provides an ethnographic exploration of the world of music recommendation systems, revealing how algorithms are deeply shaped by the humans who design them. He shows how the algorithms that drive music recommendations are shaped by human judgment, creativity, and cultural assumptions. The data companies collect, the way they construct models, how they intuitively test whether their models are working, and how they define success are all deeply human and subjective choices.

Beyond Man vs. Machine

Seaver points out that textbook definitions describe algorithms as “well-defined computational procedures” that take inputs and generate outputs, portraying them as deterministic and straightforward systems. This narrow view leads to a man-versus-machine narrative that is trite and unilluminating. Treating algorithms as though their defining quality is the absence of human influence reinforces misconceptions about their neutrality. Instead, Seaver advocates for focusing on the sociotechnical arrangements that produce different forms of “humanness and machineness,” echoing observations by Donna Haraway and others.

In practice, algorithmic systems are messy, constantly evolving, and shaped by human judgment. As Seaver notes, “these ‘cultural’ details are technical details,” meaning that the motivations, preferences, and biases of the engineering teams that design algorithms are inseparable from the technical aspects of the systems themselves. Therefore, understanding algorithms requires acknowledging the social and cultural contexts in which they operate.

From Information Overload to Capture

Seaver shows how the objective of recommendation systems has shifted from the founding myth of information overload to the current obsession with capturing user attention. Pioneers of recommender systems told stories of information overload that presented growing consumer choice as a problem in need of a solution. The notion of overwhelming users with too much content has been a central justification for creating algorithms designed to filter and organize information. If users are helpless in the face of vast amounts of data, algorithms become necessary tools to help them navigate this digital landscape. Seaver argues that the framing of overload justifies the control algorithms exert over what users see, hear, and engage with. The idea of “too much music” or “too much content” becomes a convenient rationale for developing systems that, in practice, do more than assist—they guide, constrain, and shape user choices.

In any event, commercial imperatives soon led to rationales based on information overload giving way to narratives of capture. Seaver compares recommender systems to traps designed to “hook” users, analyzing how metrics such as engagement and retention guide the development of algorithms. Seaver traces the evolution of recommender systems from their origins as tools to help users navigate the overwhelming abundance of digital content to their current role in capturing and retaining user attention. The Netflix Prize, a 2006 competition aimed at improving Netflix’s recommendation algorithm, serves as a key example of this shift. Initially, algorithms were designed to help users manage “information overload” by personalizing content based on user preferences, as Netflix sought to predict what users would enjoy. However, Netflix never used the winning entry. As streaming services became central to Netflix’s business model, the focus of recommendation systems shifted from merely helping users find content to keeping them engaged on the platform for as long as possible. This transition from personalization to attention retention shows the shift in the industry’s goals. Recommender systems, including those at Netflix, began to focus on encouraging continuous engagement by suggesting binge-worthy content to maximize viewing hours, implementing autoplay features to keep the next episode or movie rolling without user interaction, and focusing on actual viewing habits (e.g., “skip intro” clicks, time spent on a show, completion rates) rather than ratings to keep users hooked.

Seaver’s perspective is insightful, not unrelentingly critical. The final chapter investigates how the design of recommendation systems reflects the metaphor of a “park”—a managed, curated space that users are guided through. Recommender systems are neither strictly benign nor malign, but they do entail a loss of user agency. We, the listening public, are not trapped animals so much as a managed flock. Seaver recognizes that recommendation systems open up new possibilities for exploration while also constraining user behavior by narrowing choices based on past preferences.

Why Do My Playlists Still Suck?

The book also answers the question that motivated me to read it: why do my playlists still suck? No one has a good model for why we like the music that we like, when we like it, or how that extrapolates to music we haven’t heard yet. And Spotify and other corporate interests have no real interest in solving that puzzle for us. The algorithms that shape our cultural lives now prioritize engagement, rely on past behavior, and reflect a grab bag of assumptions about user preferences that are often in conflict. There is very little upside to offering us fresh or risky suggestions when a loop of familiarity will keep us more reliably engaged.

A response to Lee and Grimmelmann

TIM LEE (@binarybits) and JAMES GRIMMELMANN have written an insightful article on “Why The New York Times might win its copyright lawsuit against OpenAI” in Ars Technica and on Tim’s newsletter (https://www.understandingai.org/p/the-ai-community-needs-to-take-copyright).

Quite a few people emailed me asking for my thoughts, so here they are. This is a rough first take that began as a tweet before I realized it was too long.

Yes, we should take the NYT suit seriously

It’s hard to disagree with the bottom-line that copyright poses a significant challenge to copy-reliant AI, just as it has done to previous generations of copy-reliant technologies (reverse engineering, plagiarism detection, search engine indexing, text data mining for statistical analysis of literature, text data mining for book search).

One important insight offered by Tim and James is that building a useful technology that is consistent with some people’s rough sense of fairness, like MP3.com, is no guarantee of fair use. People loved Napster and probably would have loved MP3.com, but these services were essentially jukeboxes competing with record companies’ own distribution models for the exact same content. We could add ReDigi to this list, too. Unlike the copy-reliant technologies listed above, Napster, MP3.com, and ReDigi fell foul of copyright law because they made expressive uses of other people’s expressive works.

Tim and James make another important point, that academic researchers and Silicon Valley types might have got the wrong idea about copyright. Certainly, prior to November 2022 you almost never saw any mention of copyright in papers announcing new breakthroughs in text data mining, machine learning, or generative AI. This is why I wrote “Copyright Safety for Generative AI” (Houston Law Review 2023).

Tim and James’ third insight is that some conduct might be fair use in a small noncommercial scale but not fair use on a large commercial scale. This is right sometimes, but in fact, a lot of fair use scales up quite nicely. 2 Live crew sold millions of copies of their fair use parody of Roy Orbison’s Pretty Woman, and of course, some of the key non-expressive use precedents were all about different versions of text data mining at scale: iParadigms (commercial plagiarism detection), HathiTrust (text mining for statistical analysis of the literature, including machine learning), Google Books (commercial book search).

But how seriously?

I agree with Tim and James that the AI companies’ best fair use arguments will be some version of the non-expressive use argument I outlined in Copyright and Copy-Reliant Technology (2009) and several other papers since, such as The New Legal Landscape for Text Mining and Machine Learning (2019).

In a nutshell, that argument is that a technical process that creates some effectively invisible copies along way but ultimately produces only uncopyrightable facts, abstractions, associations, and styles should be fair use because it does not interfere with the author’s right to communicate her original expression to the public.

I also agree that this argument begins to unravel if generative AI models are in fact memorizing and delivering the underlying original expression from the training data. I don’t think we know enough about the facts to say whether individual examples of memorization are just an obscure bug or endemic problem.

The NYT v. OpenAI litigation will shed some light on this but there is a lot of discovery still to come. My gut feeling is that the NYT’s superficially compelling examples of memorization are actually examples of GPT-4 working as an agent to retrieve information from the Internet. This is still a copyright problem, but it’s a very small, easily fixed, copyright problem, not an existential threat to text data mining research, machine learning, and generative AI.

If the GPT series models are really memorizing and regurgitating vast swaths of NYT content, that is a problem for OpenAI. If pervasive memorization is unavoidable in LLMs, that would be a problem for the entire generative AI industry, but I very much doubt the premise. Avoiding memorization (or reducing to trivial levels) is a hard technical problem in LLMs, but not an impossible one.

Avoiding memorization in image models is more difficult because of the “Snoopy Problem.” Tim and James call this the “Italian plumber problem,” but I named it first and I like Snoopy better.

The Snoopy Problem is that the more abstractly a copyrighted work is protected, the more likely it is that a generative AI model will “copy” it. Text-to-image models are prone to produce potentially infringing works when the same text descriptions are paired with relatively simple images that vary only slightly. 

Generative AI models are especially likely to generate images that would infringe on copyrightable characters because characters like Snoopy appear often enough in the training data that the models learn the consistent traits and attributes associated with those names. Deduplication won’t solve this problem because the output can still infringe without closely resembling any particular image from the training data. Some people think this is really a problem with copyright being too loose with characters and morphing into trademark law. Maybe, but I don’t see that changing.

How serious is the Snoopy Problem? Tim and James frame the problem as though they innocently requested a combination of [Nationality] + [Occupation] + “from a video game” and just happened stumble upon repeated images of the world most famous Italian plumber, Mario from Mario Kart.

But of course, a random assortment of “Japanese software developers” “German fashion designers” “Australian novelists” “Kenyan cyclists” “Turkish archaeologists” and a “New Zealand plumber” don’t reveal any such problem. The problem is specific to Mario because he dominates representations of Italian plumbers from video games in the training data.

The Snoopy Problem presents a genuine difficulty for video, image, and multimodal generative AI, but it’s far from an existential threat. Partly, because the class of potential plaintiffs is significantly smaller. There are a lot fewer owners of visual copyrightable characters than there are just plain old copyright owners. And partly because the problem can be addressed in training, by monitoring prompts, or by filtering outputs.

Tim and James’s final point of concern is that the prospect of licensing markets for training data will undermine the case for fair use. Companies building AI models rely on the fact that they are simply scraping training data from the “open Internet,” the argument becomes more persuasive when these companies are more careful to avoid scraping content from sites where they are not welcome.

Respecting existing robots.txt signals and helping to develop more effective ones in the future will facilitate robust licensing markets for entities like the New York Times and the Associated Press.

I don’t think that OpenAI will need to sign a 100 million licensing deals before training its next model. Courts have already considered and rejected the circular argument that copyright owners must be given the right to charge for non-expressive uses to avoid the harm of not being able to charge for non-expressive uses. This specific argument was raised by the Authors Guild in HathiTrust and Google Books and squarely rejected in both.

Tim and James and their note of caution with a note of realism: judges will be reluctant to shut down an innovative and useful service with tens of millions of uses. We saw a similar dynamic when the US Supreme Court held that time shift in using videocassette recorders was fair use.

But there is another element of realism to add. If the US courts reject the idea that non-expressive uses should be fair use, most AI companies will simply move their scraping and training operations overseas to places like Japan, Israel, Singapore, and even the European Union. As long as the models don’t memorize the training data, they can then be hosted in the US without fear of copyright liability.

Tim and James are two of the smartest most insightful people writing about copyright and AI at the moment. The AI community should take them seriously, they should take copyright seriously, but they should not see Snoopy (or the Italian Plumber) as an existential threat.

PS: Updated to correct typos helpfully identified by ChatGPT.