Matthew Sag – Page 2 – Writes about Copyright and AI, mostly

October 30, 2025

Legal Scholars Roundtable on Artificial Intelligence 2026 (save the date)

Emory Law is proud to host the 5th annual Legal Scholars Roundtable on Artificial Intelligence on April 9-10, 2026, at Emory University in Atlanta, Georgia. The Legal Scholars Roundtable on Artificial Intelligence is a forum for the discussion of current legal scholarship on AI, covering a range of methodologies, topics, perspectives, and legal intersections.

We will make a formal call for papers in January with submission deadline some time in February.

The AI Roundtable is conveyed by Prof. Matthew Sag (Emory Law) and Prof. Charlotte Tschider (Loyola Law Chicago)

October 15, 2025

Competition from AI music, Country Girls Make Do

As of October 2025, Suno and Udio are two text-to-music AI platforms that let users create full songs—including lyrics, vocals, and artwork—simply by entering text prompts. Some of this music is unappealing, even to its creators (protagonists?), but music scene insiders have assured me that some of the music emanating from these platforms is good enough to provoke a wistful, “I wish I had written that.”

AI music is also becoming more popular. A recent article in The Economist (of all places) recounts the viral success of “Country Girls Make Do,” a raunchy parody country song generated by artificial intelligence under the pseudonym Beats By AI. The song apparently features on TikTok where users prank the unsuspecting by playing it under false pretenses.

This is more than a one off. Acts such as Aventhis and The Velvet Sundown, also AI-based, have attracted hundreds of thousands of monthly listeners on Spotify. These tools allow for rapid and prolific production: Beats By AI reportedly releases a new song every day. This is not simply a case of streaming fraud where AI slop steals music plays from real artists by adopting confusing names—Spotify recently removed 75 million such tracks, citing “bad actors” flooding the platform with low-quality content. Some people at least, like some AI music. The Economist reports a Luminate survey finding that, one-third of Americans accept AI-written instrumentals, nearly 30% are fine with AI lyrics, and over a quarter do not mind AI vocals.

No music stands alone, but AI music arguably even less so

The appeal of these tracks lies partly in their echos of established genres and tropes, with a dash of irony and experimentation thrown in. It’s to be seen whether this portends a consumer-driven revolution in content creation where listeners generate their own entertainment rather than relying on record labels.

What does this mean for copyright law?

Although the Copyright Office would not regard works of The Velvet Sundown or Beats By AI as copyrightable, Spotify seems happy to royalties for AI music, provided the works themselves (as opposed to the copying that fed the AI process that created the works) don’t infringe on other artists songs.

AI music may destabilize entrenched business models at the fringes, but it might also foster broader participation and new forms of cultural expression. Does AI pose the same threat to the economic and cultural standing of musicians as it does to stock photography and digital art? Or will AI-generated music remain a hybrid layer within popular culture that feeds off and refers back to mainstream music without replacing the central role of human creation? If so, perhaps at least some country girls will make do.

September 24, 2025

Skater Beagle and the Puzzle of AI Creativity

Generative AI poses a puzzle for copyright lawyers, and many others besides. How can a soulless mechanical process lead to the creation of new expression, seemingly out of nothing, or if not nothing, very little?

This essay will help you understanding where the apparent creativity in generative AI outputs comes from, why a lot of AI works are not copyrightable, and why the outputs of generative AI are mostly very different to the works those AIs were trained on.

**Who is the author of Skater Beagle?**

The image below was created by one LLM (Google Gemini) using a long prompt written by another LLM (Anthropic’s Claude) following the instruction “draft a prompt for an arresting image of a beagle on skateboard.”

AI generated “arresting image of a beagle on skateboard.” From a low angle, a joyful beagle with ears flying expertly rides a skateboard down a steep urban hill during a cinematic, “golden hour” sunset. A city skyline is backlit by the setting sun.

If I took this photo in real life, I would be recognized as the author. Likewise, if I painted it as a picture. But because the image was created by a process that involved very little direct human contribution, it is uncopyrightable. For many people, this seems odd. How can an image that looks creative not be recognized as copyrightable, just because it was created with AI rather than an iPhone camera, or a set of water-based paints? After all, artists use tools to make art all the time?

No copyright for the AI

The first question to address is whether Google’s image generation model is the author of Skater Beagle. The answer is no, for many reasons, but let’s focus on the copyright issues, because they are the most interesting.

The AI can’t get copyright protection because the AI itself is not creative in any of the ways we generally understand that term (at least if you are a copyright lawyer) because it lacks any desire or intention to express. In Burrow-Giles Lithographic Co. v. Sarony (1884) the U.S. Supreme Court recognized that a photograph could be copyrighted, but only because the photographer’s creative choices made the image an “original intellectual conception[] of the author” rather than a mere mechanical capture. LLMs are impressive, but they don’t have any intentions separate from the math that makes them predict one thing and not another. LLMs don’t have original intellectual conception they are trying to express.

No copyright for the simple prompt engineer

If not the AI, then maybe the person who writes the prompts should be credited with the resulting expression? After all, isn’t choosing the right words in the prompt a creative act?

That doesn’t work either. Sure, choosing the right words in the prompt might be creative in some senses, but copyright law doesn’t protect creativity in the sense of “hey, that’s a good idea,”— it protects creativity that manifests in original expression. This idea-expression distinction is one of the foundations of copyright law. Copyright attaches to the final expression, not the upstream idea or instruction that triggered it. Even if you think my idea to get one LLM to write a prompt for another LLM “for an arresting image of a beagle on skateboard” is creative, its really just a simple idea and nothing copyrightable.

Surely, it must be one or the other?

But still, many would say, if Skater Beagle exhibits all the tell-tale signs of subjective creative authorship, that creativity must come from somewhere. So it’s either the AI or the person who wrote the prompt?

This line of thinking is half right, the generative AI is doing something important, it’s creating something from nothing, but its not “creativity” in the relevant sense. If you want to think of all of the details of the skater-beagle picture as expression, that expression does not magically appear from the ether, it comes from the latent space implied by the training data as processed by the model during training. In some ways it’s fair to say it comes from the collective efforts of all of the authors of all of the works in the training data. But not in the sense of a simple remix or cut-and-paste job.

Not from nothing, but not a remix

Generative AI systems come in different kinds, GANs, diffusion models, multimodal large language models, and more. The common feature of all these systems is that they trained on a large volume of prior works, and through a mathematical process, they are able to produce new works, often with very limited additional human input. But that doesn’t mean Skater Beagle belongs to the millions (10s of millions, 100s of millions?) of authors of the works in the training data. This beagle is not simple remix or collage. Although generative AI models are data dependent, they don’t just remix the training data, they produce genuinely new outputs.

AI Creativity comes from latent space

Generative AI models learn an abstract model of the training data, a model that is in many ways more than the sum of its parts. When you prompt a generative AI model, you are not querying a database, you are navigating a latent space implied by the training data.

What do I mean by “navigating a latent space implied by the training data”? Let’s start with a simple analogy. When you fit a linear regression to a handful of data points you generate a line of best fit implied by the data as seen in the figure below. Think of the dots as the training data and the line as the model implied by the training data.

Illustration of fitting a line to scattered data. Two side-by-side scatter plots on a beige background. Left: Five orange data points scattered in an upward trend without a line. Right: The same points with a straight diagonal line drawn from bottom left to top right, representing a best-fit line. Both axes are labeled X and Y, ranging from 0 to 10.

The line illustrated above is simple, it is in fact an equation that you can use to answer the question, “if y is 6, what is x?” The point 6,6 is not in the data, but it is implied by the data and the model we used to fit the data. When you plug y=6 into the model you are navigating to a point implied by the data that tells you x=6, as seen in the figure below. That is what I mean by navigating the latent space.

Illustration of navigating to point implied by linear regression. A scatter plot with five orange data points, a green dashed diagonal line representing a trend, and red dashed lines intersecting at the point (6,6). Axes are labeled X and Y, ranging from 0 to 10, on a beige background.

But of course, if we used a different model, the data would imply a slightly different latent space, as illustrated in the figure below. Here the model is not linear its quadratic and just changing that starting assumption gives us a different line of best fit.

Illustration of fitting a different model to the data. A scatter plot with five orange data points on a textured blue-and-beige background. A green dashed curve rises steeply before leveling off, intersecting red dashed lines at the point (4,6). Axes are labeled X and Y, ranging from 0 to 10.

The difference between the straight line and the curved line here is analogous to the difference between different LLMs. Obviously, generative AI models are much more complicated than a two-dimensional regression model. Generative AI models have thousands of dimensions, and so they constructs a much richer latent space, but the analogy holds. Any number of dimensions above 3 is hard to conceptualize, don’t bother trying to imagine thousands of dimensions, your brain might melt.

Does Latent Space Solve the Creativity Puzzle?

Understanding latent space helps resolve the creativity puzzle. The image of Skater Beagle looks original because the model has generated a point in a vast space of possible images implied by its training data — not because a human author made free and creative choices about the details. The model navigates to a statistically plausible combination of features, but no person decides where the beagle’s ears should fly, how steep the hill should be, or what the sunset should look like. Understanding latent space helps explain why the output of a model can feel creative but still lacks the human authorship copyright law requires.

But wait, …

But in practice, it seems like almost any photo you send to the Copyright Office will be deemed creative enough to meet the requirements for registration. If I can get copyright for just pointing my iPhone at a beagle on skateboard and pressing a button, why can’t I get copyright in an image of a beagle on skateboard that I created using generative AI?

This seems inconsistent at first blush, but only because the question overlooks the difference between the “thin” copyright that attaches to photos based in reality and the thick copyright that typically attaches to illustrations drawn from imagination.

Small jumps versus big jumps

When you take a photo, you are making a copyrightable selection and arrangement from reality. You get no rights in the underlying reality, just a specific photographic representation therein. In most copyrightable photos there is only a small jump between idea and expression and so the resulting copyright is limited to that jump. Taking a photo does not give you exclusive rights on the underlying ideas, subjects, locations, etc.

There are two critical differences between the typical iPhone snap and an image generated with AI.

The first difference is that there is a much more significant jump between idea and expression in the transition from text prompt to final image, compared to the jump from a real life scene to photo capturing the scene. The second difference is that in photography, a human still makes some minimal creative decisions (framing, timing, composition) that manifest in the look of the resulting image. The human makes the jump, even if it’s only a small jump. In AI generation, the algorithm fills in the details that transform the prompt into a specific visual expression. The AI makes the jump between your idea for a photo and the details of the photo itself.

There is no copyright in the Skater Beagle image Gemini made for me. The work of bridging the gap from abstract concept to concrete image was done entirely by algorithms trained on trillions of words and millions photos. The details that we might think of as expression in the image didn’t come from nothing, they didn’t come directly from any particular photo featuring low angle action shots, beagles, dogs with ears flying, skateboard riders, steep hills, urban settings, “golden hour” sunsets, city skylines, etc. The details that we might think of as expression don’t reflect the free and creative choices of any human mind. They are details implied by a model trained on millions of photos, but those details don’t come from those photos either. The come from the universe of possibilities those photos imply, they come from latent space.

Skater Beagle is an extreme example

Generative AI lets us navigate a latent space implied by works too numerous to count so that we can create genuinely new digital artifacts. I began this essay with the promise that understanding this would shed light on how copyright applies to AI-generated works, but Skater Beagle is an extreme example drawn from one end of the continuum. Understanding why Skater Beagle is not a copy of beagles in the training data, but is also not my creative expression tells us that the Copyright Office is right to deny copyright to some generative AI creations. But it does not tell us at what point a user would cross the line from commissioning editor to guiding hand or creative mastermind. It’s hard to imagine crossing that line with a single text prompt, but it’s easy to see how you would leap over it in an iterative process as in A Single Piece of American Cheese. Iterative interactive use of generative AI will often be an act of authorship, so long as it is more than just choosing a winner in a beauty pageant of AI creations.

[This essay was adapted from Matthew Sag, Copyright Law in the Age of AI (2025)]

September 18, 2025September 18, 2025

Drafting Law School AI Policies

There are a lot of poorly thought through AI policies out there

Law schools are realizing that they need student conduct policies that address generative AI. But after reviewing many of their policies (and some undergrad policies as well), I feel they often miss the mark. Here are five problems that crop up again and again.

First, many conflate using AI with plagiarism.

Plagiarism, properly defined, is the unacknowledged appropriation of another’s words or ideas. Violations of prohibitions on AI use, by contrast, are often better conceptualized as breaches of disclosure obligations, misrepresentation, or general academic integrity violations. While AI misuse can sometimes constitute plagiarism it is not necessarily so. Rules that lump these activities together are too blunt. There are sound reasons, sometimes to prohibit both, but they should not be conflated. Taring a wide set of AI uses with the brush of plagiarism. Is unlikely to win acceptance from students, who will reasonably see such policies as overreach.

Second, definitions are a muddle.

Many policies leave key operative terms—such as “compose,” “proofread,” “substantially edit,” or “small part”—undefined. Absent bright-line rules or illustrative examples, students and faculty are left to infer the policy’s scope, producing inconsistent enforcement and potential due process concerns.

Sweeping prohibitions on “AI use” may unintentionally extend to widely accepted tools, including spellcheckers, grammar correction software, and dictation systems. Such breadth is rarely the drafters’ intent and risks chilling legitimate academic practice. Blanket prohibitions, especially without accommodation mechanisms, may disproportionately disadvantage non-native speakers and students with disabilities who rely on technological assistance, even as comparable human support (e.g., writing centers) remains permissible. If that kind of restriction is intended, it should be express.

Third, some schools are leaning on unreliable technology to police AI use.

Recommendations to use AI detectors or plagiarism software to identify AI-generated work are problematic given their poor reliability. Without cautionary limits, such tools risk false positives and undermine due process.

It is important to understand three key limitations here:

(1) Anti-plagiarism software does not detect novel generative AI outputs;

(2) AI detectors are not reliable in the way anti-plagiarism software are reliable;

(3) AI detectors generate a large percentage of false positives. They are especially prone to do so in cases involving neurodivergent authorship or use of standard proofreading programs such as Grammarly.

Honestly, you would be better off tossing a coin, at least then you would have a realistic assessment of how far you should trust the answer.

Fourth, few schools offer clear ways for students to disclose their use of AI.

Standardized disclosure mechanisms would enhance transparency and promote consistent expectations across courses and instructors.

Fifth, the policies themselves are often inconsistent.

One policy I read takes a categorical approach on prohibiting AI use but then in a latter part of the document it suggests allowing AI “for parts of assignments” and asks instructors to clarify expectations. What?

A template for a better Law School AI policy

So, what should your AI policy look like? It should be clear, specific, comprehensive, and custom tailored for each course you teach. You can do that with the template I suggest below, just by changing the “mays” to “may nots”

I’m sure this is not perfect, but I think it’s a useful place to begin. Your use of this template is not plagiarism, I am posting it here because I think you should copy it.

Generic Law School Syllabus AI Use Policy

(1) The use of generative AI in this course is restricted but not entirely prohibited. The restrictions serve multiple, sometimes overlapping, purposes: preserving pedagogical integrity, preserving the integrity of assessment, helping you avoid plagiarism, misrepresentation, and shoddy work. These restrictions are tailored to this course, so you need to review them carefully.

(2) Key Prohibitions:

(a) In this course you are prohibited from presenting text generated by generative AI as your own in any assessable work product. This means that you may not copy-paste more than 8 consecutive words from any source without specific attribution (superficial changes designed to evade the substance of this rule will be disregarded); you may not present specific insights and ideas from external sources without specific attribution to an appropriate source. In addition, you may not include factual information or citations from generative AI that you have not verified. Work containing obvious AI “hallucinations” of citations or quotations will merit a failing grade.

(b) In addition, you may not use generative AI to develop insights and strategies for specific assigned class activities or assessable work product without specific authorization from your professor. For example, in that context you may not use generative AI

to review legal documents (real and simulated) for potential issues where learning to spot relevant issues is part of the skillset being taught;
to suggest negotiation strategies for a simulated deal where learning to develop negotiation strategies is part of the skillset being taught;
to practice role-playing as opposing counsel for such a simulated deal or negotiation; and
to identify ethical issues in a fact pattern where identifying such issues is part of the skillset being taught.

(c) You may not use generative AI to assist with answering questions presented in class in real time: if you are on-call that does not mean ChatGPT is on-call.

(3) You may use generative AI for research and source discovery provided you do so responsibly and in compliance with (2) above. Examples of acceptable uses include asking a generative AI tool for caselaw, statutes, and regulations relating to a particular topic, or to review a draft of your work product and ask for suggested additional sources or authorities.

(4) You may use generative AI to improve your work product, provided you do so responsibly and in compliance with (2) above. For example, you may use generative AI for brainstorming/ideation for essay topics, or to suggest a more logical structure for a paper; you may use generative AI to identify weaknesses in argument, counter-arguments you may have overlooked, and otherwise critically evaluate your written work. Likewise, you may use generative AI to improve your understanding of complex legal doctrines, including by asking for different types of explanations thereof, but again, provided you do so responsibly and in compliance with (2) above.

(5) You may use generative AI for detailed assistance with drafting, editing and style, provided you do so responsibly and in compliance with (2) above and with an appropriate disclosure. For example, you may draft a passage and then ask generative AI to rewrite it a particular style (law review, client email, opening argument), or to maintain a particular style but reduce the word count; you may draft a passage in a language other than English and then ask generative AI for an English translation; you may use generative AI to suggest more effective transitions and topic sentences, introductions and conclusions; you may use generative AI for suggestions as to how to more effectively integrate quotations into your main text.

The disclosure for the editorial assistance described above should be in the following form: “Approximately [10-25 |25-50 ] % of this [essay] was redrafted with the assistance of generative AI (list all), however all of the ideas and analysis are either my own or are appropriately cited.”

(6) You may use generative AI to generate images and charts in assessable work product with specific disclosure, such as a visible note in the caption or figure description: “Chart produced with [name of tool] based on [general description of prompt or underlying data]”.

(7) You may use spell check, and dictation software without any disclosure.

(8) You may use generative AI to support your learning and comprehension of course materials, provided you do so responsibly and in compliance with (2) above. For example, you may use generative AI as a tutor or a study partner, or to create flashcards, hypotheticals, explanations, quiz questions, etc.; you may use generative AI to summarize and outline course materials; you may use generative AI to suggest answers to non-assessable problem questions, or to evaluate your answers to non-assessable problem questions.

(9) Permitted uses are not necessarily recommended. Direct engagement with primary sources and your own analysis will yield the deepest learning and the most reliable work product. AI may serve as a useful complement—helping to clarify, organize, or refine ideas—but it should be employed thoughtfully and never as a substitute for the skills this course is designed to develop.

For term papers, you need a bit more

I suggest the following additional instructions.

Write in your own voice:

To avoid the impression that your work was written by a chatbot or is just a superficial rephrasing of a few original sources you must ensure that it reflects your own original analysis, voice, and understanding. Submissions that exhibit unusually advanced legal knowledge, overly polished or professional tone, highly structured policy-style formatting, or extensive use of comparative law without appropriate scaffolding may raise concerns about authorship. Likewise, papers that rely heavily on secondary authority without clear personal engagement can suggest inappropriate use of generative AI or outside assistance.

A good way to demonstrate the originality of your contribution is to explore a narrowly defined, non-obvious topic, rather than a broad or generalized theme arising from the course. A greater level of specificity usually indicates that a student has chosen a unique angle shaped by personal interest or experience.

Research, sources, and citation practices:

Good research and appropriate citation practices go hand in hand.

For most law research papers, you should prioritize primary sources and academic sources. However, for many topics in this course, you will be discussing recent trends and developments, so it will often be appropriate to cite journalistic reports and even blog posts as well. Here are some guidelines for citing propositions relating to Law, Opinion, Facts as summarized by someone else, and Specific facts.

(1) Law: If you are making an assertion about what the law is, you should generally cite case law, statutes, or academic treatises.

(2) Opinion: If you are discussing academic commentary or opinion, cite the relevant source directly.

(3) Key arguments: If you are making an academic argument that already exists in the literature, you should identify who made that argument first. What if you can’t say for sure? If the argument is central to your thesis, put in the effort to be sure! If it is not, sometimes it will suffice to note others who have made the same point in a form such as “For arguments that …, see, for example, …”

(4) Facts as summarized by someone else: If you are referencing facts that have been summarized in academic commentary, you have some discretion as to whether to cite the academic source or go directly to primary sources for the underlying facts. Government reports and think tank publications are also useful for consolidated discussions of facts, as well as insightful commentary and analysis. In general, citing primary sources is preferable unless you are relying on an author’s summary or synthesis of multiple sources.

(5) Specific facts: Background information often comes from blogs, news articles, magazine articles, or even Wikipedia. That is fine. When using these as secondary sources, ask yourself: Is this the most direct source? Is this a reliable source? Whenever possible, prioritize more direct, reliable and authoritative sources to ensure accuracy and credibility. For example, do not cite to a blog post that summarizes an article in the NY Times, if you can read the underlying article and cite it directly.

Caution: AI summaries and dialogs with chatbots are not a reliable source of any external fact. Obviously, you can cite a ChatGPT session for a proposal like, “ChatGPT (version 4o) often recommends Kyoto when asked to suggest a random city.” But you can’t use ChatGPT as authority for the proposition that Kyoto was Japan’s capital from 794 to 1868.

Concluding thoughts

“Law schools are uniquely positioned to model thoughtful, principled engagement with new technology. A well-crafted AI policy can uphold academic integrity without stifling innovation or disadvantaging students. The goal is not to ban the future, but to teach students how to use it responsibly.”

Or that’s what ChatGPT said when I asked for suggestions on how conclude this post. I use LLMs in lots of different ways and this post benefited from long discussions with ChatGPT and with my Emory Law colleagues, but this post does not reflect the views of Emory Law, or ChatGPT for that matter.

September 11, 2025September 11, 2025

Piracy, Proxies, and Performance: Rethinking Books3’s Reported Gains

A new NBER working paper by Stella Jia and Abhishek Nagaraj makes some stunning claims about the effects of pirated book corpora on large-language-model (LLM) performance. In Cloze Encounters: The Impact of Pirated Data Access on LLM Performance (May 19, 2025) (working paper)( https://www.nber.org/papers/w33598), the authors contend that access to Books3—a pirated collection of full-text books—raises measured performance by roughly 21–23 percent in some LLMs.

This astonishing finding is an artifact of the paper’s methodology and the very narrow definition of “performance” that it adopts, as such it should not be taken at face value.

Cloze Encounters’ methodology and claims

Jia and Nagaraj assemble a 12,916-book evaluation set and apply a “name cloze” task: mask a named entity in a short passage and ask the model to supply it.

For instance, given a sentence like “Because you’re wrong. I don’t care what he thinks. [MASK] pulled his feet up onto the branch” from The Lightning Thief, the model should identify “Grover” as the missing name.

The main results of Cloze Encounters are estimates of “performance” showing large, statistically significant gains for GPT-class models (about 21–23 percent relative to baseline), smaller gains for Claude/Gemini/Llama-70B (about 7–9 percent), and no detectable effect for Llama-8B. The effects are stronger for less-popular book titles, consistent with fewer substitutes (Internet reviews or summaries) in other training data.

This is all well and good, but the way authors explicitly link these findings to current controversies relating to copyright policy, licensing markets, and training-data attribution is troubling.

Cloze Encounters is not measuring “performance” any way that people should care about

The first thing that raised my suspicion about this paper is that I had already seen this exact methodology used as a clever way to illustrate memorization and to show how some books are memorized more than others. See, Kent Chang et al., “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” (https://arxiv.org/abs/2305.00118). Cloze Encounters scales and repurposes that approach for a causal analysis of how access to pirated books in the Books3 lead to improved LLM “performance.” But it doesn’t make sense to me that what counted as a memorization probe in one paper could just be relabeled as a general “performance” metric in another.

Why is memorization so different to performance?

This is a question of construct validity. The method in Cloze Encounters tests recall of a masked name from a short passage, scored as a binary hit. This kind of lexical recall is a narrow slice of linguistic ability that is highly sensitive to direct exposure to the source text. It’s a proxy for memorization rather than the broad competencies that make LLMs interesting and useful.

The capabilities that matter in practice—long-context understanding, abstraction and synthesis, factual grounding outside narrative domains, reliable instruction following—are largely orthogonal to masked-name recall. Calling the cloze score “LLM performance” is a massive over-generalization from a task that measures a thin, exposure-sensitive facet of behavior. As an evaluation device, name-cloze is sharp for detecting whether models learned from—or memorized—a specific source; it is blunt for assessing overall performance. There is no reason to think that evidence of snippets of memorization from particular works in the books3 dataset has any necessary relationship with being a better translator, drafter, summarizer, brainstorming partner, etc.

This paper is begging to be misread and misapplied in policy and legal debates

I wouldn’t go so far as to say that success on the cloze score tells us “literally nothing” about LLM performance: “almost nothing” is a fairer estimate. To see why, think about the process of pre-training. Pre-training optimizes next-token prediction over trillions of tokens; the cloze outcome is, by construction, and basically the same as that objective. So it is not surprising that it is unusually sensitive to direct exposure to given pieces of training data. There probably is a broad correlation between next-token accuracy and perceived usefulness (we certainly saw this in the transition from GPT-3.5 to GPT-4), but the relationship is not lockstep, and it’s easy to imagine a model that excels at memorization alone but generalizes poorly.

The authors nod to these limitations at various points in the manuscript but they still frame it as a measure a of “LLM performance” in a way that is just begging to be misread and misapplied in policy and legal debates. Abstract-level claims travel further than caveats; many readers will see the former and miss the latter.

Nor does the identification strategy employed in the paper do anything to rescue the limits of the construct. The instrumental variable—publication-year share in Books3—may isolate an exogenous shock to exposure. Even granting the exclusion restriction, the estimate remains the effect of Books3 on a name-cloze score. It tells us little about summarization, reasoning, instruction following, safety behavior, or cross-domain generalization.

Bottom line

Cloze Encounters usefully documents that access to Books3 leaves a measurable imprint on exposure-sensitive recall. But its central metric does not justify broad the claims it makes about “LLM performance.” The study measures whether models can fill in masked strings drawn from particular books; it does not show that such access improves the flexible, user-tailored generation that makes these systems valuable.

April 2, 2025

Emory Law AI Roundtable 2025

The Fourth Annual Legal Scholars Roundtable on Artificial Intelligence 2025 will be held next week at Emory Law and I am very excited by the amazing line-up of speakers and commentators we have.

AI Roundtable Papers

Neel Guha, Information in AI Regulation
Michael Goodyear, Dignity and Deepfakes
Kat Geddes, AI’s Attribution Problem
Deven Desai & Mark Riedl, Responsible AI Agents
Nikola Datzov, AI Jurisprudence: Toward Automated Justice
Yiyang Mei & Matthew Sag, The Illusion of Rights-Based AI Regulation
David Rubenstein, Federalism & Algorithms
Oren Bracha, Generative AI Two Information Goods

Some of these papers are available in draft on SSRN.com or arXiv.com, others are still in development.

AI Roundtable Keynote

We also have a special keynote from Prof. Barton Beebe, presenting his new book manuscript “Technological Change and the Beautiful Deaths of Law: A Recurring History.” The Roundtable is invitation only, Emory faculty and students who are interested in attending should contact me for details.

History of the Legal Scholars Roundtable on Artificial Intelligence

The Roundtable was founded by Professor Matthew Sag and Professor Charlotte Tschider in March 2022 as an online event (due to the Covid-19 Pandemic) and has been conducted as an annual event at Emory Law School ever since. The Roundtable is supported by Emory University School of Law and by Emory’s AI.Humanity initiative.

The following were recognized as the Roundtable’s Best Paper in their respective years: Rebecca Crootof, Margot Kaminski, & Nicholson Price, Humans in the Loop, 76 Vanderbilt Law Review 429 (2023) (Best paper of 2022); Matthew T. Wansley, Regulating Driving Automation Safety, 73 Emory Law Journal 505 (2024) (Best paper of 2023); Mark Bartholomew, A Right to Be Left Dead, 112 California Law Review 1591 (2024) (Best paper of 2024)

March 20, 2025

Copyright and the AI Action Plan

On March 14, 2025, I submitted my comments to the Office of Science and Technology Policy in relation to the “AI Action Plan”. For context, the Office of Science and Technology Policy requested input on the Development of an Artificial Intelligence (AI) Action Plan to define the priority policy actions needed to sustain and enhance America’s AI dominance, and to ensure that unnecessarily burdensome requirements do not hamper private sector AI innovation. See Exec. Order No. 14,179, 90 Fed. Reg. 8741 (Jan. 31, 2025)(Executive Order titled “Removing Barriers to American Leadership in Artificial Intelligence,” signed by President Trump).

What follows is a lightly edited version of those comments (mostly removing footnotes, but also making a couple minor improvements).

AI Action Plan, Submission to the Office of Science and Technology Policy

I am the Jonas Robitscher Professor of Law in Artificial Intelligence, Machine Learning, and Data Science, Emory University. I appreciate the opportunity to contribute to OSTP’s call for policy ideas aimed at enhancing America’s global leadership in Artificial Intelligence (AI).

My primary points in this submission are that if, contrary to precedent and sound policy, American courts rule that training AI models on copyrighted works is not permissible as fair use, the U.S. government must be ready to act. And furthermore, to maintain U.S. leadership in artificial intelligence, the AI Action Plan should explicitly affirm the importance of broad copyright exceptions—particularly fair use for nonexpressive activities like AI model training.

How copyright law in various countries deals with AI training

In “The Globalization of Copyright Exceptions for AI Training” my co-author Professor Peter Yu and I examine how copyright frameworks across the world have addressed the apparent tension between copyright law and copy-reliant technologies such as computational data analysis in the form of text data mining (TDM), machine learning and AI.

Our research reveals that, although the world has yet to achieve a true consensus on copyright and AI training, an international equilibrium has emerged. In this equilibrium, countries recognize that TDM, machine learning and AI training can be socially valuable and do not inherently prejudice the copyright holders’ legitimate interests. Policymakers in the European Union, Japan, Israel, and Singapore agree in general terms that such uses should therefore be allowed without express authorization in some, but not necessarily all, circumstances.

Major industrialized economies have found different ways to this equilibrium position. Some, like the U.S. and Israel have done so through the fair use doctrine. Others, like Japan, Singapore, and the European Union, have crafted express copyright exceptions for TDM and computational data analysis. Other nations where the rule of law is not so clearly established are energetically pursuing AI development with state backing without updated copyright laws to facilitate AI training. There is little doubt that if the Chinese Communist Party deems copyright law an impediment to its AI ambitions, the law in China will change almost instantaneously, and very likely retrospectively.

U.S. litigation could unsettle global AI copyright norms

American courts have historically recognized fair use protections for technologies relying on nonexpressive copying, such as reverse engineering, plagiarism detection software, digital library searches, and computational humanities research spanning millions of scanned texts. Extending this principle logically, training AI models—which similarly involves copying without directly reproducing expressive content—would usually qualify as fair use. (For citations and discussion of the relevant literature, see Matthew Sag, Fairness and Fair Use in Generative AI, 92 Fordham Law Review 1887 (2024))

Yet, plaintiffs in more than 30 ongoing lawsuits across U.S. district courts contest this view. Collectively, they seek injunctions barring AI training without explicit consent, billions in monetary compensation, and even destruction of existing AI models. Although, in my estimation and that of many copyright experts, the plaintiffs are should not prevail on sweeping arguments that would bring AI training in the U.S. to a halt, they might.

A bad court decision may drive AI innovation offshore

Adverse outcomes in U.S. litigation will not stop the development of AI, they will simply push AI innovation overseas. The reason is straightforward: AI models, once trained, are easily portable. Companies seeking to avoid restrictive copyright rules could simply move their training operations to innovation-friendly jurisdictions like Singapore, Israel, or Japan, and then serve U.S. customers remotely, entirely free of domestic copyright concerns.

How is this possible? AI developers need fair use for all the copying that takes place to make training possible, but they don’t need fair use once the models have been trained because, by-and-large, trained AI models do not replicate the expressive details of their training datasets; instead, they distill general patterns, abstractions, and insights from that training data.

Thus, in the eyes of copyright law, these models are neither copies nor derivative works based on the training data. If U.S. copyright law turns against our AI industry, companies in the U.S. will still be able to use models trained in AI-friendly jurisdictions by either setting up a data pipeline so that the model stays overseas or hosting their models in the United States once it has been trained. Consequently, imposing overly restrictive copyright interpretations domestically will do very little to turn back the tide on AI, but risks surrendering America’s AI advantage to more AI-friendly jurisdictions.

Licensing deals are no substitute for fair use

While licensing agreements between AI developers and media companies are becoming more common, they cannot solve copyright concerns surrounding AI training. The sheer scale of AI training data makes the licensing approach impractical at the cutting edge. For instance, Meta’s recent Llama 3 model consumed over 15 trillion (15,000,000,000,000) tokens drawn from publicly accessible sources. To put this into perspective, assuming that the New York Times print edition is roughly fifty pages per day, each page has 4000 words (this is probably way over!), and there are 1.3 tokens per word, the newspaper would generate roughly 1.82 million tokens per week. At that rate, it would take about 158,500 years for the New York Times to generate 15 trillion tokens.

Licensing may be possible for some AI training, but licensing at the scale required to train frontier LLMs is not a realistic foundation for American industrial policy, it is a fantasy.

Nevertheless, existing deals with major media companies illustrate something important: AI developers are willing to pay for efficient access to high-quality datasets otherwise locked behind paywalls or machine-readable restrictions. Such agreements suggest that licensing has a niche but crucial role—not as a substitute for broad exceptions like fair use, but rather as a complementary source of premium training data. This dynamic becomes particularly valuable in AI-powered search scenarios, where language models frequently generate outputs closely resembling original copyrighted content, pushing the boundaries between acceptable use and potential infringement.

The U.S. Government must be ready to act

If, contrary to precedent and sound policy in my view, American courts rule that training AI models on copyrighted works is not permissible as fair use, the U.S. government should act. Specifically, the government would need to introduce legislation to reinstate the principle that training AI models typically falls under fair use or create a specific statutory exemption. I see no way this could be done through agency rulemaking or executive action. Legislative intervention would be necessary to safeguard America’s competitive edge against innovation-friendly jurisdictions like Japan, Singapore, Israel, and, in this context, even the European Union.

To maintain U.S. leadership in artificial intelligence, the AI Action Plan should explicitly affirm the importance of broad copyright exceptions—particularly fair use for nonexpressive activities like AI model training.

February 11, 2025

Thomson Reuters v. ROSS Intelligence (Summary Judgement)

In a closely watched decision revising a previous summary judgment, Judge Stephanos Bibas, a Third Circuit judge sitting by designation, sided largely with Thomson Reuters in its copyright dispute against ROSS Intelligence. The ruling granted partial summary judgment on direct copyright infringement claims while dismissing ROSS’s argument that its use of Thomson Reuters’ content qualified as fair use.

With Ross Intelligence now bankrupt and the technology at issue a decidedly niche application, attention is shifting to the broader implications for AI training and the use of copyrighted materials—particularly in the realm of generative AI. Earlier, Judge Bibas had refused to grant summary judgment on fair use, insisting the matter be put before a jury. However, upon further reflection, he reversed course, ultimately rejecting the defendant’s fair use defense outright.

Background

Thomson Reuters, the owner of Westlaw, accused the AI-driven legal research firm ROSS of copyright infringement, alleging that it had improperly used legal summaries—so-called Bulk Memos—derived from Westlaw’s editorial materials, particularly its headnotes, to train its technology. Thomson Reuters had refused to license its content to ROSS, a rival developing an AI-powered legal research tool requiring a database of legal questions and answers for training. To obtain the necessary data, ROSS partnered with LegalEase, which compiled and sold approximately 25,000 Bulk Memos—summaries created by lawyers referencing Westlaw headnotes. Whether the Bulk Memos involved verbatim copying or otherwise infringing copying was an issue in the case that ultimately went against ROSS. Upon discovering that ROSS had used content derived from these headnotes, Thomson Reuters filed a copyright infringement lawsuit. The summary judgment pertains only to a subset of the contested headnotes, leaving broader legal questions unresolved.

The court ruled against ROSS, determining that it had copied 2,243 headnotes and dismissing its various legal defenses, including claims of innocent infringement, copyright misuse, and the merger doctrine.

Ross’s use was not transformative

Judge Bibas ruled that ROSS’s use of Thomson Reuters’ material was commercial and non-transformative, a conclusion that weighed heavily in the publisher’s favor. According to the court, the use did not qualify as transformative because it lacked a distinct purpose or character from Thomson Reuters’ original work.

The court’s conclusion that Ross’s use was not transformative is puzzling, especially given its acknowledgment—while discussing the third fair use factor—that the output of Ross’s system did not replicate Westlaw’s copyrighted headnotes but rather produced uncopyrighted judicial opinions.

The court did distinguish two significant cases, Sega Enterprises Ltd. v. Accolade, Inc. and Sony Computer Entertainment, Inc. v. Connectix Corp. but failed to consider cases like iParadigms, HathiTrust and Google Books. Even the way the court dealt with the reverse engineering cases is a bit suspect. The court sets them aside for two reasons, first because those cases involved copying software code, and second, that such copying was “necessary for competitors to innovate.” To be sure, Oracle v. Google suggests that cases involving software may merit special treatment, but it is not clear why the software context should make a difference here. Judge Bibas’s invocation of necessity is undercooked as well. Whether an act of copying is “necessary” is inextricably tied to the level of generality at which you ask the question. In Oracle v. Google, Google’s replication of Java APIs was essential for compatibility with existing Java programmers, but whether that compatibility was a necessity or luxury depends on the level of generality at which you pose the question. After all, other smartphones ran without making life easy for Java programmers.

Not generative AI, but why?

The judge took care to distinguish this case from generative AI, yet the distinction remains murky. The court stated: “Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw. It is undisputed that Ross’s AI is not generative AI (AI that writes new content itself).” And later that “Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today.”

But what, exactly, sets this apart from generative AI? More broadly, how does this differ from other cases where nonexpressive uses have been deemed fair use? The opinion offers little guidance. It fails to engage with seemingly comparable precedents, such as plagiarism detection tools, library digitization for text analysis and digital humanities research, or the creation of a book search engine—cases where courts have found fair use.

The closest we get to an explanation of why Ross’s use of the Westlaw headnotes is different to the intermediate copying iParadigms, HathiTrust and Google Books is that Ross merely retrieves and presents judicial opinions in response to user queries. This process, the court observed, closely parallels Westlaw’s own practice of using headnotes and key numbers to identify relevant cases. Consequently, the court concluded that Ross’s use was not transformative, as it primarily served to facilitate the development of a competing legal research tool rather than to add new expression or meaning to the copied material.

Market effect

The court determined that ROSS’s actions impaired Thomson Reuters’ market for legal AI training data, and in its reasoning, the fourth fair use factor carried substantial weight. Without qualification, the opinion echoes Harper & Row’s assertion that the fourth factor “is undoubtedly the single most important element of fair use.” This is problematic. Asserting the absolute primacy of the Fourth factor is obviously in error in light of Campbell, as well as the Court’s more recent decisions in Google v. Oracle and Andy Warhol Foundation. The Court’s contemporary approach to fair use eschews rigid hierarchies among the statutory factors.

That said, the judge’s finding in relation to the fourth factor may not be entirely unreasonable in this case: Ross explicitly intended to compete with Westlaw by creating a viable market alternative. For the court the key fact was that Ross “meant to compete with Westlaw by developing a market substitute.” “And it does not matter whether Thomson Reuters has used the data to train its own legal search tools; the effect on a potential market for AI training data is enough.”

Implications

One district court opinion that barely engages with the relevant caselaw will not change U.S. fair use law overnight, but it will certainly be welcome news for the plaintiffs in the more than 30 ongoing AI copyright cases currently being litigated.

I think what is really going on in this decision is that the judge has confused the first factor with the fourth factor. There is no obvious way to distinguish training on the question and answer memos to develop a model that directly links user questions to the relevant case law from cases involving search engines and plagiarism detection software. The real distinction, if there is one, is that ROSS used Westlaw’s product to create a directly competing product.

Looking at the case this way, the decision might actually be good for the generative AI defendants, in cases like NYT v OpenAI, because there isn’t the same direct competition.

* This is my first quick take on the decision just hours after it was handed down.

* Citation: Thomson Reuters Enter. Ctr. GmbH v. ROSS Intelligence Inc., No. 1:20-cv-613-SB (D. Del. Feb. 11, 2025)

October 3, 2024

Book Review: Nick Seaver, Computing Taste: Algorithms and the Makers of Music Recommendation

(University of Chicago Press, 2022)

In Computing Taste, Nick Seaver provides an ethnographic exploration of the world of music recommendation systems, revealing how algorithms are deeply shaped by the humans who design them. He shows how the algorithms that drive music recommendations are shaped by human judgment, creativity, and cultural assumptions. The data companies collect, the way they construct models, how they intuitively test whether their models are working, and how they define success are all deeply human and subjective choices.

Beyond Man vs. Machine

Seaver points out that textbook definitions describe algorithms as “well-defined computational procedures” that take inputs and generate outputs, portraying them as deterministic and straightforward systems. This narrow view leads to a man-versus-machine narrative that is trite and unilluminating. Treating algorithms as though their defining quality is the absence of human influence reinforces misconceptions about their neutrality. Instead, Seaver advocates for focusing on the sociotechnical arrangements that produce different forms of “humanness and machineness,” echoing observations by Donna Haraway and others.

In practice, algorithmic systems are messy, constantly evolving, and shaped by human judgment. As Seaver notes, “these ‘cultural’ details are technical details,” meaning that the motivations, preferences, and biases of the engineering teams that design algorithms are inseparable from the technical aspects of the systems themselves. Therefore, understanding algorithms requires acknowledging the social and cultural contexts in which they operate.

From Information Overload to Capture

Seaver shows how the objective of recommendation systems has shifted from the founding myth of information overload to the current obsession with capturing user attention. Pioneers of recommender systems told stories of information overload that presented growing consumer choice as a problem in need of a solution. The notion of overwhelming users with too much content has been a central justification for creating algorithms designed to filter and organize information. If users are helpless in the face of vast amounts of data, algorithms become necessary tools to help them navigate this digital landscape. Seaver argues that the framing of overload justifies the control algorithms exert over what users see, hear, and engage with. The idea of “too much music” or “too much content” becomes a convenient rationale for developing systems that, in practice, do more than assist—they guide, constrain, and shape user choices.

In any event, commercial imperatives soon led to rationales based on information overload giving way to narratives of capture. Seaver compares recommender systems to traps designed to “hook” users, analyzing how metrics such as engagement and retention guide the development of algorithms. Seaver traces the evolution of recommender systems from their origins as tools to help users navigate the overwhelming abundance of digital content to their current role in capturing and retaining user attention. The Netflix Prize, a 2006 competition aimed at improving Netflix’s recommendation algorithm, serves as a key example of this shift. Initially, algorithms were designed to help users manage “information overload” by personalizing content based on user preferences, as Netflix sought to predict what users would enjoy. However, Netflix never used the winning entry. As streaming services became central to Netflix’s business model, the focus of recommendation systems shifted from merely helping users find content to keeping them engaged on the platform for as long as possible. This transition from personalization to attention retention shows the shift in the industry’s goals. Recommender systems, including those at Netflix, began to focus on encouraging continuous engagement by suggesting binge-worthy content to maximize viewing hours, implementing autoplay features to keep the next episode or movie rolling without user interaction, and focusing on actual viewing habits (e.g., “skip intro” clicks, time spent on a show, completion rates) rather than ratings to keep users hooked.

Seaver’s perspective is insightful, not unrelentingly critical. The final chapter investigates how the design of recommendation systems reflects the metaphor of a “park”—a managed, curated space that users are guided through. Recommender systems are neither strictly benign nor malign, but they do entail a loss of user agency. We, the listening public, are not trapped animals so much as a managed flock. Seaver recognizes that recommendation systems open up new possibilities for exploration while also constraining user behavior by narrowing choices based on past preferences.

Why Do My Playlists Still Suck?

The book also answers the question that motivated me to read it: why do my playlists still suck? No one has a good model for why we like the music that we like, when we like it, or how that extrapolates to music we haven’t heard yet. And Spotify and other corporate interests have no real interest in solving that puzzle for us. The algorithms that shape our cultural lives now prioritize engagement, rely on past behavior, and reflect a grab bag of assumptions about user preferences that are often in conflict. There is very little upside to offering us fresh or risky suggestions when a loop of familiarity will keep us more reliably engaged.

August 8, 2024December 12, 2024

London Marathon!

On April 27, 2025, I will be running the London Marathon to raise money for research and treatment of pancreatic cancer.

The future is what we make it

My sister Rebecca was diagnosed with pancreatic cancer in 2018. She was as brave as anyone could be, but her battle was short-lived; Becky passed away just weeks after her diagnosis. Her story is typical, pancreatic cancer is almost always fatal because it is diagnosed too late.

But we can change this story.

I am asking you to help me raise money for Pancreatic Cancer UK which is pioneering efforts to develop earlier detection methods to give others the fighting chance that Becky deserved.

There are several different ways you can contribute

(1) donate directly to Pancreatic Cancer UK at https://2025tcslondonmarathon.enthuse.com/pf/matthew-sag (immediate impact)

(2) send me money via venmo that I will then aggregate and convert into GBP and donate in your name (@Matthew-Sag) (minimize foreign transaction fees)

(3) donate to a different pancreatic cancer in your country of choice, send me the details and I’ll match that with a donation to Pancreatic Cancer UK (local impact)

Every contributor gets to add a song to my London Marathon Spotify Playlist!

If you donate for this cause, you can let me know what song I should add to my London Marathon Spotify Playlist. I will set this to shuffle during the race and try to remember who suggested which song

Please join me

Please join me on this journey and donate in Rebecca’s memory and bringing us closer to a future where no more lives are stolen by this devastating disease.

Follow my progress

To see how my training is going, follow me on Strava, check out my google spreadsheet (https://docs.google.com/spreadsheets/d/16hV2e-IxXbo01uM7_X5tGyOz6ONPtSZGJZdjKUeVH0k/edit?usp=sharing) or get narrative updates here on this page (https://2025tcslondonmarathon.enthuse.com/pf/matthew-sag)

Playlist so far …

Queen, Don’t Stop Me Now (my pick)

The Weekend, Blinding Lights (my pick)

Monty Python, Always Look on the Bright Side of Life (Matt & Mindy Lawrence)

The Clash, London Calling (Spencer Waller)

Bruce Springsteen, Born to Run (Richard Fields)

Olivia Newton-John, Xanadu (Jo Groube)

The Beatles, Long and Winding Road (Jan & Andy Sag)