March 2026 – Matthew Sag

March 23, 2026

David Kemp’s AI Policy Builder for Legal Education

David Kemp has just released a policy builder designed to help people who are struggling to design an AI policy that is relevant to their specific course.

According to the website, the policy content is “grounded in Sag, AI Policies for Law Schools (2025); Bliss, Teaching Law in the Age of Generative AI (2024); Perkins, AI Now: The Duty to Integrate AI Education in Law Schools (2024); Moppett, Preparing Students for the AI Era (2025); ABA Formal Opinion 512; and state bar guidance. All content is dedicated to the public domain under CC0 1.0 Universal.”

I have not tested this, but it seems like a fabulous idea. If you want to know more about my thoughts on building AI policies for law school classes, I have a paper on this topic on SSRN, the upshot of which is that effective AI policies must be course-specific, enforceable, and focused on teaching students to use AI responsibly as future legal professionals.

March 11, 2026

The Fallacy of Compression

This post is a very lightly edited extract from my forthcoming article in the Duke Law Journal, Copyright’s Jagged Frontier (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6319379)

What does AI memorization prove?

Some argue that any evidence of memorization necessarily negates the claim that AI models are transformative. They advance this claim by injecting the term “compression” into the conversation in a way that suggests that AI models like GPT, Claude, and Gemini, are compressed representations of their training data in the same way that an MP3 music file is a compressed version of music from a compact disc.

“[model training is] similar to what’s called lossy compression, which one way to describe it is if you have a giant file and you compress it into a ZIP file, you lose some of the contents of the work, but effectively you’re just actually compressing the file. … it’s actually taking the expressive content of the training data and compressing it down into a model. And that confirms that there’s no actual transformative use going on here … what the model is doing is actually just repeating over and over the training data over and over again.”

— Bartz v. Anthropic, Transcript of Motion for Summary Judgment Oral Argument, May 22, 2025., p44-45 (explaining Plaintiff’s expert’s view)

Alex Reisner (AI’s Memorization Crisis, The Atlantic), for example, draws on the Cooper and Ahmed studies, and argues that the evidence of memorization undermines the learning metaphor and reveals generative AI training for what it really is: “compression.” The upshot is, “Large language models don’t ‘learn’—they copy[.]” See also Ted Chaing‘s famous essay: ChatGPT Is a Blurry JPEG of the Web.

Technically accurate but thoroughly misleading

Associating AI training with compression is technically accurate if you understand the term the way computer scientists do; but it is also thoroughly misleading if you associate compression with MP3s, JPEGs, and Zip files, as most of us do.

AI models learn compact internal representations of their training data which capture whatever patterns that enable more accurate predictions. It is equally valid to label this process as “abstraction”, “learning”, “dimension reduction”, or “compression”; but the compression label invites analogy to familiar media formats such as MP3s and JPEGs.

These formats store approximations of original works that can later be reconstructed in forms that closely resemble their sources and are usually regarded as functionally indistinguishable. Other than hipsters with a taste for vinyl records, consumers interact with ZIP files, JPEGs, and MP3s as functionally equivalent to their uncompressed originals; whatever information is discarded is socially normalized as imperceptible. Side note, I highly recommend Jonathan Sterne, MP3: The Meaning of a Format (2012).

Calling it compression tells you nothing

Training an AI model is nothing like a ripping music into an MP3 format. Calling that process “compression,” tells you nothing about the level of detail of what is learned or the significance of the information discarded. The compression metaphor is further misleading because it implies uniformity and predictability. In conventional audio or image compression, the same categories of information are discarded from every file according to stable and transparent criteria that reflect advance judgments about what matters and what does not. By contrast, memorization in large language models is uneven, incidental, and difficult to anticipate. We know that memorization is more likely when a model is exposed to multiple copies of the same work, and that the timing of exposure during training can matter. Beyond such generalities, however, it is not possible to predict in advance which works will be retained verbatim or to what degree.

The rhetoric of compression is really just an effort to sidestep a difficult empirical question, rather than to answer it. The fact that one thing is memorized to a degree that seems relevant under copyright law doesn’t prove that everything is memorized to a similar degree.

To evaluate whether memorization actually has significance under copyright law requires some kind of qualitative and quantitative assessment of the nature and extent of memorization. But even that statement is overbroad, as I explain in Copyright’s Jagged Frontier, what actually matters in terms of a fair use analysis is not memorization in the abstract, but memorization that finds its way into production.

March 11, 2026

Copyright’s Jagged Frontier

Why the Line Between Legal and Infringing AI Won’t Be a Line at All

By Matthew James Sag

Everyone wants to know whether training AI on copyrighted works is legal. The real answer is: it depends—and the boundary between what’s permissible and what isn’t will be far messier than anyone expects.

In my forthcoming article in the Duke Law Journal, I argue that the copyright boundary for generative AI will be jagged rather than smooth. Not a clean bright line, but an irregular, context-dependent frontier shaped by the interaction of varying memorization rates across different AI models, divergent legal standards of similarity across different creative media, and the interplay of three distinct bodies of copyright doctrine (substantial similarity, fair use and secondary liability).

Understanding that jaggedness turns out to be essential—not just for predicting litigation outcomes, but for seeing the opportunities that lie on the other side.

The phrase “jagged frontier” will be familiar to many. It comes from the influential 2023 study by Fabrizio Dell’Acqua, Ethan Mollick, and colleagues, who used it to describe the uneven capability landscape of AI itself. It’s a useful concept because it captures the way that AI can be astonishingly good at some tasks while failing at others that seem equally difficult.

I borrow the metaphor deliberately, because copyright law presents generative AI with an analogous problem. The legal boundary between permissible and infringing AI conduct is similarly jagged: not because AI’s capabilities are uneven (though they are), but because the legal standards that determine infringement are themselves uneven across different creative domains. It seems likely that an AI system can cross the line into copyright infringement far more easily when generating music or images of recognizable characters than when generating prose—even when the underlying technology is essentially the same.

Explaining how and why the intersection of copyright and AI leads to a jagged frontier accounts for the first third of the article.

That jagged frontier is only the beginning of the story. Drawing on Ronald Coase’s insight that legal rules are starting points for adaptation and negotiation rather than final allocations, the Article argues that the extensive literature on AI and copyright has focused almost exclusively on fair use while ignoring what comes next. I might have something to say about that in a future post.

Matthew James Sag is the Jonas Robitscher Professor of Law in Artificial Intelligence, Machine Learning, and Data Science at Emory University School of Law. His article “Copyright’s Jagged Frontier” is forthcoming in the Duke Law Journal.