memorization – Matthew Sag

This post is a very lightly edited extract from my forthcoming article in the Duke Law Journal, Copyright’s Jagged Frontier (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6319379)

What does AI memorization prove?

Some argue that any evidence of memorization necessarily negates the claim that AI models are transformative. They advance this claim by injecting the term “compression” into the conversation in a way that suggests that AI models like GPT, Claude, and Gemini, are compressed representations of their training data in the same way that an MP3 music file is a compressed version of music from a compact disc.

“[model training is] similar to what’s called lossy compression, which one way to describe it is if you have a giant file and you compress it into a ZIP file, you lose some of the contents of the work, but effectively you’re just actually compressing the file. … it’s actually taking the expressive content of the training data and compressing it down into a model. And that confirms that there’s no actual transformative use going on here … what the model is doing is actually just repeating over and over the training data over and over again.”

— Bartz v. Anthropic, Transcript of Motion for Summary Judgment Oral Argument, May 22, 2025., p44-45 (explaining Plaintiff’s expert’s view)

Alex Reisner (AI’s Memorization Crisis, The Atlantic), for example, draws on the Cooper and Ahmed studies, and argues that the evidence of memorization undermines the learning metaphor and reveals generative AI training for what it really is: “compression.” The upshot is, “Large language models don’t ‘learn’—they copy[.]” See also Ted Chaing‘s famous essay: ChatGPT Is a Blurry JPEG of the Web.

Technically accurate but thoroughly misleading

Associating AI training with compression is technically accurate if you understand the term the way computer scientists do; but it is also thoroughly misleading if you associate compression with MP3s, JPEGs, and Zip files, as most of us do.

AI models learn compact internal representations of their training data which capture whatever patterns that enable more accurate predictions. It is equally valid to label this process as “abstraction”, “learning”, “dimension reduction”, or “compression”; but the compression label invites analogy to familiar media formats such as MP3s and JPEGs.

These formats store approximations of original works that can later be reconstructed in forms that closely resemble their sources and are usually regarded as functionally indistinguishable. Other than hipsters with a taste for vinyl records, consumers interact with ZIP files, JPEGs, and MP3s as functionally equivalent to their uncompressed originals; whatever information is discarded is socially normalized as imperceptible. Side note, I highly recommend Jonathan Sterne, MP3: The Meaning of a Format (2012).

Calling it compression tells you nothing

Training an AI model is nothing like a ripping music into an MP3 format. Calling that process “compression,” tells you nothing about the level of detail of what is learned or the significance of the information discarded. The compression metaphor is further misleading because it implies uniformity and predictability. In conventional audio or image compression, the same categories of information are discarded from every file according to stable and transparent criteria that reflect advance judgments about what matters and what does not. By contrast, memorization in large language models is uneven, incidental, and difficult to anticipate. We know that memorization is more likely when a model is exposed to multiple copies of the same work, and that the timing of exposure during training can matter. Beyond such generalities, however, it is not possible to predict in advance which works will be retained verbatim or to what degree.

The rhetoric of compression is really just an effort to sidestep a difficult empirical question, rather than to answer it. The fact that one thing is memorized to a degree that seems relevant under copyright law doesn’t prove that everything is memorized to a similar degree.

To evaluate whether memorization actually has significance under copyright law requires some kind of qualitative and quantitative assessment of the nature and extent of memorization. But even that statement is overbroad, as I explain in Copyright’s Jagged Frontier, what actually matters in terms of a fair use analysis is not memorization in the abstract, but memorization that finds its way into production.

Tag: memorization

The Fallacy of Compression

What does AI memorization prove?

Technically accurate but thoroughly misleading

Calling it compression tells you nothing