Piracy, Proxies, and Performance: Rethinking Books3’s Reported Gains

A new NBER working paper by Stella Jia and Abhishek Nagaraj makes some stunning claims about the effects of pirated book corpora on large-language-model (LLM) performance. In Cloze Encounters: The Impact of Pirated Data Access on LLM Performance (May 19, 2025) (working paper)( https://www.nber.org/papers/w33598), the authors contend that access to Books3—a pirated collection of  full-text books—raises measured performance by roughly 21–23 percent in some LLMs.  

This astonishing finding is an artifact of the paper’s methodology and the very narrow definition of “performance” that it adopts, as such it should not be taken at face value. 

Cloze Encounters’ methodology and claims

Jia and Nagaraj assemble a 12,916-book evaluation set and apply a “name cloze” task: mask a named entity in a short passage and ask the model to supply it.

For instance, given a sentence like “Because you’re wrong. I don’t care what he thinks. [MASK] pulled his feet up onto the branch” from The Lightning Thief, the model should identify “Grover” as the missing name.

The main results of Cloze Encounters are estimates of “performance” showing large, statistically significant gains for GPT-class models (about 21–23 percent relative to baseline), smaller gains for Claude/Gemini/Llama-70B (about 7–9 percent), and no detectable effect for Llama-8B. The effects are stronger for less-popular book titles, consistent with fewer substitutes (Internet reviews or summaries) in other training data.

This is all well and good, but the way authors explicitly link these findings to current controversies relating to copyright policy, licensing markets, and training-data attribution is troubling.

Cloze Encounters is not measuring “performance” any way that people should care about

The first thing that raised my suspicion about this paper is that I had already seen this exact methodology used as a clever way to illustrate memorization and to show how some books are memorized more than others. See, Kent Chang et al., “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” (https://arxiv.org/abs/2305.00118). Cloze Encounters scales and repurposes that approach for a causal analysis of how access to pirated books in the Books3 lead to improved LLM “performance.” But it doesn’t make sense to me that what counted as a memorization probe in one paper could just be relabeled as a general “performance” metric in another.

Why is memorization so different to performance?

This is a question of construct validity. The method in Cloze Encounters tests recall of a masked name from a short passage, scored as a binary hit. This kind of lexical recall is a narrow slice of linguistic ability that is highly sensitive to direct exposure to the source text. It’s a proxy for memorization rather than the broad competencies that make LLMs interesting and useful.

The capabilities that matter in practice—long-context understanding, abstraction and synthesis, factual grounding outside narrative domains, reliable instruction following—are largely orthogonal to masked-name recall. Calling the cloze score “LLM performance” is a massive over-generalization from a task that measures a thin, exposure-sensitive facet of behavior. As an evaluation device, name-cloze is sharp for detecting whether models learned from—or memorized—a specific source; it is blunt for assessing overall performance. There is no reason to think that evidence of snippets of memorization from particular works in the books3 dataset has any necessary relationship with being a better translator, drafter, summarizer, brainstorming partner, etc.

This paper is begging to be misread and misapplied in policy and legal debates

I wouldn’t go so far as to say that success on the cloze score tells us “literally nothing” about LLM performance: “almost nothing” is a fairer estimate. To see why, think about the process of pre-training. Pre-training optimizes next-token prediction over trillions of tokens; the cloze outcome is, by construction, and basically the same as that objective. So it is not surprising that it is unusually sensitive to direct exposure to given pieces of training data. There probably is a broad correlation between next-token  accuracy  and perceived usefulness (we certainly saw this in the transition from GPT-3.5 to GPT-4), but the relationship is not lockstep, and it’s easy to imagine a model that excels  at memorization alone but generalizes poorly.

The authors nod to these limitations at various points in the manuscript but they still frame it as a measure a  of “LLM performance” in a way that is just begging to be misread and misapplied in policy and legal debates. Abstract-level claims travel further than caveats; many readers will see the former and miss the latter.

Nor does the identification strategy employed in the paper do anything to rescue the limits of the construct. The instrumental variable—publication-year share in Books3—may isolate an exogenous shock to exposure. Even granting the exclusion restriction, the estimate remains the effect of Books3 on a name-cloze score. It tells us little about summarization, reasoning, instruction following, safety behavior, or cross-domain generalization.

Bottom line

Cloze Encounters usefully documents that access to Books3 leaves a measurable imprint on exposure-sensitive recall. But its central metric does not justify broad the claims it makes about “LLM performance.” The study measures whether models can fill in masked strings drawn from particular books; it does not show that such access improves the flexible, user-tailored generation that makes these systems valuable.