AI – Page 2 – Matthew Sag

October 15, 2025

Competition from AI music, Country Girls Make Do

As of October 2025, Suno and Udio are two text-to-music AI platforms that let users create full songs—including lyrics, vocals, and artwork—simply by entering text prompts. Some of this music is unappealing, even to its creators (protagonists?), but music scene insiders have assured me that some of the music emanating from these platforms is good enough to provoke a wistful, “I wish I had written that.”

AI music is also becoming more popular. A recent article in The Economist (of all places) recounts the viral success of “Country Girls Make Do,” a raunchy parody country song generated by artificial intelligence under the pseudonym Beats By AI. The song apparently features on TikTok where users prank the unsuspecting by playing it under false pretenses.

This is more than a one off. Acts such as Aventhis and The Velvet Sundown, also AI-based, have attracted hundreds of thousands of monthly listeners on Spotify. These tools allow for rapid and prolific production: Beats By AI reportedly releases a new song every day. This is not simply a case of streaming fraud where AI slop steals music plays from real artists by adopting confusing names—Spotify recently removed 75 million such tracks, citing “bad actors” flooding the platform with low-quality content. Some people at least, like some AI music. The Economist reports a Luminate survey finding that, one-third of Americans accept AI-written instrumentals, nearly 30% are fine with AI lyrics, and over a quarter do not mind AI vocals.

No music stands alone, but AI music arguably even less so

The appeal of these tracks lies partly in their echos of established genres and tropes, with a dash of irony and experimentation thrown in. It’s to be seen whether this portends a consumer-driven revolution in content creation where listeners generate their own entertainment rather than relying on record labels.

What does this mean for copyright law?

Although the Copyright Office would not regard works of The Velvet Sundown or Beats By AI as copyrightable, Spotify seems happy to royalties for AI music, provided the works themselves (as opposed to the copying that fed the AI process that created the works) don’t infringe on other artists songs.

AI music may destabilize entrenched business models at the fringes, but it might also foster broader participation and new forms of cultural expression. Does AI pose the same threat to the economic and cultural standing of musicians as it does to stock photography and digital art? Or will AI-generated music remain a hybrid layer within popular culture that feeds off and refers back to mainstream music without replacing the central role of human creation? If so, perhaps at least some country girls will make do.

September 11, 2025September 11, 2025

Piracy, Proxies, and Performance: Rethinking Books3’s Reported Gains

A new NBER working paper by Stella Jia and Abhishek Nagaraj makes some stunning claims about the effects of pirated book corpora on large-language-model (LLM) performance. In Cloze Encounters: The Impact of Pirated Data Access on LLM Performance (May 19, 2025) (working paper)( https://www.nber.org/papers/w33598), the authors contend that access to Books3—a pirated collection of full-text books—raises measured performance by roughly 21–23 percent in some LLMs.

This astonishing finding is an artifact of the paper’s methodology and the very narrow definition of “performance” that it adopts, as such it should not be taken at face value.

Cloze Encounters’ methodology and claims

Jia and Nagaraj assemble a 12,916-book evaluation set and apply a “name cloze” task: mask a named entity in a short passage and ask the model to supply it.

For instance, given a sentence like “Because you’re wrong. I don’t care what he thinks. [MASK] pulled his feet up onto the branch” from The Lightning Thief, the model should identify “Grover” as the missing name.

The main results of Cloze Encounters are estimates of “performance” showing large, statistically significant gains for GPT-class models (about 21–23 percent relative to baseline), smaller gains for Claude/Gemini/Llama-70B (about 7–9 percent), and no detectable effect for Llama-8B. The effects are stronger for less-popular book titles, consistent with fewer substitutes (Internet reviews or summaries) in other training data.

This is all well and good, but the way authors explicitly link these findings to current controversies relating to copyright policy, licensing markets, and training-data attribution is troubling.

Cloze Encounters is not measuring “performance” any way that people should care about

The first thing that raised my suspicion about this paper is that I had already seen this exact methodology used as a clever way to illustrate memorization and to show how some books are memorized more than others. See, Kent Chang et al., “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” (https://arxiv.org/abs/2305.00118). Cloze Encounters scales and repurposes that approach for a causal analysis of how access to pirated books in the Books3 lead to improved LLM “performance.” But it doesn’t make sense to me that what counted as a memorization probe in one paper could just be relabeled as a general “performance” metric in another.

Why is memorization so different to performance?

This is a question of construct validity. The method in Cloze Encounters tests recall of a masked name from a short passage, scored as a binary hit. This kind of lexical recall is a narrow slice of linguistic ability that is highly sensitive to direct exposure to the source text. It’s a proxy for memorization rather than the broad competencies that make LLMs interesting and useful.

The capabilities that matter in practice—long-context understanding, abstraction and synthesis, factual grounding outside narrative domains, reliable instruction following—are largely orthogonal to masked-name recall. Calling the cloze score “LLM performance” is a massive over-generalization from a task that measures a thin, exposure-sensitive facet of behavior. As an evaluation device, name-cloze is sharp for detecting whether models learned from—or memorized—a specific source; it is blunt for assessing overall performance. There is no reason to think that evidence of snippets of memorization from particular works in the books3 dataset has any necessary relationship with being a better translator, drafter, summarizer, brainstorming partner, etc.

This paper is begging to be misread and misapplied in policy and legal debates

I wouldn’t go so far as to say that success on the cloze score tells us “literally nothing” about LLM performance: “almost nothing” is a fairer estimate. To see why, think about the process of pre-training. Pre-training optimizes next-token prediction over trillions of tokens; the cloze outcome is, by construction, and basically the same as that objective. So it is not surprising that it is unusually sensitive to direct exposure to given pieces of training data. There probably is a broad correlation between next-token accuracy and perceived usefulness (we certainly saw this in the transition from GPT-3.5 to GPT-4), but the relationship is not lockstep, and it’s easy to imagine a model that excels at memorization alone but generalizes poorly.

The authors nod to these limitations at various points in the manuscript but they still frame it as a measure a of “LLM performance” in a way that is just begging to be misread and misapplied in policy and legal debates. Abstract-level claims travel further than caveats; many readers will see the former and miss the latter.

Nor does the identification strategy employed in the paper do anything to rescue the limits of the construct. The instrumental variable—publication-year share in Books3—may isolate an exogenous shock to exposure. Even granting the exclusion restriction, the estimate remains the effect of Books3 on a name-cloze score. It tells us little about summarization, reasoning, instruction following, safety behavior, or cross-domain generalization.

Bottom line

Cloze Encounters usefully documents that access to Books3 leaves a measurable imprint on exposure-sensitive recall. But its central metric does not justify broad the claims it makes about “LLM performance.” The study measures whether models can fill in masked strings drawn from particular books; it does not show that such access improves the flexible, user-tailored generation that makes these systems valuable.

August 9, 2022August 9, 2022

I have moved to Emory University School of Law

Posts on this website are infrequent these days. But I thought it was worth mentioning that I have moved to Atlanta to take a position on the amazing Emory Law faculty. I was hired as a Professor of Law in Artificial Intelligence, Machine Learning, and Data Science as part of Emory’s bold new AI.Humanity initiative.

You can read the Emory announcement here: https://law.emory.edu/news-and-events/releases/2022/04/sag_joins_emory_law.html