Matthew Sag – Writes about Copyright, AI, Machine learning, and Empirical legal studies

April 2, 2025

Emory Law AI Roundtable 2025

The Fourth Annual Legal Scholars Roundtable on Artificial Intelligence 2025 will be held next week at Emory Law and I am very excited by the amazing line-up of speakers and commentators we have.

AI Roundtable Papers

Neel Guha, Information in AI Regulation
Michael Goodyear, Dignity and Deepfakes
Kat Geddes, AI’s Attribution Problem
Deven Desai & Mark Riedl, Responsible AI Agents
Nikola Datzov, AI Jurisprudence: Toward Automated Justice
Yiyang Mei & Matthew Sag, The Illusion of Rights-Based AI Regulation
David Rubenstein, Federalism & Algorithms
Oren Bracha, Generative AI Two Information Goods

Some of these papers are available in draft on SSRN.com or arXiv.com, others are still in development.

AI Roundtable Keynote

We also have a special keynote from Prof. Barton Beebe, presenting his new book manuscript “Technological Change and the Beautiful Deaths of Law: A Recurring History.” The Roundtable is invitation only, Emory faculty and students who are interested in attending should contact me for details.

History of the Legal Scholars Roundtable on Artificial Intelligence

The Roundtable was founded by Professor Matthew Sag and Professor Charlotte Tschider in March 2022 as an online event (due to the Covid-19 Pandemic) and has been conducted as an annual event at Emory Law School ever since. The Roundtable is supported by Emory University School of Law and by Emory’s AI.Humanity initiative.

The following were recognized as the Roundtable’s Best Paper in their respective years: Rebecca Crootof, Margot Kaminski, & Nicholson Price, Humans in the Loop, 76 Vanderbilt Law Review 429 (2023) (Best paper of 2022); Matthew T. Wansley, Regulating Driving Automation Safety, 73 Emory Law Journal 505 (2024) (Best paper of 2023); Mark Bartholomew, A Right to Be Left Dead, 112 California Law Review 1591 (2024) (Best paper of 2024)

March 20, 2025

Copyright and the AI Action Plan

On March 14, 2025, I submitted my comments to the Office of Science and Technology Policy in relation to the “AI Action Plan”. For context, the Office of Science and Technology Policy requested input on the Development of an Artificial Intelligence (AI) Action Plan to define the priority policy actions needed to sustain and enhance America’s AI dominance, and to ensure that unnecessarily burdensome requirements do not hamper private sector AI innovation. See Exec. Order No. 14,179, 90 Fed. Reg. 8741 (Jan. 31, 2025)(Executive Order titled “Removing Barriers to American Leadership in Artificial Intelligence,” signed by President Trump).

What follows is a lightly edited version of those comments (mostly removing footnotes, but also making a couple minor improvements).

AI Action Plan, Submission to the Office of Science and Technology Policy

I am the Jonas Robitscher Professor of Law in Artificial Intelligence, Machine Learning, and Data Science, Emory University. I appreciate the opportunity to contribute to OSTP’s call for policy ideas aimed at enhancing America’s global leadership in Artificial Intelligence (AI).

My primary points in this submission are that if, contrary to precedent and sound policy, American courts rule that training AI models on copyrighted works is not permissible as fair use, the U.S. government must be ready to act. And furthermore, to maintain U.S. leadership in artificial intelligence, the AI Action Plan should explicitly affirm the importance of broad copyright exceptions—particularly fair use for nonexpressive activities like AI model training.

How copyright law in various countries deals with AI training

In “The Globalization of Copyright Exceptions for AI Training” my co-author Professor Peter Yu and I examine how copyright frameworks across the world have addressed the apparent tension between copyright law and copy-reliant technologies such as computational data analysis in the form of text data mining (TDM), machine learning and AI.

Our research reveals that, although the world has yet to achieve a true consensus on copyright and AI training, an international equilibrium has emerged. In this equilibrium, countries recognize that TDM, machine learning and AI training can be socially valuable and do not inherently prejudice the copyright holders’ legitimate interests. Policymakers in the European Union, Japan, Israel, and Singapore agree in general terms that such uses should therefore be allowed without express authorization in some, but not necessarily all, circumstances.

Major industrialized economies have found different ways to this equilibrium position. Some, like the U.S. and Israel have done so through the fair use doctrine. Others, like Japan, Singapore, and the European Union, have crafted express copyright exceptions for TDM and computational data analysis. Other nations where the rule of law is not so clearly established are energetically pursuing AI development with state backing without updated copyright laws to facilitate AI training. There is little doubt that if the Chinese Communist Party deems copyright law an impediment to its AI ambitions, the law in China will change almost instantaneously, and very likely retrospectively.

U.S. litigation could unsettle global AI copyright norms

American courts have historically recognized fair use protections for technologies relying on nonexpressive copying, such as reverse engineering, plagiarism detection software, digital library searches, and computational humanities research spanning millions of scanned texts. Extending this principle logically, training AI models—which similarly involves copying without directly reproducing expressive content—would usually qualify as fair use. (For citations and discussion of the relevant literature, see Matthew Sag, Fairness and Fair Use in Generative AI, 92 Fordham Law Review 1887 (2024))

Yet, plaintiffs in more than 30 ongoing lawsuits across U.S. district courts contest this view. Collectively, they seek injunctions barring AI training without explicit consent, billions in monetary compensation, and even destruction of existing AI models. Although, in my estimation and that of many copyright experts, the plaintiffs are should not prevail on sweeping arguments that would bring AI training in the U.S. to a halt, they might.

A bad court decision may drive AI innovation offshore

Adverse outcomes in U.S. litigation will not stop the development of AI, they will simply push AI innovation overseas. The reason is straightforward: AI models, once trained, are easily portable. Companies seeking to avoid restrictive copyright rules could simply move their training operations to innovation-friendly jurisdictions like Singapore, Israel, or Japan, and then serve U.S. customers remotely, entirely free of domestic copyright concerns.

How is this possible? AI developers need fair use for all the copying that takes place to make training possible, but they don’t need fair use once the models have been trained because, by-and-large, trained AI models do not replicate the expressive details of their training datasets; instead, they distill general patterns, abstractions, and insights from that training data.

Thus, in the eyes of copyright law, these models are neither copies nor derivative works based on the training data. If U.S. copyright law turns against our AI industry, companies in the U.S. will still be able to use models trained in AI-friendly jurisdictions by either setting up a data pipeline so that the model stays overseas or hosting their models in the United States once it has been trained. Consequently, imposing overly restrictive copyright interpretations domestically will do very little to turn back the tide on AI, but risks surrendering America’s AI advantage to more AI-friendly jurisdictions.

Licensing deals are no substitute for fair use

While licensing agreements between AI developers and media companies are becoming more common, they cannot solve copyright concerns surrounding AI training. The sheer scale of AI training data makes the licensing approach impractical at the cutting edge. For instance, Meta’s recent Llama 3 model consumed over 15 trillion (15,000,000,000,000) tokens drawn from publicly accessible sources. To put this into perspective, assuming that the New York Times print edition is roughly fifty pages per day, each page has 4000 words (this is probably way over!), and there are 1.3 tokens per word, the newspaper would generate roughly 1.82 million tokens per week. At that rate, it would take about 158,500 years for the New York Times to generate 15 trillion tokens.

Licensing may be possible for some AI training, but licensing at the scale required to train frontier LLMs is not a realistic foundation for American industrial policy, it is a fantasy.

Nevertheless, existing deals with major media companies illustrate something important: AI developers are willing to pay for efficient access to high-quality datasets otherwise locked behind paywalls or machine-readable restrictions. Such agreements suggest that licensing has a niche but crucial role—not as a substitute for broad exceptions like fair use, but rather as a complementary source of premium training data. This dynamic becomes particularly valuable in AI-powered search scenarios, where language models frequently generate outputs closely resembling original copyrighted content, pushing the boundaries between acceptable use and potential infringement.

The U.S. Government must be ready to act

If, contrary to precedent and sound policy in my view, American courts rule that training AI models on copyrighted works is not permissible as fair use, the U.S. government should act. Specifically, the government would need to introduce legislation to reinstate the principle that training AI models typically falls under fair use or create a specific statutory exemption. I see no way this could be done through agency rulemaking or executive action. Legislative intervention would be necessary to safeguard America’s competitive edge against innovation-friendly jurisdictions like Japan, Singapore, Israel, and, in this context, even the European Union.

To maintain U.S. leadership in artificial intelligence, the AI Action Plan should explicitly affirm the importance of broad copyright exceptions—particularly fair use for nonexpressive activities like AI model training.

February 11, 2025

Thomson Reuters v. ROSS Intelligence (Summary Judgement)

In a closely watched decision revising a previous summary judgment, Judge Stephanos Bibas, a Third Circuit judge sitting by designation, sided largely with Thomson Reuters in its copyright dispute against ROSS Intelligence. The ruling granted partial summary judgment on direct copyright infringement claims while dismissing ROSS’s argument that its use of Thomson Reuters’ content qualified as fair use.

With Ross Intelligence now bankrupt and the technology at issue a decidedly niche application, attention is shifting to the broader implications for AI training and the use of copyrighted materials—particularly in the realm of generative AI. Earlier, Judge Bibas had refused to grant summary judgment on fair use, insisting the matter be put before a jury. However, upon further reflection, he reversed course, ultimately rejecting the defendant’s fair use defense outright.

Background

Thomson Reuters, the owner of Westlaw, accused the AI-driven legal research firm ROSS of copyright infringement, alleging that it had improperly used legal summaries—so-called Bulk Memos—derived from Westlaw’s editorial materials, particularly its headnotes, to train its technology. Thomson Reuters had refused to license its content to ROSS, a rival developing an AI-powered legal research tool requiring a database of legal questions and answers for training. To obtain the necessary data, ROSS partnered with LegalEase, which compiled and sold approximately 25,000 Bulk Memos—summaries created by lawyers referencing Westlaw headnotes. Whether the Bulk Memos involved verbatim copying or otherwise infringing copying was an issue in the case that ultimately went against ROSS. Upon discovering that ROSS had used content derived from these headnotes, Thomson Reuters filed a copyright infringement lawsuit. The summary judgment pertains only to a subset of the contested headnotes, leaving broader legal questions unresolved.

The court ruled against ROSS, determining that it had copied 2,243 headnotes and dismissing its various legal defenses, including claims of innocent infringement, copyright misuse, and the merger doctrine.

Ross’s use was not transformative

Judge Bibas ruled that ROSS’s use of Thomson Reuters’ material was commercial and non-transformative, a conclusion that weighed heavily in the publisher’s favor. According to the court, the use did not qualify as transformative because it lacked a distinct purpose or character from Thomson Reuters’ original work.

The court’s conclusion that Ross’s use was not transformative is puzzling, especially given its acknowledgment—while discussing the third fair use factor—that the output of Ross’s system did not replicate Westlaw’s copyrighted headnotes but rather produced uncopyrighted judicial opinions.

The court did distinguish two significant cases, Sega Enterprises Ltd. v. Accolade, Inc. and Sony Computer Entertainment, Inc. v. Connectix Corp. but failed to consider cases like iParadigms, HathiTrust and Google Books. Even the way the court dealt with the reverse engineering cases is a bit suspect. The court sets them aside for two reasons, first because those cases involved copying software code, and second, that such copying was “necessary for competitors to innovate.” To be sure, Oracle v. Google suggests that cases involving software may merit special treatment, but it is not clear why the software context should make a difference here. Judge Bibas’s invocation of necessity is undercooked as well. Whether an act of copying is “necessary” is inextricably tied to the level of generality at which you ask the question. In Oracle v. Google, Google’s replication of Java APIs was essential for compatibility with existing Java programmers, but whether that compatibility was a necessity or luxury depends on the level of generality at which you pose the question. After all, other smartphones ran without making life easy for Java programmers.

Not generative AI, but why?

The judge took care to distinguish this case from generative AI, yet the distinction remains murky. The court stated: “Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw. It is undisputed that Ross’s AI is not generative AI (AI that writes new content itself).” And later that “Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today.”

But what, exactly, sets this apart from generative AI? More broadly, how does this differ from other cases where nonexpressive uses have been deemed fair use? The opinion offers little guidance. It fails to engage with seemingly comparable precedents, such as plagiarism detection tools, library digitization for text analysis and digital humanities research, or the creation of a book search engine—cases where courts have found fair use.

The closest we get to an explanation of why Ross’s use of the Westlaw headnotes is different to the intermediate copying iParadigms, HathiTrust and Google Books is that Ross merely retrieves and presents judicial opinions in response to user queries. This process, the court observed, closely parallels Westlaw’s own practice of using headnotes and key numbers to identify relevant cases. Consequently, the court concluded that Ross’s use was not transformative, as it primarily served to facilitate the development of a competing legal research tool rather than to add new expression or meaning to the copied material.

Market effect

The court determined that ROSS’s actions impaired Thomson Reuters’ market for legal AI training data, and in its reasoning, the fourth fair use factor carried substantial weight. Without qualification, the opinion echoes Harper & Row’s assertion that the fourth factor “is undoubtedly the single most important element of fair use.” This is problematic. Asserting the absolute primacy of the Fourth factor is obviously in error in light of Campbell, as well as the Court’s more recent decisions in Google v. Oracle and Andy Warhol Foundation. The Court’s contemporary approach to fair use eschews rigid hierarchies among the statutory factors.

That said, the judge’s finding in relation to the fourth factor may not be entirely unreasonable in this case: Ross explicitly intended to compete with Westlaw by creating a viable market alternative. For the court the key fact was that Ross “meant to compete with Westlaw by developing a market substitute.” “And it does not matter whether Thomson Reuters has used the data to train its own legal search tools; the effect on a potential market for AI training data is enough.”

Implications

One district court opinion that barely engages with the relevant caselaw will not change U.S. fair use law overnight, but it will certainly be welcome news for the plaintiffs in the more than 30 ongoing AI copyright cases currently being litigated.

I think what is really going on in this decision is that the judge has confused the first factor with the fourth factor. There is no obvious way to distinguish training on the question and answer memos to develop a model that directly links user questions to the relevant case law from cases involving search engines and plagiarism detection software. The real distinction, if there is one, is that ROSS used Westlaw’s product to create a directly competing product.

Looking at the case this way, the decision might actually be good for the generative AI defendants, in cases like NYT v OpenAI, because there isn’t the same direct competition.

* This is my first quick take on the decision just hours after it was handed down.

* Citation: Thomson Reuters Enter. Ctr. GmbH v. ROSS Intelligence Inc., No. 1:20-cv-613-SB (D. Del. Feb. 11, 2025)

October 3, 2024

Book Review: Nick Seaver, Computing Taste: Algorithms and the Makers of Music Recommendation

(University of Chicago Press, 2022)

In Computing Taste, Nick Seaver provides an ethnographic exploration of the world of music recommendation systems, revealing how algorithms are deeply shaped by the humans who design them. He shows how the algorithms that drive music recommendations are shaped by human judgment, creativity, and cultural assumptions. The data companies collect, the way they construct models, how they intuitively test whether their models are working, and how they define success are all deeply human and subjective choices.

Beyond Man vs. Machine

Seaver points out that textbook definitions describe algorithms as “well-defined computational procedures” that take inputs and generate outputs, portraying them as deterministic and straightforward systems. This narrow view leads to a man-versus-machine narrative that is trite and unilluminating. Treating algorithms as though their defining quality is the absence of human influence reinforces misconceptions about their neutrality. Instead, Seaver advocates for focusing on the sociotechnical arrangements that produce different forms of “humanness and machineness,” echoing observations by Donna Haraway and others.

In practice, algorithmic systems are messy, constantly evolving, and shaped by human judgment. As Seaver notes, “these ‘cultural’ details are technical details,” meaning that the motivations, preferences, and biases of the engineering teams that design algorithms are inseparable from the technical aspects of the systems themselves. Therefore, understanding algorithms requires acknowledging the social and cultural contexts in which they operate.

From Information Overload to Capture

Seaver shows how the objective of recommendation systems has shifted from the founding myth of information overload to the current obsession with capturing user attention. Pioneers of recommender systems told stories of information overload that presented growing consumer choice as a problem in need of a solution. The notion of overwhelming users with too much content has been a central justification for creating algorithms designed to filter and organize information. If users are helpless in the face of vast amounts of data, algorithms become necessary tools to help them navigate this digital landscape. Seaver argues that the framing of overload justifies the control algorithms exert over what users see, hear, and engage with. The idea of “too much music” or “too much content” becomes a convenient rationale for developing systems that, in practice, do more than assist—they guide, constrain, and shape user choices.

In any event, commercial imperatives soon led to rationales based on information overload giving way to narratives of capture. Seaver compares recommender systems to traps designed to “hook” users, analyzing how metrics such as engagement and retention guide the development of algorithms. Seaver traces the evolution of recommender systems from their origins as tools to help users navigate the overwhelming abundance of digital content to their current role in capturing and retaining user attention. The Netflix Prize, a 2006 competition aimed at improving Netflix’s recommendation algorithm, serves as a key example of this shift. Initially, algorithms were designed to help users manage “information overload” by personalizing content based on user preferences, as Netflix sought to predict what users would enjoy. However, Netflix never used the winning entry. As streaming services became central to Netflix’s business model, the focus of recommendation systems shifted from merely helping users find content to keeping them engaged on the platform for as long as possible. This transition from personalization to attention retention shows the shift in the industry’s goals. Recommender systems, including those at Netflix, began to focus on encouraging continuous engagement by suggesting binge-worthy content to maximize viewing hours, implementing autoplay features to keep the next episode or movie rolling without user interaction, and focusing on actual viewing habits (e.g., “skip intro” clicks, time spent on a show, completion rates) rather than ratings to keep users hooked.

Seaver’s perspective is insightful, not unrelentingly critical. The final chapter investigates how the design of recommendation systems reflects the metaphor of a “park”—a managed, curated space that users are guided through. Recommender systems are neither strictly benign nor malign, but they do entail a loss of user agency. We, the listening public, are not trapped animals so much as a managed flock. Seaver recognizes that recommendation systems open up new possibilities for exploration while also constraining user behavior by narrowing choices based on past preferences.

Why Do My Playlists Still Suck?

The book also answers the question that motivated me to read it: why do my playlists still suck? No one has a good model for why we like the music that we like, when we like it, or how that extrapolates to music we haven’t heard yet. And Spotify and other corporate interests have no real interest in solving that puzzle for us. The algorithms that shape our cultural lives now prioritize engagement, rely on past behavior, and reflect a grab bag of assumptions about user preferences that are often in conflict. There is very little upside to offering us fresh or risky suggestions when a loop of familiarity will keep us more reliably engaged.

August 8, 2024December 12, 2024

London Marathon!

On April 27, 2025, I will be running the London Marathon to raise money for research and treatment of pancreatic cancer.

The future is what we make it

My sister Rebecca was diagnosed with pancreatic cancer in 2018. She was as brave as anyone could be, but her battle was short-lived; Becky passed away just weeks after her diagnosis. Her story is typical, pancreatic cancer is almost always fatal because it is diagnosed too late.

But we can change this story.

I am asking you to help me raise money for Pancreatic Cancer UK which is pioneering efforts to develop earlier detection methods to give others the fighting chance that Becky deserved.

There are several different ways you can contribute

(1) donate directly to Pancreatic Cancer UK at https://2025tcslondonmarathon.enthuse.com/pf/matthew-sag (immediate impact)

(2) send me money via venmo that I will then aggregate and convert into GBP and donate in your name (@Matthew-Sag) (minimize foreign transaction fees)

(3) donate to a different pancreatic cancer in your country of choice, send me the details and I’ll match that with a donation to Pancreatic Cancer UK (local impact)

Every contributor gets to add a song to my London Marathon Spotify Playlist!

If you donate for this cause, you can let me know what song I should add to my London Marathon Spotify Playlist. I will set this to shuffle during the race and try to remember who suggested which song

Please join me

Please join me on this journey and donate in Rebecca’s memory and bringing us closer to a future where no more lives are stolen by this devastating disease.

Follow my progress

To see how my training is going, follow me on Strava, check out my google spreadsheet (https://docs.google.com/spreadsheets/d/16hV2e-IxXbo01uM7_X5tGyOz6ONPtSZGJZdjKUeVH0k/edit?usp=sharing) or get narrative updates here on this page (https://2025tcslondonmarathon.enthuse.com/pf/matthew-sag)

Playlist so far …

Queen, Don’t Stop Me Now (my pick)

The Weekend, Blinding Lights (my pick)

Monty Python, Always Look on the Bright Side of Life (Matt & Mindy Lawrence)

The Clash, London Calling (Spencer Waller)

Bruce Springsteen, Born to Run (Richard Fields)

Olivia Newton-John, Xanadu (Jo Groube)

The Beatles, Long and Winding Road (Jan & Andy Sag)

February 21, 2024February 22, 2024

A response to Lee and Grimmelmann

TIM LEE (@binarybits) and JAMES GRIMMELMANN have written an insightful article on “Why The New York Times might win its copyright lawsuit against OpenAI” in Ars Technica and on Tim’s newsletter (https://www.understandingai.org/p/the-ai-community-needs-to-take-copyright).

Quite a few people emailed me asking for my thoughts, so here they are. This is a rough first take that began as a tweet before I realized it was too long.

Yes, we should take the NYT suit seriously

It’s hard to disagree with the bottom-line that copyright poses a significant challenge to copy-reliant AI, just as it has done to previous generations of copy-reliant technologies (reverse engineering, plagiarism detection, search engine indexing, text data mining for statistical analysis of literature, text data mining for book search).

One important insight offered by Tim and James is that building a useful technology that is consistent with some people’s rough sense of fairness, like MP3.com, is no guarantee of fair use. People loved Napster and probably would have loved MP3.com, but these services were essentially jukeboxes competing with record companies’ own distribution models for the exact same content. We could add ReDigi to this list, too. Unlike the copy-reliant technologies listed above, Napster, MP3.com, and ReDigi fell foul of copyright law because they made expressive uses of other people’s expressive works.

Tim and James make another important point, that academic researchers and Silicon Valley types might have got the wrong idea about copyright. Certainly, prior to November 2022 you almost never saw any mention of copyright in papers announcing new breakthroughs in text data mining, machine learning, or generative AI. This is why I wrote “Copyright Safety for Generative AI” (Houston Law Review 2023).

Tim and James’ third insight is that some conduct might be fair use in a small noncommercial scale but not fair use on a large commercial scale. This is right sometimes, but in fact, a lot of fair use scales up quite nicely. 2 Live crew sold millions of copies of their fair use parody of Roy Orbison’s Pretty Woman, and of course, some of the key non-expressive use precedents were all about different versions of text data mining at scale: iParadigms (commercial plagiarism detection), HathiTrust (text mining for statistical analysis of the literature, including machine learning), Google Books (commercial book search).

But how seriously?

I agree with Tim and James that the AI companies’ best fair use arguments will be some version of the non-expressive use argument I outlined in Copyright and Copy-Reliant Technology (2009) and several other papers since, such as The New Legal Landscape for Text Mining and Machine Learning (2019).

In a nutshell, that argument is that a technical process that creates some effectively invisible copies along way but ultimately produces only uncopyrightable facts, abstractions, associations, and styles should be fair use because it does not interfere with the author’s right to communicate her original expression to the public.

I also agree that this argument begins to unravel if generative AI models are in fact memorizing and delivering the underlying original expression from the training data. I don’t think we know enough about the facts to say whether individual examples of memorization are just an obscure bug or endemic problem.

The NYT v. OpenAI litigation will shed some light on this but there is a lot of discovery still to come. My gut feeling is that the NYT’s superficially compelling examples of memorization are actually examples of GPT-4 working as an agent to retrieve information from the Internet. This is still a copyright problem, but it’s a very small, easily fixed, copyright problem, not an existential threat to text data mining research, machine learning, and generative AI.

If the GPT series models are really memorizing and regurgitating vast swaths of NYT content, that is a problem for OpenAI. If pervasive memorization is unavoidable in LLMs, that would be a problem for the entire generative AI industry, but I very much doubt the premise. Avoiding memorization (or reducing to trivial levels) is a hard technical problem in LLMs, but not an impossible one.

Avoiding memorization in image models is more difficult because of the “Snoopy Problem.” Tim and James call this the “Italian plumber problem,” but I named it first and I like Snoopy better.

The Snoopy Problem is that the more abstractly a copyrighted work is protected, the more likely it is that a generative AI model will “copy” it. Text-to-image models are prone to produce potentially infringing works when the same text descriptions are paired with relatively simple images that vary only slightly.

Generative AI models are especially likely to generate images that would infringe on copyrightable characters because characters like Snoopy appear often enough in the training data that the models learn the consistent traits and attributes associated with those names. Deduplication won’t solve this problem because the output can still infringe without closely resembling any particular image from the training data. Some people think this is really a problem with copyright being too loose with characters and morphing into trademark law. Maybe, but I don’t see that changing.

How serious is the Snoopy Problem? Tim and James frame the problem as though they innocently requested a combination of [Nationality] + [Occupation] + “from a video game” and just happened stumble upon repeated images of the world most famous Italian plumber, Mario from Mario Kart.

But of course, a random assortment of “Japanese software developers” “German fashion designers” “Australian novelists” “Kenyan cyclists” “Turkish archaeologists” and a “New Zealand plumber” don’t reveal any such problem. The problem is specific to Mario because he dominates representations of Italian plumbers from video games in the training data.

The Snoopy Problem presents a genuine difficulty for video, image, and multimodal generative AI, but it’s far from an existential threat. Partly, because the class of potential plaintiffs is significantly smaller. There are a lot fewer owners of visual copyrightable characters than there are just plain old copyright owners. And partly because the problem can be addressed in training, by monitoring prompts, or by filtering outputs.

Tim and James’s final point of concern is that the prospect of licensing markets for training data will undermine the case for fair use. Companies building AI models rely on the fact that they are simply scraping training data from the “open Internet,” the argument becomes more persuasive when these companies are more careful to avoid scraping content from sites where they are not welcome.

Respecting existing robots.txt signals and helping to develop more effective ones in the future will facilitate robust licensing markets for entities like the New York Times and the Associated Press.

I don’t think that OpenAI will need to sign a 100 million licensing deals before training its next model. Courts have already considered and rejected the circular argument that copyright owners must be given the right to charge for non-expressive uses to avoid the harm of not being able to charge for non-expressive uses. This specific argument was raised by the Authors Guild in HathiTrust and Google Books and squarely rejected in both.

Tim and James and their note of caution with a note of realism: judges will be reluctant to shut down an innovative and useful service with tens of millions of uses. We saw a similar dynamic when the US Supreme Court held that time shift in using videocassette recorders was fair use.

But there is another element of realism to add. If the US courts reject the idea that non-expressive uses should be fair use, most AI companies will simply move their scraping and training operations overseas to places like Japan, Israel, Singapore, and even the European Union. As long as the models don’t memorize the training data, they can then be hosted in the US without fear of copyright liability.

Tim and James are two of the smartest most insightful people writing about copyright and AI at the moment. The AI community should take them seriously, they should take copyright seriously, but they should not see Snoopy (or the Italian Plumber) as an existential threat.

PS: Updated to correct typos helpfully identified by ChatGPT.

February 12, 2024

Third Annual Legal Scholars Roundtable on Artificial Intelligence 2024,

Call For Papers

Roundtable

Emory Law is proud to host the third annual Legal Scholars Roundtable on Artificial Intelligence. The Roundtable will take place on April 11-12, 2024 at Emory University in Atlanta, Georgia. The Legal Scholars Roundtable on Artificial Intelligence (AI) is designed to be a forum for the discussion of current legal scholarship on AI, covering a range of methodologies, topics, perspectives, and legal intersections.

Format
Participation at the Roundtable will be limited and invitation-only. Participants are expected to read all the papers in advance and be prepared to offer substantive comments. We will try to accommodate a limited number of Zoom-based participants, but in person attendance is strongly preferred.

Applications to present, comment, or participate
We invite applications to participate, to comment, and/or to present from academics working on any topic relating to legal issues in AI. To request to present, you need to submit a substantially complete draft paper. Microsoft word format is strongly preferred for these purposes, but you can submit a pdf version for broader distribution. The deadline for submission is February 23, 2024, and decisions on participation will be made shortly thereafter, ideally, by March 4, 2024. If selected, final manuscripts are due April 1, 2024, to permit all participants an opportunity to read the papers prior to the conference.

To apply to participate, comment, or present, please fill out the google form:(https://forms.gle/Ubv2maLWfMK5tbPs8).

What to expect from the Legal Scholars Roundtable on Artificial Intelligence
The Legal Scholars Roundtable on Artificial Intelligence is a forum for the discussion of current legal scholarship on AI, spanning a range of methodologies, topics, perspectives, and legal intersections. Authors who present at the Roundtable will be selected from a competitive application process, and commentators are assigned based on their expertise. Participants will have an opportunity to provide direct feedback in paper sessions and will have access to draft papers but will be asked not to post papers publicly or share without author permission. Robust sessions involve energetic feedback from other paper authors, commentators, and participants. Our goal is to ensure all authors have the full participation of all workshop participants in each author’s session.

Essential logistics
The Roundtable will be held in person on the Emory campus in Atlanta, Georgia. The conference will begin on Thursday morning and run until 1PM on Friday. You can expect to be at the Atlanta airport by 1:45 PM, in time for a 2:30 PM flight or later on Friday. We will pay for your reasonable (economy) travel and accommodation expenses within the U.S. At the roundtable you will be well fed and caffeinated.

Organizers
Matthew Sag, Professor of Law in Artificial Intelligence, Machine Learning, and Data Science at Emory University Law School (msag@emory.edu)
Charlotte Tschider, Associate Professor at Loyola Law Chicago (ctschider@luc.edu)

Emory Law’s Commitment to AI
Emory University recognizes that artificial intelligence (AI) is a transformative technology that is already reshaping almost every aspect of our lives. Through its AI.Humanity initiative, Emory is building capacity in key areas of AI research and policy, including health care, medical research, business, law, and the humanities.

July 24, 2023

ABA Webinar on Scraping/Mining Public-Facing Information for Generative AI

Thanks for everyone involved in the ABA Section on IP Lawwebinar toady (July 24, 2023). As promised, here is a link to my slides.

July 24, 2023

My testimony to the US Senate Judiciary Subcommittee on IP re: Copyright and AI

I had the great honor of testifying to the US Senate Judiciary Subcommittee on Intellectual Property in relation to Artificial Intelligence Copyright on Wednesday, July 12th, 2023.

Video and my written submission are available here: https://www.judiciary.senate.gov/artificial-intelligence-and-intellectual-property_part-ii-copyright and I have also linked to written statement here in case that other link is unavailable.

In my testimony I explained that although we are still a long way from the science fiction version of artificial general intelligence that thinks, feels, and refuses to “open the pod bay doors”, recent advances in machine learning AI raise significant issues for copyright law.

I explained why copyright law does not, and should not, recognize computer systems as authors and why training generative AI on copyrighted works is usually fair use because it falls into the category of non-expressive.

For more on copyright and generative AI, read Matthew Sag, Copyright Safety for Generative AI (Houston Law Review, Forthcoming) (https://ssrn.com/abstract=4438593)

June 1, 2023

Law School Academic Impact Rankings, with FLAIR

Cross-posted with Prawfsblog

I am pleased to announce the release of the Forward-Looking Academic Impact Rankings (FLAIR) for US law schools for 2023. I began this project two years ago because of my intense frustration that my law faculty (Loyola Chicago, at the time) had yet again been left out of the Sisk Rankings. The project has evolved and matured since then, and the design of the FLAIR rankings owes a great deal to debates that I have had with Prof. Gregory Sisk, partly in public, but mostly in private.

You can download the full draft paper from SSRN or wait for it to come out in the Florida State University Law Review.

How do the FLAIR rankings work?

I combined individual five-year citation data from HeinOnline with faculty lists scraped directly from almost 200 Law school websites to calculate the mean and median five-year citation numbers for every ABA accredited law school. Yes, that was a lot of work. Based on faculty websites, hiring announcements, and other data sources, I excluded assistant professors and faculty who began their tenure-track career in 2017 or later. I also limited the focus to what is traditionally considered to be the “doctrinal” faculty. The paper provides more details and the rationales for both of these decisions.

How do the FLAIR rankings compare to other law school rankings?

Among their many flaws, the U.S News law school rankings rely on poorly designed, highly subjective surveys to gauge “reputational strength,” rather than looking to easily available, objective citation data that is more valid and reliable. Would-be usurpers of U.S. News use better data but make other arbitrary choices that limit and distort their rankings. One flaw common to U.S. News and those who would displace it is the fetishization of minor differences in placement that do not reflect actual differences in substance. In my view, this information is worse than trivial: it is actively misleading.

The FLAIR rankings use objective citation data that is more valid and reliable than the U.S. News surveys, and unlike the Sisk rankings, FLAIR gives every ABA accredited law school a chance have the work of its faculty considered. Obviously, it is much fairer to assess every school rather than arbitrarily excluding some based on an intuition (a demonstrably faulty intuition at that) that particular schools have no chance to ranking the top X%. Well, it’s obvious to me at least. But perhaps more importantly, looking out all the data gives us a valid context to assess individual data points. The FLAIR rankings are designed to convey relevant distinctions without placing undue emphasis on minor differences in rank that are substantively unimportant. This goes against the horserace mentality that drives so much interest in U.S. News, but I’m not here to sell anything.

What are the relevant distinctions?

The FLAIR rankings assign law faculties to four separate tiers based on how their mean and median five-year citation counts compared to the standard deviation of the means and mediums of all faculties. Tier 1 is made up of those faculties that are more than one standard deviation above the mean, Tier 2 is between zero and one standard deviations above the mean, Tier 3 ranges from the mean to half a standard deviation below, and Tier 4 includes all of the schools more than half a standard deviation below the mean. In other words, Tier 1 schools are exceptional, Tier 2 schools are above average, Tier 3 are below average, and Tier 4 are well-below average.

The figure below illustrates a boxplot for the distribution of citation counts for each tier. (There is a more complete explanation in the paper, but essentially, the middle of the boxplot is the median, the box around the median is the middle 50%, and the “whiskers” at either and are the lowest/highest 25%.) The boxplot figure below illustrates the substantial differences between the tiers, but it also underscores that there is nonetheless considerable overlap between tiers.

The FLAIR rankings

The next figure focuses on Tier 1. The FLAIR rank for each school is indicated in parentheses. The boxplot next to each school’s name indicates the distribution of citations for each doctrinal faculty member within that school.

Readers who pay close attention to the U.S. News rankings will note that the top tier consists of 23 schools, not the much vaunted “T14”. The T14 is a meaningless category; it does not reflect any current empirical reality or any substantial differences between the 14^th and 15^th rank. Attentive readers will also note that several schools well outside of the (hopefully now discredited concept of the) T14—namely U.C. Irvine, U.C. Davis, Emory, William & Mary, and George Washington—are in the top tier. These schools’ academic impact outpaces their overall U.S. News rankings significantly. U.C. Davis outperforms its U.S. News ranking by 42 places!

Looking at the top tier of the FLAIR rankings as visualized in the figure above also illustrates how misleading ordinal differences in ranking can be. There is very little difference between Virginia, Vanderbilt, and the University of Pennsylvania in terms of academic impact. The medians and the general distribution of each of these faculties are quite similar. And thus we can conclude that differences between ranks 6 and 8 are unimportant and that it is not news if Virginia “drops” to 8^th or Pennsylvania rises to 6^th in the FLAIR rankings, or indeed in the U.S. News rankings.

The differences that matter, and those that don’t

In the Olympics, third place is a bronze medal, and fourth place is nothing; but there are no medals in the legal academy and there is no difference in academic impact between third and fourth that is worth talking about. Minor differences in placement rarely correspond to differences in substance. Accordingly, rather than emphasizing largely irrelevant ordinal comparisons between schools only a few places apart, what we should really focus on is which tier in the rankings a school belongs to. Moreover, even when a difference in ranking suggests that there is a genuine difference in the overall academic impact of one faculty versus another, those aggregate differences say very little about the academic impact of individual faculty members. There is a lot of variation within faculties!

Objections to quantification

Many readers will object to any attempt to quantify academic impact, or to the use of data from HeinOnline specifically. Some of these objections make sense in relation to assessing individuals, but I don’t think that any of them retain much force when applied to assessing faculties as a whole. If we are really interested in the impact of individual scholars, we need to assess a broad range of objective evidence in context; that context comes from reading their work and understanding the field as whole. In contrast, no one could be expected to read the works of an entire faculty to get a sense of its academic influence. Indeed, citation counts, or other similarly reductive measures are the only feasible way to make between-faculty comparisons with any degree of rigor. What is more, aggregating the data at the faculty level reduces the impact of individual distortions, much like a mutual fund reduces the volatility associated with individual stocks.

One thing I should be very clear about is that academic impact is not the same thing as quality or merit. This is important because, although I think that the data can be an important tool for overcoming bias, I also need to acknowledge that citation counts will reflect the structural inequalities that pervade the legal academy. A glance at the most common first names among law school doctrinal faculty in the United States is illustrative. In order of frequency, the 15 most common first names are Michael, David, John, Robert, Richard, James, Mark, Daniel, William, Stephen, Paul, Christopher, Thomas, Andrew, and Susan. It should be immediately apparent that this group is more male and probably a lot whiter than a random sample of the U.S. population would predict. As I said, citation counts are a measure of impact, not merit. This is not a problem with citation counts as such, qualitative assessments and reputational surveys suffer the same problem. There is no objective way to assess what the academic impact of individuals or faculties would be in an alternative universe free from racism, sexism, and ableism. A better system of ranking the academic impact of law faculties will more accurately reflect the world we live in, that increased accuracy might help make the world better at the margins, but it won’t do much to fix underlying structural inequalities.

Corrections and updates

Several schools took the opportunity to email me with corrections or updates to their faculty lists in the past three months. If I receive other corrections that might meaningfully change the rankings, I will post a revised version.