copyright – Matthew Sag

February 11, 2025

Thomson Reuters v. ROSS Intelligence (Summary Judgement)

In a closely watched decision revising a previous summary judgment, Judge Stephanos Bibas, a Third Circuit judge sitting by designation, sided largely with Thomson Reuters in its copyright dispute against ROSS Intelligence. The ruling granted partial summary judgment on direct copyright infringement claims while dismissing ROSS’s argument that its use of Thomson Reuters’ content qualified as fair use.

With Ross Intelligence now bankrupt and the technology at issue a decidedly niche application, attention is shifting to the broader implications for AI training and the use of copyrighted materials—particularly in the realm of generative AI. Earlier, Judge Bibas had refused to grant summary judgment on fair use, insisting the matter be put before a jury. However, upon further reflection, he reversed course, ultimately rejecting the defendant’s fair use defense outright.

Background

Thomson Reuters, the owner of Westlaw, accused the AI-driven legal research firm ROSS of copyright infringement, alleging that it had improperly used legal summaries—so-called Bulk Memos—derived from Westlaw’s editorial materials, particularly its headnotes, to train its technology. Thomson Reuters had refused to license its content to ROSS, a rival developing an AI-powered legal research tool requiring a database of legal questions and answers for training. To obtain the necessary data, ROSS partnered with LegalEase, which compiled and sold approximately 25,000 Bulk Memos—summaries created by lawyers referencing Westlaw headnotes. Whether the Bulk Memos involved verbatim copying or otherwise infringing copying was an issue in the case that ultimately went against ROSS. Upon discovering that ROSS had used content derived from these headnotes, Thomson Reuters filed a copyright infringement lawsuit. The summary judgment pertains only to a subset of the contested headnotes, leaving broader legal questions unresolved.

The court ruled against ROSS, determining that it had copied 2,243 headnotes and dismissing its various legal defenses, including claims of innocent infringement, copyright misuse, and the merger doctrine.

Ross’s use was not transformative

Judge Bibas ruled that ROSS’s use of Thomson Reuters’ material was commercial and non-transformative, a conclusion that weighed heavily in the publisher’s favor. According to the court, the use did not qualify as transformative because it lacked a distinct purpose or character from Thomson Reuters’ original work.

The court’s conclusion that Ross’s use was not transformative is puzzling, especially given its acknowledgment—while discussing the third fair use factor—that the output of Ross’s system did not replicate Westlaw’s copyrighted headnotes but rather produced uncopyrighted judicial opinions.

The court did distinguish two significant cases, Sega Enterprises Ltd. v. Accolade, Inc. and Sony Computer Entertainment, Inc. v. Connectix Corp. but failed to consider cases like iParadigms, HathiTrust and Google Books. Even the way the court dealt with the reverse engineering cases is a bit suspect. The court sets them aside for two reasons, first because those cases involved copying software code, and second, that such copying was “necessary for competitors to innovate.” To be sure, Oracle v. Google suggests that cases involving software may merit special treatment, but it is not clear why the software context should make a difference here. Judge Bibas’s invocation of necessity is undercooked as well. Whether an act of copying is “necessary” is inextricably tied to the level of generality at which you ask the question. In Oracle v. Google, Google’s replication of Java APIs was essential for compatibility with existing Java programmers, but whether that compatibility was a necessity or luxury depends on the level of generality at which you pose the question. After all, other smartphones ran without making life easy for Java programmers.

Not generative AI, but why?

The judge took care to distinguish this case from generative AI, yet the distinction remains murky. The court stated: “Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw. It is undisputed that Ross’s AI is not generative AI (AI that writes new content itself).” And later that “Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today.”

But what, exactly, sets this apart from generative AI? More broadly, how does this differ from other cases where nonexpressive uses have been deemed fair use? The opinion offers little guidance. It fails to engage with seemingly comparable precedents, such as plagiarism detection tools, library digitization for text analysis and digital humanities research, or the creation of a book search engine—cases where courts have found fair use.

The closest we get to an explanation of why Ross’s use of the Westlaw headnotes is different to the intermediate copying iParadigms, HathiTrust and Google Books is that Ross merely retrieves and presents judicial opinions in response to user queries. This process, the court observed, closely parallels Westlaw’s own practice of using headnotes and key numbers to identify relevant cases. Consequently, the court concluded that Ross’s use was not transformative, as it primarily served to facilitate the development of a competing legal research tool rather than to add new expression or meaning to the copied material.

Market effect

The court determined that ROSS’s actions impaired Thomson Reuters’ market for legal AI training data, and in its reasoning, the fourth fair use factor carried substantial weight. Without qualification, the opinion echoes Harper & Row’s assertion that the fourth factor “is undoubtedly the single most important element of fair use.” This is problematic. Asserting the absolute primacy of the Fourth factor is obviously in error in light of Campbell, as well as the Court’s more recent decisions in Google v. Oracle and Andy Warhol Foundation. The Court’s contemporary approach to fair use eschews rigid hierarchies among the statutory factors.

That said, the judge’s finding in relation to the fourth factor may not be entirely unreasonable in this case: Ross explicitly intended to compete with Westlaw by creating a viable market alternative. For the court the key fact was that Ross “meant to compete with Westlaw by developing a market substitute.” “And it does not matter whether Thomson Reuters has used the data to train its own legal search tools; the effect on a potential market for AI training data is enough.”

Implications

One district court opinion that barely engages with the relevant caselaw will not change U.S. fair use law overnight, but it will certainly be welcome news for the plaintiffs in the more than 30 ongoing AI copyright cases currently being litigated.

I think what is really going on in this decision is that the judge has confused the first factor with the fourth factor. There is no obvious way to distinguish training on the question and answer memos to develop a model that directly links user questions to the relevant case law from cases involving search engines and plagiarism detection software. The real distinction, if there is one, is that ROSS used Westlaw’s product to create a directly competing product.

Looking at the case this way, the decision might actually be good for the generative AI defendants, in cases like NYT v OpenAI, because there isn’t the same direct competition.

* This is my first quick take on the decision just hours after it was handed down.

* Citation: Thomson Reuters Enter. Ctr. GmbH v. ROSS Intelligence Inc., No. 1:20-cv-613-SB (D. Del. Feb. 11, 2025)

February 21, 2024February 22, 2024

A response to Lee and Grimmelmann

TIM LEE (@binarybits) and JAMES GRIMMELMANN have written an insightful article on “Why The New York Times might win its copyright lawsuit against OpenAI” in Ars Technica and on Tim’s newsletter (https://www.understandingai.org/p/the-ai-community-needs-to-take-copyright).

Quite a few people emailed me asking for my thoughts, so here they are. This is a rough first take that began as a tweet before I realized it was too long.

Yes, we should take the NYT suit seriously

It’s hard to disagree with the bottom-line that copyright poses a significant challenge to copy-reliant AI, just as it has done to previous generations of copy-reliant technologies (reverse engineering, plagiarism detection, search engine indexing, text data mining for statistical analysis of literature, text data mining for book search).

One important insight offered by Tim and James is that building a useful technology that is consistent with some people’s rough sense of fairness, like MP3.com, is no guarantee of fair use. People loved Napster and probably would have loved MP3.com, but these services were essentially jukeboxes competing with record companies’ own distribution models for the exact same content. We could add ReDigi to this list, too. Unlike the copy-reliant technologies listed above, Napster, MP3.com, and ReDigi fell foul of copyright law because they made expressive uses of other people’s expressive works.

Tim and James make another important point, that academic researchers and Silicon Valley types might have got the wrong idea about copyright. Certainly, prior to November 2022 you almost never saw any mention of copyright in papers announcing new breakthroughs in text data mining, machine learning, or generative AI. This is why I wrote “Copyright Safety for Generative AI” (Houston Law Review 2023).

Tim and James’ third insight is that some conduct might be fair use in a small noncommercial scale but not fair use on a large commercial scale. This is right sometimes, but in fact, a lot of fair use scales up quite nicely. 2 Live crew sold millions of copies of their fair use parody of Roy Orbison’s Pretty Woman, and of course, some of the key non-expressive use precedents were all about different versions of text data mining at scale: iParadigms (commercial plagiarism detection), HathiTrust (text mining for statistical analysis of the literature, including machine learning), Google Books (commercial book search).

But how seriously?

I agree with Tim and James that the AI companies’ best fair use arguments will be some version of the non-expressive use argument I outlined in Copyright and Copy-Reliant Technology (2009) and several other papers since, such as The New Legal Landscape for Text Mining and Machine Learning (2019).

In a nutshell, that argument is that a technical process that creates some effectively invisible copies along way but ultimately produces only uncopyrightable facts, abstractions, associations, and styles should be fair use because it does not interfere with the author’s right to communicate her original expression to the public.

I also agree that this argument begins to unravel if generative AI models are in fact memorizing and delivering the underlying original expression from the training data. I don’t think we know enough about the facts to say whether individual examples of memorization are just an obscure bug or endemic problem.

The NYT v. OpenAI litigation will shed some light on this but there is a lot of discovery still to come. My gut feeling is that the NYT’s superficially compelling examples of memorization are actually examples of GPT-4 working as an agent to retrieve information from the Internet. This is still a copyright problem, but it’s a very small, easily fixed, copyright problem, not an existential threat to text data mining research, machine learning, and generative AI.

If the GPT series models are really memorizing and regurgitating vast swaths of NYT content, that is a problem for OpenAI. If pervasive memorization is unavoidable in LLMs, that would be a problem for the entire generative AI industry, but I very much doubt the premise. Avoiding memorization (or reducing to trivial levels) is a hard technical problem in LLMs, but not an impossible one.

Avoiding memorization in image models is more difficult because of the “Snoopy Problem.” Tim and James call this the “Italian plumber problem,” but I named it first and I like Snoopy better.

The Snoopy Problem is that the more abstractly a copyrighted work is protected, the more likely it is that a generative AI model will “copy” it. Text-to-image models are prone to produce potentially infringing works when the same text descriptions are paired with relatively simple images that vary only slightly.

Generative AI models are especially likely to generate images that would infringe on copyrightable characters because characters like Snoopy appear often enough in the training data that the models learn the consistent traits and attributes associated with those names. Deduplication won’t solve this problem because the output can still infringe without closely resembling any particular image from the training data. Some people think this is really a problem with copyright being too loose with characters and morphing into trademark law. Maybe, but I don’t see that changing.

How serious is the Snoopy Problem? Tim and James frame the problem as though they innocently requested a combination of [Nationality] + [Occupation] + “from a video game” and just happened stumble upon repeated images of the world most famous Italian plumber, Mario from Mario Kart.

But of course, a random assortment of “Japanese software developers” “German fashion designers” “Australian novelists” “Kenyan cyclists” “Turkish archaeologists” and a “New Zealand plumber” don’t reveal any such problem. The problem is specific to Mario because he dominates representations of Italian plumbers from video games in the training data.

The Snoopy Problem presents a genuine difficulty for video, image, and multimodal generative AI, but it’s far from an existential threat. Partly, because the class of potential plaintiffs is significantly smaller. There are a lot fewer owners of visual copyrightable characters than there are just plain old copyright owners. And partly because the problem can be addressed in training, by monitoring prompts, or by filtering outputs.

Tim and James’s final point of concern is that the prospect of licensing markets for training data will undermine the case for fair use. Companies building AI models rely on the fact that they are simply scraping training data from the “open Internet,” the argument becomes more persuasive when these companies are more careful to avoid scraping content from sites where they are not welcome.

Respecting existing robots.txt signals and helping to develop more effective ones in the future will facilitate robust licensing markets for entities like the New York Times and the Associated Press.

I don’t think that OpenAI will need to sign a 100 million licensing deals before training its next model. Courts have already considered and rejected the circular argument that copyright owners must be given the right to charge for non-expressive uses to avoid the harm of not being able to charge for non-expressive uses. This specific argument was raised by the Authors Guild in HathiTrust and Google Books and squarely rejected in both.

Tim and James and their note of caution with a note of realism: judges will be reluctant to shut down an innovative and useful service with tens of millions of uses. We saw a similar dynamic when the US Supreme Court held that time shift in using videocassette recorders was fair use.

But there is another element of realism to add. If the US courts reject the idea that non-expressive uses should be fair use, most AI companies will simply move their scraping and training operations overseas to places like Japan, Israel, Singapore, and even the European Union. As long as the models don’t memorize the training data, they can then be hosted in the US without fear of copyright liability.

Tim and James are two of the smartest most insightful people writing about copyright and AI at the moment. The AI community should take them seriously, they should take copyright seriously, but they should not see Snoopy (or the Italian Plumber) as an existential threat.

PS: Updated to correct typos helpfully identified by ChatGPT.

July 24, 2023

ABA Webinar on Scraping/Mining Public-Facing Information for Generative AI

Thanks for everyone involved in the ABA Section on IP Lawwebinar toady (July 24, 2023). As promised, here is a link to my slides.

July 24, 2023

My testimony to the US Senate Judiciary Subcommittee on IP re: Copyright and AI

I had the great honor of testifying to the US Senate Judiciary Subcommittee on Intellectual Property in relation to Artificial Intelligence Copyright on Wednesday, July 12th, 2023.

Video and my written submission are available here: https://www.judiciary.senate.gov/artificial-intelligence-and-intellectual-property_part-ii-copyright and I have also linked to written statement here in case that other link is unavailable.

In my testimony I explained that although we are still a long way from the science fiction version of artificial general intelligence that thinks, feels, and refuses to “open the pod bay doors”, recent advances in machine learning AI raise significant issues for copyright law.

I explained why copyright law does not, and should not, recognize computer systems as authors and why training generative AI on copyrighted works is usually fair use because it falls into the category of non-expressive.

For more on copyright and generative AI, read Matthew Sag, Copyright Safety for Generative AI (Houston Law Review, Forthcoming) (https://ssrn.com/abstract=4438593)

October 17, 2022October 17, 2022

Lessons for Empirical Studies of Copyright Litigation … A Case Study of Copyright Injunctions

This morning I presented Lessons for Empirical Studies of Copyright Litigation … A Case Study of Copyright Injunctions, CREATe@10 – Copyright Evidence: Synthesis and Futures, University of Glasgow October 17, 2022.

For those who missed the slides, here they are!

The presentation is based on Matthew Sag and Pamela Samuelson, Discovering eBay’s Impact on Copyright Injunctions Through Empirical Evidence forthcoming in the William & Mary Law Review 2023 ( https://ssrn.com/abstract=3898460)

August 20, 2019August 21, 2019

So, you got a copyright infringement demand letter from Higbee & Associates?

Some context

In 2018 Jake Haskell and I published an article called “Defense Against the Dark Arts of Copyright Trolling” in the Iowa Law Review. The article focused on BitTorrent related litigation that accounted for roughly half of all copyright cases filed in the United States at the time. As we described in the article, in the typical BitTorrent case,

“the plaintiff’s claims of infringement rely on a poorly substantiated form pleading and are targeted indiscriminately at non-infringers as well as infringers. This practice is a subset of the broader problem of opportunistic litigation, but it persists due to certain unique features of copyright law and the technical complexity of Internet technology. The plaintiffs bringing these cases target hundreds or thousands of defendants nationwide and seek quick settlements priced just low enough that it is less expensive for the defendant to pay rather than to defend the claim, regardless of the claim’s merits.”

Given my interest in this topic, I get a lot of emails and phone calls asking about another high volume copyright plaintiff’s lawyer, Higbee & Associates.

I am writing this post so that people have something to go on without waiting for a response from me (which can often take a while, sorry).

Is Higbee & Associates a copyright troll?

Some people call Higbee & Associates (or the clients they represent) copyright trolls. Certainly, they seem more interested in monetizing infringement than simply stopping it. After all, they could use DMCA takedowns in most of these cases and it would be just as effective.

Fair point, but even if they are looking primarily to the rewards of the courthouse rather than the market place, they would no doubt respond that litigation is required to make people understand that photography is not free for the taking. The performing rights organization, ASCAP, files a lot of lawsuits for exactly this reason.

So, in terms of motive, the copyright troll label might not be a great fit, what about methods?

Higbee & Associates are a little different to the copyright trolls Jake and I discussed in Defense Against the Dark Arts of Copyright Trolling. As far as I know, they don’t make a habit of go after obvious non-infringers. Although they don’t seem to recognize many potential fair use arguments either. Also they don’t appear to rely on dodgy technology or bogus experts to make their case — a feature that is endemic of in the BitTorrent litigation.

However, Higbee does seem to send a lot of out letters of demand without much underlying depth. These letters often fail to provide a copyright registration. They often claim to represent a copyright owner who is not the author without evidencing any assignment of rights. You don’t need a registration to make a demand, but you absolutely need one to file a claim in federal court and to get statutory damages. So that seems a bit odd. Not connecting the dots between the person who took the photo and the client they say they represent is also a bit odd.

Moreover, the copyright troll label certainly fits with the sense of being ambushed that many defendants experience. I hear from a lot of these recipients. Receiving a letter from Higbee & Associates feels like an ambush because so many people don’t really understand how copyright works. It also feels like an ambush because the settlement amounts Higbee & Associates demand in a typical letter don’t seem to reflect the value of the underlying work.

Instead of demanding some multiple of the standard license fee for the work in question, Higbee will demand a settlement amount based on what they could get in court under copyright’s rather imprecise statutory damages rules. Which makes their oft noted failure to provide proof of registration even more interesting.

Assuming the work was registered at the relevant time, the prevailing plaintiff in copyright litigation can get statutory damages in the range of $750 to $150,000 per work infringed, regardless of the amount of actual damage. This is a pretty terrifying prospect for most accused infringers. But it gets worse. The real kicker is that if you fight the infringement accusation and lose, you risk just adding to your pain because if they are the prevailing party, the plaintiff has a good chance of getting their attorneys fees as well as statutory damages!

So, what to do?

Step one: figure out whether you have a good story to tell on the merits

You might have a case on the merits. Here are some examples:

you paid for a license to use the photo (or you thought you did);
you made fair use of the photo by using it as the foundation for commentary, parody or criticism (if you made changes to the photo that reinforce this transformative purpose, the merits of your fair use defense will be even clearer);
the party Higbee & Associates represents does not actually own the photo;
the photo was not registered with the U.S. Copyright Office before you started using it;
you didn’t post the photo, one of your users did it. This gets complicated. You might be covered by the DMCA, but only if you jump through the right hoops including registering an agent with the Copyright Office every three years. If you are not covered by the DMCA, you still might not be responsible for infringing acts by your users, it depends on a number of issues too detailed to summarized here.

Arguments on the merits that won’t help:

you didn’t post the photo, one of your employees did — sorry, you are responsible for your employees in a case like this.
you didn’t know the photo was copyrighted — this doesn’t help as much as you might think.
you thought that photos on the Internet were in the public domain — they aren’t.
you were not making a profit on your website — this doesn’t help as much as you might think.

Step two: ask for more information

Request copy of copyright registration, the deposit material that accompanied application, and documents sufficient to show Higbee is authorized by copyright owner to act as agent.

Explain that any settlement you agree to will have to contain proposed settlement a warranty that Higbee is the duly authorized agent of the copyright owner, that their client owns the copyright asserted, and that such copyright is valid. If they won’t do this, why not?

Step two: If you realize now that you might have been infringing the photographer’s copyright

Take down the photo and audit the rest of the images on your website.
If the work was unregistered. Do what your conscience tells you is right. The reality is that it is not worthwhile for them to take this case to court unless they can show actual damages of more than a few hundred dollars.
If the work was registered and they actually represent the copyright owner, make a reasonable settlement offer.
- What’s a reasonable offer? Based on the cases I have seen, probably, $1000 and go up to $1250 but your individual facts may vary.
If the plaintiff won’t settle, don’t contest every point in the litigation. Instead try to keep everyone’s costs as low as possible; make an “offer of judgment” and hope that you get a reasonable judge who can see that there is no virtue in awarding more than $750 minimum in statutory damages. If you make this strategy clear to them, they should agree to a reasonable offer and move on to their next target.

Do you need a lawyer?

Probably, yes.

You could try to settle (or tell them to take a hike) by yourself, but without a lawyer representing you it’s hard to know how to respond to the arguments that the Higbee are going to throw back.

If you need a referral to a lawyer with experience in these matters, I can try to provide one. I don’t handle these cases myself. You should also know that because I am not your lawyer, any emails you send me are not going to be protected by attorney client privilege.

Good luck.

August 9, 2019August 9, 2019

The missing theory of transformative use

Today I will be presenting on “The Missing Theory of Transformative Use” at the Intellectual Property Scholars Conference at DePaul University in Chicago. My presentation is basically a distillation of the first three chapters of a book I am writing on the modern law of fair use.

November 15, 2017

NAFTA must include fair use commitments

I joined with over seventy international copyright law experts today in calling for NAFTA and other trade negotiators to support a set of balanced copyright principles.

Policies like fair use, online safe harbors, and other exceptions and limitations to copyright permit and encourage access to knowledge, flourishing creativity, and innovation.

The following copyright principles are essential to ensure consumers’ digital rights. Copyright law should:

Protect and promote copyright balance, including fair use
Provide technology-enabling exceptions, such as for search engines and text- and data-mining
Include safe harbor provisions to protect online platforms from users’ infringement
Ensure legitimate exceptions for anti-circumvention, such as documentary filmmaking, cybersecurity research, and allowing assistive reading technologies for the blind
Adhere to existing multilateral commitments on copyright term
Guarantee proportionality and due process in copyright enforcement

Measuring the value of copyright and the value of copyright exceptions is methodologically challenging, but if we use the same criteria that WIPO adopts to estimate the value of copyright, then in the U.S., fair use industries represent 16% of annual GDP and employ 18 million American workers.

The Washington Principles on Copyright Balance in Trade Agreements and the new research on Measuring the Impact of Copyright Balance are located at http://infojustice.org/flexible-use

May 25, 2017

Text Mining, Non-Expressive Use and the Technological Advantage of Fair Use

On March 29, 2017, I attended a fantastic conference on “Globalizing Fair Use: Exploring the Diffusion of General, Open and Flexible Exceptions in Copyright Law” hosted by American University Washington College of Law’s Program and Information Justice and Intellectual Property. As part of that event we held a webcast Q&A session moderated by Sasha Moss of the R Street Institute. The following is rough transcript of my comments in response to Sasha’s questions about the legality of the non-expressive use copyrighted works.

Copyright Questions For the Digital Age

There is no country in the world where simply reading a book and giving someone information about the book, such its subject or themes, whether it uses particular words or particular combinations of words, the number of words, the number of pages, the ratio of female to male pronouns, etc., would amount to copyright infringement.

Why? Because information about the book is not the book. It is metadata. The question for the digital age is, “Can we use computers to produce that kind of data?” This question is important because although I can read a few books and produce some useful metadata, I can’t read a million books. But a computer can.

We have the technology

We have the technology to digitize large collections of books in order to produce data that enables computer scientists, linguists, historians, English professors, and the like, to answer important research questions. The data and the questions it can be used to answer do nothing to communicate the original expression of all those millions of books. However, technically speaking, this kind of digitalization is still copying.

But is this the kind of copying that copyright law should be concerned about? If a tree falls in an empty forest, does it truly make a sound? If something is copied but only read by a computer and the computer only communicates metadata about the work, is that the kind of copying this should amount to copyright infringement?

Text mining is vital for machine learning, automatic translation, and developing the language models

It seems to me, that once you phrase the question that way the answer is clear. We all use this amazing technology on a daily basis when we rely on Internet search engines, but text mining use is about much more than this. By data mining vast quantities of scientific papers, researchers have been able to identify new treatments for diseases. Text mining has also allowed humanities scholars to identify patterns in vast libraries of literature. Text mining is vital for machine learning, automatic translation, and developing the language models the power dictation software.

Fair use and technological advantage

The United States is a world leader in various applications of text mining, starting with Internet search, but going far beyond that. In the United States, once people realized what was possible they more or less start doing it. If Larry Page and Sergy Brin had had the idea for the Google Internet search engine in Canada, Australia, England, or Germany in the 1990s it would have been crystal-clear that because their search engine relied on making copies of other people’s HTML webpages and there was no realistic way to obtain permission from all those people, building search engine would be illegal. In countries with a closed list of copyright exceptions and limitations, or with fair dealing provisions that are tied to specific narrowly defined purposes, a lawyer would have looked at the list and said, “I don’t see Internet search or data mining on that list, so you can’t do it.”

The fair use doctrine reinforces copyright rather than negating it

In the United States, we have the fair use doctrine, which means that the list is not closed. In the United States, the fair use doctrine means you at least get a chance to explain why your particular use of a copyrighted work is for a purpose that promotes the goals of copyright, is reasonable in light of that purpose, and is unlikely to harm the interests of copyright owners. The fair use doctrine reinforces copyright rather than negating it; fair use doesn’t mean that you get to do whatever you want. Fair use is a system for determining how copyright should apply in new situations. That is especially important whether the law was written decades ago and society and technology are changing fast.

Without something like fair use, other countries can only follow the United States

Without something like fair use, other countries can only follow the United States. Non-expressive uses of copyrighted works such as text mining, building an Internet search engine, or running plagiarism detection software have all been held to be fair use in the United States and are slowly becoming more accepted around the world. Of course, now that it is readily apparent that these activities are immensely beneficial and entirely non-prejudicial to the interests of copyright owners we could probably write some specific amendments to the copyright act to make them legal. The problem is, do we didn’t know this two decades ago when we actually needed those rules. I don’t know what the next thing that we don’t know is, but I do know that experience has shown that the flexibility of the fair use doctrine—which has been part of copyright law virtually since the English Statute of Anne in 1710, by the way—has worked better than a system of closed lists.

The fair use doctrine is a real source of competitive advantage for technologists and academic researchers in the United States. Right now, there are technologies being developed and research being done in the United States that either can’t be done in other countries, or can only be done by particular people subject to various arbitrary restrictions. Whether it’s Internet search, digital humanities research, machine learning or cloud computing, other countries have followed the United States in adopting technologies that make non-expressive use of copyrighted works, because some of the copyright risks begin to look less daunting once the practice has become accepted. The Europeans, for example, are pretty sure building a search engine must be legal, but they can’t quite agree why. But the thing to understand is that you can follow this way but you can never lead. It’s much harder to do the new thing if by the letter of the law it is illegal and you have no forum to argue that it should be allowed.

The future doesn’t have a lobby group

Of course, that’s not quite true, you have one forum … you can spend a vast amount of money are lobbyists and go to the government, go to Congress and try to get some favorable rules written. But even if that is successful from time to time, those rules have a particular character. A company that spends millions of dollars on a lobbying campaign to change the law is always going to try and make sure that those new rules only benefit its business. Special interests will get some laws changed, but usually in ways that disadvantage their competitors or exclude alternative technologies that might one day compete with them. The fundamental problem with relying on static lists of copyright exceptions and lobbying to get those lists revised as needed is that the future doesn’t have a lobby group.

If you would like to read more about these topics:

Matthew Jockers, Matthew Sag & Jason Schultz, Brief of Digital Humanities and Law Scholars in Support of Defendants-Appellees and Affirmance in Authors Guild v. Google (13-4829) (July 10, 2014)
Matthew Jockers, Matthew Sag & Jason Schultz, Digital Archives: Don’t Let Copyright Block Data Mining, 490 Nature 29-30 (October 4, 2012)
Matthew Sag, Orphan Works as Grist for the Data Mill, 27 Berkeley Technology Law Journal 1503 – 1550 (2012)
Matthew Sag, Copyright and Copy-Reliant Technology, 103 Northwestern University Law Review 1607–1682 (2009)

March 10, 2017

Internet Safe Harbors and the Transformation of Copyright Law will be published in the Notre Dame Law Review

My article, Internet Safe Harbors and the Transformation of Copyright Law, will be published in the Notre Dame Law Review, Vol. 93, 2017, later this year.

This Article shows how the substantive balance of copyright law has been overshadowed online by the system of intermediary safe harbors enacted as part of the Digital Millennium Copyright Act (“DMCA”) in 1998. The Internet safe harbors and the system of notice-and-takedown fundamentally changed the incentives of platforms, users, and rightsholders in relation to claims of copyright infringement. These different incentives interact to yield a functional balance of copyright online that diverges markedly from the experience of copyright law in traditional media environments. This article also explores a second divergence: the DMCA’s safe harbor system is being superseded by private agreements between rightsholders and large commercial Internet platforms made in the shadow of those safe harbors. These agreements relate to automatic copyright filtering systems, such as YouTube’s Content ID, that not only return platforms to their gatekeeping role, but encode that role in algorithms and software.

The normative implications of these developments are contestable. Fair use and other axioms of copyright law still nominally apply online; but in practice, the safe harbors and private agreements made in the shadow of those safe harbors are now far more important determinants of online behavior than whether that conduct is, or is not, substantively in compliance with copyright law. The diminished relevance of substantive copyright law to online expression has benefits and costs that appear fundamentally incommensurable. Compared to the offline world, online platforms are typically more permissive of infringement, and more open to new and unexpected speech and new forms of cultural participation. However, speech on these platforms is also more vulnerable to over-reaching claims by rightsholders. There is no easy metric for comparing the value of non-infringing expression enabled by the safe harbors to that which has been unjustifiably suppressed by misuse of the notice-and-takedown system. Likewise, the harm that copyright infringement does to rightsholders is not easy to calculate, nor is it easy to weigh against the many benefits of the safe harbors.

DMCA-plus agreements raise additional considerations. Automatic copyright enforcement systems have obvious advantages for both platforms and rightsholders; they may also allow platforms to be more hospitable to certain types of user content. However, automated enforcement systems may also place an undue burden on fair use and other forms of non-infringing speech. The design of copyright enforcement robots encodes a series of policy choices made by platforms and rightsholders and, as a result, subjects online speech and cultural participation to a new layer of private ordering and private control. In the future, private interests, not public policy will determine the conditions under which users get to participate in online platforms that adopt these systems. In a world where communication and expression is policed by copyright robots, the substantive content of copyright law matters only to the extent that those with power decide that it should matter.