Text Mining, Non-Expressive Use and the Technological Advantage of Fair Use

On March 29, 2017, I attended a fantastic conference on “Globalizing Fair Use: Exploring the Diffusion of General, Open and Flexible Exceptions in Copyright Law” hosted by American University Washington College of Law’s Program and Information Justice and Intellectual Property. As part of that event we held a webcast Q&A session moderated by Sasha Moss of the R Street Institute. The following is rough transcript of my comments in response to Sasha’s questions about the legality of the non-expressive use copyrighted works.

Copyright Questions For the Digital Age

There is no country in the world where simply reading a book and giving someone information about the book, such its subject or themes, whether it uses particular words or particular combinations of words, the number of words, the number of pages, the ratio of female to male pronouns, etc., would amount to copyright infringement.

Why? Because information about the book is not the book. It is metadata. The question for the digital age is, “Can we use computers to produce that kind of data?” This question is important because although I can read a few books and produce some useful metadata, I can’t read a million books. But a computer can.

We have the technology

We have the technology to digitize large collections of books in order to produce data that enables computer scientists, linguists, historians, English professors, and the like, to answer important research questions. The data and the questions it can be used to answer do nothing to communicate the original expression of all those millions of books. However, technically speaking, this kind of digitalization is still copying.

But is this the kind of copying that copyright law should be concerned about? If a tree falls in an empty forest, does it truly make a sound? If something is copied but only read by a computer and the computer only communicates metadata about the work, is that the kind of copying this should amount to copyright infringement?

Text mining is vital for machine learning, automatic translation, and developing the language models

It seems to me, that once you phrase the question that way the answer is clear. We all use this amazing technology on a daily basis when we rely on Internet search engines, but text mining use is about much more than this. By data mining vast quantities of scientific papers, researchers have been able to identify new treatments for diseases. Text mining has also allowed humanities scholars to identify patterns in vast libraries of literature. Text mining is vital for machine learning, automatic translation, and developing the language models the power dictation software.

Fair use and technological advantage

The United States is a world leader in various applications of text mining, starting with Internet search, but going far beyond that. In the United States, once people realized what was possible they more or less start doing it. If Larry Page and Sergy Brin had had the idea for the Google Internet search engine in Canada, Australia, England, or Germany in the 1990s it would have been crystal-clear that because their search engine relied on making copies of other people’s HTML webpages and there was no realistic way to obtain permission from all those people, building search engine would be illegal. In countries with a closed list of copyright exceptions and limitations, or with fair dealing provisions that are tied to specific narrowly defined purposes, a lawyer would have looked at the list and said, “I don’t see Internet search or data mining on that list, so you can’t do it.”

The fair use doctrine reinforces copyright rather than negating it

In the United States, we have the fair use doctrine, which means that the list is not closed. In the United States, the fair use doctrine means you at least get a chance to explain why your particular use of a copyrighted work is for a purpose that promotes the goals of copyright, is reasonable in light of that purpose, and is unlikely to harm the interests of copyright owners. The fair use doctrine reinforces copyright rather than negating it; fair use doesn’t mean that you get to do whatever you want. Fair use is a system for determining how copyright should apply in new situations. That is especially important whether the law was written decades ago and society and technology are changing fast.

Without something like fair use, other countries can only follow the United States

Without something like fair use, other countries can only follow the United States. Non-expressive uses of copyrighted works such as text mining, building an Internet search engine, or running plagiarism detection software have all been held to be fair use in the United States and are slowly becoming more accepted around the world. Of course, now that it is readily apparent that these activities are immensely beneficial and entirely non-prejudicial to the interests of copyright owners we could probably write some specific amendments to the copyright act to make them legal. The problem is, do we didn’t know this two decades ago when we actually needed those rules. I don’t know what the next thing that we don’t know is, but I do know that experience has shown that the flexibility of the fair use doctrine—which has been part of copyright law virtually since the English Statute of Anne in 1710, by the way—has worked better than a system of closed lists.

The fair use doctrine is a real source of competitive advantage for technologists and academic researchers in the United States. Right now, there are technologies being developed and research being done in the United States that either can’t be done in other countries, or can only be done by particular people subject to various arbitrary restrictions. Whether it’s Internet search, digital humanities research, machine learning or cloud computing, other countries have followed the United States in adopting technologies that make non-expressive use of copyrighted works, because some of the copyright risks begin to look less daunting once the practice has become accepted. The Europeans, for example, are pretty sure building a search engine must be legal, but they can’t quite agree why. But the thing to understand is that you can follow this way but you can never lead. It’s much harder to do the new thing if by the letter of the law it is illegal and you have no forum to argue that it should be allowed.

The future doesn’t have a lobby group

Of course, that’s not quite true, you have one forum … you can spend a vast amount of money are lobbyists and go to the government, go to Congress and try to get some favorable rules written. But even if that is successful from time to time, those rules have a particular character. A company that spends millions of dollars on a lobbying campaign to change the law is always going to try and make sure that those new rules only benefit its business. Special interests will get some laws changed, but usually in ways that disadvantage their competitors or exclude alternative technologies that might one day compete with them. The fundamental problem with relying on static lists of copyright exceptions and lobbying to get those lists revised as needed is that the future doesn’t have a lobby group.

If you would like to read more about these topics:

Some thoughts on Fair use, Transformative Use and Non-Expressive Use

Fair use, Transformative Use and Non-Expressive Use

Or,

Campbell v. Acuff-Rose and the Future of Digital Technologies, notes on a short presentation at the Fair Use In The Digital Age: The Ongoing Influence of Campbell v. Acuff-Rose’s “Transformative Use Test” Conference, April 17 & 18, 2015, University of Washington School of Law.

Copyright and disintermediation technologies

Copyright policy was hit by an analog wave of disintermediation technology in the post-war era and a digital wave of disintermediation technologies beginning in the 1990s. These successive waves of technology have forced us to reevaluate the foundational assumption of copyright law; that assumption being that any reproduction of the work should be seen as an exchange of value passing from the author (or copyright owner) to the consumer.

Technologies such as the photocopier and the videocassette recorder and then later the personal computer significantly destabilized copyright policy because these inventions, for the first time, placed commercially significant copying technology directly in the hands of large numbers of consumers. This challenge has only been accelerated by digitalization and the Internet. Digitalization allows for perfect reproduction such that the millionth copy of an MP3 file sounds just as good as the first copy.

The implications of the copying that these devices enabled were not clear-cut. In some cases, the new copying technology simply enabled greater flexibility in consumption, in others they generated new copies to be released into the stream of commerce as competitors with the author’s original authorized versions. The Internet has connected billions of people together leading to an outpouring of creativity and user-generativity, but from the perspective of the entertainment industry is also brought people together to undertake a massive scale piracy.

The significant of Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994)

The Supreme Court in Sony v. Universal[1] had already shown that it was willing to apply fair use in a flexible manner in situations where the use was personal and immaterial to the copyright owner. The significance of the Court’s decision in Campbell[2] was that, by reorienting the fair use doctrine around the concept of transformative use, the Court prepared the way for a flexible consideration of technical acts of reproduction that do not have the usual copyright significance.

Internet search engines, plagiarism detection software, text mining software and other copy-reliant technologies do not read, understand, or enjoy copyrighted works, nor do they deliver these works directly to the public.  They do, however, necessarily copy them in order to process them as grist for the mill, raw materials that feed various algorithms and indices. Campbell arrived just in time to provide a legal framework far more hospitable to copy-reliant technology than had previously existed. Even in its broadest sense, transformative use is not the be all and end all of fair use. At the risk of over-simplification, Sony v. Universal safeguarded the future of the mp3 player, whereas Campbell secured the future of the Internet and reading machines.

Copy-reliant technology and non-expressive use

Some of the most important recent technological fair use cases can be summarized as follows: Copying that occurs as an intermediate technical step in the production of a non-infringing end product is a ‘non-expressive’ use and thus ordinarily constitutes fair use.[3] The main examples of non-expressive use I have in mind are the construction of search engine indices,[4] the operation of plagiarism detection software[5] and, most recently, library digitization to make paper books text-searchable.[6]

To have a coherent concept of fair use, or any particular category of fair use, one needs a coherent concept of copyright. As expressed in the U.S. Constitution, copyright’s motivating purpose is “to promote the Progress of Science and useful Arts.”[7] Ever since the Statute of Anne in 1710, the purpose of Copyright law has been to encourage the creativity of authors and to promote the creation and dissemination of works of authorship. Copyright is not a guarantee of total control; in general, the copyright owner’s rights are limited and defined in reference to the communication of the expressive aspects of the work to the public. This is evident in the idea-expression distinction, the way courts determine whether two works are substantial similar and the focus of fair use cases on expressive substitution. Thus, subsequent authors may not compete with the copyright owner by offering her original expression to the public as a substitute for the copyright owner’s work, but they are free to compete with their own expression of the same facts, concepts and ideas. They are also free to expose, criticize and even vilify the original work. Genuine parodies, critiques and illustrative uses are fair use so long as the copying they partake in is reasonable in light of those purposes.

If public communication and expressive substitution are rightly understood as copyright’s basic organizing principles, then it follows that non-expressive uses — i.e., uses that involve copying, but don’t communicate the expressive aspects of the work to be read or otherwise enjoyed — must be fair use. In fact, they are arguably the purest essence of fair use. Groking the concept of non-expressive use simply involves taking the well understood distinction between expressive and nonexpressive works and making the same distinction in relation to potential acts of infringement.

The legal status of actual copying for nonexpressive uses was not a burning issue before digital technology. Outside the context of reading machines like search engines, plagiarism software and the like, courts have quite reasonably presumed that every copy of an expressive work is for an expressive purpose. But this assumption no longer holds. At a minimum, preserving the functional force of the idea-expression distinction in the digital context requires that copying for purely non-expressive purposes, such as the automated extraction of data, should not be infringing.

Some limits to the non-expressive use framework

Non-expressive use is a sufficient but not necessary condition of fair use. For example, parody is an expressive use, but it is fair use because it does not tend to threaten expressive substation. Even within the realm of recent technology cases, non-expressive use is not the right framework for addressing important man-machine interaction questions such as disability access, also a key issue in the HathiTrust litigation, but it does tie together a number of disparate threads.

The cases which hold that software reverse engineering is fair use are grounded firmly in the idea-expression distinction,[8] but they are not exactly non-expressive use cases for the reasons that follow.[9] The non-expressive use framework is also not the right tool in cases where software is copied in order to access its functionality: after-all, software is primarily functional and its primary (perhaps exclusive) value comes from the function it performs. Software piracy can’t be justified as a non-expressive use, because to do so would defeat the statutory scheme wherein Congress chose to graft computer software protection onto copyright. However, the reverse engineering cases still follow the logic of non-expressive use. In those cases copying to access certain API’s and other unprotectable elements enabled the copyists to either independently recreate that functionality (akin to conveying the same ideas with different expression) or to develop programs or machines that would complement the original software.

Non-expressive use versus transformative use?

The main issue left to resolve in terms of the copy-reliant technology and non-expressive use seems to be one of nomenclature. Is non-expressive use simply a subset of transformative use? Or is it a separate species of fair use with similar implications to that of transformative use.

Non-expressive use, as I have defined and elucidated in a series of law review articles and amicus briefs, is a clear coherent concept that ties a broad set of fair use cases directly to one of copyright’s core principles, the idea-expression distinction. Transformative use, as explained by Pierre Leval and adopted by the Supreme Court is rooted in the constitutional imperative for copyright protection – the creation of new works and the promotion of progress in culture, learning, science and knowledge. But for all that, if transformative use is invoked as an umbrella term, it is often hard to see what holds the category together.

The Campbell Court did not posit transformative use as a unified, exhaustive theory, but it did say that “[a]lthough such transformative use is not absolutely necessary for a finding of fair use, the goal of copyright, to promote science and the arts, is generally furthered by the creation of transformative works. Such works thus lie at the heart of the fair use doctrine’s guarantee of breathing space within the confines of copyright, …”[10] No doubt, when the Supreme Court spoke of transformative use, it had various communicative and expressive uses, such as parody, the right of reply, public comment and criticism in mind. But since Campbell, lower courts have applied the same purposive interpretation of copyright to a broader set of challenges. Campbell was decided in a different technological context and it is true that many of today’s technological fair use issues were entirely unimaginable before the birth of the World Wide Web and our modern era of big data, cloud computing, social media, mobile connectivity and the “Internet of Things”.

Non-expressive use is a useful concept because it provides a way for courts to recognize the legitimacy of copying that is inconsequential in terms of expressive substitution, but does not necessarily lead to the creation of the type of new expression that the Supreme Court had in mind in Campbell. The use of reading machines in digital humanities research is easy to justify, both in terms of the lack of expressive substitution and in the obvious production of meaning, new insights and potentially new and utterly transformative works of authorship. But what of less generative non-expressive uses? For example, in the future a robot might ‘read’ a copyrighted poster on a subway wall advertising a rock concert in Central Park. The robot might then ‘decide’ to change its travel plans in light of the predictable disruption. The acts of ‘reading’ and ‘deciding’ are both simply computational. Even if reading involves making a copy of the work inside the brain of a machine, it seems nonsensical to conclude that the robot was used to infringe copyright. In the age of the printing press, copying a work had clear and obvious implications. Copying was invariably for expressive ends and it was almost always the point of exchange of value between author and reader. The copyright implications of copying are much more contingent in the digital age.

There is much clarity to be gained by talking directly in terms of non-expressive use rather than relying on transformative as broad umbrella for a range of expressive and non-expressive fair uses. Such clear thinking would hopefully ease the anxieties of the entertainment industry that still fears that fair use is simply a stalking horse for dismantling copyright. Nonetheless, it would not be surprising if courts were more comfortable sticking with the language of transformativeness that Judge Pierre Leval gave us in “Toward a Fair Use Standard“,[11] and the Supreme Court adopted in Campbell.

This is a sketch of some ideas, no doubt revisions will follow after this exciting conference.

Related Publications:

Matthew Sag, Copyright and Copy-Reliant Technology 103 Northwestern University Law Review 1607–1682 (2009)

Matthew Sag, Orphan Works as Grist for the Data Mill, 27 Berkeley Technology Law Journal 1503–1550 (2012)

Matthew Jockers, Matthew Sag & Jason Schultz, Digital Archives: Don’t Let Copyright Block Data Mining, 490 Nature 29-30 (October 4, 2012)

Somewhat Related Publications:

Peter DiCola & Matthew Sag, An Information-Gathering Approach to Copyright Policy, 34 Cardozo Law Review 173–247 (2012)

Matthew Sag, Predicting Fair Use 73 Ohio State Law Journal 47–91 (2012)

Matthew Sag, The Pre-History of Fair Use 76 Brooklyn Law Review 1371–1412 (2011)

 

[1] Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417 (1984).

[2] Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994).

[3] See generally, Matthew Sag, Copyright and Copy-Reliant Technology 103 Northwestern University Law Review 1607–1682 (2009)

[4] There is no case addressing the legality of the process of making a text-based search index (as opposed to caching or display of search results), but the proposition naturally flows from Kelly v. Arriba Soft Corp., 336 F.3d 811 (9th Cir. 2003) and Perfect 10, Inc. v. Amazon.com, Inc., 508 F.3d 1146 (9th Cir. 2007) and is a necessary implication of Authors Guild, Inc. v. Hathitrust, Court of Appeals, 2nd Circuit 2014 and Authors Guild, Inc. v. Google Inc., 954 F. Supp. 2d 282 (S.D.N.Y. 2013)

[5] A.V. ex rel. Vanderhye v. iParadigms, LLC, 562 F.3d 630 (4th Cir. 2009).

[6] Authors Guild, Inc. v. Hathitrust, Court of Appeals, 2nd Circuit 2014; Authors Guild, Inc. v. Google Inc., 954 F. Supp. 2d 282 (S.D.N.Y. 2013). See also Matthew Sag, Orphan Works as Grist for the Data Mill, 27 Berkeley Technology Law Journal 1503–1550 (2012); Matthew Jockers, Matthew Sag & Jason Schultz, Digital Archives: Don’t Let Copyright Block Data Mining, 490 Nature 29-30 (October 4, 2012).

[7] U.S. Const. art. I, § 8, cl. 8.

[8] Sega Enter. Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992); Sony Computer Entm’t, Inc. v. Connectix Corp., 203 F.3d 596, 606 (9th Cir. 2000).

[9] These reasons are more fully elaborated in Matthew Sag, Copyright and Copy-Reliant Technology 103 Northwestern University Law Review 1607–1682 (2009).

[10] Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 579 (1994)(citation omitted).

[11] 103 Harv. L. Rev. 1105 (1990)

Why digital humanities researchers support google’s fair use defense

I posted a guest-blog over at the Authors Alliance explaining why digital humanities researchers support google’s fair use defense in Authors Guild v. Google.  The  Authors Alliance supports Google’s fair use defense because it helps authors reach readers. In my post, I explained another reason why this case is important to the advancement of knowledge and scholarship.

Earlier this month a group of more than 150 researchers, scholars and educators with an interest in the ‘Digital Humanities’ joined an amicus brief urging the Second Circuit Court of Appeals to side with Google in this dispute. Why would so many teachers and academics from fields ranging from Computer Science, English Literature, History, Law, to Linguistics care about this lawsuit? It’s not because they are worried about Google—Google surely has the resources to look after itself—but because they are concerned about the future of academic inquiry in a world of ‘big data’ and ubiquitous copyright.

For decades now, physicists, biologists and economists have used massive quantities of data to explore the world around them. With increases in computing power, advances in computational linguistics and natural language processing, and the mass digitization of texts, researchers in the humanities can apply these techniques to the study of history, literature, language and so much more.

Conventional literary scholars, for example, rely on the close reading of selected canonical works. Researchers in the ‘Digital Humanities’ are able to enrich that tradition with a broader analysis of patterns emergent in thousands, hundreds of thousands, or even millions of texts. Digital Humanities scholars fervently believe that text mining and the computational analysis of text are vital to the progress of human knowledge in the current Information Age. Digitization enhances our ability to process, mine, and ultimately better understand individual texts, the connections between texts, and the evolution of literature and language.

A Simple Example of the Power of the Digital Humanities

The figure below, is an Ngram-generated chart that compares the frequency with which authors of texts in the Google Book Search database refer to the United States as a single entity (“is”) as opposed to a collection of individual states (“are”). As the chart illustrates, it was only in the latter half of the Nineteenth Century that the conception of the United States as a single, indivisible entity was reflected in the way a majority of writers referred to the nation. This is a trend with obvious political and historical significance, of interest to a wide range of scholars and even to the public at large. But this type of comparison is meaningful only to the extent that it uses as raw data a digitized archive of significant size and scope.

The United States is/are

There are two very important things to note here. First, the data used to produce this visualization can only be collected by digitizing the entire contents of the relevant books–no one knows in advance which books to look in for this kind of search. Second, not a single sentence of the underlying books has been reproduced in the finished product. The original authors expression was an input to the process, but it was not a recognizable part of the output. This is the fundamental distinction that the Digital Humanities Amici are asking the court to preserve–the distinction between ideas and expression.

Will Copyright Law Prevent the Computational Analysis of Text?

The computational analysis of text has opened the door to new fields of inquiry in the humanities–it allows researchers to ask questions that were simply inconceivable in the analog era. However, the lawsuit by the Authors Guild threatens to slam that door shut.

For over 300 years Copyright has balanced the author’s right to control the copying of her expression with the public’s freedom to access the facts and ideas contained within that expression. Authors get the chance to sell their books to the public, but they don’t get to say how those books are read, how people react to them, whether they choose to praise them or pan them, how they talk to their friends about them. Copyright protects the author’s expression (for a limited time and subject to a number of exceptions and limitations not relevant here) but it leaves the information within that expression and information about that expression “free as the air to common use.” The protection of expression and the freedom of non-expression are both fundamental pillars of American Copyright law. However, the Author Guild’s long running campaign against library digitization threatens to erase that distinction in the digital age and fundamentally alter the balance of copyright law.

In the pre-digital era, the only reason to copy a book was to read it, or at least preserve the option of reading it. But this is no longer true. There are a host of modern technologies that literally copy text as an input into some larger data-processing application that has nothing to do with reading. For want of a better term, we call these ‘non-expressive uses’ because they don’t necessarily involve any human being reading the authors original expression at the end of the day. 

Most authors, if asked, support making their works searchable because they want them to be discovered by new generations of readers. But this is not our central point. Our point is that if it is permissible for a human to pick up a book and count the number of occurrences of the word “whale” (1119 times in Moby Dick) or the ratio of male to female pronouns (about 2:1 in A Game of Thrones Book 1—A Song of Ice and Fire), etc., then there is no reason the law should prevent researchers doing this on a larger and more systematic basis.

Game of Thrones Pronouns Etc

Digitizing a library collection to make it searchable or to allow researchers to analyze create and analyze metadata does not interfere with the interests that copyright owners have in the underlying expression in their books.

Who knows what the next generation of humanities researchers will uncover about literature, language, and history if we let them?

You can download the Brief of Digital Humanities and Law Scholars as Amici Curiae here.