Why digital humanities researchers support google’s fair use defense

I posted a guest-blog over at the Authors Alliance explaining why digital humanities researchers support google’s fair use defense in Authors Guild v. Google.  The  Authors Alliance supports Google’s fair use defense because it helps authors reach readers. In my post, I explained another reason why this case is important to the advancement of knowledge and scholarship.

Earlier this month a group of more than 150 researchers, scholars and educators with an interest in the ‘Digital Humanities’ joined an amicus brief urging the Second Circuit Court of Appeals to side with Google in this dispute. Why would so many teachers and academics from fields ranging from Computer Science, English Literature, History, Law, to Linguistics care about this lawsuit? It’s not because they are worried about Google—Google surely has the resources to look after itself—but because they are concerned about the future of academic inquiry in a world of ‘big data’ and ubiquitous copyright.

For decades now, physicists, biologists and economists have used massive quantities of data to explore the world around them. With increases in computing power, advances in computational linguistics and natural language processing, and the mass digitization of texts, researchers in the humanities can apply these techniques to the study of history, literature, language and so much more.

Conventional literary scholars, for example, rely on the close reading of selected canonical works. Researchers in the ‘Digital Humanities’ are able to enrich that tradition with a broader analysis of patterns emergent in thousands, hundreds of thousands, or even millions of texts. Digital Humanities scholars fervently believe that text mining and the computational analysis of text are vital to the progress of human knowledge in the current Information Age. Digitization enhances our ability to process, mine, and ultimately better understand individual texts, the connections between texts, and the evolution of literature and language.

A Simple Example of the Power of the Digital Humanities

The figure below, is an Ngram-generated chart that compares the frequency with which authors of texts in the Google Book Search database refer to the United States as a single entity (“is”) as opposed to a collection of individual states (“are”). As the chart illustrates, it was only in the latter half of the Nineteenth Century that the conception of the United States as a single, indivisible entity was reflected in the way a majority of writers referred to the nation. This is a trend with obvious political and historical significance, of interest to a wide range of scholars and even to the public at large. But this type of comparison is meaningful only to the extent that it uses as raw data a digitized archive of significant size and scope.

The United States is/are

There are two very important things to note here. First, the data used to produce this visualization can only be collected by digitizing the entire contents of the relevant books–no one knows in advance which books to look in for this kind of search. Second, not a single sentence of the underlying books has been reproduced in the finished product. The original authors expression was an input to the process, but it was not a recognizable part of the output. This is the fundamental distinction that the Digital Humanities Amici are asking the court to preserve–the distinction between ideas and expression.

Will Copyright Law Prevent the Computational Analysis of Text?

The computational analysis of text has opened the door to new fields of inquiry in the humanities–it allows researchers to ask questions that were simply inconceivable in the analog era. However, the lawsuit by the Authors Guild threatens to slam that door shut.

For over 300 years Copyright has balanced the author’s right to control the copying of her expression with the public’s freedom to access the facts and ideas contained within that expression. Authors get the chance to sell their books to the public, but they don’t get to say how those books are read, how people react to them, whether they choose to praise them or pan them, how they talk to their friends about them. Copyright protects the author’s expression (for a limited time and subject to a number of exceptions and limitations not relevant here) but it leaves the information within that expression and information about that expression “free as the air to common use.” The protection of expression and the freedom of non-expression are both fundamental pillars of American Copyright law. However, the Author Guild’s long running campaign against library digitization threatens to erase that distinction in the digital age and fundamentally alter the balance of copyright law.

In the pre-digital era, the only reason to copy a book was to read it, or at least preserve the option of reading it. But this is no longer true. There are a host of modern technologies that literally copy text as an input into some larger data-processing application that has nothing to do with reading. For want of a better term, we call these ‘non-expressive uses’ because they don’t necessarily involve any human being reading the authors original expression at the end of the day. 

Most authors, if asked, support making their works searchable because they want them to be discovered by new generations of readers. But this is not our central point. Our point is that if it is permissible for a human to pick up a book and count the number of occurrences of the word “whale” (1119 times in Moby Dick) or the ratio of male to female pronouns (about 2:1 in A Game of Thrones Book 1—A Song of Ice and Fire), etc., then there is no reason the law should prevent researchers doing this on a larger and more systematic basis.

Game of Thrones Pronouns Etc

Digitizing a library collection to make it searchable or to allow researchers to analyze create and analyze metadata does not interfere with the interests that copyright owners have in the underlying expression in their books.

Who knows what the next generation of humanities researchers will uncover about literature, language, and history if we let them?

You can download the Brief of Digital Humanities and Law Scholars as Amici Curiae here.

Digital Humanities and Legal Scholars in Authors Guild v. Google filed

On Thursday this week, we filed a brief on behalf over 150 researchers, scholars and educators in Authors Guild v. Google, currently on appeal to the Second Circuit Court of Appeals.
The Brief of Digital Humanities and Legal Scholars argues that Copyright law is not, and should not be, an obstacle to the computational analysis of text. Copyright law has long recognized the distinction between protecting an author’s original expression and the public’s right to access the facts and ideas contained within that expression.
We are confident that the Second Circuit will vote to maintain that distinction in the digital age so that library digitization, internet search and related non-expressive uses of written works remain legal.
The final version of the brief is available on the free online repository ssrn.com at this link address: http://ssrn.com/abstract=2465413.
We are grateful for the support of so many wonderful scholars in this important case and we are even more grateful for all the fascinating research that these computer scientists, english professors, historians, linguists, and all those working in the digital humanities do to enrich our lives.
We would also like to thank The Association for Computers and the Humanities and the Canadian Society of Digital Humanities/Société canadienne des humanités numériques for their support as institutions.
Matthew Jockers
Matthew Sag
Jason Schultz

Call for signatories: Digital Humanities Amicus in Authors Guild v. Google

Matthew Jockers, Jason Schultz and I have written an amicus brief in the upcoming Court of Appeals round of Authors Guild v. Google, Inc.

Download the draft here: DH Amicus AG v Google CA2

Background

Since we started working on this project just over two years ago two district courts and the Court of Appeals for the Second Circuit have rejected the Authors Guild’s attacks on library digitization and the legality of text-mining. We are confident that the Second Circuit will uphold Judge Chin’s decision last year where he rejected (on a motion for summary judgement)  the Authors Guild’s copyright infringement claim against Google over its Google Book Search product.  The rulings in Authors Guild v. Google and the parallel case of Authors Guild v. Hathitrust are a critical moment in the fight to define fair use for the Digital Humanities.In Authors Guild v. Google, Judge Chin expressly based ruling in part on the fact that

“Google Books … has transformed book text into data for purposes of substantive research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google Books has created something new in the use of book text — the frequency of words and trends in their usage provide substantive information.”
In his decision, Judge Chin cites the Brief of Digital Humanities and Law Scholars as Amici Curiae that we submitted on behalf of more than 100 researchers and scholars last year. Chin wrote that
“Google Books permits humanities scholars to analyze massive amounts of data — the literary record created by a collection of tens of millions of books.”

The Authors Guild is now appealing Judge Chin’s decision (on this and other grounds).  A different panel of that same court has already upheld the decision in Authors Guild v. Hathitrust. We believe that these cases will have a dramatic effect on research in computer science to linguistics, history, literature and the digital humanities.

Argument in a nutshell

According to the U.S. Constitution, the purpose of copyright is “To promote the Progress of Science and useful Arts”. Copyright law should not be an obstacle to statistical and computational analysis of the millions of books owned by university libraries. Copyright law has long recognized the distinction between protecting an author’s original expression and the public’s right to access the facts and ideas contained within that expression. That distinction must be maintained in the digital age so that library digitization, internet search and related non-expressive uses of written works remain legal.

What can you do?

If you are a legal academic or student, academic or researcher who would be effected by this issue, you can help preserve the balance of copyright law by joining our brief as a signatory (we need your name and affiliation e.g. Associate Professor, Jane Doe, Springfield University).

Does this concern you?

If you are still reading this post, the answer is probably YES.  We are collecting signatures from a wide range of fields, including computer science, englishhistory, law, linguistics and philosophy. We need your name etc., by July 9, 2014. Please enter your details directly via this online tool:

https://docs.google.com/forms/d/1QSA_fUSaRpw47wwRcXh0SXkZFx1NQ2NbjhBbfTrICnA/viewform?usp=send_formPlease feel free to share this invitation with other interested academics and Phd students.

Thank you!

Matthew Jockers explains why you can’t read a book through snippets

The Authors Guild’s war on search engines, text-mining and academic research is in its final throws. Over the last two years two different US Federal District Courts have held that library digitization for the purpose of building a search index and running a search engine is fair use. See, Authors Guild v. Hathitrust 902 F. Supp. 2d 445 (S.D.N.Y. 2012) and Authors Guild v. Google 954 F. Supp. 2d 282 (S.D.N.Y. 2013). The Hathitrust decision was upheld on appeal on June 10 this year (Authors Guild v. Hathitrust, 2nd Circuit 2014) and the parties and interested amici are gearing up for a final showdown in the appeal of Authors Guild v. Google.

In the Guild’s latest legal salvo it argues – by repeated assertion – that the text snippets Google displays to users allow 78% of the contents of any book to be reconstructed. (e.g., at p.10 “The scanning process resulted in an index that contains the complete text of all the books copied in the Library Project.”)

My sometime co-author and accomplished Digital Humanities researcher, Matthew Jockers, tested out the Guild’s claims on his own book and … it turns out that you can’t read a book through snippets, unless you already have the book, and that even then it takes about 30 minutes to trick the search engine into giving you the next 100 words beyond the free-view.

As Matt explains:

“Reading 78% of my book online, as the Guild asserts, requires that the reader anticipate what words will appear in the concealed sections of the book.”

He concludes

“Well, this certainly is reading Macroanalysis the hard way. I’ve now spent 30 minutes to gain access to exactly 100 words beyond what was offered in the initial preview. And, of course, my method involved having access to the full text! Without the full text, I don’t think such a process of searching and reading is possible, and if it is possible, it is certainly not feasible!

But let’s assume that a super savvy text pirate, with extensive training in English language syntax could guess the right words to search and then perform at least as well as I did using a full text version of my book as a crutch. My book contains, roughly, 80,000 words. Not counting the ~5k offered in the preview, that leaves 75,000 words to steal. At a rate of 200 words per hour, it would take this super savvy text pirate 375 hours to reconstruct my book. That’s about 47 days of full-time, eight-hour work.”

Matt’s book is Macroanalysis: Digital Methods and Literary History and — as seen on the screen shot I just made of Google Books — you can buy the eBook version, linked to from the Google Books web page, for $14.95.

Screen Shot 2014-06-19 at 11.48.07 AM

 

 

Authors Guild v. HathiTrust — Libraries 3 : Authors Guild 0

The Second Circuit Court of Appeals has upheld the most important parts of the District Court decision in Authors Guild v. HathiTrust. Here is a link to the decision –AGvHathiTrust_CA2_2013.

Along with the district decision in this case and the one in Authors Guild v. Google, this makes the current score, Libraries 3 : Authors Guild 0

The decision confirms that library digitization (as performed by Google in conjunction with the University of Michigan, University of Illinois and many others) does not infringe copyright if it is done for the purpose of allowing blind and visually disabled people to read books. 

Access to the PrintDisabled
The HDL also provides print‐disabled patrons with versions of all of the works contained in its digital archive in formats accessible to them. In order to obtain access to the works, a patron must submit documentation from a qualified expert verifying that the disability prevents him or her from reading printed materials, and the patron must be affiliated with an HDL member that has opted‐into the program. Currently, the University of Michigan is the only HDL member institution that has opted‐in. We conclude that  this use is also protected by the doctrine of fair use.

The decision confirms that library digitization does not infringe copyright if it is done for the purpose  of text-mining or creating a search engine. This is core of the non-expressive use argument that Matthew Jockers, Jason Schultz and I made in the Digital Humanities Amicus Brief  (http://ssrn.com/abstract=2274832). That brief was joined by over 100 professors and scholars who teach, write, and research in computer science, the digital humanities, linguistics or law, and two associations that represent Digital Humanities scholars generally.

The crux of our argument was that mass digitization of books for text-mining purposes is a form of incidental or “intermediate” copying that enables ultimately non-expressive, non-infringing, and socially beneficial uses without unduly treading on any expressive—i.e., legally cognizable—uses of the works. The Court of Appeals appears to have agreed.

FullText Search
It is not disputed that, in order to perform a full‐text search of books, the Libraries must first create digital copies of the entire books. Importantly, as we have seen, the HDL does not allow users to view any portion of the books they are searching. Consequently,  in providing this service, the HDL does not add into circulation any new, human‐readable copies of any books. Instead, the HDL simply permits users to “word search”—that is, to locate where specific  words or phrases appear in the digitized books. Applying the relevant factors, we conclude that this use is a fair use.

The Court left itself some room to maneuver if it turns out that, for reason, digitization for non-expressive uses like text mining causes unforeseen harm in different circumstances. For example, a digitization project that did not bother with any kind of security might not be fair use.

Without foreclosing a future claim based on circumstances not  now predictable, and based on a different record, we hold that the  balance of relevant factors in this case favors the Libraries. In sum,  we conclude that the doctrine of fair use allows the Libraries to  digitize copyrighted works for the purpose of permitting full‐text  searches.

With that appropriate caveat, this is a great win for for humanity and the Digital Humanities respectively.

I am proud to have played my small part in this case over the years.

Google Books held to be fair use

Authors Guild v. Google: library digitization as fair use vindicated, again.

After more than eight years of litigation, the legality of the Google Books Search engine has finally been vindicated.

Screen Shot 2013-11-14 at 10.35.00 AM

Authors Guild v Google Summary Judgement (Nov. 14, 2013)

The heart of the decision

The key to understanding Authors Guild v. Google is not in the court’s explanation of any of the individual fair use factors — although there is a great deal here for copyright lawyers to mull over —  but rather in the court’s description of its overall assessment of how the statutory factors should be weighed together in light of the purposes of copyright law.

“In my view, Google Books provides significant public benefits. It advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders. It has become an invaluable research tool that permits students, teachers, librarians, and others to more efficiently identify and locate books. It has given scholars the ability, for the first time, to conduct full-text searches of tens of millions of books. It preserves books, in particular out-of-print and old books that have been forgotten in the bowels of libraries, and it gives them new life. It facilitates access to books for print-disabled and remote or underserved populations. It generates new audiences and creates new sources of income for authors and publishers. Indeed, all society benefits.”  (Authors Guild v. Google, p.26)

Even before last year’s HathiTrust decision (Authors Guild v. Hathitrust), the case law on transformative use and market effect was stacked in Google’s favor. Nonetheless, Judge Chin’s rulings in other cases (e.g. WNET, THIRTEEN v. Aereo, Inc.) suggest that he takes the rights of copyright owners very seriously and that it was essential to persuade him that Google was not merely evading the rights of authors through clever legal or technological structures. The court’s conclusion that the Google Library Project “advance[d] the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders” pervades all of its more specific analysis.

Data mining, text mining and digital humanities

An entire page of the judgment is devoted to explaining how digitization enables data mining. This discussion relies substantially on the Amicus Brief brief of Digital Humanities and Law Scholars signed by over 100 academics last year.

“Second, in addition to being an important reference tool, Google Books greatly promotes a type of research referred to as “data mining” or “text mining.”  (Br. of Digital Humanities and Law Scholars as Amici Curiae at 1 (Doc. No. 1052)).  Google Books permits humanities scholars to analyze massive amounts of data — the literary record created by a collection of tens of millions of books.  Researchers can examine word frequencies, syntactic patterns, and thematic markers to consider how literary style has changed over time.  …

Using Google Books, for example, researchers can track the frequency of references to the United States as a single entity (“the United States is”) versus references to the United States in the plural (“the United States are”) and how that usage has changed over time.  (Id. at 7).  The ability to determine how often different words or phrases appear in books at different times “can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology.”  Jean-Baptiste Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, 331 Science 176, 176 (2011) (Clancy Decl. Ex. H)” (Authors Guild v. Google, p.9-10)

The court held that Google Books was “[transformative] in the sense that it has transformed books text into data for purposes of substandard research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google books has created something new in the use of text — the frequency of words and trends in the usage provide substantial information.”

A snippet of new law

Last year, the court in HathiTrust ruled that library digitization for the non-expressive use of text mining and the expressive use of providing access to the visually disabled was fair use. Today’s decision in Authors Guild v. Google supports both of those conclusions; it further holds that the use of snippets of text in search results is also fair use. The court noted that  displaying snippets of text as search results is similar to the display of thumbnail images of photographs as search results and that these snippets may help users locate books and determine whether they may be of interest.

The judgment clarifies something that confuses a lot of people — the difference between “snippet” views on Google books and more extensive document previews. Google has scanned over 20 million library books to create its search engine, mostly without permission. However, Google has agreements with thousands of publishers and authors who authorize it to make far more extensive displays of their works – presumably because these authors and publishers understand that even greater exposure on Google Books will further drive sales.

The court was not convinced that Google Books poses any threat of expressive substitution because, although it is a powerful tool for learning about books individually and collectively, “it is not a tool to be used to read books.”

The Authors Guild had attempted to show that an accumulation of individual snippets could substitute for books, but the court found otherwise: the kind of accumulation of snippets that the plaintiffs were suggesting was both technically infeasible because of certain security measures and, perhaps more importantly, was bizarre and unlikely: “Nor is it likely that someone would take the time and energy to input countless searches to try and get enough snippets to comprise an entire book.  Not only is that not possible as certain pages and snippets are blacklisted, the individual would have to have a copy of the book in his possession already to be able to piece the different snippets together in coherent fashion.”

Significance

Today’s decision is an important victory for Google and the entire United States technology sector; it also confirms the recent victory libraries, academics and the visually disabled in Authors Guild v. HathiTrust.

Unless today’s decision is overruled by the Second Circuit or the Supreme Court — something I personally think is very unlikely –, it is now absolutely clear that technical acts of reproduction that facilitate purely non-expressive uses of copyrighted works such as books, manuscripts and webpages do not infringe United States copyright law. This means that copy-reliant technologies including plagiarism detection software, caching, search engines and data mining more generally now stand on solid legal ground in the United States. Copyright law in the majority of other nations does not provide the same kind of flexibility for new technology.

All in all, an excellent result.

* Updated at 4.57pm. The initial draft of this post contained several dictation errors which I will now endeavor to correct. My apologies. Updated at 5.17pm with additional links and minor edits. 

 

 

 

 

University of Iowa presentation on copyright, mass digitization and the digital humanities

I am giving a talk today on copyright, mass digitization and the digital humanities at the University of Iowa law school. My talk will focus on the ongoing litigation between the Authors Guild and Google and the separate case of Authors Guild v. HathiTrust. The case against Google began in 2005 shortly after Google launched its ambitious library digitization project. The case against the HathiTrust, a digital library that pulls together the resources of a number of American universities, began much later in September 2011.

These cases raise complicated issues about standing, the scope of class actions, statutory interpretation, the interaction of general and specific limitations and exceptions to copyright under the Copyright Act of 1976, and probably a few others besides. However, at the heart of both cases is actually a very simple question — does copying for non-expressive use require the express approval of the copyright owner?

A non-expressive uses one which involve some technical act of copying the above for which the resultant copy is not read by any human being. For example, checking work for plagiarism involves comparing the suspect work against a database of potential sources. It is certainly valuable to know that work A is suspiciously like work B, but that knowledge is entirely independent of the expressive value of either of the underlying works.

Non-expressive use was not a particularly pressing concern before the digital era – from the printing press to the photocopier, the only plausible reason to copy a work was in anticipation on reading it. In the present however, scanning technology, computer processing power and powerful software tools make it possible to crunch the numbers on the written word in all sorts of remarkable ways. The non-expressive use that most people will be familiar with relates to Internet search engines. Search engines direct users to sites of interest based on a complicated set of algorithms, but underlying those algorithms is an extraordinary database describing the contents of billions of individual webpages. To build a database requires copying and indexing billions of individual webpages.

Authors Guild v. Google will determine whether it was legitimate for Google to extend its Internet search model to the off-line world and apply it to paper-based works which had never been digitized. However, the significance of this cases goes well beyond building a better library catalog — although the importance of that should not be casually dismissed — Authors Guild v. Google and Authors Guild v. HathiTrust will shape the future of the digital humanities. If the District Court ruling in HathiTrust stands, as I believe it should, academics who wish to combine data science and a love of literature will not be shackled to the pre-1923 public domain. They will be able to apply the same analytical techniques to the works of William Faulkner as to those of William Shakespeare. More importantly, distant reading empowered by computational analysis will allow scholars to extend their gaze beyond a narrow literary canon or even the few thousand works for most of us can hope to read in our lifetime and address questions on a broader scale.

Slides are available here: Copyright and Mass Digitization, Iowa 2013

 

Archives & Copyright: Developing An Agenda For Reform starts tomorrow #dh #archivescopyright

Archives & Copyright: Developing An Agenda For Reform

This is a one day symposium, co-organised by CREATe and the Wellcome Library. The symposium considers forthcoming changes to the copyright regime in the UK as it impacts the work of archives, as well as the role that risk-management plays in copyright compliance for archival digitization projects.

I will be speaking on a panel along with Professors Peter Jaszi and Peter Hirtle. We will discuss how cultural heritage institutions in the US work with copyright law, and in particular the ongoing Authors Guild v. HathiTrust case (currently on appeal).

I plan to talk about my experience bringing together (along with Jason Schultz and Matthew Jockers) the digital humanities amicus briefs for Authors Guild v. Hathi Trust I and II and Authors Guild v. Google. My slides are available right here.

The #hashtag for the symposium is #archivescopyright

A Collection of Briefs in Authors Guild v. HathiTrust

I have collected all the briefs in Authors Guild v. Hathitrust for anyone who is interested.

The leading number refers to the court docket. There are some briefs in support of the plaintiffs, but the majority are in support of the defendants.

You can download the whole set as a zip file (26MG) here: AG v. Ht Appeal Briefs as filed 2013 …

Or individually from the links below: