Call for signatories: Digital Humanities Amicus in Authors Guild v. Google

Matthew Jockers, Jason Schultz and I have written an amicus brief in the upcoming Court of Appeals round of Authors Guild v. Google, Inc.

Download the draft here: DH Amicus AG v Google CA2

Background

Since we started working on this project just over two years ago two district courts and the Court of Appeals for the Second Circuit have rejected the Authors Guild’s attacks on library digitization and the legality of text-mining. We are confident that the Second Circuit will uphold Judge Chin’s decision last year where he rejected (on a motion for summary judgement)  the Authors Guild’s copyright infringement claim against Google over its Google Book Search product.  The rulings in Authors Guild v. Google and the parallel case of Authors Guild v. Hathitrust are a critical moment in the fight to define fair use for the Digital Humanities.In Authors Guild v. Google, Judge Chin expressly based ruling in part on the fact that

“Google Books … has transformed book text into data for purposes of substantive research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google Books has created something new in the use of book text — the frequency of words and trends in their usage provide substantive information.”
In his decision, Judge Chin cites the Brief of Digital Humanities and Law Scholars as Amici Curiae that we submitted on behalf of more than 100 researchers and scholars last year. Chin wrote that
“Google Books permits humanities scholars to analyze massive amounts of data — the literary record created by a collection of tens of millions of books.”

The Authors Guild is now appealing Judge Chin’s decision (on this and other grounds).  A different panel of that same court has already upheld the decision in Authors Guild v. Hathitrust. We believe that these cases will have a dramatic effect on research in computer science to linguistics, history, literature and the digital humanities.

Argument in a nutshell

According to the U.S. Constitution, the purpose of copyright is “To promote the Progress of Science and useful Arts”. Copyright law should not be an obstacle to statistical and computational analysis of the millions of books owned by university libraries. Copyright law has long recognized the distinction between protecting an author’s original expression and the public’s right to access the facts and ideas contained within that expression. That distinction must be maintained in the digital age so that library digitization, internet search and related non-expressive uses of written works remain legal.

What can you do?

If you are a legal academic or student, academic or researcher who would be effected by this issue, you can help preserve the balance of copyright law by joining our brief as a signatory (we need your name and affiliation e.g. Associate Professor, Jane Doe, Springfield University).

Does this concern you?

If you are still reading this post, the answer is probably YES.  We are collecting signatures from a wide range of fields, including computer science, englishhistory, law, linguistics and philosophy. We need your name etc., by July 9, 2014. Please enter your details directly via this online tool:

https://docs.google.com/forms/d/1QSA_fUSaRpw47wwRcXh0SXkZFx1NQ2NbjhBbfTrICnA/viewform?usp=send_formPlease feel free to share this invitation with other interested academics and Phd students.

Thank you!

Matthew Jockers explains why you can’t read a book through snippets

The Authors Guild’s war on search engines, text-mining and academic research is in its final throws. Over the last two years two different US Federal District Courts have held that library digitization for the purpose of building a search index and running a search engine is fair use. See, Authors Guild v. Hathitrust 902 F. Supp. 2d 445 (S.D.N.Y. 2012) and Authors Guild v. Google 954 F. Supp. 2d 282 (S.D.N.Y. 2013). The Hathitrust decision was upheld on appeal on June 10 this year (Authors Guild v. Hathitrust, 2nd Circuit 2014) and the parties and interested amici are gearing up for a final showdown in the appeal of Authors Guild v. Google.

In the Guild’s latest legal salvo it argues – by repeated assertion – that the text snippets Google displays to users allow 78% of the contents of any book to be reconstructed. (e.g., at p.10 “The scanning process resulted in an index that contains the complete text of all the books copied in the Library Project.”)

My sometime co-author and accomplished Digital Humanities researcher, Matthew Jockers, tested out the Guild’s claims on his own book and … it turns out that you can’t read a book through snippets, unless you already have the book, and that even then it takes about 30 minutes to trick the search engine into giving you the next 100 words beyond the free-view.

As Matt explains:

“Reading 78% of my book online, as the Guild asserts, requires that the reader anticipate what words will appear in the concealed sections of the book.”

He concludes

“Well, this certainly is reading Macroanalysis the hard way. I’ve now spent 30 minutes to gain access to exactly 100 words beyond what was offered in the initial preview. And, of course, my method involved having access to the full text! Without the full text, I don’t think such a process of searching and reading is possible, and if it is possible, it is certainly not feasible!

But let’s assume that a super savvy text pirate, with extensive training in English language syntax could guess the right words to search and then perform at least as well as I did using a full text version of my book as a crutch. My book contains, roughly, 80,000 words. Not counting the ~5k offered in the preview, that leaves 75,000 words to steal. At a rate of 200 words per hour, it would take this super savvy text pirate 375 hours to reconstruct my book. That’s about 47 days of full-time, eight-hour work.”

Matt’s book is Macroanalysis: Digital Methods and Literary History and — as seen on the screen shot I just made of Google Books — you can buy the eBook version, linked to from the Google Books web page, for $14.95.

Screen Shot 2014-06-19 at 11.48.07 AM

 

 

Authors Guild v. HathiTrust — Libraries 3 : Authors Guild 0

The Second Circuit Court of Appeals has upheld the most important parts of the District Court decision in Authors Guild v. HathiTrust. Here is a link to the decision –AGvHathiTrust_CA2_2013.

Along with the district decision in this case and the one in Authors Guild v. Google, this makes the current score, Libraries 3 : Authors Guild 0

The decision confirms that library digitization (as performed by Google in conjunction with the University of Michigan, University of Illinois and many others) does not infringe copyright if it is done for the purpose of allowing blind and visually disabled people to read books. 

Access to the PrintDisabled
The HDL also provides print‐disabled patrons with versions of all of the works contained in its digital archive in formats accessible to them. In order to obtain access to the works, a patron must submit documentation from a qualified expert verifying that the disability prevents him or her from reading printed materials, and the patron must be affiliated with an HDL member that has opted‐into the program. Currently, the University of Michigan is the only HDL member institution that has opted‐in. We conclude that  this use is also protected by the doctrine of fair use.

The decision confirms that library digitization does not infringe copyright if it is done for the purpose  of text-mining or creating a search engine. This is core of the non-expressive use argument that Matthew Jockers, Jason Schultz and I made in the Digital Humanities Amicus Brief  (http://ssrn.com/abstract=2274832). That brief was joined by over 100 professors and scholars who teach, write, and research in computer science, the digital humanities, linguistics or law, and two associations that represent Digital Humanities scholars generally.

The crux of our argument was that mass digitization of books for text-mining purposes is a form of incidental or “intermediate” copying that enables ultimately non-expressive, non-infringing, and socially beneficial uses without unduly treading on any expressive—i.e., legally cognizable—uses of the works. The Court of Appeals appears to have agreed.

FullText Search
It is not disputed that, in order to perform a full‐text search of books, the Libraries must first create digital copies of the entire books. Importantly, as we have seen, the HDL does not allow users to view any portion of the books they are searching. Consequently,  in providing this service, the HDL does not add into circulation any new, human‐readable copies of any books. Instead, the HDL simply permits users to “word search”—that is, to locate where specific  words or phrases appear in the digitized books. Applying the relevant factors, we conclude that this use is a fair use.

The Court left itself some room to maneuver if it turns out that, for reason, digitization for non-expressive uses like text mining causes unforeseen harm in different circumstances. For example, a digitization project that did not bother with any kind of security might not be fair use.

Without foreclosing a future claim based on circumstances not  now predictable, and based on a different record, we hold that the  balance of relevant factors in this case favors the Libraries. In sum,  we conclude that the doctrine of fair use allows the Libraries to  digitize copyrighted works for the purpose of permitting full‐text  searches.

With that appropriate caveat, this is a great win for for humanity and the Digital Humanities respectively.

I am proud to have played my small part in this case over the years.

Google Books held to be fair use

Authors Guild v. Google: library digitization as fair use vindicated, again.

After more than eight years of litigation, the legality of the Google Books Search engine has finally been vindicated.

Screen Shot 2013-11-14 at 10.35.00 AM

Authors Guild v Google Summary Judgement (Nov. 14, 2013)

The heart of the decision

The key to understanding Authors Guild v. Google is not in the court’s explanation of any of the individual fair use factors — although there is a great deal here for copyright lawyers to mull over —  but rather in the court’s description of its overall assessment of how the statutory factors should be weighed together in light of the purposes of copyright law.

“In my view, Google Books provides significant public benefits. It advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders. It has become an invaluable research tool that permits students, teachers, librarians, and others to more efficiently identify and locate books. It has given scholars the ability, for the first time, to conduct full-text searches of tens of millions of books. It preserves books, in particular out-of-print and old books that have been forgotten in the bowels of libraries, and it gives them new life. It facilitates access to books for print-disabled and remote or underserved populations. It generates new audiences and creates new sources of income for authors and publishers. Indeed, all society benefits.”  (Authors Guild v. Google, p.26)

Even before last year’s HathiTrust decision (Authors Guild v. Hathitrust), the case law on transformative use and market effect was stacked in Google’s favor. Nonetheless, Judge Chin’s rulings in other cases (e.g. WNET, THIRTEEN v. Aereo, Inc.) suggest that he takes the rights of copyright owners very seriously and that it was essential to persuade him that Google was not merely evading the rights of authors through clever legal or technological structures. The court’s conclusion that the Google Library Project “advance[d] the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders” pervades all of its more specific analysis.

Data mining, text mining and digital humanities

An entire page of the judgment is devoted to explaining how digitization enables data mining. This discussion relies substantially on the Amicus Brief brief of Digital Humanities and Law Scholars signed by over 100 academics last year.

“Second, in addition to being an important reference tool, Google Books greatly promotes a type of research referred to as “data mining” or “text mining.”  (Br. of Digital Humanities and Law Scholars as Amici Curiae at 1 (Doc. No. 1052)).  Google Books permits humanities scholars to analyze massive amounts of data — the literary record created by a collection of tens of millions of books.  Researchers can examine word frequencies, syntactic patterns, and thematic markers to consider how literary style has changed over time.  …

Using Google Books, for example, researchers can track the frequency of references to the United States as a single entity (“the United States is”) versus references to the United States in the plural (“the United States are”) and how that usage has changed over time.  (Id. at 7).  The ability to determine how often different words or phrases appear in books at different times “can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology.”  Jean-Baptiste Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, 331 Science 176, 176 (2011) (Clancy Decl. Ex. H)” (Authors Guild v. Google, p.9-10)

The court held that Google Books was “[transformative] in the sense that it has transformed books text into data for purposes of substandard research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google books has created something new in the use of text — the frequency of words and trends in the usage provide substantial information.”

A snippet of new law

Last year, the court in HathiTrust ruled that library digitization for the non-expressive use of text mining and the expressive use of providing access to the visually disabled was fair use. Today’s decision in Authors Guild v. Google supports both of those conclusions; it further holds that the use of snippets of text in search results is also fair use. The court noted that  displaying snippets of text as search results is similar to the display of thumbnail images of photographs as search results and that these snippets may help users locate books and determine whether they may be of interest.

The judgment clarifies something that confuses a lot of people — the difference between “snippet” views on Google books and more extensive document previews. Google has scanned over 20 million library books to create its search engine, mostly without permission. However, Google has agreements with thousands of publishers and authors who authorize it to make far more extensive displays of their works – presumably because these authors and publishers understand that even greater exposure on Google Books will further drive sales.

The court was not convinced that Google Books poses any threat of expressive substitution because, although it is a powerful tool for learning about books individually and collectively, “it is not a tool to be used to read books.”

The Authors Guild had attempted to show that an accumulation of individual snippets could substitute for books, but the court found otherwise: the kind of accumulation of snippets that the plaintiffs were suggesting was both technically infeasible because of certain security measures and, perhaps more importantly, was bizarre and unlikely: “Nor is it likely that someone would take the time and energy to input countless searches to try and get enough snippets to comprise an entire book.  Not only is that not possible as certain pages and snippets are blacklisted, the individual would have to have a copy of the book in his possession already to be able to piece the different snippets together in coherent fashion.”

Significance

Today’s decision is an important victory for Google and the entire United States technology sector; it also confirms the recent victory libraries, academics and the visually disabled in Authors Guild v. HathiTrust.

Unless today’s decision is overruled by the Second Circuit or the Supreme Court — something I personally think is very unlikely –, it is now absolutely clear that technical acts of reproduction that facilitate purely non-expressive uses of copyrighted works such as books, manuscripts and webpages do not infringe United States copyright law. This means that copy-reliant technologies including plagiarism detection software, caching, search engines and data mining more generally now stand on solid legal ground in the United States. Copyright law in the majority of other nations does not provide the same kind of flexibility for new technology.

All in all, an excellent result.

* Updated at 4.57pm. The initial draft of this post contained several dictation errors which I will now endeavor to correct. My apologies. Updated at 5.17pm with additional links and minor edits. 

 

 

 

 

University of Iowa presentation on copyright, mass digitization and the digital humanities

I am giving a talk today on copyright, mass digitization and the digital humanities at the University of Iowa law school. My talk will focus on the ongoing litigation between the Authors Guild and Google and the separate case of Authors Guild v. HathiTrust. The case against Google began in 2005 shortly after Google launched its ambitious library digitization project. The case against the HathiTrust, a digital library that pulls together the resources of a number of American universities, began much later in September 2011.

These cases raise complicated issues about standing, the scope of class actions, statutory interpretation, the interaction of general and specific limitations and exceptions to copyright under the Copyright Act of 1976, and probably a few others besides. However, at the heart of both cases is actually a very simple question — does copying for non-expressive use require the express approval of the copyright owner?

A non-expressive uses one which involve some technical act of copying the above for which the resultant copy is not read by any human being. For example, checking work for plagiarism involves comparing the suspect work against a database of potential sources. It is certainly valuable to know that work A is suspiciously like work B, but that knowledge is entirely independent of the expressive value of either of the underlying works.

Non-expressive use was not a particularly pressing concern before the digital era – from the printing press to the photocopier, the only plausible reason to copy a work was in anticipation on reading it. In the present however, scanning technology, computer processing power and powerful software tools make it possible to crunch the numbers on the written word in all sorts of remarkable ways. The non-expressive use that most people will be familiar with relates to Internet search engines. Search engines direct users to sites of interest based on a complicated set of algorithms, but underlying those algorithms is an extraordinary database describing the contents of billions of individual webpages. To build a database requires copying and indexing billions of individual webpages.

Authors Guild v. Google will determine whether it was legitimate for Google to extend its Internet search model to the off-line world and apply it to paper-based works which had never been digitized. However, the significance of this cases goes well beyond building a better library catalog — although the importance of that should not be casually dismissed — Authors Guild v. Google and Authors Guild v. HathiTrust will shape the future of the digital humanities. If the District Court ruling in HathiTrust stands, as I believe it should, academics who wish to combine data science and a love of literature will not be shackled to the pre-1923 public domain. They will be able to apply the same analytical techniques to the works of William Faulkner as to those of William Shakespeare. More importantly, distant reading empowered by computational analysis will allow scholars to extend their gaze beyond a narrow literary canon or even the few thousand works for most of us can hope to read in our lifetime and address questions on a broader scale.

Slides are available here: Copyright and Mass Digitization, Iowa 2013

 

No adjournment in Authors Guild v. Google. Oral argument is set for Sept. 23, 2013

In an order dated yesterday, the court denied a request for adjournment of the oral argument, saying:

The date of September 23, 2013 was set on July 8, 2013, more than five weeks ago. The Court will not adjourn oral argument because new counsel in this eight-year old litigation is unavailable on September 23rd because he will be attending a conference on copyright law.

I don’t know who the new counsel for the Authors Guild is or what conference she or he was going to be attending.

Lessig v. Liberation Music is almost too good to be true

Fair use and DMCA takedowns

Lawrence Lessig has written many fine books and articles, played a key role in founding the Creative Commons and worked tirelessly to promote the interests of the public in copyright law. And now he has also given us a DMCA takedown case that perfectly illustrates the tension between fair use and the takedown procedure.

Lessig v. Liberation Music Pty Ltd 1:13-cv-12028 (D. Mass. Aug 22, 2013)

Complaint in Lessig_v_Liberation_Music

On June 4, 2010,  Lessig delivered the keynote address at a Creative Commons conference in Seoul, South Korea. In the course of his 49-minute lecture Lessig discussed the present and future of cultural and technological innovation.

The lecture included several clips of amateur music videos in order to illustrate cultural developments in the age of the Internet.

Lisztomania

One set of clips depicts groups of people dancing to the same song, “Lisztomania,” by the band Phoenix. The “Lisztomania” craze began when a YouTube user, called “avoidant consumer,” posted on YouTube a video combining scenes from several movies, with the song “Lisztomania” serving as the soundtrack to the video. Inspired by avoidant consumer’s work, other YouTube users from around the world created their own versions of the video, with real people “performing” the roles of the actors in the original movies, and again with “Lisztomania” as the soundtrack. Lessig’s evident purpose in including these clips in his lecture was to illustrate how young people are using videos and other tools to create and communicate via the Internet.

Lessig taken down

Liberation Music sent a takedown notice to Youtube and then threatened to sue Lessig when he made a counter-notifaction. Rather than risk statutory damages and banishment from Youtube, Lessig withdrew his counter-notice and filed this complaint instead.

Fair Use and the DMCA notice and takedown process

Is this fair use?

Yes. If the facts as set out in the complaint are true, and I have no reason to doubt them, this is quite obviously fair use.

Lessig’s use His use was self-evidently highly “transformative” as that term is used in the fair use case law.

 The original work exists to

entertain music fans, Lessig’s use of small slices of the work was intended to inform and illustrate a broader cultural phenomenon.

Lessig’s use would not have effected the market for or value of the Lisztomania composition or sound recording (a) because of it was transformative (see above) and (b) because the

five clips used in the lecture were fairly short compared to the original track — they ranged from

10 seconds to 47 seconds.

Is it wrong to file a takedown notice in the face of a compelling fair use defense?

The DMCA provisions that set up the notice and takedown regime (§ 512 of the Copyright Act) do not make any explicit reference to the fair use doctrine. However, for a copyright owner to proceed with “a good faith belief that use of the material in the manner complained of is not authorized by the copyright owner, its agent, or the law,” the copyright owner must determine whether the material makes fair use of the copyright. (See Lenz v. Universal Music Corp., 572 F. Supp. 2d 1150 (N.D. Cal. 2008))

I have no doubt that Lessig will his declaratory judgment action. But prevailing on the misrepresentation claim under § 512(f) might be harder. He will have to show that the copyright owner acted in bad faith by issuing a takedown notice without “proper consideration” of the fair use doctrine. Case law suggests that this is a subjective, rather than objective standard.

Assuming that there is no admission by the defendant, the court will have to determine whether this particular claim to fair use was so obviously valid that it merited further consideration before filing the DMCA notice. The assertion by Liberation Music seems so reckless to me that this requirement may well have been met.

 

Code of Best Practices in Fair Use for Academic and Research Libraries

Code of Best Practices in Fair Use for Academic and Research Libraries

The Association of Research Libraries (ARL) has just released the Code of Best Practices in Fair Use for Academic and Research Libraries. The Code of Best Practices is intended to work as a clear and easy-to-use statement of fair and reasonable approaches to fair use developed by and for librarians who support academic inquiry and higher education.

What are these Best Practices codes about?

Best Practice statements such as this have been developed over the past decade in relation to classroom teaching, documentary filmmaking, online video, open courseware, media and communications studies, librarianship, poetry, and more. In general, Best Practices statements seek to identify points of strong and general agreement within user  communities  about what circumstances exist in which the unauthorized use of copyrighted material is crucial to the fulfillment of that community’s shared artistic or informational mission.

What does a Code of Best Practice Achieve? 

Best Practices are not a form of legal guarantee, but they are an important way for various  communities to educate themselves, bring together disparate sources of information, and state a common position. They also enable these communities to educate important third party stakeholders.

For example, following the development of the Documentary Filmmakers Statement of Best Practices in Fair Use in November 2005, every U.S. insurer that provides coverage against “errors and omissions” was willing to offer coverage for films that followed the Best Practices, which in turn, meant that films that had not been able to obtain copyright clearance but relied on fair use were able to be picked up for theatrical showing, DVD distribution, and television broadcasting – something that was not possible before the Best Practices.  There is ample evidence that filmmakers rely both extensively and successfully on own Statement of Best Practices, and the same is true of other creative communities that have created such documents for their own collective use.

What does the Code of Best Practices in Fair Use for Academic and Research Libraries do?

I have not read it yet, but taking its authors at their word, the Code deals with such common questions in higher education as:

  • When and how much copyrighted material can be digitized for student use? And should video be treated the same way as print?
  • How can libraries’ special collections be made available online?
  • Can libraries archive websites for the use of future students and scholars?

The Code identifies the relevance of fair use in eight recurrent situations for librarians:

Supporting teaching and learning with access to library materials via digital technologies
Using selections from collection materials to publicize a library’s activities, or to create physical and virtual exhibitions

  • Digitizing to preserve at-risk items
  • Creating digital collections of archival and special collections materials
  • Reproducing material for use by disabled students, faculty, staff, and other appropriate users
  • Maintaining the integrity of works deposited in institutional repositories
  • Creating databases to facilitate non-consumptive research uses (including search)
  • Collecting material posted on the web and making it available
  • In the Code, librarians affirm that fair use is available in each of these contexts, providing helpful guidance about the scope of best practice in each.

The development of the Code of Best Practices in Fair Use for Academic and Research Libraries is supported by a grant from The Andrew W. Mellon Foundation. The Code was developed in partnership with the Center for Social Media and the Washington College of Law at American University.

Post Script: 

The wonderful Peter Jaszi has been the driving force behind many of these Best practices projects. You can read all about it in Patricia Aufderheide and Peter Jaszi, Reclaiming Fair Use (University of Chicago Press, 2011).