The Multiple Copy Argument – Some thoughts on #fairuse and Authors Guild v. Hathitrust (pt4)

Introduction and Necessary Disclaimer

This one of a series of posts concerning the Authors Guild v. Hathitrust case, specifically these posts take the form of commentary on the Authors Guild Appeal Brief (February 25, 2013). The views expressed on this site are purely my own.

Today’s topic …

The Multiple Copy Argument

The Authors Guild Appeal Brief contains an interesting argument that is hard to summarize with perfect fidelity because it appears in so many places throughout the document (illustrations to follow). Essentially the plaintiffs now appear to argue that even if some copying would be allowed for certain library digitization purposes, the defendants created a too many copies and that these copies, or their retention, exceed the parameters of any fair use claim.

Examples from the Authors Guild Appeal Brief

The multiple copy argument first appears in the plaintiffs’ Statement of issues presented

“3. Did the District Court err by failing to recognize that the Libraries’ online storage of multiple copies of the unauthorized digital library goes far beyond what is necessary to accomplish any transformative purpose of the MDP?” (Authors Guild Ap. Br. page 4)

However it also appears on pages 8, 9-10, 12, 18, 30, 31, 32, 33, 36, 37 and 38.

“Each digital replica would include a set of image files representing every page of the work and a text file of the book’s words generated through an optical character recognition process.” (Authors Guild Ap. Br. page 8)

“the Libraries receive their own digital copies of the works to store and use.”(Authors Guild Ap. Br. page 8)

“In addition to the copies retained by Google, four digital copies of each book are maintained in the HDL, with two such copies stored on servers located in Michigan and Indiana and two additional copies stored on backup tapes.” (Authors Guild Ap. Br. 9-10)

“Moreover, even if certain of the Libraries’ uses are deemed transformative, their online storage of multiple digital duplicates of the books goes far beyond what is necessary to fulfill that purpose.” (Authors Guild Ap. Br. 12)

“[I]n analyzing whether the Mass Digitization Program is fair use under Section 107, the District Court failed to consider whether the Libraries could have made the uses the court found to be transformative – facilitating search and access for the print-disabled – without keeping multiple copies of the Authors’ works online and subjecting them to unauthorized access and widespread distribution.” (Authors Guild Ap. Br. 18)

“Moreover, to the extent that there is any transformative or other legitimate purpose to the Libraries’ actions, the making of multiple copies of the works and then storing the full text and image files online where they are susceptible to theft and widespread distribution goes far beyond what is needed to satisfy such purpose.” (Authors Guild Ap. Br. 30)

“the District Court erred by failing to recognize that the Libraries are able to facilitate text searching and to provide access to the print-disabled without creating and storing so many digital copies online.” (Authors Guild Ap. Br. page 31)

“(ii) Even if Copying Millions of Books to Facilitate Search is Transformative, There is No Justification for Storing Multiple Copies of the Image and Text Files Online” …”The Authors maintain, as they did below, that the Libraries have no right to copy and use millions of books without authorization or payment. If the Libraries want to scan print books in order to create indices or to facilitate text mining or other research tools, they should be required to ask for and obtain permission for their copying. But more importantly for purposes of this appeal, to the extent that any of the Libraries’ goals fit within the rubric of fair use, the Libraries should be permitted to do no more than is necessary to accomplish that particular purpose.” (Authors Guild Ap. Br. 32)

“Moreover, unlike HathiTrust’s perpetual storage of high resolution image files and text files of every book, the Web pages copied by a search engine are incidental to the search function.” (Authors Guild Ap. Br. page 33)

“[O]nce a book’s text is recorded in the index, the image and text files are no longer necessary for the operation of the search engine.” (Authors Guild Ap. Br. page 37)

“[E]ven if it is necessary to digitize an entire work in order to index the contents for facilitating search, the third factor weighs heavily against the Libraries because they are unnecessarily retaining complete image and text files comprising every page of every book.” (Authors Guild Ap. Br. page 36)

Some thoughts on the Multiple Copy Argument

It entirely plausible that a plaintiff might look at a defendant who has made lots and lots of copies and argue that the very multiplicity of the copying is evidence that the real purpose was not the transformative use claimed, but some other use. For example, if Borders (1971-2011) had scanned its whole inventory and made 60,000 copies of the collection in dvd bundles, we might have begun to suspect they were planning on selling them.

However, in the context of the library digitization being litigated in Authors Guild v. HathiTrust, there is no similar mystery about the extent of copying. The libraries maintain the original scan images because those images are needed to quality-check the OCR (optical character recognition) text versions. Those versions are also needed so that the collection can be re-digitizes when, inevitably, someone invents a smarter OCR program that is less prone to error. A biologist would not throw out an original specimen after taking their initial notes; a social scientist would not delete her original data after running her initial set of regressions. It would be somewhere between reckless and crazy to throw out the original scans.

The same applies to the OCR-text files. It might be true that once you create a search index you don’t need the original text files to actually implement search. But as anyone with any experience in software development or working with data will tell you, there are always new and better ways to process information. It would be hubris, almost a crime against knowledge, to pretend that search indexing or optical character recognition in 2013 are a good as they will ever be.

The Authors Guild Appeal Brief appears (to me) to be deliberately obtuse when it says “… even if this Court were to hold that HathiTrust in its current configuration satisfies these criteria, the Libraries still have not demonstrated their need to retain the digital image files in order to facilitate access to the print-disabled, as the assistive technology uses text files to convert the text from the book into speech.” (Authors Guild Ap. Br. page 38). Does the Authors Guild seriously intend that the print-disabled should be held hostage to state of the art in OCR and text-to-speech as of 2013?

Any library digitization exercise should generate a handful of copies per book – you have to keep the original image and OCR files safe; you have to duplicate them so people can examine them; you have to store everything in multiple locations in case of flood, fire, terrorist attack or simple human error, and if scientists are regularly testing new equations against the original data you might need to mirror some of that data to increase the speed of the network. There is no reason why the universities should treat these digitized files any more cavalierly than Facebook treats the 267 photos of my dog I have posted to the social network.