Matthew Jockers explains why you can’t read a book through snippets

The Authors Guild’s war on search engines, text-mining and academic research is in its final throws. Over the last two years two different US Federal District Courts have held that library digitization for the purpose of building a search index and running a search engine is fair use. See, Authors Guild v. Hathitrust 902 F. Supp. 2d 445 (S.D.N.Y. 2012) and Authors Guild v. Google 954 F. Supp. 2d 282 (S.D.N.Y. 2013). The Hathitrust decision was upheld on appeal on June 10 this year (Authors Guild v. Hathitrust, 2nd Circuit 2014) and the parties and interested amici are gearing up for a final showdown in the appeal of Authors Guild v. Google.

In the Guild’s latest legal salvo it argues – by repeated assertion – that the text snippets Google displays to users allow 78% of the contents of any book to be reconstructed. (e.g., at p.10 “The scanning process resulted in an index that contains the complete text of all the books copied in the Library Project.”)

My sometime co-author and accomplished Digital Humanities researcher, Matthew Jockers, tested out the Guild’s claims on his own book and … it turns out that you can’t read a book through snippets, unless you already have the book, and that even then it takes about 30 minutes to trick the search engine into giving you the next 100 words beyond the free-view.

As Matt explains:

“Reading 78% of my book online, as the Guild asserts, requires that the reader anticipate what words will appear in the concealed sections of the book.”

He concludes

“Well, this certainly is reading Macroanalysis the hard way. I’ve now spent 30 minutes to gain access to exactly 100 words beyond what was offered in the initial preview. And, of course, my method involved having access to the full text! Without the full text, I don’t think such a process of searching and reading is possible, and if it is possible, it is certainly not feasible!

But let’s assume that a super savvy text pirate, with extensive training in English language syntax could guess the right words to search and then perform at least as well as I did using a full text version of my book as a crutch. My book contains, roughly, 80,000 words. Not counting the ~5k offered in the preview, that leaves 75,000 words to steal. At a rate of 200 words per hour, it would take this super savvy text pirate 375 hours to reconstruct my book. That’s about 47 days of full-time, eight-hour work.”

Matt’s book is Macroanalysis: Digital Methods and Literary History and — as seen on the screen shot I just made of Google Books — you can buy the eBook version, linked to from the Google Books web page, for $14.95.

