When Matthew Jockers, Jason Shultz and I were writing the Digital Humanities Amicus Briefs relating to the Google Books and HathiTrust cases, we searched for an illustration that would concisely explain why data mining expressive works was (a) socially valuable and (b) no threat to the copyright interests of the authors of the underlying works. We came across a graph produced using the Google n-gram tool that perfectly fit the bill. The graph below was part of the Digital Humanities Amicus Brief in both the HathiTrust and Google Books cases.
This graph is a reconstruction of data generated using Google Ngram, sampled at five-year intervals. The y-axis is scaled to 1/100,000 of a percent, such that 1 = 0.00001%.
The graph was referred to by the District Court in Authors Guild v. HathiTrust and last week’s decision in Authors Guild v. Google. As we explained in our brief, “[the figure] compares the frequency with which authors of texts in the Google Book Search database refer to the United States as a single entity (“is”) as opposed to a collection of individual states (“are”). As the chart illustrates, it was only in the latter half of the Nineteenth Century that the conception of the United States as a single, indivisible entity was reflected in the way a majority of writers referred to the nation. This is a trend with obvious political and historical significance, of interest to a wide range of scholars and even to the public at large. But this type of comparison is meaningful only to the extent that it uses as raw data a digitized archive of significant size and scope.”
Metadata like this can only be collected by digitizing the entire contents of books, and it clearly does not communicate any author’s original expression to the reading public.
I decided that the graph deserved its own post.