All posts by Matthew Sag

About Matthew Sag

Technology enthusiast, law professor, copyright and internet law specialist.

Pornography and copyright trolling, some data.

“the Northern District of Illinois can be proud to be the pornography copyright trolling capital of the United States”


Compared to patent trolls, copyright trolls have received comparatively little attention. Admittedly, patent trolls may have broader economic significance, but copyright trolling raises its own unique set of issues that deserve to be addressed. Defining exactly what makes an individual or an organization a troll is inevitably controversial. As initially invoked, the term was meant to be a disparaging caricature. However, over the years, the term patent troll has come to mean something more than merely “someone whose lawsuit is inconvenient to me,” although it is certainly still invoked in that way from time to time.

The paradigmatic patent troll is a nonpracticing entity who asserts patent infringement against companies who have clearly not copied its technology and seeks either (a) to extract a series of nuisance value settlements or (b) to extract a large settlement from a deep pocket (eBay, Microsoft, etc) that is entirely out of proportion with the contribution on the patented technology. This latter approach has become significantly more difficult after the Supreme Court’s decision in eBay.

The nature of copyright trolling is a reflection of the economics on statutory damages and the principal technologies of infringement. The Electronic Frontier Foundation defines a copyright troll as a person or organization that files a copyright infringement lawsuit against as many defendants as possible for the purposes of extracting the most settlements with the least court costs. Generally these suits take the form of “Copyright Owner v. John Does 1 –1000” or some other large number. Not all of these suits are related to pornography, but a very large number of them are.

The theory behind these multi-party John Doe lawsuits is that every participant in a BitTorrent swarm is engaged in an act of copyright infringement and that each persistent is jointly liable for the resulting infringement. Courts and commentators have formed the distinct impression that such lawsuits are never intended to go to trial – they are simply a mechanism to compel Internet service companies to give the plaintiffs names and addresses to match the IP addresses that they already have. With this information in hand, the plaintiff can negotiate hundreds, even thousands of settlements. Reports indicate that $3000 is a typical settlement figure. This is a lot to pay for an adult movie, but it’s a small fraction of the potential statutory damages for willful copyright infringement which could be as high as $150,000 per work infringed. The threat of statutory damages, and the threat of exposure and embarrassment drive many settlements.

The Data

Just how widespread is this practice? I examined all copyright cases filed in the federal district courts associated with the Second, Seventh and Ninth circuits between January 1, 2001 and August 31, 2013. I identified “John Doe” lawsuits by looking for those words in the case title and I differentiated pornography copyright trolls from other plaintiffs by reviewing at least one underlying complaint per plaintiff. Figure 1, below, breaks the filing data down by state and into three year time periods based on the year of filing, beginning with the year 2001. This figure shows the prevalence of all “John Doe” actions as a percentage of all copyright filings. The figure highlights the recent growth of “John Doe” lawsuits and their uneven geographic concentration. It’s particularly noteworthy that in 2013 the suits make up the majority of filings in Illinois, Indiana, Washington and Wisconsin.

Figure 1: Percentage of John Doe Law Suits by State



Figure 2, below, is based on the same data except that it differentiates between pornography related John Doe litigation and other John Doe litigation.

Figure2: Percentage of John Doe (Pornography) Law Suits by State



Figure 3, below, is similar to figure 2 except that it focuses in on the last four years. The previous figures illustrate the prevalence of John Doe lawsuits as a percentage of all copyright lawsuits.

Figure 3: Percentage of John Doe (Pornography) Law Suits by State 2010-2013



Figure 4, below, shows the more numbers for John Doe pornography and other John Doe copyright litigation.

Figure 4: John Doe (Pornography) Law Suits by District 2001 – 2013

Count of porn by district

I have not examined the data from district courts beyond the second, seventh and ninth circuits, but for the moment it appears that the Northern District of Illinois can be proud to be the pornography copyright trolling capital of the United States.

United States Is versus United States Are

When Matthew Jockers, Jason Shultz and I were writing the Digital Humanities Amicus Briefs relating to the Google Books and HathiTrust cases, we searched for an illustration that would concisely explain why data mining expressive works was (a)  socially valuable and (b)  no threat to the copyright interests of the authors of the underlying works. We came across a graph produced using the Google n-gram tool that perfectly fit the bill. The graph below was part of the Digital Humanities Amicus Brief in both the HathiTrust and Google Books cases.



This graph is a reconstruction of data generated using Google Ngram, sampled at five-year intervals. The y-axis is scaled to 1/100,000 of a percent, such that 1 = 0.00001%.

The graph was referred to by the District Court in Authors Guild v. HathiTrust and last week’s decision in Authors Guild v. Google. As we explained in our brief, “[the figure] compares the frequency with which authors of texts in the Google Book Search database refer to the United States as a single entity (“is”) as opposed to a collection of individual states (“are”). As the chart illustrates, it was only in the latter half of the Nineteenth Century that the conception of the United States as a single, indivisible entity was reflected in the way a majority of writers referred to the nation. This is a trend with obvious political and historical significance, of interest to a wide range of scholars and even to the public at large. But this type of comparison is meaningful only to the extent that it uses as raw data a digitized archive of significant size and scope.”

Metadata like this can only be collected by digitizing the entire contents of books, and it clearly does not communicate any author’s original expression to the reading public.

I decided that the graph deserved its own post.

Google Books held to be fair use

Authors Guild v. Google: library digitization as fair use vindicated, again.

After more than eight years of litigation, the legality of the Google Books Search engine has finally been vindicated.

Screen Shot 2013-11-14 at 10.35.00 AM

Authors Guild v Google Summary Judgement (Nov. 14, 2013)

The heart of the decision

The key to understanding Authors Guild v. Google is not in the court’s explanation of any of the individual fair use factors — although there is a great deal here for copyright lawyers to mull over —  but rather in the court’s description of its overall assessment of how the statutory factors should be weighed together in light of the purposes of copyright law.

“In my view, Google Books provides significant public benefits. It advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders. It has become an invaluable research tool that permits students, teachers, librarians, and others to more efficiently identify and locate books. It has given scholars the ability, for the first time, to conduct full-text searches of tens of millions of books. It preserves books, in particular out-of-print and old books that have been forgotten in the bowels of libraries, and it gives them new life. It facilitates access to books for print-disabled and remote or underserved populations. It generates new audiences and creates new sources of income for authors and publishers. Indeed, all society benefits.”  (Authors Guild v. Google, p.26)

Even before last year’s HathiTrust decision (Authors Guild v. Hathitrust), the case law on transformative use and market effect was stacked in Google’s favor. Nonetheless, Judge Chin’s rulings in other cases (e.g. WNET, THIRTEEN v. Aereo, Inc.) suggest that he takes the rights of copyright owners very seriously and that it was essential to persuade him that Google was not merely evading the rights of authors through clever legal or technological structures. The court’s conclusion that the Google Library Project “advance[d] the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders” pervades all of its more specific analysis.

Data mining, text mining and digital humanities

An entire page of the judgment is devoted to explaining how digitization enables data mining. This discussion relies substantially on the Amicus Brief brief of Digital Humanities and Law Scholars signed by over 100 academics last year.

“Second, in addition to being an important reference tool, Google Books greatly promotes a type of research referred to as “data mining” or “text mining.”  (Br. of Digital Humanities and Law Scholars as Amici Curiae at 1 (Doc. No. 1052)).  Google Books permits humanities scholars to analyze massive amounts of data — the literary record created by a collection of tens of millions of books.  Researchers can examine word frequencies, syntactic patterns, and thematic markers to consider how literary style has changed over time.  …

Using Google Books, for example, researchers can track the frequency of references to the United States as a single entity (“the United States is”) versus references to the United States in the plural (“the United States are”) and how that usage has changed over time.  (Id. at 7).  The ability to determine how often different words or phrases appear in books at different times “can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology.”  Jean-Baptiste Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, 331 Science 176, 176 (2011) (Clancy Decl. Ex. H)” (Authors Guild v. Google, p.9-10)

The court held that Google Books was “[transformative] in the sense that it has transformed books text into data for purposes of substandard research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google books has created something new in the use of text — the frequency of words and trends in the usage provide substantial information.”

A snippet of new law

Last year, the court in HathiTrust ruled that library digitization for the non-expressive use of text mining and the expressive use of providing access to the visually disabled was fair use. Today’s decision in Authors Guild v. Google supports both of those conclusions; it further holds that the use of snippets of text in search results is also fair use. The court noted that  displaying snippets of text as search results is similar to the display of thumbnail images of photographs as search results and that these snippets may help users locate books and determine whether they may be of interest.

The judgment clarifies something that confuses a lot of people — the difference between “snippet” views on Google books and more extensive document previews. Google has scanned over 20 million library books to create its search engine, mostly without permission. However, Google has agreements with thousands of publishers and authors who authorize it to make far more extensive displays of their works – presumably because these authors and publishers understand that even greater exposure on Google Books will further drive sales.

The court was not convinced that Google Books poses any threat of expressive substitution because, although it is a powerful tool for learning about books individually and collectively, “it is not a tool to be used to read books.”

The Authors Guild had attempted to show that an accumulation of individual snippets could substitute for books, but the court found otherwise: the kind of accumulation of snippets that the plaintiffs were suggesting was both technically infeasible because of certain security measures and, perhaps more importantly, was bizarre and unlikely: “Nor is it likely that someone would take the time and energy to input countless searches to try and get enough snippets to comprise an entire book.  Not only is that not possible as certain pages and snippets are blacklisted, the individual would have to have a copy of the book in his possession already to be able to piece the different snippets together in coherent fashion.”


Today’s decision is an important victory for Google and the entire United States technology sector; it also confirms the recent victory libraries, academics and the visually disabled in Authors Guild v. HathiTrust.

Unless today’s decision is overruled by the Second Circuit or the Supreme Court — something I personally think is very unlikely –, it is now absolutely clear that technical acts of reproduction that facilitate purely non-expressive uses of copyrighted works such as books, manuscripts and webpages do not infringe United States copyright law. This means that copy-reliant technologies including plagiarism detection software, caching, search engines and data mining more generally now stand on solid legal ground in the United States. Copyright law in the majority of other nations does not provide the same kind of flexibility for new technology.

All in all, an excellent result.

* Updated at 4.57pm. The initial draft of this post contained several dictation errors which I will now endeavor to correct. My apologies. Updated at 5.17pm with additional links and minor edits. 





University of Iowa presentation on copyright, mass digitization and the digital humanities

I am giving a talk today on copyright, mass digitization and the digital humanities at the University of Iowa law school. My talk will focus on the ongoing litigation between the Authors Guild and Google and the separate case of Authors Guild v. HathiTrust. The case against Google began in 2005 shortly after Google launched its ambitious library digitization project. The case against the HathiTrust, a digital library that pulls together the resources of a number of American universities, began much later in September 2011.

These cases raise complicated issues about standing, the scope of class actions, statutory interpretation, the interaction of general and specific limitations and exceptions to copyright under the Copyright Act of 1976, and probably a few others besides. However, at the heart of both cases is actually a very simple question — does copying for non-expressive use require the express approval of the copyright owner?

A non-expressive uses one which involve some technical act of copying the above for which the resultant copy is not read by any human being. For example, checking work for plagiarism involves comparing the suspect work against a database of potential sources. It is certainly valuable to know that work A is suspiciously like work B, but that knowledge is entirely independent of the expressive value of either of the underlying works.

Non-expressive use was not a particularly pressing concern before the digital era – from the printing press to the photocopier, the only plausible reason to copy a work was in anticipation on reading it. In the present however, scanning technology, computer processing power and powerful software tools make it possible to crunch the numbers on the written word in all sorts of remarkable ways. The non-expressive use that most people will be familiar with relates to Internet search engines. Search engines direct users to sites of interest based on a complicated set of algorithms, but underlying those algorithms is an extraordinary database describing the contents of billions of individual webpages. To build a database requires copying and indexing billions of individual webpages.

Authors Guild v. Google will determine whether it was legitimate for Google to extend its Internet search model to the off-line world and apply it to paper-based works which had never been digitized. However, the significance of this cases goes well beyond building a better library catalog — although the importance of that should not be casually dismissed — Authors Guild v. Google and Authors Guild v. HathiTrust will shape the future of the digital humanities. If the District Court ruling in HathiTrust stands, as I believe it should, academics who wish to combine data science and a love of literature will not be shackled to the pre-1923 public domain. They will be able to apply the same analytical techniques to the works of William Faulkner as to those of William Shakespeare. More importantly, distant reading empowered by computational analysis will allow scholars to extend their gaze beyond a narrow literary canon or even the few thousand works for most of us can hope to read in our lifetime and address questions on a broader scale.

Slides are available here: Copyright and Mass Digitization, Iowa 2013


Some thoughts on the use of bio photos

I have noticed over the years that whenever someone puts together a bio page for me in relation to a talk or a conference presentation that they tend to grab just any old photo from the Internet. Quite frankly, some of these photographs a more flattering than others. Most of them are not as good as the selfy I took on my iPhone this morning.  Photos from 10 years ago might be considered too flattering in terms of hairline.

Perhaps with some strategic tagging and linking I can get this to the top of the Google search engine.

Photo of Prof. Matthew Sag 2013

Matthew Sag


I also have a full bio page at which contains all sorts of useful information.

Archives & Copyright: Developing An Agenda For Reform starts tomorrow #dh #archivescopyright

Archives & Copyright: Developing An Agenda For Reform

This is a one day symposium, co-organised by CREATe and the Wellcome Library. The symposium considers forthcoming changes to the copyright regime in the UK as it impacts the work of archives, as well as the role that risk-management plays in copyright compliance for archival digitization projects.

I will be speaking on a panel along with Professors Peter Jaszi and Peter Hirtle. We will discuss how cultural heritage institutions in the US work with copyright law, and in particular the ongoing Authors Guild v. HathiTrust case (currently on appeal).

I plan to talk about my experience bringing together (along with Jason Schultz and Matthew Jockers) the digital humanities amicus briefs for Authors Guild v. Hathi Trust I and II and Authors Guild v. Google. My slides are available right here.

The #hashtag for the symposium is #archivescopyright

Empirical Studies of Copyright Litigation: Can we rely on PACER’s Nature of Suit coding

I have just posted a new paper titled, Empirical Studies of Copyright Litigation: Nature of Suit Coding ( The paper investigates reliance on the Nature of Suit coding in the PACER records for empirical studies of copyright litigation. It concludes that although the PACER Nature of Suit for copyright does not in fact capture all copyright cases, it is a good enough sample for most purposes.

In spite of the increasing popularity of empirical legal studies more generally, there are relatively few empirical studies of copyright law, and even fewer of copyright litigation. This state of affairs cannot continue. The creation and distribution of copyrighted works is an important economic driver of the U.S. economy and copyright law’s interactions with freedom of expression and cultural participation have made it an area of significant public policy focus.  If we truly want to understand copyright litigation we need to examine then we need to look at LITIGATION and not just at cases. But before we go too far down the rabbit hole of docket analysis, someone needs to ask whether we are studying the right dockets.

As part of a broader ongoing study of copyright litigation I selected every case in the Lexis database published (by lexis, not necessarily designated as such by the court) between 2000 and 2012 that included the word “copyright”. The search was designed to be over-inclusive. From this broad sample, I randomly selected one fifth of the district court opinions and all of the court of appeals opinions.

A team of Loyola Law School students reviewed each opinion following a detailed coding form and determined, among other things, whether the case was truly a copyright case. Of the 472 cases coded, 102 were not copyright cases. More specifically, of the 137 court of appeals cases and 275 district court cases selected, 42 appeals cases and 60 district court cases only mentioned copyright in passing or in the course of discussing copyright case law but did not relate to a claim of copyright infringement.

Screen Shot 2013-09-24 at 6.59.33 AM

Determining the NOS coding for these true copyright cases was a simple, but laborious matter of cross-referencing the docket number with the PACER records. As set forth in Table 3, below, the almost 80% of district court cases and 85% court of appeals true copyright cases were filed as NOS=Copyright [820]. 

Screen Shot 2013-09-24 at 6.59.44 AM

The “other” category included: Contract, Cable/Sat TV, Other Statutory Actions, Insurance, Assault, Libel, & Slander, Other Personal Property Damage, Civil Rights, Fraud, Personal Injury and even some criminal filings. What is does this imply for empirical research? Most obviously, it implies that docket analysis of copyright disputes relying solely on the nature of suit coding misses one in five of the kind of copyright case that is likely to end up as a written opinion at the district court level.

Is 80% good enough? It’s not bad. If we assume that most attorneys are competent enough to know what the major focus of their case is, then the copyright cases that are overlooked by focusing solely on the 820 cohort are likely to be only partially about copyright. However, researchers should also be aware that some dockets that grow up to be copyright cases, even some that make it into text books, will be missed by reliance on the 820 coding. They should this understand that selection is probably not random and may not be inconsequential. Consider, for example the difference in duration between district level true copyright cases coded as NOS=820 and those that were not.

The average duration of terminated district court true copyright cases was 752 days (488 median) if the case was filed as NOS=820. For the corresponding set filed as something other than NOS=820, the average duration was 506 days (479 median). The average duration of unterminated district court true copyright cases as of January 1, 2013 was 1232 days (1074 median) if the case was filed as NOS=820. For the corresponding set filed as something other than NOS=820, the average duration was 1099 days (942 median). Figures 1 and 2, below, present the same information in the form of histograms indicating the distribution of duration for all four categories.

Screen Shot 2013-09-24 at 7.00.23 AM

Screen Shot 2013-09-24 at 7.04.01 AM

In simple terms, district court true copyright cases tended to be longer in average duration if filed as NOS=820, although it is noteworthy that they are not that different at the median.

What does all this mean for empirical studies of copyright litigation?
My conclusion is that, for copyright, at least, although the PACER Nature of Suit for copyright does not in fact capture all copyright cases, as long as researchers are clear about their methods and what data they are excluding, it is a good enough enough sample for most purposes.

Ivan Sag 1949 – 2013

Ivan was a large and brilliant man, the world feels like a smaller place without him. Ivan loved to drink, he loved to eat, he loved ideas, he loved his wife and he loved his friends. We loved him right back.

Ivan made significant contributions to the fields of syntax, semantics, pragmatics, and language processing. He wrote at least 10 books and over 100 articles. Ivan was the Sadie Dernham Patek Professor in Humanities, Professor of Linguistics, and Director of the Symbolic Systems Program at Stanford University. A fellow of the American Academy of Arts and Sciences and the Linguistic Society of America, in 2005 he received the LSA’s Fromkin Prize for distinguished contributions to the field of linguistics. All of which is to say that he was a brilliant wonderful man who I proudly call my uncle (even though he is in fact my first cousin, once removed). He will be missed.

A true scientist, Ivan was proud to live and die as an atheist.