All posts by Matthew Sag

About Matthew Sag

Technology enthusiast, law professor, copyright and internet law specialist.

Copyright Trolling Data, Updated to June 30 2014

Copyright Trolls, Pornography, Statutory Damages…

[Revised at 5:43pm to account for an idiotic mistake in Excel – Just going to show that you should not use excel for even the most simple things]

The gifts that keep on giving.

I have updated my data on copyright trolling to include cases filed up to June 30, 2014. The  data is now available to anyone interested in replication. I have also revised my paper  Copyright Trolling, An Empirical Study (download the full paper from ssrn) with the following table that shows the phenomenal influence of Malibu Media.

Bottom line: Malibu Media accounted for 10% of all copyright suits filed in 2012, 27% in 2013 and 40% in the first half of 2014.

Copyright Suits Filed in U.S. District Courts – 2001 to June 30 2014

Screen Shot 2014-07-03 at 5.42.39 PM


The top section of the table shows how many cases were filed under the 820 code for Copyright in U.S. Federal District Courts in the years 2003 to 2014. The bottom section of the table translates the same information into percentages. The “Copyright – All” category includes all copyright cases. “Copyright –John Doe” includes all copyright cases where the defendant was a John Doe, without differentiating as to the underlying subject matter of the compliant. “Copyright – John Doe (Porn)” is a subset of the previous category and includes all cases identified as relating to pornography. The final category, “Malibu Media v. Doe(s)” includes every case filed by Malibu Media against one or more John Does.


Call for signatories: Digital Humanities Amicus in Authors Guild v. Google

Matthew Jockers, Jason Schultz and I have written an amicus brief in the upcoming Court of Appeals round of Authors Guild v. Google, Inc.

Download the draft here: DH Amicus AG v Google CA2


Since we started working on this project just over two years ago two district courts and the Court of Appeals for the Second Circuit have rejected the Authors Guild’s attacks on library digitization and the legality of text-mining. We are confident that the Second Circuit will uphold Judge Chin’s decision last year where he rejected (on a motion for summary judgement)  the Authors Guild’s copyright infringement claim against Google over its Google Book Search product.  The rulings in Authors Guild v. Google and the parallel case of Authors Guild v. Hathitrust are a critical moment in the fight to define fair use for the Digital Humanities.In Authors Guild v. Google, Judge Chin expressly based ruling in part on the fact that

“Google Books … has transformed book text into data for purposes of substantive research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google Books has created something new in the use of book text — the frequency of words and trends in their usage provide substantive information.”
In his decision, Judge Chin cites the Brief of Digital Humanities and Law Scholars as Amici Curiae that we submitted on behalf of more than 100 researchers and scholars last year. Chin wrote that
“Google Books permits humanities scholars to analyze massive amounts of data — the literary record created by a collection of tens of millions of books.”

The Authors Guild is now appealing Judge Chin’s decision (on this and other grounds).  A different panel of that same court has already upheld the decision in Authors Guild v. Hathitrust. We believe that these cases will have a dramatic effect on research in computer science to linguistics, history, literature and the digital humanities.

Argument in a nutshell

According to the U.S. Constitution, the purpose of copyright is “To promote the Progress of Science and useful Arts”. Copyright law should not be an obstacle to statistical and computational analysis of the millions of books owned by university libraries. Copyright law has long recognized the distinction between protecting an author’s original expression and the public’s right to access the facts and ideas contained within that expression. That distinction must be maintained in the digital age so that library digitization, internet search and related non-expressive uses of written works remain legal.

What can you do?

If you are a legal academic or student, academic or researcher who would be effected by this issue, you can help preserve the balance of copyright law by joining our brief as a signatory (we need your name and affiliation e.g. Associate Professor, Jane Doe, Springfield University).

Does this concern you?

If you are still reading this post, the answer is probably YES.  We are collecting signatures from a wide range of fields, including computer science, englishhistory, law, linguistics and philosophy. We need your name etc., by July 9, 2014. Please enter your details directly via this online tool: feel free to share this invitation with other interested academics and Phd students.

Thank you!

Matthew Jockers explains why you can’t read a book through snippets

The Authors Guild’s war on search engines, text-mining and academic research is in its final throws. Over the last two years two different US Federal District Courts have held that library digitization for the purpose of building a search index and running a search engine is fair use. See, Authors Guild v. Hathitrust 902 F. Supp. 2d 445 (S.D.N.Y. 2012) and Authors Guild v. Google 954 F. Supp. 2d 282 (S.D.N.Y. 2013). The Hathitrust decision was upheld on appeal on June 10 this year (Authors Guild v. Hathitrust, 2nd Circuit 2014) and the parties and interested amici are gearing up for a final showdown in the appeal of Authors Guild v. Google.

In the Guild’s latest legal salvo it argues – by repeated assertion – that the text snippets Google displays to users allow 78% of the contents of any book to be reconstructed. (e.g., at p.10 “The scanning process resulted in an index that contains the complete text of all the books copied in the Library Project.”)

My sometime co-author and accomplished Digital Humanities researcher, Matthew Jockers, tested out the Guild’s claims on his own book and … it turns out that you can’t read a book through snippets, unless you already have the book, and that even then it takes about 30 minutes to trick the search engine into giving you the next 100 words beyond the free-view.

As Matt explains:

“Reading 78% of my book online, as the Guild asserts, requires that the reader anticipate what words will appear in the concealed sections of the book.”

He concludes

“Well, this certainly is reading Macroanalysis the hard way. I’ve now spent 30 minutes to gain access to exactly 100 words beyond what was offered in the initial preview. And, of course, my method involved having access to the full text! Without the full text, I don’t think such a process of searching and reading is possible, and if it is possible, it is certainly not feasible!

But let’s assume that a super savvy text pirate, with extensive training in English language syntax could guess the right words to search and then perform at least as well as I did using a full text version of my book as a crutch. My book contains, roughly, 80,000 words. Not counting the ~5k offered in the preview, that leaves 75,000 words to steal. At a rate of 200 words per hour, it would take this super savvy text pirate 375 hours to reconstruct my book. That’s about 47 days of full-time, eight-hour work.”

Matt’s book is Macroanalysis: Digital Methods and Literary History and — as seen on the screen shot I just made of Google Books — you can buy the eBook version, linked to from the Google Books web page, for $14.95.

Screen Shot 2014-06-19 at 11.48.07 AM



Authors Guild v. HathiTrust — Libraries 3 : Authors Guild 0

The Second Circuit Court of Appeals has upheld the most important parts of the District Court decision in Authors Guild v. HathiTrust. Here is a link to the decision –AGvHathiTrust_CA2_2013.

Along with the district decision in this case and the one in Authors Guild v. Google, this makes the current score, Libraries 3 : Authors Guild 0

The decision confirms that library digitization (as performed by Google in conjunction with the University of Michigan, University of Illinois and many others) does not infringe copyright if it is done for the purpose of allowing blind and visually disabled people to read books. 

Access to the PrintDisabled
The HDL also provides print‐disabled patrons with versions of all of the works contained in its digital archive in formats accessible to them. In order to obtain access to the works, a patron must submit documentation from a qualified expert verifying that the disability prevents him or her from reading printed materials, and the patron must be affiliated with an HDL member that has opted‐into the program. Currently, the University of Michigan is the only HDL member institution that has opted‐in. We conclude that  this use is also protected by the doctrine of fair use.

The decision confirms that library digitization does not infringe copyright if it is done for the purpose  of text-mining or creating a search engine. This is core of the non-expressive use argument that Matthew Jockers, Jason Schultz and I made in the Digital Humanities Amicus Brief  ( That brief was joined by over 100 professors and scholars who teach, write, and research in computer science, the digital humanities, linguistics or law, and two associations that represent Digital Humanities scholars generally.

The crux of our argument was that mass digitization of books for text-mining purposes is a form of incidental or “intermediate” copying that enables ultimately non-expressive, non-infringing, and socially beneficial uses without unduly treading on any expressive—i.e., legally cognizable—uses of the works. The Court of Appeals appears to have agreed.

FullText Search
It is not disputed that, in order to perform a full‐text search of books, the Libraries must first create digital copies of the entire books. Importantly, as we have seen, the HDL does not allow users to view any portion of the books they are searching. Consequently,  in providing this service, the HDL does not add into circulation any new, human‐readable copies of any books. Instead, the HDL simply permits users to “word search”—that is, to locate where specific  words or phrases appear in the digitized books. Applying the relevant factors, we conclude that this use is a fair use.

The Court left itself some room to maneuver if it turns out that, for reason, digitization for non-expressive uses like text mining causes unforeseen harm in different circumstances. For example, a digitization project that did not bother with any kind of security might not be fair use.

Without foreclosing a future claim based on circumstances not  now predictable, and based on a different record, we hold that the  balance of relevant factors in this case favors the Libraries. In sum,  we conclude that the doctrine of fair use allows the Libraries to  digitize copyrighted works for the purpose of permitting full‐text  searches.

With that appropriate caveat, this is a great win for for humanity and the Digital Humanities respectively.

I am proud to have played my small part in this case over the years.

Updated Copyright Trolling and Pornography Data for 2014


Copyright Lawsuits Filed in U.S. Federal Courts 2001 – 2014


This graphs is from my forthcoming article, Copyright Trolling, An Empirical Study. The graph illustrates the effect of two separate waves of John Doe litigation. The first wave was the recording industry’s battle with filesharing technology from 2004 through 2008. The second wave began in 2010 and continues through to the present and is dominated to a remarkable degree by lawsuits relating to pornography.

There is an article in the New Yorker Online today about this phenomenon with a a closeup view of one the major plaintiffs, Malibu Media, see THE BIGGEST FILER OF COPYRIGHT LAWSUITS? THIS EROTICA WEB SITE, BY GABE FRIEDMAN . Well worth a read.


Patent troll statistics, a correction

In a post on Friday I mentioned that in 2012, businesses and individuals targeted by patent aggregators and patent holding companies accounted for fifty-six percent of all patent defendants. That number should actually be 37.8% of all patent defendants. 

There are estimates of even higher numbers, see Colleen Chien, Patent Trolls by the Numbers reporting RPX’s estimate that  PAEs initiated 62% of all patent litigation suits in 2012. However, it is not exactly clear how RPX determines who is and is not a patent troll. Also RPX is in the business of providing “patent risk management services”, so it is not exactly a disinterested bystander in the patent troll debate.

Christopher A. Cotropia, Jay P. Kesan & David L. Schwartz, Unpacking Patent Assertion Entities (PAEs) have provided some great data on this issue – but it still needs to be read carefully to understand what it means.

Figure 3 of that paper reports patent litigation numbers in terms of the number of individual defendants sued.

On this metric:

  • suits by large aggregators and patent holding companies increased from 31.6% of all patent litigation in 2010 to 37.8% in 2012;
  • in contrast suits by operating companies went down from 48.9% in 2010 to 47.3% in 2012;
  • if you include the IP holding companies of operating companies, suits by operating companies went down from 51.0% in 2010 to 47.8% in 2012;

Cotropia, Kesan & Schwartz round out this picture by reporting the numbers for universities & colleges, individuals & family trusts, failed operating companies & failed start-ups, and technology development companies. Some of these suits may be troll litigation, but without case specific information it is hard to tell.


Some measured thoughts on patent trolls

This post expands on the remarks I made today at the Chicago Tech Roundtable meeting on Patent Trolls and Chicago’s Tech Community. The meeting was attended by a number of elected officials and their representatives as well as start-ups such as Jump Rope and Options Away Travel who have had direct experiences with patent trolls. This is an important issue for Chicago’s technology sector.

Trolls and Trolling – The Nature of the Problem

Patent trolls are in the news and they have been high on the agenda of intellectual property policy makers and academics for over a decade now. I started thinking about these issues when I worked as an IP lawyer in Silicon Valley in the early 2000’s. The Federal Trade Commission sounded an important call to action on patent trolls and the balance of competition and patent law and policy in 2003 [FTC, To Promote Innovation: The Proper Balance of Competition and Patent Law and Policy, (2003)], and again in 2011 [FTC, The Evolving IP Marketplace: Aligning Patent Notice and Remedies with Competition (2011)].

I first wrote about these issues in a paper published in 2007, Patent Reform and Differential Impact (with Kurt W. Rohde who was then a student of mine and is now is a partner with McDonnell Boehnen Hulbert & Berghoff LLP), some things have changed since then, but the patent trolls problem persists, it may even be getting worse. In 2012, businesses and individuals targeted by patent aggregators and patent holding companies accounted for 58 37.8% of all patent defendants.

* An earlier version of this post contained the assertion that businesses and individuals targeted by patent aggregators and patent holding companies accounted for 58% of all patent defendants. That was a very unfortunate transcription error.

Not every non-practicing entity, patent aggregator and patent holding company is a necessarily a troll. There is obviously a role in our innovation ecosystem for people who invent but don’t have the complementary skills to commercialize. But the rough and ready correlation between patent assertion entities and trolls seems to fit. Any business that is sending out infringement notices by the hundreds or thousands can’t plausibly be doing the kind of diligence it should take to make a meritorious claim of infringement.

Debating who is and is not a troll is beside the point – it is trolling behavior that we need to address. Trolling is abusive and opportunistic behavior such as asserting bad patents or using patents to extort settlements that are only justified by the threat of legal fees. Trolling is mostly a numbers game – trolls targets hundreds or thousands of defendants, seeking quick settlements priced just low enough that it is easier for the defendant to pay the troll directly rather than pay his lawyers to defend the claim. Anyone who takes part in this kind of systematic opportunism that undermines innovation is a troll, NPE or not.

Patent trolls thrive by opportunistically taking advantage of the uncertain scope of patent claims, the poor quality of patent examination, the high cost of litigation and the asymmetry of risks and costs of litigation. There is nothing wrong with licensing your technology, or with litigating against infringers who would rather take your technology without a license. The patent system is meant to encourage investment in R&D and innovation. Businesses monetizing their technology aught to be celebrated, but using the threat of costly litigation to monetize bad or ill-fitting patent claims takes money away from R&D budgets for no social gain.


There will always be opportunists who try to exploit the system – the goal of patent reform should be to limit unproductive rent-seeking while leaving the door open to those businesses that actually contribute to the research and development that makes the U.S. a world leader in so many fields. We need reforms targeted at bad patents such as limiting continuations, rejecting highly abstract functional claims (especially in software and business method patents) and improving the quality of patent examination. But we also need reforms targeted at bad conduct. We need reforms that level the litigation playing field — it is far too easy to impose millions of dollars of defense costs based on dubious patents and tenuous theories of the scope of those patents. Some of the most useful reforms— reforms that leave the door open for well-founded claims — include: fee shifting such that the losing party pays the winning party’s fees in ordinary cases; heightened pleading standards; delaying discovery until after “claim construction,” and limiting discovery to those documents likely to be relevant to the specific litigation at hand. “Consumer stay” provisions would also do a lot to end opportunistic threats of litigation – under this proposal if, for example, a cafe was sued by offering Wi-Future interest to its customers, the manufacturer of the router could intervene and effectively consolidate the cases of all of its customers. Reforms aimed at transparency of patent ownership and taking action against misleading and deceptive language in demand letters would also be of some assistance.

House recently passed the Innovation Act to take up some of these issues and the Senate Judiciary Committee seems close to finalizing a complementary bill. Neither of these bills will put an end to opportunism, but they have the potential to make life a harder for patent trolls and a little easier for the rest of us.

Patent Troll Statistics

According to RPX Corporation PAEs initiated 62% of all patent litigation suits in 2012. [See Colleen Chien, Patent Trolls by the Numbers (] However, it is not exactly clear how RPX determines who is and is not a patent troll. Also RPX is in the business of providing “patent risk management services”, so it is not exactly a disinterested bystander in the patent troll debate.

The best empirical analysis of patent troll numbers  so far is contained by Christopher A. Cotropia, Jay P. Kesan & David L. Schwartz, Unpacking Patent Assertion Entities (PAEs) (working paper available at Figure 3 of that paper reports these numbers in terms of the number of individual defendants sued. On this metric:

  • suits by large aggregators and patent holding companies increased from 31.6% of all patent litigation in 2010 to 37.8% in 2012;
  • in contrast suits by operating companies went down from 48.9% in 2010 to 47.3% in 2012;
  • if you include the IP holding companies of operating companies, suits by operating companies went down from 51.0% in 2010 to 47.8% in 2012;

Cotropia, Kesan & Schwartz round out this picture by reporting the numbers for universities & colleges, individuals & family trusts, failed operating companies & failed start-ups, and technology development companies. Some of these suits may be troll litigation, but without case specific information it is hard to tell.


Garcia v. Google amicus briefs with very brief summaries

Due to the level of interest in Garcia v. Google, the Ninth Circuit has a dedicated page providing information and key court documents to the public.

I have listed the Amicus briefs along with very cursory descriptions below. The briefs are all quite short.

Internet Law Professors — addressing the implications of the Court’s decision for Section 230 of the Communications Decency Act.  Section 230 is vital to the health of e-commerce and web 2.0 businesses, it provides the legal foundation for many of the most popular websites that enable users to communicate with each other or the world at large.  The panel’s broad interpretation of copyright law undermines Section 230 immunity without even expressly considering it. * I am one of the Amici for this brief.

Professors of Intellectual Property Law — arguing that the Court’s opinion misinterprets the baseline requirements for copyrightability. ** This is the brief to read for students of copyright law.

Adobe Systems, et al. It is not surprising that eBay, Facebook, Gawker, Kickstarter, Pinterest, Tumblr, Twitter, and Yahoo!  would feel strongly about this case. These amici argue that the Court’s decision and order places too much responsibility on service providers to monitor their services and denies the public’s interest in free expression and access to information. They also argue that the Court’s order is unworkable and that the ruling poses a serious threat to online service providers’ businesses.

Netflix, Inc. — arguing that that the ruling creates a new species of copyright and risks wreaking havoc with established copyright and business rules on which third party distributors, such as Netflix, depend.

International Documentary Ass’n — arguing that the Court’s opinion has created uncertainty as to several fundamental concepts that are essential to modern filmmaking.

Floor64 (publisher of & Organization for Transformative Works —arguing that the Court’s decision undermines Congress’s goal of fostering online speech by effectively stripping intermediaries of the statutory protection they depend on to deliver it — i.e. the safe harbors created by the § 230 of the Communications Decency Act and the Digital Millennium Copyright Act (17 U.S.C. § 512).)

California Broadcasters — arguing that a finding that individual performances within films and television programs may be entitled to copyright protection creates uncertainly for entertainment media creators and distributors.

News Organizations — arguing that the Court’s decision did not properly consider important First Amendment interests and that it poses serious risk to news organizations that extend far beyond the unique facts of the case at hand.

Electronic Frontier Foundation, et al. — framing the issues in Constitutional terms and addressing the standard for preliminary injunctions.

Public Citizen Litigation Group — focusing on the correct standard for issuing an injunction restraining speech.