Copyright and Pornography — Is now the time to panic?

There were 2004 copyright lawsuits filed in federal district courts in the United States in the period from January 1st to June 30th 2014. Just under 48% of these suits were filed by copyright owners against anonymous IP addresses accused of copyright infringement online. This is not surprising given the extent of online piracy, but what is more than a little surprising is that almost all of these lawsuits relate to pornographic films. Lawsuits alleging illegal file sharing of pornography were virtually non-existent before 2010, they now (Jan-Jun 2014) account for than 41% of all copyright suits filed.

Slide03

In my talk tomorrow at the 14th Annual Intellectual Property Scholars Conference at Berkeley Law School I will address this phenomenon and answer three fundamental questions: (1) When did this happen? (2) How did it happen? and (3) Is now the time to panic?

Here are some of the slides from my talk (below), the full paper is available here (download    Copyright Trolling, An Empirical Study)

 

 

Slide05

Slide06


Slide08

Slide09

Slide10 Slide11

 

Why digital humanities researchers support google’s fair use defense

I posted a guest-blog over at the Authors Alliance explaining why digital humanities researchers support google’s fair use defense in Authors Guild v. Google.  The  Authors Alliance supports Google’s fair use defense because it helps authors reach readers. In my post, I explained another reason why this case is important to the advancement of knowledge and scholarship.

Earlier this month a group of more than 150 researchers, scholars and educators with an interest in the ‘Digital Humanities’ joined an amicus brief urging the Second Circuit Court of Appeals to side with Google in this dispute. Why would so many teachers and academics from fields ranging from Computer Science, English Literature, History, Law, to Linguistics care about this lawsuit? It’s not because they are worried about Google—Google surely has the resources to look after itself—but because they are concerned about the future of academic inquiry in a world of ‘big data’ and ubiquitous copyright.

For decades now, physicists, biologists and economists have used massive quantities of data to explore the world around them. With increases in computing power, advances in computational linguistics and natural language processing, and the mass digitization of texts, researchers in the humanities can apply these techniques to the study of history, literature, language and so much more.

Conventional literary scholars, for example, rely on the close reading of selected canonical works. Researchers in the ‘Digital Humanities’ are able to enrich that tradition with a broader analysis of patterns emergent in thousands, hundreds of thousands, or even millions of texts. Digital Humanities scholars fervently believe that text mining and the computational analysis of text are vital to the progress of human knowledge in the current Information Age. Digitization enhances our ability to process, mine, and ultimately better understand individual texts, the connections between texts, and the evolution of literature and language.

A Simple Example of the Power of the Digital Humanities

The figure below, is an Ngram-generated chart that compares the frequency with which authors of texts in the Google Book Search database refer to the United States as a single entity (“is”) as opposed to a collection of individual states (“are”). As the chart illustrates, it was only in the latter half of the Nineteenth Century that the conception of the United States as a single, indivisible entity was reflected in the way a majority of writers referred to the nation. This is a trend with obvious political and historical significance, of interest to a wide range of scholars and even to the public at large. But this type of comparison is meaningful only to the extent that it uses as raw data a digitized archive of significant size and scope.

The United States is/are

There are two very important things to note here. First, the data used to produce this visualization can only be collected by digitizing the entire contents of the relevant books–no one knows in advance which books to look in for this kind of search. Second, not a single sentence of the underlying books has been reproduced in the finished product. The original authors expression was an input to the process, but it was not a recognizable part of the output. This is the fundamental distinction that the Digital Humanities Amici are asking the court to preserve–the distinction between ideas and expression.

Will Copyright Law Prevent the Computational Analysis of Text?

The computational analysis of text has opened the door to new fields of inquiry in the humanities–it allows researchers to ask questions that were simply inconceivable in the analog era. However, the lawsuit by the Authors Guild threatens to slam that door shut.

For over 300 years Copyright has balanced the author’s right to control the copying of her expression with the public’s freedom to access the facts and ideas contained within that expression. Authors get the chance to sell their books to the public, but they don’t get to say how those books are read, how people react to them, whether they choose to praise them or pan them, how they talk to their friends about them. Copyright protects the author’s expression (for a limited time and subject to a number of exceptions and limitations not relevant here) but it leaves the information within that expression and information about that expression “free as the air to common use.” The protection of expression and the freedom of non-expression are both fundamental pillars of American Copyright law. However, the Author Guild’s long running campaign against library digitization threatens to erase that distinction in the digital age and fundamentally alter the balance of copyright law.

In the pre-digital era, the only reason to copy a book was to read it, or at least preserve the option of reading it. But this is no longer true. There are a host of modern technologies that literally copy text as an input into some larger data-processing application that has nothing to do with reading. For want of a better term, we call these ‘non-expressive uses’ because they don’t necessarily involve any human being reading the authors original expression at the end of the day. 

Most authors, if asked, support making their works searchable because they want them to be discovered by new generations of readers. But this is not our central point. Our point is that if it is permissible for a human to pick up a book and count the number of occurrences of the word “whale” (1119 times in Moby Dick) or the ratio of male to female pronouns (about 2:1 in A Game of Thrones Book 1—A Song of Ice and Fire), etc., then there is no reason the law should prevent researchers doing this on a larger and more systematic basis.

Game of Thrones Pronouns Etc

Digitizing a library collection to make it searchable or to allow researchers to analyze create and analyze metadata does not interfere with the interests that copyright owners have in the underlying expression in their books.

Who knows what the next generation of humanities researchers will uncover about literature, language, and history if we let them?

You can download the Brief of Digital Humanities and Law Scholars as Amici Curiae here.

Digital Humanities and Legal Scholars in Authors Guild v. Google filed

On Thursday this week, we filed a brief on behalf over 150 researchers, scholars and educators in Authors Guild v. Google, currently on appeal to the Second Circuit Court of Appeals.
The Brief of Digital Humanities and Legal Scholars argues that Copyright law is not, and should not be, an obstacle to the computational analysis of text. Copyright law has long recognized the distinction between protecting an author’s original expression and the public’s right to access the facts and ideas contained within that expression.
We are confident that the Second Circuit will vote to maintain that distinction in the digital age so that library digitization, internet search and related non-expressive uses of written works remain legal.
The final version of the brief is available on the free online repository ssrn.com at this link address: http://ssrn.com/abstract=2465413.
We are grateful for the support of so many wonderful scholars in this important case and we are even more grateful for all the fascinating research that these computer scientists, english professors, historians, linguists, and all those working in the digital humanities do to enrich our lives.
We would also like to thank The Association for Computers and the Humanities and the Canadian Society of Digital Humanities/Société canadienne des humanités numériques for their support as institutions.
Matthew Jockers
Matthew Sag
Jason Schultz

Copyright Trolling Data, Updated to June 30 2014

Copyright Trolls, Pornography, Statutory Damages…

[Revised at 5:43pm to account for an idiotic mistake in Excel – Just going to show that you should not use excel for even the most simple things]

The gifts that keep on giving.

I have updated my data on copyright trolling to include cases filed up to June 30, 2014. The  data is now available to anyone interested in replication. I have also revised my paper  Copyright Trolling, An Empirical Study (download the full paper from ssrn) with the following table that shows the phenomenal influence of Malibu Media.

Bottom line: Malibu Media accounted for 10% of all copyright suits filed in 2012, 27% in 2013 and 40% in the first half of 2014.

Copyright Suits Filed in U.S. District Courts – 2001 to June 30 2014

Screen Shot 2014-07-03 at 5.42.39 PM

 

The top section of the table shows how many cases were filed under the 820 code for Copyright in U.S. Federal District Courts in the years 2003 to 2014. The bottom section of the table translates the same information into percentages. The “Copyright – All” category includes all copyright cases. “Copyright –John Doe” includes all copyright cases where the defendant was a John Doe, without differentiating as to the underlying subject matter of the compliant. “Copyright – John Doe (Porn)” is a subset of the previous category and includes all cases identified as relating to pornography. The final category, “Malibu Media v. Doe(s)” includes every case filed by Malibu Media against one or more John Does.

 

Call for signatories: Digital Humanities Amicus in Authors Guild v. Google

Matthew Jockers, Jason Schultz and I have written an amicus brief in the upcoming Court of Appeals round of Authors Guild v. Google, Inc.

Download the draft here: DH Amicus AG v Google CA2

Background

Since we started working on this project just over two years ago two district courts and the Court of Appeals for the Second Circuit have rejected the Authors Guild’s attacks on library digitization and the legality of text-mining. We are confident that the Second Circuit will uphold Judge Chin’s decision last year where he rejected (on a motion for summary judgement)  the Authors Guild’s copyright infringement claim against Google over its Google Book Search product.  The rulings in Authors Guild v. Google and the parallel case of Authors Guild v. Hathitrust are a critical moment in the fight to define fair use for the Digital Humanities.In Authors Guild v. Google, Judge Chin expressly based ruling in part on the fact that

“Google Books … has transformed book text into data for purposes of substantive research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google Books has created something new in the use of book text — the frequency of words and trends in their usage provide substantive information.”
In his decision, Judge Chin cites the Brief of Digital Humanities and Law Scholars as Amici Curiae that we submitted on behalf of more than 100 researchers and scholars last year. Chin wrote that
“Google Books permits humanities scholars to analyze massive amounts of data — the literary record created by a collection of tens of millions of books.”

The Authors Guild is now appealing Judge Chin’s decision (on this and other grounds).  A different panel of that same court has already upheld the decision in Authors Guild v. Hathitrust. We believe that these cases will have a dramatic effect on research in computer science to linguistics, history, literature and the digital humanities.

Argument in a nutshell

According to the U.S. Constitution, the purpose of copyright is “To promote the Progress of Science and useful Arts”. Copyright law should not be an obstacle to statistical and computational analysis of the millions of books owned by university libraries. Copyright law has long recognized the distinction between protecting an author’s original expression and the public’s right to access the facts and ideas contained within that expression. That distinction must be maintained in the digital age so that library digitization, internet search and related non-expressive uses of written works remain legal.

What can you do?

If you are a legal academic or student, academic or researcher who would be effected by this issue, you can help preserve the balance of copyright law by joining our brief as a signatory (we need your name and affiliation e.g. Associate Professor, Jane Doe, Springfield University).

Does this concern you?

If you are still reading this post, the answer is probably YES.  We are collecting signatures from a wide range of fields, including computer science, englishhistory, law, linguistics and philosophy. We need your name etc., by July 9, 2014. Please enter your details directly via this online tool:

https://docs.google.com/forms/d/1QSA_fUSaRpw47wwRcXh0SXkZFx1NQ2NbjhBbfTrICnA/viewform?usp=send_formPlease feel free to share this invitation with other interested academics and Phd students.

Thank you!

Matthew Jockers explains why you can’t read a book through snippets

The Authors Guild’s war on search engines, text-mining and academic research is in its final throws. Over the last two years two different US Federal District Courts have held that library digitization for the purpose of building a search index and running a search engine is fair use. See, Authors Guild v. Hathitrust 902 F. Supp. 2d 445 (S.D.N.Y. 2012) and Authors Guild v. Google 954 F. Supp. 2d 282 (S.D.N.Y. 2013). The Hathitrust decision was upheld on appeal on June 10 this year (Authors Guild v. Hathitrust, 2nd Circuit 2014) and the parties and interested amici are gearing up for a final showdown in the appeal of Authors Guild v. Google.

In the Guild’s latest legal salvo it argues – by repeated assertion – that the text snippets Google displays to users allow 78% of the contents of any book to be reconstructed. (e.g., at p.10 “The scanning process resulted in an index that contains the complete text of all the books copied in the Library Project.”)

My sometime co-author and accomplished Digital Humanities researcher, Matthew Jockers, tested out the Guild’s claims on his own book and … it turns out that you can’t read a book through snippets, unless you already have the book, and that even then it takes about 30 minutes to trick the search engine into giving you the next 100 words beyond the free-view.

As Matt explains:

“Reading 78% of my book online, as the Guild asserts, requires that the reader anticipate what words will appear in the concealed sections of the book.”

He concludes

“Well, this certainly is reading Macroanalysis the hard way. I’ve now spent 30 minutes to gain access to exactly 100 words beyond what was offered in the initial preview. And, of course, my method involved having access to the full text! Without the full text, I don’t think such a process of searching and reading is possible, and if it is possible, it is certainly not feasible!

But let’s assume that a super savvy text pirate, with extensive training in English language syntax could guess the right words to search and then perform at least as well as I did using a full text version of my book as a crutch. My book contains, roughly, 80,000 words. Not counting the ~5k offered in the preview, that leaves 75,000 words to steal. At a rate of 200 words per hour, it would take this super savvy text pirate 375 hours to reconstruct my book. That’s about 47 days of full-time, eight-hour work.”

Matt’s book is Macroanalysis: Digital Methods and Literary History and — as seen on the screen shot I just made of Google Books — you can buy the eBook version, linked to from the Google Books web page, for $14.95.

Screen Shot 2014-06-19 at 11.48.07 AM

 

 

Authors Guild v. HathiTrust — Libraries 3 : Authors Guild 0

The Second Circuit Court of Appeals has upheld the most important parts of the District Court decision in Authors Guild v. HathiTrust. Here is a link to the decision –AGvHathiTrust_CA2_2013.

Along with the district decision in this case and the one in Authors Guild v. Google, this makes the current score, Libraries 3 : Authors Guild 0

The decision confirms that library digitization (as performed by Google in conjunction with the University of Michigan, University of Illinois and many others) does not infringe copyright if it is done for the purpose of allowing blind and visually disabled people to read books. 

Access to the PrintDisabled
The HDL also provides print‐disabled patrons with versions of all of the works contained in its digital archive in formats accessible to them. In order to obtain access to the works, a patron must submit documentation from a qualified expert verifying that the disability prevents him or her from reading printed materials, and the patron must be affiliated with an HDL member that has opted‐into the program. Currently, the University of Michigan is the only HDL member institution that has opted‐in. We conclude that  this use is also protected by the doctrine of fair use.

The decision confirms that library digitization does not infringe copyright if it is done for the purpose  of text-mining or creating a search engine. This is core of the non-expressive use argument that Matthew Jockers, Jason Schultz and I made in the Digital Humanities Amicus Brief  (http://ssrn.com/abstract=2274832). That brief was joined by over 100 professors and scholars who teach, write, and research in computer science, the digital humanities, linguistics or law, and two associations that represent Digital Humanities scholars generally.

The crux of our argument was that mass digitization of books for text-mining purposes is a form of incidental or “intermediate” copying that enables ultimately non-expressive, non-infringing, and socially beneficial uses without unduly treading on any expressive—i.e., legally cognizable—uses of the works. The Court of Appeals appears to have agreed.

FullText Search
It is not disputed that, in order to perform a full‐text search of books, the Libraries must first create digital copies of the entire books. Importantly, as we have seen, the HDL does not allow users to view any portion of the books they are searching. Consequently,  in providing this service, the HDL does not add into circulation any new, human‐readable copies of any books. Instead, the HDL simply permits users to “word search”—that is, to locate where specific  words or phrases appear in the digitized books. Applying the relevant factors, we conclude that this use is a fair use.

The Court left itself some room to maneuver if it turns out that, for reason, digitization for non-expressive uses like text mining causes unforeseen harm in different circumstances. For example, a digitization project that did not bother with any kind of security might not be fair use.

Without foreclosing a future claim based on circumstances not  now predictable, and based on a different record, we hold that the  balance of relevant factors in this case favors the Libraries. In sum,  we conclude that the doctrine of fair use allows the Libraries to  digitize copyrighted works for the purpose of permitting full‐text  searches.

With that appropriate caveat, this is a great win for for humanity and the Digital Humanities respectively.

I am proud to have played my small part in this case over the years.

Updated Copyright Trolling and Pornography Data for 2014

 

Copyright Lawsuits Filed in U.S. Federal Courts 2001 – 2014

Figure8

This graphs is from my forthcoming article, Copyright Trolling, An Empirical Study. The graph illustrates the effect of two separate waves of John Doe litigation. The first wave was the recording industry’s battle with filesharing technology from 2004 through 2008. The second wave began in 2010 and continues through to the present and is dominated to a remarkable degree by lawsuits relating to pornography.

There is an article in the New Yorker Online today about this phenomenon with a a closeup view of one the major plaintiffs, Malibu Media, see THE BIGGEST FILER OF COPYRIGHT LAWSUITS? THIS EROTICA WEB SITE, BY GABE FRIEDMAN . Well worth a read.

 

Patent troll statistics, a correction

In a post on Friday I mentioned that in 2012, businesses and individuals targeted by patent aggregators and patent holding companies accounted for fifty-six percent of all patent defendants. That number should actually be 37.8% of all patent defendants. 

There are estimates of even higher numbers, see Colleen Chien, Patent Trolls by the Numbers reporting RPX’s estimate that  PAEs initiated 62% of all patent litigation suits in 2012. However, it is not exactly clear how RPX determines who is and is not a patent troll. Also RPX is in the business of providing “patent risk management services”, so it is not exactly a disinterested bystander in the patent troll debate.

Christopher A. Cotropia, Jay P. Kesan & David L. Schwartz, Unpacking Patent Assertion Entities (PAEs) have provided some great data on this issue – but it still needs to be read carefully to understand what it means.

Figure 3 of that paper reports patent litigation numbers in terms of the number of individual defendants sued.

On this metric:

  • suits by large aggregators and patent holding companies increased from 31.6% of all patent litigation in 2010 to 37.8% in 2012;
  • in contrast suits by operating companies went down from 48.9% in 2010 to 47.3% in 2012;
  • if you include the IP holding companies of operating companies, suits by operating companies went down from 51.0% in 2010 to 47.8% in 2012;

Cotropia, Kesan & Schwartz round out this picture by reporting the numbers for universities & colleges, individuals & family trusts, failed operating companies & failed start-ups, and technology development companies. Some of these suits may be troll litigation, but without case specific information it is hard to tell.