On Monday, my laptop and I set up shop in my local university library so I could escape the world of digital content and get some writing done. The university in question has internet connectivity, of course, but I have no affiliation and, therefore, no access. In the heart of an institution dedicated to preserving and sharing knowledge, I managed to find a place where I could disconnect from incoming data long enough to get some writing done.
It didn’t work. It turns out I need Google to write these days. I wanted to reference a longshoremen’s strike in Long Beach, and couldn’t remember whether it took place in 2002 or 2005. A better man would have left a question mark in the text – I found myself using my phone to try to Google the answer, then broke down and went to visit the public internet terminals. Realizing how silly this was, I went back to the coffee shop where I usually write, logged on and went back to researching and writing. (And yes, I realize there are likely ways I could have solved this on paper. But I find I’m not consulting the Reader’s Guide to Periodicals nearly as often now that I can search newspaper websites. Are you?)
There’s a problem with this discovery. Much of the good stuff isn’t on the web yet. It’s still in the library. And those of us who’ve gotten out of the library habit are missing information we need. I got a rude awakening to this when I started writing about media attention. Unable to find anyone online writing about media attention, I assumed I was one of the first… missing critical work done by Johan Galtung in 1965, which would have been obvious searching the literature in the library.
Those of us who live and work in a digital world can’t wait for a day when an online search for information looks not only at data stored on millions of webservers, but in the millions of books in the Library of Congress. That vision motivates pioneers like Brewster Kahle of the Internet Archive, Michael Hart of Progject Gutenberg and Maura Marx of the Open Knowledge Commons.
Marx spoke at the Berkman Center earlier this week, introducing Open Knowledge Commons, a new project funded by the Alfred P. Sloan Foundation to help coordinate the myriad of projects working towards the goal of a universal digital library. Sloan’s motivation for funding this new organization has to do with fear of duplication of effort – they are supporting a wide range of efforts to digitize content, and realize that there’s a need for a central registry of content that’s been digitized. There’s also a great need to coordinate legal and advocacy efforts to make the larger vision of a global, multilingual open library possible.
Formerly with the Boston Public Library’s digitization project, Marx sees the project of a universal digital library as an extention of the work Josiah Quincy Jr. and others took up when they formed the American public library movement – the availability of knowledge that would be “free to all”. In a digital age, Marx argues that an open knowledge commons needs to be without enclosure, encompassing both all recorded media and the “cognitive processes applied to it” – the uses of that media – and maintained in the public sphere for the use and benefit of everyone. Her vision is broader than just having access to all texts digitally, but being able to do complex, cross-text work like named entity analysis and text extraction on a huge corpus.
Of course, it’s not as simple as putting all the world’s books in a pile and scanning them one at a time. There’s a great deal of complex legal uncertainty around what libraries can and cannot do with scanned books. Public libraries are possible – in intellectual property terms in the US – through the doctrine of “first sale”: if you’ve bought a book, you can lend it to others, if you’d like, rather than forcing them to buy their own copy. It’s not so clear how this applies in a digital age, and there are open questions about copyright, licensing and fair use in a digital age.
As a result, most of the projects working on a digital library are starting with content that’s out of copyright. Project Gutenberg began keying in public domain books in 1970s, avoiding licensing issues by focusing on texts where copyright has expired. The Million Books Project, started early this decade, has used OCR to create a collection of books focused on agriculture, scanning both public domain works and asking permission to scan copyrighted works. The Internet Archive, Marx tells us, was very active in this project, helping invent the high-speed scanning systems used by most libraries today, including the Northeast Regional Scanning Center, funded by 20 libraries in the Boston area. These libraries have scanned half a million books in the past three years, focusing on the Biodiversity Heritage Library project.
The landscape around library digitzation shifted in 2004 when Google announced partnerships with major university libraries to scan a huge set of texts and make them accessible via Google Book Search. Google has the resources to scan a huge number of books, and its ambitions in the field make a wide range of people in the book world nervous. Publishers and some copyright-holders worry that Google Book Search could become a Napster for books, allowing users to download copyrighted material without paying – the American Association of Publishers and the Author’s Guild sued Google in 2005 for “massive copyright infringement”. Many of the other library digitization projects aren’t real happy about Google’s plans either. Google isn’t handing the output from its scans to other digitization projects – it’s making them available through their search service. And because Google is scanning so much content and making much of it available at no cost, it is likely undercutting other projects to digitize texts and make them available in a free, open way. They also worry that Google may not be fighting hard enough for concepts enshrined in US copyright law, like fair use.
Marx is critical of the recent settlement between Google and the publisher and author groups. The 300-page settlement (Google’s summary of it here, EFF’s reader’s guide to the settlement here) allows Google to offer “previews” of works that are in copyright, but out of print, and give access to the full text for a fee. This involves creation of a licensing body that will provide this subscription access, and a book rights registry which should make it easier for authors to register “orphan works” and get paid for their work. Google has already scanned 7 million books and seems likely to scan many more now that a solution is in sight.
But there are real problems with this settlement, Marx tells us. Libraries get limited access to this subscription content – a single terminal with access per library. While Google has rights to the entire corpus to run analyses and experiments, other researchers can gain access only through two research centers. These centers must evaluate research before it takes place, and Google and the publisher and authors groups can block said research. Furthermore, the research can’t be used for commercial purposes without Google’s permission. Finally, there are concerns about privacy – will Google be as committed to the privacy of readers as libraries are?
In other words, Marx and the Open Knowledge Commons group would like a solution that’s a good bit more open. And they’re worried that a pretty good, but closed solution, will remove incentives to build a high quality open solution. Just because Google has the power to achieve a settlement with Google doesn’t mean that other digitization projects will be offered the same terms.
In contrast to this unsatisfying outcome, she points to a recent argument over the use of catalog records in the WorldCat format. OCLC, a non-profit library organization, attempted to force users of its catalog record system to add a field to their records, a license assigning the data to OCLC and demanding it not be used in competing systems. The library community reacted strongly to the policy, and OCLC retreated, recommending but not obligating the record field.
Marx hopes that OKC will be able to monitor and weigh in on battles like these, but will also push the envelope for open content. She argues that open access can increase sales, and wants to see ways for libraries to make affordable printed copies of texts accessible. This requires working with publishers and authors, as well as fighting copyright battles in the courts. She also believes it would benefit from a massive public works project – a giant effort to create a public good of open licensed, freely available digitized books, funded by governments as well as foundations. Having this data available openly would allow experimentation with annotation, as with Library of Congress’s Commons experiment on Flickr, asking Flickr users to help tag and categorize images from the LOC.
I’m a huge admirer of these projects, but I worry that they are – for completely understandable, wise and legally logical reasons – scanning the wrong stuff. In approaching the elephant in the room – America’s broken copyright laws – innovators have worked on two sides of the problem, avoiding the massive middle. Projects like Creative Commons urge creators to release content under less restrictive licenses, allowing reuse and remix. This is starting to have a modest effect on new content. It’s certainly possible to build a good powerpoint presentation using only CC-licensed images on your slides; it can be a bit trickier to find enough CC-licensed music to keep your mp3 player full. The digitization projects work on the other side of the spectrum, focusing on content old enough to have returned to the public domain. The new frontier is orphaned works, books governed by copyright, but where the copyright holder isn’t findable.
It’s exciting to be able to freely download Treasure Island, or remix a Jonathan Coulton song without violating laws. But much of the content we want and need isn’t going to be available anytime soon, via legal means. It is, however, available via illegal means, and the discussion at Berkman revealed how many of us have resorted to pirate media to access books we’ve wanted to read. (I’ll own up. I’ve been known to download the PDF versions of Harry Potter books so I can read them on my laptop while I travel. But I’m a good boy and buy them – in hardcover – when I get home.)
My concern is that if projects like OKC are seen as focusing exclusively on collections of texts primarily interesting to historians, the larger vision of a universal digital library gets positioned as a fight for academics, not a mainstream concern. I hope OKC will consider taking on larger fights as well, perhaps attempting to win access to put texts online by asserting fair use, or testing whether the first sale doctrine can apply to digital, not just analog media. If not, I worry we’re going to end up with two parallel systems – a carefully worked out, legal system for texts 99% of people aren’t looking for, and well-developed black markets for texts people are looking for.