DocumentCloud - improving search for journalists

Aron Pilhofer and Jeremy Ashkenas introduce DocumentCloud, a powerful new tool for journalists, bloggers and activists at IBM’s Transparent Text symposium. On the abstract level, it’s designed to “turn unstructured text into structured data”, to let folks work with documents in a more meaningful way. What that means in practical terms is unlocking a lot of documents that newspapers usually publish as PDF image files. The goal is to make the background documents for journalism open, accessible, structured and searchable in ways that benefit investigative journalism, improve transparency and contribute to the universe of linked data.

The project is inspired by tools that are accessible to New York Times journalists. They’ve got a huge database, build on a platform from FAST, which not only indexes documents but does powerful entity extraction. This means that a reporter can search for mentions of an individual or a corporation across a very wide range of documents. Subjects of these investigations sometimes remark, “How are you pulling these needles out of haystacks?” The answer, for the New York Times, is a very expensive system.

With support of the Knight Foundation and others, DocumentCloud is trying to make this functionality more accessible to journalists and others without the massive price tag. They’re using OpenCalais to extract entities from documents (as Media Cloud does as well) and then allowing search on entities as well as text strings. The goal is to make it possible to run a searches that are unambigious, allowing a researcher to find mentions of IBM in geographic areas within fifty miles of Paris, for instance.

New releases of code will include a powerful document reader, based around Docstock, and a very cool system for crowdsourcing tasks, specifically OCR’ing PDF documents. That system, CloudCrowd, has been released open source, and looks like a fantastic new system for distributed computing tasks.

2 thoughts on “DocumentCloud – improving search for journalists”

Pingback: Jan Stedehouder (janstedehouder) 's status on Monday, 21-Sep-09 18:21:46 UTC - Identi.ca
Scott Klein September 21, 2009 at 4:27 pm

Thanks for the post Ethan! One thing though: The NYT Document Reader isn’t based on DocStoc. It’s written by the coders at the Times.

Comments are closed.