Meetings with the board of the Information Program of Open Society Institute generally involve travelling to exotic places around the globe and sitting in a non-descript conference room while brilliant people from that nation stream in and talk with us.
We’re trying something different this time around, and I’ve been feeling like my (rusty) Bay Area navigational skills have been far more important this week than any knowledge I have about technology in the developing world as I’ve piloted an SUV of OSI colleagues all around the 415 and 510 area codes. We’ve visited UC Berkeley, Stanford University and several of the major tech players in the Silicon Valley ecosystem so far, and there’s still two days of meetings and travel pending.
Part of the fun today involved a tour of the Internet Archives, led by chief archivist Brewster Kahle. I was wonderfully happy to discover that the headquarters of a project that sometimes sounds like Isaac Azimov’s “Foundation” is located in a decidedly humble building in San Francisco’s Presidio. (Okay, as this is San Francisco, the “humble building” probably costs vastly more than my personal net worth, but it’s refreshingly unflashy.) In fact, the small white house the Archive is housed in reminds me profoundly of the small white house we ran Tripod out of for its first few years…
I spent much of our time at the Internet Archive nostalgic for the early days of the web. Using the Wayback Machine, I looked up some early versions of Tripod’s site… which caused those pages to be displayed on the huge monitor in the office adjoining the conference room I was sitting in, which made me deeply happy. Of course, the Archive only has editions of our site back to late 1996, when Kahle started his ambitious project of capturing as much of the web as possible, which means that the versions of the Tripod site I had a hand in designing and coding are lost to the winds of time. Which, in turn, makes me a little sad, and makes me realize the reason why it’s nice to have archives of a medium as fluid as the web.
There’s much to be impressed about with the work Brewster and Rick Prelinger are doing, both in terms of its impact, its sheer audacity, and the raw numbers behind it. I’ve helped build large server installations, and have a rough idea of how difficult it is to put multiple petabytes of spinning storage online – that the Internet Archives crew have done so much and still have so much more to do impresses the heck out of me.
I was introduced to Brewster years ago by Tim O’Reilly because I became deeply interested in the problems of bringing books to the developing world. Since the Internet Archives’ ambitions extend to digitizing works of literature as well as capturing passing information on the Internet and on TV, Brewster is a very useful guy to talk to. The Internet Archives have made huge progress with scanning books non-destructively with a beautifully engineered system, pictured above.
Books are held open at an angle that’s minimally damaging to their spines. The pages are pressed into place by glass panes and photographed by a pair of 12 megapixel cameras. The resulting images are corrected for skew, autorotated and displayed to the operator, who can manually adjust them. When they’re ready, he or she presses a pedal, which saves the images, lifts the glass and lets the operator turn the book pages. The whole operation takes about five seconds, and Brewster tells us that, in large installations (ten workstations in the basement of a library), books can be scanned at about $0.10 a page.
Once the scans are saved, they’re converted into multiple formats – JPEG, PDF, DjVu, animated GIF. They’re also run through an OCR (optical character recognition) engine, which provides rough transcriptions of the image files so they can be indexed (and maybe at a later date, automatically tagged or summarized.) The OCR isn’t hand-checked, so is likely far from perfect, but is close enough for indexing purposes.
Obviously, this is extremely high-end tech for major libraries… but I want one! I can’t tell you how often I’ve been 5000 miles from home and wanted a book in my home library… not just the text, but my margin notes, dogeared pages and all. In the meantime, I’ll settle for an Internet Archive that keeps growing, putting more and more information into the hands of people around the web.