Google’s Mathew Gray has been playing with data sets for a long time. He’s the author of the world’s first known web spider, and he’s now working on Google Books. And he wants to know, “What would you like to do with a 30 billion word diachronic corpus?” (We’re at IBM’s Transparent Text summit, so it’s not like he’s walking around Kendall Square asking this question at random.)
For those of you who aren’t computational linguists, that’s a two-terabyte set of words, tagged by date, over a st of 400 years. This is one of the interesting consequences of Google’s attempt to close the world’s information gap around books. By non-destructively scanning books in great libraries of the world, and partnering with publishers to descructively scan their works, Google has created an amazing resource for understanding the evolution of text.
Gray admits that metadata is an issue – a surprisingly difficult one. We’d expect libraries to have information like the place of publication and author birthdates. But it’s harder than you think – some libraries don’t know when books originated and simply used “1899” as a placeholder for any book they didn’t know. You see mentions of “Internet” in publications dated 1620 – that’s because the mentions were in 1997 from a journal that’s been published since 1620.
Oddly enough, you can use Google’s corpus to solve this problem. You can analyze language models from texts published in different years and then make intelligent guesses about when a book with bad or no metadata was published. Gray has been building Markoff models of language from given years – he then seeds them with a phrase and can construct arbitrary text from different times in history. (Useful for writers of period romance novels.)
This sort of huge data set analysis lets you do truly fun stuff. Gray asks, “How was the world represented in books in 1820?” The map above takes a stab at that question.
The ability to analyze these texts suggests the possibility of answering some very big questions. If we can analyze how verbs get regularized over time, would this give creedence to the “great man” theory of history versus the “in the air” hypothesis of language development… or of history in general.
Gray asks the room, “what should we be doing with this data?” I wondered whether Google might hand over this data to Erin McKean’s wordnik project, to help generate a vast, open-source dictionary. Another questioner wonders whether Google is trying to align texts in different languages to create aligned corpora for machine translation. Gray offers a fascinating answer: Yes, but the really worthwhile corpora are patents, because they’re very precisely translated from one language to another. Wow.