I gave a talk at the New Museum in New York City on Saturday, along with Omar Wasow, the co-founder of Black Planet. It was good fun – Omar and I are nerds of the same generation, and as he showed slides of his beloved VIC-20, I found myself throwing devil horns in a gesture of 8-bit solidarity.
The talk was… sparsely… attended. Which is to say, the two friends from Global Voices who came to support me represented a third of the paying audience. And while this can be a bit dispiriting, Lova Rakotomalala was kind enough not only to attend, but to videotape the beginning of the discussion, which is now posted on his blog. (For those of you who can’t get enough video of me – and, frankly, you should seek professional help – Saturday’s talk was descended in part from my Berkman talk on mapping infrastructure and flow, available here.) And the conversation Omar, Lova, Solana Larsen and others had afterwards at the museum, and later over drinks, made for an extremely rewarding seminar experience, if not the grand NYC lecture I might have been hoping for.
Anyway… the talk tried to draw a line between the idea that we should study what people actually do online, rather than what we might hope they would do, to the phenomenon of social translation. As I put slides together, I kept coming back to a talk Zhang Lei gave at the China Internet Research conference. He was making the argument that, while Chinese-speaking internet users may now represent the largest group of internet users, there’s probably much less Chinese-language content than English language.
As Lei admitted, we don’t really know that this is true, however. It’s really hard to get an answer to the question, “What percent of the internet is in English? In Chinese?” Near as I can tell, the last time anyone made a serious, rigorous attempt to answer the question was a study by Excite AtHome in 1999, which looked at 600 million webpages and concluded that 72% were in English. (If I’m wrong and missing the obvious, definitive study, please let me know in the comments – would love to know about it.) I haven’t found that study, but it’s referred to in an excellent piece in the American Prospect – “Will the Internet Always Speak English?” – and might be part of the study announced here, which looks at a wide range of research questions around a corpus of Excite data.
If Excite’s estimates were right, and the Internet was 72% English in 1999, it’s probably much less so now – in the past decade, the ITU estimates that internet penetration has almost tripled, and much of this growth has been in countries where English isn’t the predominant language. The rise of read/write web technologies like blogs have made it far easier for people to author content in their native languages. While the NITLE blog census, last updated in 2003, saw English, followed by Catalan (?!), as the dominant languages in the blogosphere, Technorati’s “State of the Live Web” in 2007 saw English as a plurality, roughly tied with Japanese… and I believe Technorati’s methods missed lots of Chinese blogs and undercounted them. Still, search for “What percent of the Internet is in English?” and you’ll probably get directed to a Wikipedia page that tells you that 80% is a widely cited figure, but it could be lower.
Why is this question so hard to answer? Well, it’s worth a close look at how people have tried to answer this in the past. In 1997, the Internet Society and Alis Technologies used a random number generator to find IP addresses. They retrieved whatever content these pages turned up (if any) and ran the text through a language analyzer to determine what language the page was in. This method was pretty cool for 12 years ago, but it would fail pretty badly today – millions of sites in hundreds of different languages live at the handful of IP addresses that represent blogger.com, for instance.
The NITLE blog census tried to solve this by spidering content, much as a search engine does, following links from one weblog to another. This isn’t a bad method for finding weblogs, but your results depend greatly on what blogs you seed the engine with – start with a lot of Catalan-language blogs and you’re going to get links to lots of other Catalan blogs, while if you don’t have any Chinese blogs in your starting set, you probably won’t find any. Basically, to get a fair picture of what languages are represented online, you’d need a pretty good guess at what languages you should be paying attention to and where to good hubs likely to point to other content.
The Excite survey deals with search engine data. Search engines run huge spiders, which capture lots more pages than most researchers can hope for. And engines are tuned to filter out spam pages, nonsense content designed to game search engines, which tend to defeat most automated retrieval methods. But search engines don’t index all content – they can’t spider sites where they need to log in, like Facebook or other authorization-based sites, like bulletin boards. They miss chat rooms, of course, and don’t always get data on fast-moving sites like Twitter. And search engines aren’t generally very generous with their data – most engines have stopped posting figures about how many pages they’re indexing, and I haven’t been able to find one that publishes statistics on the magnitudes of the languages they’re indexing.
In his talk. Lei Zhang outlined a straightforward, probably overly simple, method for comparing English and Chinese content indexed by Google – search for an English term (he suggested “breast cancer”) and its Chinese equivalent, and compare the number of results. In a study on how Google’s PageRank algorithm works across languages, researchers suggest that “http” is a good term for these sorts of comparisons, as it shows up in text of any language that talks about the Internet. So I did a quick study:
Using data on the top ten languages listed by people who speak them as first or second languages (from the 15th edition of Ethnologue, listed on Wikipedia), I looked at how many entries each language Wikipedia contains, and used Google’s ability to search in selected languages to count matches for the term “http”. Hindi and Bengali both have an order of magnitude fewer entries than their linguistic peers, and Google doesn’t allow you to search for either specifically, perhaps because there are too few pages indexed. German is the second-largest Wikipedia and ranks highly in matches for “http”.
The Chinese Wikipedia is comparatively small, ranking 12th, but outpaces all other languages, but English, on the “http” test. If the http test is a valid one – and I have no reason to believe it is – it implies that there are roughly a quarter as many Chinese pages as English ones, and that Chinese outpaces Japanese by almost five to one.
It would be great if Google or other engines would reveal this sort of data – for those of us arguing that we’re entering a world where massive social translation is neccesary, the polyglot internet, it would be awfully helpful to have a sense for whether English is still the majority language on the Internet, and when that fact might change.