My colleague Hal Roberts has been hard at work on a fascinating research question: where in the world are the websites we pay attention to? It’s an important question for his work on surveillance – if most of the popular sites for Chinese audiences are hosted in mainland China (they are), then surveillance doesn’t have to occur at the edges of the national network, but within China.
The question is critical for my research, too. I’m interested in whether we’re more or less cosmopolitan in an internet age than we were in earlier media ages. One proxy for cosmopolitanism could be the ratio of domestic to international news sites that are popular in a given country – we might consider a country that looks abroad for news coverage to be more cosmopolitan than one where most media attention is focused on domestic sites.
For both of our research, we need really good data on global language distribution. It appears that country that acts as a center for a global language, as China does for Chinese, or France for French, will tend to have a higher degree of locality than a country in a language’s periphery (Algeria for French or Taiwan for Chinese). To test this – and to see if it’s hiding other factors that explain our data, we need to know how many people in a given country are fluent in a given language. And we need to know, or be able to calculate, what percent of a language’s speakers are located in one country or another.
Hal’s been discovering that this is really tough data to get. The CIA World Factbook has data for about half the countries we’re interested in, and we need data for all. Ethnologue is focused on mother tongues, which leads to weird distortions in the data. Despite the fact that nearly everyone in Tanzania speaks Swahili, less than 5% speak it as a mother tongue, so it shows up as a minority language in Ethnologue’s data. Wolfram Alpha seems to have what we need… but you’re banned from scraping Alpha, and there’s no source for their data, which leaves us very reluctant to use it. The data on Wikipedia isn’t especially helpful – it’s largely extracted from either CIA or Ethnologue, and not well footnoted.
I’m posting the question in the hopes that one of the brilliant folks who reads this blog might have a line on a data set for us, or could pass this query to someone who does. We need information on who speaks what where, what percentage of a language’s speakers globally are in a particular country, we need to know the source of the data, and we would greatly prefer to work with open data. If you’ve got any leads, please post ’em in the comments, or drop me a line. Thanks in advance.