Home » Blog » Berkman » Help us find some language data

Help us find some language data

My colleague Hal Roberts has been hard at work on a fascinating research question: where in the world are the websites we pay attention to? It’s an important question for his work on surveillance – if most of the popular sites for Chinese audiences are hosted in mainland China (they are), then surveillance doesn’t have to occur at the edges of the national network, but within China.

The question is critical for my research, too. I’m interested in whether we’re more or less cosmopolitan in an internet age than we were in earlier media ages. One proxy for cosmopolitanism could be the ratio of domestic to international news sites that are popular in a given country – we might consider a country that looks abroad for news coverage to be more cosmopolitan than one where most media attention is focused on domestic sites.

For both of our research, we need really good data on global language distribution. It appears that country that acts as a center for a global language, as China does for Chinese, or France for French, will tend to have a higher degree of locality than a country in a language’s periphery (Algeria for French or Taiwan for Chinese). To test this – and to see if it’s hiding other factors that explain our data, we need to know how many people in a given country are fluent in a given language. And we need to know, or be able to calculate, what percent of a language’s speakers are located in one country or another.

Hal’s been discovering that this is really tough data to get. The CIA World Factbook has data for about half the countries we’re interested in, and we need data for all. Ethnologue is focused on mother tongues, which leads to weird distortions in the data. Despite the fact that nearly everyone in Tanzania speaks Swahili, less than 5% speak it as a mother tongue, so it shows up as a minority language in Ethnologue’s data. Wolfram Alpha seems to have what we need… but you’re banned from scraping Alpha, and there’s no source for their data, which leaves us very reluctant to use it. The data on Wikipedia isn’t especially helpful – it’s largely extracted from either CIA or Ethnologue, and not well footnoted.

I’m posting the question in the hopes that one of the brilliant folks who reads this blog might have a line on a data set for us, or could pass this query to someone who does. We need information on who speaks what where, what percentage of a language’s speakers globally are in a particular country, we need to know the source of the data, and we would greatly prefer to work with open data. If you’ve got any leads, please post ’em in the comments, or drop me a line. Thanks in advance.

9 thoughts on “Help us find some language data”

  1. Having just finished a conference where every participant spoke at least three different languages fluently (and each about a third of the time it seemed), language data is going to keep getting fuzzier and fuzzier. From what I’ve seen ethnologue is always cited as the authority, but Wikipedia has a list of other sources:


    I’m not sure if there is a global aggregate of national census data with regard to language, but that seems like it would be the most accurate.

    Two footnotes:

    1) There is a difference between the news we consume and where we consume it. Measuring site visits alone confuses that.

    2) I think multilingualism is only one aspect – and probably not the most important aspect – of cosmopolitanism. After all, you’re one of the most cosmopolitan guys I know … but also the only multilingual cosmopolitan I know.

    Have you seen http://peace.facebook.com? Now there’s a fascinating data set!

  2. One thing which is fun to look at is what languages people have their computers/browsers set to when they visit Global Voices in English (according to Google Analytics).

    Current top 10


    When you compare it to our current top 10 visitor countries you can see that the languages don’t always match the country ranking, which makes you wonder (generally) what bits of the data are inaccurate, or whether it says something about language behavior of people who visit Global Voices wherever they live – and of course what language services are comfortably available in various computer operating systems.

    United States
    United Kingdom

    Anyway, that’s just GV. But I thought if you were able to get lots more data on global internet traffic as well as language settings you might be able to deduct something – about some things.

  3. Sure, you already had a look on that but i thought it was right to mention: as EU emphasizes multilingualism in Europe, and since a specific commission has been created for that purpose, you would find interesting stuff on their website.

    Surveys about the topic you ask for:


    As well, here you can find a map of european minorities:


    I can imagine several regional organisations should provide such resources.


  4. I think that you will find some interesting ideas in the paper below. They looked at using other census data to predict the different languages spoken in Australia. Demographics being, after all, the science of prediction based on fuzzy, inaccurate, incomplete data that is changing over time.

    Modelling languages other than English spoken in Australia using census data, Lujuan Chen, Paul Romanis and Katie Palin, Research paper 1351.0.55.002, Australian Bureau of Statistics, Presented at the Australian Population Association 12th Biennial Conference, 15–17 September 2004, Canberra, Australia [PDF, 241 kb]


    The Australian Bureau of Statistics also has Standards for Statistics on Cultural and Language Diversity, but they don’t seem to actually talk much about language, but does talk about ethnic diversity, which is part of what you are talking about when you talk about cosmopolitanism, I think.


    Perhaps you would be better off with the Australian Standard Classification of Languages (ASCL), 2005-06 (go to the Downloads tab for the good stuff).


    Some of the complexity in this issue is expressed by this submission to an Australian Bureau of Statistics review by the Federation of Ethnic Community Councils of Australia in 2004.


    Keeping all that in mind, here is a lovely spreadsheet of languages spoken in Australia over time (1996, 2001 and 2006 censuses).


    Have you thought about cross-checking using other data sets? One of my colleagues here at RMIT is looking at remittances, which is a great way to look at links between countries at the family level. Match it up with trade data and you get a different view of cosmopolitanism. But maybe I don’t understand what you are looking for.

    PS: Sorry about the crappy long links, but I didn’t know if I could embed HTML tags into this comment box and didn’t want to have to do it twice.

    Jonathan O’Donnell
    Nautilus Institute
    RMIT University

  5. There’s some great stuff here that participants have posted, but I am wondering whether there isn’t a glitch in your assumptions — in my experience language use tends to be transactional. That is, you use what languages you have in a given environment to accomplish what you need to accomplish (to the degree you can). Case in point (maybe stupid) I can remember having “conversations” with people in Prague circa 1969 that consisted entirely of Beatles lyrics (the lingua franca of the 1960s, even behind the Iron Curtain)–this is sort of Solanasaurus’ point, I think. Hal’s criterion of attention is also transactional — pay attention FOR WHAT PURPOSE? That redounds also, I think, to the issue of “cosmopolitanism” — the Internet facilitates access to “otherness” but that isn’t a measure that says there was no interest previously. Your own fascination with sumo, for example, might have been latent (say) in 1960, but there would have been no resources for satisfying that interest. I had a schoolmate (in my tiny Montana mining town birthplace) who wanted IN THEORY to be interested in soccer, but at that time there were simply no resources for gaining information — no magazines, no newspapers, and certainly no TV (we got two channels, one of them poorly). So the desire was there…unfulfilled. Finally, how much language facility constitutes “fluency”? Do you have to speak English like a college prof to be “fluent”? Or is it enough that you can get done what you want to get done? Especially in a confined environment like the internet… I got fascinated at one point with Persian on-line magazines like “Cappucino” (now defunct, I think) even though I have no Persian (pix I could understand). I can kind of follow Beppo Grillo, even though I miss maybe 80% a lot of the time. I was able to get what Yoanni is about (Cuba blogger) even though my Spanish is high school French with bad endings…and so on. The point is not to boast (pointless, since most of my languages make natives weep either with pain or laughter or both) but rather to point out that langauge use is multi-level, with all kinds of degrees of motivation and purposes. My point — if I have one? — is that there is a certain reductionism lurking here. People who want to access information will find a way (case in point — I know a Latvian kid who learned English from the Cartoon Network, so for a time he spoke fluent Batman), whereas people who have closed minds will stay closed, no matter how many languages they formally possess.

  6. Thanks for the suggestions, folks and for the pushback.

    I’m aware that it’s an oversimplification to build a model in which people speak a single language, and I think Tony’s right to point out that people are frequently capable of interacting without achieving fluency in a language.

    The sort of transactional border-crossing Tony outlines in his comment is exactly what I’d expect to be seeing as made possible by the
    internet. In rural Montana, it’s now possible for the cricket fan to
    read cricinfo.com, one of the most popular sites in the world. That
    there’s very little readership for cricinfo.com in Montana is an
    obvious, but also interesting finding – the availability of the
    information is important for the small fringe of people interested in
    that content, but most people don’t choose to access that information.
    What’s more interesting is that, in some cases, people do choose to
    access information that’s assembled abroad – in UAE, for instance,
    many of the most popular news sites are not focused on Gulf audiences.
    This reflects the fact that the residents of UAE are primarily
    immigrants, focused on news from home.

    A language prevalence model – like the one we’re looking for – won’t
    be accurate. We know that. The hope is to find one that’s
    representative on a gross scale. I.e., it would tell us that Canada
    has a lot of English speakers, but isn’t the “anchor nation” for
    English, and has a lot of French and Mandarin speakers, but doesn’t
    anchor those languages either. That would then help explain the high
    degree of non-locality for news. I don’t deny the validity of the
    Beatles-lyric conversation in Prague (actually, you can have quite a rich dialog by just quoting Beatles lyrics…) but think that behavior is going to be a rounding error when considering tens of millions of pageviews in Prague of people reading Czech newspapers…

    In other words… to get a set that lets us correlate, we’re going to have to simplify, and perhaps to oversimplify. That said, the links gathered here already are a great introduction to some of the complexities that orbit this issue.

  7. You might get some pointers from Nicholas Ostler (linguist and author of the book Empires of the Word). In his book he makes some attempt to classify world languages by number of native and second language speakers, which I haven’t seen in other places. I don’t know what he’s worked on recently, but he seems to have given this issue some thought anyway.

  8. Interesting project. I think browser settings are not very helpful however. A huge percentage of Norwegians just download English Firefox or Internet Explorer, and never bother to set the language setting – many use Windows in English too, but they might still prefer to read in Norwegian, etc.

    Correct me if I am wrong, but the language setting never seem to have “taken off” – doesn’t seem like many pages offer various language alternatives based on this setting… which would be confusing as well, since so few people bother to set it. The Canadian government, for example, insist on always offering me huge buttons for English and French, although this theoretically would be an ideal use for the language setting.

    (Although I suspect that part of it is also the need for showing off their bilingualism. If English-speakers could just surf seamlessly in English, without ever noticing that there were French versions available, the French-speakers might find that situation less than optimal: even if they had exactly the same experience with French).

Comments are closed.