Your language or mine?

My friend SJ Klein and I spent a chunk of yesterday evening talking about Wikimedia’s language issues. SJ is a wikipedian and a language enthusiast – a polyglot; I am embarrasingly monolingual (and, lately, have been having difficulty spelling common names in my own native language. Sorry, Michael…) That key difference aside, we’re both interested in how generative media projects on the web can include speakers of as many languages as possible.

My regular readers know that I’m obsessed with the question of how the Internet will change as an additional billion users join the network. It’s a safe assumption that many of these billion users will not read English… and will not create content in English. Recent statistics from Technorati suggest that more blogposts current exist in Japanese than in English; my research suggests that there might be even more blogposts in Chinese than in Japanese.

Wikipedia gives an interesting introduction to some of the potentials and challenges of a massively multilingual internet. To fulfill Jimmy Wales’s vision of a free encyclopedia for everyone in the world, in their own native language, Wikipedia needs to do one of two things (or a combination of the two): create a comprehensive encyclopedia in one language and translate it into multiple languages, or comprehensive encyclopedias in every target language.

Wikipedia’s doing a little of each, with an emphasis on the second strategy – there are now ten huge encyclopedias (100,000+ articles), 29 big encyclopedias (10,000+ articles) and hundreds of smaller wikipedias. While translation between wikipedias takes place – and the Simple English wikipedia exists, in part, so it can be translated and serve as the starting point for a new Wikipedia – the global Wikipedia community is engaged in the creation of hundreds of encyclopedias, not just one.

But not all wikipedias are growing at the same rate. Some wikipedias are surprisingly large, given how few people are native speakers of the language. Others are surprisingly small given how widely spoken the language is.

I took a close look at this question today. Of the most widely spoken native languages in the world – languages with over a million native speakers – several are well represented by very large (100,000+ article) wikipedias: Spanish, English, Portuguese, Japanese, French and German. Some are represented by smaller, growing wikipedias: Chinese, Russian, Arabic. And three have very small wikipedias: Hindi, Bengali and Punjabi. The Punjabi wikipedia has 42 entries – the language is the first language for 104 million speakers.

Putting together a very rough metric, I calculated the number of wikipedia articles per million native speakers of the language (WA/MS) for languages with over 30 million speakers. The leader in the set is Polish, with a 233,740 article wikipedia and 46 million native speakers, a WA/MS of 5081.3. The German and English speakers have strong showings as well, with WA/MS of 3925 and 3656 respectively. (If we extended beyond the 30 most spoken languages in the world, the Scandinavians begin displaying their strength – Swedish weighs in with 18,041 articles per million speakers of the language. And the Icelanders have created a 10,059 article Wikipedia, despite the fact that the language has less than 300,000 native speakers. Were we to consider languages with no native speakers – Esperanto, Ido, Interlingua – we’d encounter division by zero errors… but discover that Esperanto has 43,687 entries in their wikipedia.)

Of the ten languages that score lowest on this metric, eight (Punjabi, Oriya, Hindi, Gujarati, Bengali, Urdu, Malayalam and Tamil) are Indian languages. (So are the next three – Kannada, Telugu and Marathi.) The other two are Southeast Asian – Burmese and Javanese.

I strongly suspect that the slow growth of these wikipedias is not a function of their geography – Amharic, with 27 million native speakers and 312 articles, would place between Bengali and Urdu in terms of WA/MS if I extended the calculations beyond the top 30 languages. It’s a function of the digital divide. Swedish may only have 8.8 million native speakers, but the majority of them have Internet access. Net penetration in India is much lower (perhaps 5%)… and it’s extremely low and heavily restricted in Burma, which helps explain the size of that wikipedia.

Still, even with only 5% internet penetration, India has an estimated 50 million Internet users – more than any nation other than the US, China or Japan. Which proposes a complicating factor, which SJ and I argued about at length last night: what language does a multilingual person choose to write in?

I’ve talked about this question at some length with multilingual blogger friends. Many of the bloggers I know who speak Arabic fluently choose to blog in English. Their audience isn’t their countrymen – it’s an international audience. Furthermore, many of their countrymen (and women!) who are online are also bilingual – the ability to read and write in English is closely correlated with high levels of education, high incomes and internet access. The same is likely true in India – popular bloggers in India appear to be blogging primarily in English.

But wikipedia’s different, isn’t it? The goal is to create a body of knowledge useful for the wider world – surely Punjabi speakers want to ensure that people who speak only Punjabi have useful content when they come onto the Internet?

Well, yes. But they may also want to influence perception and opinion on topics important to them by creating articles on political figures, important issues, issues of national or regional pride. And it makes sense to contribute to the wikipedia which has a broad audience and, therefore, a maximum chance of being read and influencing opinion. Which likely argues that it makes more sense to edit the English wikipedia – with a huge audience – than the Punjabi audience. There’s also a critical mass issue – until the Punjabi wikipedia hits a certain size – 1,000 articles? 10,000 articles? – it’s a project, rather than a resource.

If I understand SJ correctly, he’d like to see wikipedians editing articles in their native language, and another group of translators making sure those unique articles are translated into additional languages. I wonder if this will work – our experience with Global Voices is that it’s harder to get people to translate than it is to get them to write. But SJ’s proposed method would be critically important for projects like the One Laptop per Child project – children learn to read better in their native language. Having a wikipedia in Kannada or Burmese makes it easier to use the computer as a teaching tool. And it means that language is less of a barrier to some of the next billion users having access to critical content.

But it won’t be easy, I suspect. I hope that the Boston Wikimania conference can include some conversations with Swedish and Polish wikipedians so we can find out how those language communities have generated such interst in wikipedia. At the last Global Voices summit, participants from around the world listened intently to Ory Okolloh as she explained how the Kenyan blog community had become so robust – I hope wikipedians in successful communities can help along the language groups that are struggling at present.

Some rough data on the WA/MS index… By the way, the largest language not to have a wikipedia appears to be Madura, spoken by 8-14 million people in Indonesia. The most widely spoken language which doesn’t have a Wikipedia entry in the English wikipedia: Maninka, spoken by 3.3 million people in eastern Mali and Guinea.

  1. “China’s leading web search company has launched an online, user-generated encyclopedia modelled on Wikipedia, the hugely popular co-operative reference website that is blocked by Beijing censors.Unlike the encyclopedia developed by donation-funded Wikipedia, “Baidupedia” – the new service from Nasdaq-listed Baidu.com – is heavily censored to avoid offending the Chinese government”


  2. Ethan:

    I enjoyed reading this entry on wikimedia’s language issues.

    With reference to India where many of them are polyglots, I guess the default language for the moment is English. The reason being that PC users in India tend to write mostly in English. But this trend could undergo a change in the next couple of years as broadband users grow in India and the user of Internet increases. I think in the next few months you might see an increase in regional language wikimedia from India, especially in Tamil, Telugu and Hindi. Why did I pick these languages? Tamil is spoken in other countries besides India…Sri Lanka, Singapore and Malaysia to mention a few, and the online Tamil community is pretty strong and robust. Telugu because Andhr Pradesh (where this language is spoken) has a sizeable PC penetration.

    About Punjabi wikimedia…I suspect that many of them who write in Punjabi probably live outside of India in the UK, Canada and the US. This trend could be true of those who are writing in Gujurati. I suspect that many people of Indian origin who write in Indian languages may be living outside India…this is an educated guess on my part, and I could be wrong in my assumption.


  3. A small role game: let us suppose you are a politician of a country with a “not-so-represented-on-wikipedia” language (for example, we choose India and Tamil language).
    Question: Would you invest some amount of money to employ a group of professionists (let us say 10 highly literate people) to write full time in the Tamil Wikipedia? If yes, why? If no, why? How this would change the quality of the wikipedia? Its growth? The amount of volunteer contributions? The perception of this wikipedia as a collective resource? The NPOV issue on, for example, entries about the government or politicians?
    I mean, would it be a savvy-investment in your opinion or something that would disrupt the “grass-root” power of Wikipedia?

  4. Hi Ethan, I enjoyed your blog entry as usual, but I have just one comment. Your statement that “children learn to read better in their native language” may be true, but is that really the point? Think of it in classic cost-benefit terms. Perhaps it is less “costly” in terms of effort to learn to read in one’s native language, but the “benefit” –being able to communicate with other people– might be a great deal less as well. Wouldn’t that child be better off able to communicate with as many people as possible, a la Metcalf’s Law? When I was in Mali, a fellow (non-Malian) Geek started the Bambara Wikipedia, but was met with tepid (at best) enthusiasm from native Bambara speakers when he asked them to write articles. It turns out that even if they speak Bambara daily, when they wish to commit thoughts to paper (virtual or otherwise) they all do so in French. You can say what you want about colonialist legacies or cultural hegemony, but at least part of the reason for doing so must be that by writing in French, Malians expose themselves to a broader readership.

    The tension that exists between the mutually-exclusive forces of globalization and of balkanization is evident in many spheres besides Wikipedia, of course, and I am not at all advocating mass homogenization of culture. But I think the risk for that is very low. MTV tried to go global with their programming, and they found out that (surprise!) people in, say, Botswana didn’t really want to see a Cyndi Lauper video. (Hey, it was the ’80s.) Now, MTVs hegemony is nearly global, but they merely replicated the format; the programming is local. Come to think of it, that is what Wikipedia is doing…I am just saying that is may be of little value for a Dogon child to learn to read in Dogon, which, even if it existed as a written language (and assuming 100% literacy among the Dogon) would have a readership of at most a couple of hundred thousand people worldwide…

  5. Amadomon – I’ve had a similar experience to yours working in Africa – while there’s a lot of talk (especially from aid workers and educators) about the preservation of languages, parents want their kids to learn the language of business, which is usually English or French.

    The other night, I was arguing for the economic logic of publishing in English as a Punjabi author – you now suggest another logical reason. I suspect there are great reasons on the other side, besides a sentimental attachement to the preservation of living languages – I hope SJ or others will weigh in and support that viewpoint because I fear I won’t do it justice…

  6. I am now, Ian – that would have saved me a lot of time… but hey, sometimes it’s worth working the data yourself to get a better understanding of what it really means…

  8. Sorry, Esperantist. I’d understood that Esperanto was supposed to be a second language so that everyone would be at an equal disadvantage speaking it – hadn’t realized that people had learned it as a native language.

  9. Great piece Ethan, definitely worth the time it took me to read and absorb it.

    Some comments if I may.

    While you quite correctly identified a ton of issues with regards to work on wikipaedia (and on the web in general), I see one fundermental issue that you may have missed simply because you are monolingual.

    I speak about 5 languages but, paradoxically, I only WRITE one very well (and another pretty badly). The reason is that most of the (traditional ) languages I know are spoken, not written. They have non-roman alphabets and I have not (and I know that this is true for a large part of the population I come from) had a chance to study them in a formal setting. I simply picked them up through use.

    All this might be neither here nor there but the end effect which is the crux of the issue is that when I have to document something by writing, I have no choice but to write it in English.

    Now, I will grant you that many of the South East Asian languages that you listed have users who can write in them since they have written alphabets but I am willing to bet a tremendous amount (how about 50cts?) that you will invariably find that most of these languages are not taught formally or used as official languages.

    There lies the issue: lots of people who can speak the language but no one who can write it.


  10. Just a follow up to your comments … the Bengali language wikipedia now has 2801 articles. Not too many, but definitely a 400% (or 5 times) growth since March (we had 542 articles back then).


  14. I heard unpublished Wikipedia statistics from its founder that most of the South Asian language wikipedias were edited from IPs outside the region ! Hows that for linguistic ties getting stronger when one is distanced…

