Much of the conversation I read online about Wikipedia seems to be focused on the radical, audacious idea that an encyclopedia written by amateurs could rival the quality and comprehensiveness of encyclopedias written by professionals.
I’d like to suggest that this is by no means the most audacious aspect of the project Jimmy Wales has taken on.
Guest-blogging on Larry Lessig’s site a year ago, Jimmy wrote, “The goal of Wikipedia (and the core goal of the Wikimedia Foundation) is to create and provide a freely licensed and high quality encyclopedia to every single person on the planet in his or her own language.”
It’s that last clause that’s the radical one. It implies a massive data dissemination effort – either the distribution of billions of print copies of an encyclopedia, or participation in a huge digital divide project like the One Laptop Per Child effort. And it implies a translation and localization effort on a scale that boggles the mind.
Jimmy constrains the problem somewhat:
I will define a reasonable degree of success as follows, while recognizing that it does leave out a handful of people around the world who only speak rare languages: this problem will be solved when Wikipedia versions with at least 250,000 articles exists in every language which has at least 1,000,000 speakers and significant efforts exist for even very small languages. There are many local languages which are spoken by people who also speak a more common international language — both facts are relevant.
Ethnologue, a leading resource for language study, lists 6,912 known, living languages. 94% of the people in the world speak one of 347 languages which have one million or more speakers.
By contrast, Wikipedias with 250,000 articles currently exist only in four languages: English, German, French and Polish. (The Japanese wikipedia is 242,000 as I’m writing this post.) Of languages with 100 million or more native speakers, the Chinese, Hindi, Spanish, Bahasa Indonesia/Bahasa Malaysia, Arabic, Portuguese, Bengali, Russian, Japanese and Punjabi wikipedias still need work – the Punjabi wikipedia is apparently up to 50 articles, eight more than the last time I wrote about it.
Jimmy had an interesting proposal at Wikimania about addressing this problem: paid coordinators who help recruit contributors and build these new Wikipedias. I hope whoever these new coordinators are, they’ll have a chance to learn from Ndesanjo Macha, who is both the father of the Kiswahili blogosphere and one of the key movers behind the Kiswahili wikipedia, which recently crossed the thousand article mark. At one of the sessions I moderated at Wikimania, Ndesanjo told us that building the Kiswahili wikipedia has involved extensive evangelization, leveraging offline and online social networks, strong-arming bloggers into writing articles, publishing articles on Wikipedia in Tanzanian newspapers, and persuading Kiswahili teachers in the US to make writing articles for the Wikipedia a class project.
Talking with Ndesanjo and other multi-lingual Wikipedians, I became aware of an interesting debate within the Wikipedia community. In trying to achieve Jimmy’s dream of a free encyclopedia for everyone in their own language, is the goal to create a single, coherent encyclopedia that can be translated into many different languages? Or to help every language community around the world create their own encyclopedia which will have somewhere from a little to a lot of overlap with another encyclopedia?
No one was brave, anglocentric or foolish enough to suggest that the solution to Wikipedia language problems was to start translating the English wikipedia into as many languages as possible… though I mentioned the Simple English Wikipedia, which is designed to help people learn English and as a source of simply-worded articles which could be translated into other languages. (This earned me a tongue-lashing from my friend Alek Tarkowski, who pointed out that speakers of other languages weren’t stupid, just uneducated in that language…) But some Wikipedians suggested that much of the translation problem could be tackled by finding the ur-version of articles and translating them into different languages: if the French version of the article on cheese was the definitive cheese article, the English wikipedia article on cheese should be a translation of the French article.
Searching Wikipedia for information on 18th-century encyclopedias and the idea of an encyclopedia as a summary of human knowledge, I found a real-world example of this suggestion. The English Wikipedia article on the 18th-century, Diderot-edited Encyclopédie appears to be substantially based on the 1911 Encyclopedia Brittanica article on Encyclopaedias. Challenging the neutrality of the article, Wikipedian Hardouin notes that the article “tries to portray the Encyclopédie as essentially an English work pirated by evil Frenchmen using dubious legal proceedings to dispossess innocent English editors” – an understandable bias in a hundred-year old British encyclopedia, but perhaps less forgiveable now – and suggests a translation of the French article on the Encyclopédie as an alternative.
While this may be the right solution to solving a debate about the Encyclopédie, it’s unlikely to solve some other cross-language arguments. Which article is the ur-article to translate on Jerusalem? The Hebrew, the Arabic or the English? (Even if you don’t read all three languages, the choice of images on each of the three articles is an interesting contrast.) Raising the Jerusalem article in one of the sessions I was moderating, one Wikipedian suggested that the point of NPOV – neutral point of view – was that it should enable creation of a factual article satisfactory to the Arabic, Hebrew and English authors. It could present – but not assert – opinions held by Christians, Muslims and Jews, but would be sufficiently neutral as to satisfy all audiences. Whether or not such a compromise is possible, it raises other questions – how do debates about an article take place between speakers of different languages? Do we decide the ur-language and then debate in that language? Is this fair to an author who is weaker in the language of debate than her native tongue?
Ndesanjo suggests another possibility – if we consider Wikipedia to be a project to “decolonize” cyberspace, as he does, it makes more sense to consider each language’s encyclopedia indepdent, with its own priorities, standards and processes. In some languages, the priority might be to create a widely usable reference quickly, which might focus on translating a lot of articles from a convenient encyclopedia, like the English wikipedia. Or it might be to document aspects of the culture associated with the language likely to be undocumented in other languages. Ndesanjo gives the example of the Kiswahili wikipedia article on Mbege, a beer made from fermented millet and bananas. Mbege has merited a two-sentence stub in the English wikipedia, but it’s an important part of Tanzanian culture Ndesanjo and collaborators want to ensure is preserved in cyberspace… which they’ve done with a much longer article.
The difference between the Mbege article in English and Kiswahili suggests that it might be worth searching for language-specific articles, articles that exist (or exist as full entries) only in smaller Wikipedias. Of the 1000 articles in the Kiswahili wikipedia, how many have no satisfactory parallel in the English wikipedia? How about for a large Wikipedia, like Polish? Are 10 of the 272,000 articles unique to the Polish edition, or 10,000?
(It wouldn’t be all that difficult to conduct this experiment, since Wikipedia articles link to versions of that article in other languages. Spider the wikipedia for a target language, follow the links to English versions of the article. When those links aren’t present, or when they link to a much shorter version of an article, flag that article as linguistically unique. It may also not be neccesary to go through all this trouble – some wikipedias appear to feature their “unique” articles more often as featured articles of the day than articles that are derivative of other wikipedias – this data could be mined as well.)
Finding and translating the articles that are linguistically unique would have the effect of strengthening large wikipedias, like the English wikipedia, as well as calling attention to the original work being done in building smaller wikipedias. A Serbian contributor – bilingual between English and Serbian – noted that he rarely writes for the English wikipedia because so much already exists in the English version. Identifying and translating the unique articles in the Serbian wikipedia might balance these content flows.
This also opens an intriguing possibility for potentially controversial topics, like Jerusalem: the English language article on Jerusalem might include not only links to the Arabic and Hebrew versions, but to English translations of the Arabic and Hebrew versions, letting English readers see how the subject is covered in other languages. (“English” here is a placeholder – I think this would be interesting to try in any language where you can find translators capable of the language pairing.) There’s lots of practical problems – you need to retranslate as the other articles grow, you need to find ways to present the translations that don’t confuse a casual user, and ways to deal with the combinatorial explosion of languages. (347 languages with more than a million speakers implies the need for a Polish – Punjabi translator, who may or may not exist. And it suggests 347!/2 translations of each article, which is a number that breaks most earthly calculators…)
As Wales and the rest of the Wikipedia community start addressing the immense problem of producing free encyclopedias in 347 languages, it’s worth asking: “Are we writing one encyclopedia and translating it, or writing 347 encyclopedias and translating when neccesary?” Phrasing that question to Wikipedians, some expressed confident opinions… which contradicted each other. I’m hoping I can provoke Jimmy to offer a more definitive statement on the approach Wikipedia is taking… or invite a larger discussion on a topic that I think is critical to the success of the project.
In thinking about language and encyclopedias, I found myself reading more about the Rosetta Project, which is attempting to document all the world’s languages in an online database. They’ve got word lists for roughly 3,000 languages, perhaps half the world’s living languages. One of the more audacious parts of their project involves creating a nickel disc microengraved with 15,000 pages of text which could serve as a Rosetta stone for future generations trying to decipher long-dead languages. Oh, and they’re launching a copy into space to rendezvous with the Wirtanen Comet in 2011. You know, an off-site backup.
TOTH to Brewster Kahle, for letting me know about the project.
This post is intended as a sequel to my May post, “Your Language or Mine?”