If you want to know what people around the world are thinking and feeling, you need help from a translator. Recent events in Iran are a reminder that the internet and citizen media aren’t enough to give us access to events throughout the world – we need tools and strategies for bridging language gaps as well, or we limit ourselves to only the voices we can understand.
For those of us who think the Internet is a powerful tool for international understanding, language is a challenge we need to confront, a complex set of problems we need to address. I just had the chance to join a small band of people dedicated to solving these problems, joining in the Open Translation Tools summit, held this week in Amsterdam. I came away hopeful, sobered by the size and complexity of the problems, but thrilled that such a smart, creative and global group was willing to take on these challenges.
The internet has been polyglot since early days, but the rise of read/write technologies has brought issues of linguistic diversity to the fore. In our experience with Global Voices, we saw lots of people blogging in English as a second language until there were lots of their fellow speakers online… then we saw lots more bloggers in local languages. Once you’ve got an audience that speaks your language, it makes sense to blog, twitter or otherwise publish in that language. It’s extremely difficult to accurately estimate how many people are blogging in Chinese – figures from companies like Spinn3r or Technorati aren’t counting most of the China-hosted blogging platforms. The number is somewhere between enormous and freaking huge, and people who want to know what what Chinese netizens are thinking better hope we figure out how to clone Roland Soong sometime soon. (Roland and the EastSouthWestNorth blog are so important to English/Chinese dialog that I know of several folks who refer to plans for massive Chinese/English translation as “the distributed Roland Soong problem”.)
Other languages are moving online as a way to ensure their survival in a digital age. The 27,000+ articles in the Lëtzebuergesch wikipedia don’t reflect the size of the language (spoken by roughly 390,000 people in Luxembourg) but the passion of that community to ensure the language exists in the 21st century. While Jay Walker may predict the rise of English as the globe’s second language, I’m predicting that the internet will make it easier to document, share and keep alive the world’s linguistic diversity. (They’re not incompatible ideas, BTW, though I still think Jay’s overstating the trend.)
In other words, every single day, there’s more content online in languages you don’t speak, and you can read a smaller percentage of the internet. It’s not just a matter of learning Chinese, though that would be a great first step. We’re seeing content in Tagalog, in Malagasy, in Hindi, and it’s not clear how we’re going to read, index, search, amplify and understand all of it.
The folks at the Open Translation Tools summit (OTT09) have been working on this problem for a long time. Allen Gunn – “Gunner” to anyone who knows him – characterized the participants as toolbuilders, translators, and publishers. But the common ground is that the people represented at the gathering are pioneers, people who’ve pushed the boundaries to ensure that languages can be present online, and that we can translate between them.
Some of the folks in the crowd, like Javier Solá, can claim credit for bringing whole languages online. (That Solá, a Spaniard, can claim that credit for Khmer is its own wonderful story.) Dwayne Bailey, who’s done excellent work bringing African languages online through his project, translate.org.za, reminded the crowd of the painstaking steps necessary to bring a language online: one or more fonts to represent the character set, a keyboard map to allow text entry, appropriate unicode representations, support for the language within software like OpenOffice, the creation of utilities like spellcheckers. Internationalization is now part of virtually any open source project, but it still tends to be an afterthought, and several groups at the summit were focused on the painstaking work necessary to bring Indian, Central Asian and African languages online for the first time.
Thanks in part to the Global Voices tendency to occupy other people’s conferences – we don’t have an office, so we simply send a dozen people to cool conferences and hold our meetings before or after – publishers were probably the best represented group at the meeting. Many of the projects I most admire were represented, including Meedan, which bridges between Arabic and English speakers via translation, and Yeeyan, which translates English-language content into Chinese. It’s interesting to see the different models emerging around social translation. Meedan translates everything, first with machine translation, and then with volunteer human translators, to make English/Arabic conversation seamless. Yeeyan invites readers to suggest English-language content they think Chinese readers would benefit from reading – Jiamin Zhao, who leads their Beijing team, says this hasn’t been very popular with their users, and that much of the translation happens around large, established projects like the translation of The Guardian. And Global Voices just lets anything go – each language team gets to pick what content they want to translate and what tools they want to use.
Some of the publishers are toolbuilders as well. Ed Zad showed off dotsub’s lovely platform for subtitling and translating online video. While dotsub hosts thousands of subtitled videos, many of us know it better as the toolkit underlying TED’s ambitious open translation project. This model of hosting subtitled and translated videos for third parties is a major part of dotsub’s business model – Ed shows us subtitled videos from the US Army, allowing the Army to meet legal obligations to make all their content available to the hearing impaired, at lower costs as dotsub’s tools are far more efficient than other technologies available.
Meedan offers a beautiful set of tools to allow volunteer translators to turn machine translations into more readable, human translations, and is working closely with Brian McConnell’s WorldWide Lexicon, which focuses on giving publishers a great deal of control over how their site is translated while embracing the model of social translation. I was excited to get a peek at Traduxio, which is focusing on translating cultural texts, like Balzac and Tchekhov and building complex translation memories in the process.
One of the central questions at the meeting was whether toolbuilders were building the right tools for translators to use. A number of projects focused on building open source translation memories. These are tools that keep track of how a translator has rendered a particular word or phrase in the past and prompts her with past translations in a new document. Many professional translators use Trados, though it’s apparently one of these tools that’s industry standard, though not well-loved. (One of the odd quirks of the translation industry, Ed Zad tells us, is that translation clients own the contents of these translation memories, not the translators.) It’s not clear whether social translation projects are really using translation memories. We’ve talked about the subject a great deal within Global Voices, but none of our translation teams is using one… perhaps because they’re not aware of open source ones available, perhaps because few of those open source ones are very good, or perhaps because it’s not how they’re used to working. Ziamin from Yeeyan made the same confession – perhaps because we’re working with volunteers who are translating, rather than translators who are volunteering their time, there’s not much push from within our communities for translation memory tools.
There might be more traction for tools that helped with translation workflow. Professional translators tend to be closely project-managed, and work in teams, with a translator, an editor and a proofreader. Most of the social translation models use less complex systems – an editor usually reviews a translated text in a Global Voices community, for instance, but the system isn’t as formalized. And there seemed to be great demand for tools that matched potential readers of texts with translators, systems that could allow readers to flag a text they wanted to read in another language or show translators potential readership for a particular text. I moderated a session on “demand” which generated a wide range of ideas, from seeking data from Google Translate on what documents were most requested by users to creating Firefox plugins that automatically translated texts and allowed readers to request human-translated versions. My Global Voices comrades were exploring a set of ideas about rewarding translators, with recognition, with karma ratings that might translate into professional translation work, with micropayments for translations – all these ideas require new tools and working methods.
Google wasn’t present at the conference, but was the unspoken presence in almost every session. While there was widespread agreement that Google’s machine translation tools were far from perfect – and sometimes farcically bad – they’ve been getting lots better and some participants wondered whether we should be putting the effort into building new social translation systems if they’re going to obviate all our work in a few years. Personally, I think it’s a bad mistake to stop work because we think Google might be working on the same issues.
The languages where Google is good are ones where we’ve got huge corpora – sets of documents that exist in two or more languages, which have been “aligned” by algorithms so that it’s possible to see how one phrase has been translated into another. A corpus like the Europarl Corpora – which contains millions of aligned sentences in eleven languages, taken from human translations of European parliament proceedings – can make it fairly easy to build these tools… though one wonders if they’re better at translating bureacratic memos than casual conversations. (Another major corpus, the Acquis Communautaire, offers the whole body of EU law in 23 languages. Sounds like a blast to read.) These statistical machine translation methods get stronger as we get more aligned documents available.
But some languages don’t have large corpora available – I don’t know where we’re going to find a large set of English/Malagasy translations, for instance. In these cases, rule-based machine translation might work better – one of our participants, who studies rule-based systems, argues that they’ve proved their utility in translating between closely related languages like Spanish and Catalan. They parse sentences into parts of speech, or into more complex intermediate representations, then translate word by word, restructuring the sentences into grammatically correct forms. Our friend pointed to a study he’d helped conduct which saw these rule-based systems doubling the efficiency of human translators from 3000 words a day to 6000 words, in closely-related languages.
My sense is that the most exciting potential in the near future may be to use social translation to create corpora that could benefit statistical machine translation. That probably means ensuring that Google – admired and feared at gatherings like this one – has a seat at the table in a future discussion.
It’s a long path from the discussions in Amsterdam to a system that allows me to stumble upon a blogpost in Persian and request (and perhaps offer a bounty for) a translation. But those conversations have to start somewhere, and it was a pleasure to have a ringside seat for them in Amsterdam.
One of the projects taking place around the OTT summit is a “book sprint“, a five-day project to write a book that outlines the state of the art in open source translation systems. If that sounds crazy… well, it is, but not as nuts as you think. My friend Tomas Krag pioneered the model a few years back with a brilliant book on wireless networking in the developing world, and it’s been adopted by the fine folks at FLOSS Manuals. I’ll link when the book is available… which should be about three days from now!
You can read notes on each of the sessions on the OTT wiki – it’s a great summary of the discussions that took place.
Great summary of the state of industry. I am feeling quite bullish on some major break-throughs on the horizen, and there are now enough of us raising the language flag that the tipping point is getting closer all the time.
Keep up the inspirational writing
“The internet has been polyglot since early days”
Completely untrue. While other languages have existed, it was by no means a multi-lingual internet. To just build a website requires that you have a good command of English as all the technologies are based in English, thus excluding a vast chunk of the world. We couldn’t even type accented characters on the same page without making them entities until UTF-8 got more widespread a couple years ago.
It a markedly Anglophonic web and will undoubtedly continue to be so for some time to come. If ICANN were to become a world entity, then this might shift a lot faster, but as it is, with so much of the net based in the US and its defiantly monolingual culture, this will not change. This becomes painfully clear when you work on speaking more than just English.
Pingback: rusalka's status on Saturday, 27-Jun-09 20:15:37 UTC - Identi.ca
Great summary. Thanks. Can you tell us how many people were there?
Renato, there were about 70 people there, much more than the organizers had expected.
Miquel, I absolutely agree that English has been the dominant language, and that it’s taken hard work to bring languages online – I tried to acknowledge some of those pioneers in this post. But there were dozens of languages represented in Usenet in the 1980s, and the origins of the web at CERN meant that we saw a number of European languages in the early web.
Pingback: The Value Chain 2.0: Bringing In The Consumer « Fredzimny’s CCCCC Blog
Thanks for this excellent post Ethan – this is such a good summary of the key questions and debates that were discussed in Amsterdam. A great resource for us all.
I’ve also added a post to the Meedan blog today about the question of demand for translation. We’re translating the post into Arabic too. Please do add your views:
Thanks for writing up your notes; sounds really interesting and as with the last one I can only say I wish I was there!
Your comments on the difference between “volunteers who are translating” vs “translators who are volunteering their time” were interesting. It’s certainly interesting to see how the huge numbers of volunteer translators on the web will take to TM eg in Google’s Translator Toolkit.
(I work for ToggleText, on proprietary MT, but I’m also very interested in this topic from the open source/Wikipedia perspective.)
I can tell you the size of the Japanese blogosphere…..
About 750k posts per hour….. 3x the size of the English blogosphere.
We haven’t started to index it yet as our customers haven’t been demanding it …. but it’s on the horizon.
We have access to the content…. we just haven’t turned it on as the hardware costs obviously aren’t cheap.
Thanks for weighing in, Kevin. That’s an interesting update on Technorati’s numbers from two years ago, where they saw Japanese roughly equaling English.
The really interesting question for me is the Chinese-language blogosphere. My guess is it’s very hard for you guys to index as blog platforms there don’t generally use pingservers. Would be very interesting to hear if you’ve got size estimates, though…
Pingback: …My heart’s in Accra » The Open Translation Manual
Pingback: El Oso » Archive » Open Translation Tools 2009
Pingback: El Oso Blogs the OTT Conference « Global Voices Translation Exchange
This is exciting development on the horizon.
What i have been doing in the last few months or so is translating not-so-commonly translated (yet popular) songs and poems from Indian Sub Continent. I have tried collaborating with various other readers of my blog and it has been a success. In the process i have developed a good body of translation of songs / poems from few major Indian subcontinent language such as Hindi, Urdu, Punjabi and Bengali.
I think organic translation collaboration is also an important component for this translation effort. The reason is precisely because of few elements which if translated using algorithm wont just make sense. And things like literature (particularly poetry, lyrics) will fall under that category. There is a possibility to create platform for organic translation and translators as sub-branch of this whole endeavor.
I would be very much open to participate in brain storming (and heart storming) in the process. Pls feel free to ping me at email@example.com
Pingback: Global Voices in Italiano » Al via il Translation Exchange Project di Global Voices
Pingback: …My heart’s in Accra » Alex MacGillivray explains the Google Books settlement