Update: The interface to search the PeopleFinder database is up. According to our friends at Salesforce.com, there are now 87,000 records in the database, all entered by hand.
Jeff Jarvis, who’s done an excellent job of blogging various Katrina recovery efforts, sees an opportunity for a dialog about reactions to future natural (or, god forbid, manmade) disasters – he’s calling the idea Recovery 2.0.
I think Jeff’s on the right track here, although I think we’re probably at Recovery 0.2a in software terms rather than 2.0 – we’re a long way away from a 1.0 response from the web community that we could all be happy with. I hope folks will take time to document the work they did to help out with Katrina, and that we’ll keep developing and refining these tools after the immediate need has passed. Unfortunately we all understand that we’ll eventually face another disaster of one sort or another.
In that spirit, I wanted to offer some reflections on the small part of the Katrina PeopleFinder project I’ve been involved with. My basic conclusion: we got an amazing amount done in a very short time with very, very bad tools. If we’re lucky enough to have the same sort of response from the web community and took the time to build some better tools, we’d be able to tackle huge data entry challenges the next time around.
Timeline, as I saw it. Apologies to folks who are misidentified, or not identified.
– Friday afternoon, David Geilhufe starts organizing geeks to start “screen scraping” databases and bulletin boards with information about hurricane survivors. Some time that evening, David and others develop PFIF – the PeopleFinder Interchange Format, a spec and XML format for missing and found person information.
– Saturday morning, David sends an email to some of the “usual suspects” in the activist technology world, asking for assistance in organizing a part of the PeopleFinder project: manual entry of data from “unstructured” sources, like bulletin boards, blog comments, etc. Other teams are working on importing data from structured databases and building the database where all this data will live – Zach Rosen from CivicspaceLabs is leading the structured data entry team.
I find Jon Lebkowsky in the #globalvoices channel on irc.freenode.net – we commandeer the channel as headquarters for the project. Jon agrees to take on the human element of the project – volunteer management; I take on the technical part – breaking bulletin boards into chunks and assigning them to users.
– We set up a wiki on the GlobalVoices Wiki and start assigning chunks of databases in a truly, brain-dead stupid fashion. After a few hours, we move the wiki to Katrinahelp.info, to clear up namespace confusion.
We rapidly figure out that assigning people a page of bulletin board results isn’t going to work, as the posts on each page change as new posts are added to the system. A pair of Craigslist geeks solve the problem on their site, by creating HTML pages with the contents of 25 Craigslist posts on each page – they place them on a constant URL so we can index the pages easily for the wiki.
Nate Kurz comes up with a clever hack to index posts on bulletin boards that use sequential post IDs. I write an ugly perl script using his hack to generate assignment pages that have links to bulletin board posts.
– Over Saturday night, a few volunteers check in and start entering data, primarily from the Craigslist pages. Sunday morning, we post links to several new data sets, using the technique Nate and I have developed. A small cadre of volunteers starts entering data… and promoting the data entry effort on their blogs.
A-List blogs start promoting the effort and we’re quickly swamped by volunteers. The wiki slows to a crawl, and we’ve got countless edit conflicts as new wiki users discover what happens when they try to edit a page at the same time as another user is also editing.
The database used to collect information from volunteers is crashing under the load – its load average is between 35 and 50, perhaps ten times what it should be. There’s a hasty decision made to take down the database and stop data entry until we can put data into a more robust database.
– During the data entry downtime, the team running the wiki reconfigures it to handle a greater load. Nick Branstator, a developer who’s already helped develop Air America Radio’s Katrina Voicemail for VoodooVox comes over to my house and we start developing tools to scrape bulletin boards that don’t have sequentially numbered posts. He scrapes two large boards before heading home after dinner.
Building on Nick’s model, Steven Skoczen, Intelliseek‘s Matt Hurst and other programmers scrape another dozen bulletin boards overnight. By Monday morning, we’ve got thousands of 25-post chunks ready to be entered into the database.
– The new database, hosted by Salesforce.com, is up by 10pm EDT Sunday night, and volunteers enter data through the night. By 4am, there are 7,000 records. When I log on at 8am on Labor Day, there are 12,000.
Volunteers pour in through Labor day and by 9pm, we’ve reached the 50,000 record mark. I also realize I’ve reached the burnout point and tell David that I need to hand off my part of the project. David quickly finds volunteers to take over my role, including Paul Schreiber, Deborah Finn and others. A little more than 48 hours after clocking in to the project, I’ve clocked out.
With absolutely no figures to back up this statement, I’m guessing we’d readied about 90% of the known bulletin board posts for assignment by midnight last night. The vast majority of the data entry work is done… but the PR machine is just kicking into gear. As of 6am, the PeopleFinder Volunteer page on the wiki is the tenth-most linked page according to Daypop and folks on the team are starting to get phonecalls from the press.
The project’s not done – more data will keep coming in as more refugees get online access and can post information about their whereabouts. And the key part of the system – an interface to the data in the database is still missing. But a group of loosely organized people did an amazing job of tackling a huge data entry problem in roughly 36 hours.
People want to help.
None of us were prepared for the volunteer turnout – indeed, the willingness of people to help us out brought our system to its knees more than once. Midday Sunday, I recognized the tags of most of the people claiming chunks from the wiki – many were friends from my LiveJournal community. By the time database melted down, people I knew were in the minority.
Basically, hundreds of people saw the requests for help on BoingBoing, Metafilter or elsewhere and pitched in. In many cases, it was the first time a volunteer had encountered a wiki… but people coped with the new technology remarkably well.
I got dozens of emails thanking me for an opportunity to help out. I suspect a huge number of people were sitting at home in front of the TV this weekend, feeling helpless and were grateful for something they could do above and beyond writing a check that made them feel hopeful.
Sometimes code is the solution. Sometimes 2,000 loosely organized people are the solution.
I got a dozen emails or blog comments from people asking – basically – why we were being luddites and having people enter data into forms instead of writing scripts to do the data entry automatically. I responded to some of these by asking people to look at five of the entries on a bulletin board and getting back to me if they still thought scripts were a good idea.
Here’s the problem. A typical message board post looked something like this:
My father, Joe, was working in New Orleans and hadn’t evacuated – he was living in Jefferson Parish. We don’t know if he’s okay. Please call me or Mom in Houston – Lisa Brown, Houston, TX.
To parse that post automatically, a script needs to figure out that “My father, Joe” is probably named “Joe Brown” and that “We don’t know if he’s okay” means he should be marked as “missing” in the database. While it’s very simple for a human to draw those conclusions, programming an computer to make those conclusions is a major artificial intelligence challenge.
Computer programmers are naturally inclined to solve problems with code. That’s because we’re lazy – not lazy in the bad, won’t-get-out-of-bed sense of the word, but in the good, avoid-boring-repetitive-tasks-at-all-costs type of lazy. This is usually a good thing – most people don’t like boring, repetitive work, and it costs money to hire people to do even the most mind-numbing jobs.
But when 2,000 people show up and ask for something to do, it’s a great idea to take advantage of their generosity. Estimating that it took roughly two minutes to enter each name into the database, volunteers donated roughly 2,250 hours of time over the past 48 hours to do data entry. That’s a $11,600 in-kind contribution, valuing people’s time at US minimum wage.
Could a talented programmer solve the unstructured data parsing problem in 120 hours at $100 an hour? Possibly. Probably not. And 1,999 other people wouldn’t have had the chance to help out and feel good about doing their part.
Simple tools work surprisingly well.
Wikis are spam-prone, hard for beginners to use, subject to arcane problems (edit conflicts) and make it too easy to create long, complex, unreadable pages (some as bad as my blog posts…).
Despite those flaws, they work surprisingly well as adhoc workflow management systems.
In a perfect world, I would sit down with a couple of good developers and develop a workflow management system for the next time we need to get a thousand volunteers together to enter some data. It would have a simple, web-based interface that logged users in, assigned them a task, nagged them via email until they completed it, and provided a comprehensive view of what was and wasn’t assigned to administrators.
But I don’t think it’s a burning need, because MediaWiki – once it was tuned to handle the load – worked pretty damned well. Some key things:
– Assignment pages need to be small. Huge ones lead to edit conflicts. Lots of small pages is better than a few big ones.
– Wikis where you need to login to edit might turn some users away, but they do a nice job of preventing spam and make it very easy to track users who are having problems.
– It’s a good idea to put someone – or multiple people – in charge of wiki gardening early in a project.
It’s not just your tools that need to be robust. You’re dependent on everyone else’s tools as well.
As we turned hundreds of volunteers onto message boards to read and index posts, those boards – predictably – crashed. Wondering why so many users were accessing posts by post ID, two board sysadmins changed their indexing scheme to block that sort of access. That broke our data entry process midstream and forced numerous volunteers to abandon their work.
We made two big mistakes here. One was that we didn’t properly respect the sysadmins running these message boards. We should have let them know what we were planning to do and throttled our volunteer force so that we didn’t swamp their systems.
More critically, my post ID hack is a stupid way to solve this problem – a well written scraper is a better way to handle this problem. Then the data lives on servers the volunteer team controls, not servers that are overloaded with people posting missing people information. Next time, lots of scrapers, no URL hacks.
Many of the people who are working on chunks of the PeopleFinder project are people who’ve know each other for years – sometimes in person, sometimes virtually – and trust each other a great deal. Most of the people I reached out to for help on coding problems are people I’ve known and worked with for over a decade. Many of the first volunteers who started entering data into the system – and who debugged our first data entry problems – are part of an extended LiveJournal community.
Basically, when net people try to solve a problem, they bring their posse with them. For me, one of the lessons of the weekend was discovering what a powerful force my posse can be, and how effective the network of posses around the net can be.
Around 4pm yesterday I realized two things: 1) despite the fact that we’d entered 80% of the existing data, data was still going to be generated and we might be entering data for another month to come; 2) I’m scheduled to give two talks in the next ten days and to submit an academic paper. Next time I get involved with one of these projects, I’m going to rope someone in early in the process so that I can hand my tasks off to her when, inevitably, I have to return to normal life. She, in turn, would find someone to shadow her, and so on.
Because you’re going to burnout, use generic email addresses that can be redirected. Roughly a thousand volunteers around the world now have my gmail address and I’m going to be redirecting email for the next weeks to come, because I’m an idiot. Don’t be an idiot – set up firstname.lastname@example.org before you do anything else.
We’re all going to learn more as we share the lessons of the online relief effort over the next couple of months – I hope these notes are useful to someone else as they think about how to build the set of tools we need to cope better with the next emergency. And a thank you, from the bottom of my heart, to everyone who pitched in on this project, whether you wrote code, entered data or promoted it. During a dark time, it’s a wonderful reminder just how many people want to do the right thing and lend a hand.
Pingback: infundibulum » More on Katrina and Translation
Pingback: BuzzMachine » Blog Archive » Recovery 2.0: A call to convene
Ethan, you and yours (all 2100 of ’em) are amazing. Truly, truly amazing.
Pingback: infundibulum » Katrina Data Entry Doohickey
It’s an amazing effort to use the Net and volunteers to accomplish something so very useful for the survivors. Projects like this can be organized in the future to help people in major emergencies and catastrophes around the world.
I tried the database out with a search for “John Smith” and the results were fast and cross-referenced all people located with John Smith at a refugee center. I hope that the folks at Salesforce.com keeps that baby (database) super secure and ready to handle increasingly heavy loads from 1,000s of users.
Do you still need people for data entry and promotion of the database? I’m in if you need me, anytime.
BRE – Thanks for the offer. Looks like – for the moment – we’re done. 95,000 records, no chunks left to assigned right at the moment. I’ll post a notice if that situation changes.
Thanks for documenting these lessons learned, and it’s very inspiring to see what can get done with some leadership and tools to facilitate collaboration.
Anyone who gives out personal email addresses to the burgeoning masses is a hero in my book. This is a great retrospective for those who come after.
Pingback: Community Knowledge Works
Wow! What a fantastic write-up of the effort & lessons learned. Super-valuable, and, as ever, thoughtful, sensible, and generous of spirit.
It was truly an impressive effort. I did my small bit, and I was exhausted after a few hours of data entry. I really admire all the work others put into this!
One thing that strikes me is that we need to ensure that systems are in place to make sure that people leave proper data in the first place. Most of the data I entered lacked any clear statement as to what state/county they were referring to, basic contact information if people are found (I assume because they expected the information to be posted back to the forums rather than entered elsewhere), but most importantly, any sense of how urgent the notice was. A lot of what I entered seemed to be people simply expressing concern about people they hadn’t seen in a while rather than what I would call a “missing person” report – but it was impossible to distinguish between the two. It seems to me that for rescue workers, such information might actually make things harder.
I don’t know if it is worth the effort, but perhaps there should be some standards developed for collecting such information – in consultation with rescue agencies, and then there should be an education campaign to education web administrators about these standards. One might object that in a real disaster such things are ad-hoc. Nobody knew that NOLA.org would become such an important site for such info – but it strikes me as being comparable to knowing where the fire exits are in your building. If every web administrator knew where to go to look for such info they could quickly set up a proper data collection system. Maybe Google and Yahoo could even be convinced to provide the backend for such a system…
I posted an update at SmartMobs.com. The project’s still active; we’re still chunking and entering data. We’ve had several issues along the way, but for a distributed project with vague leadership, we’ve done remarkably well, and I think we’re building tools that will be more effective next time.
This is a great analysis of the PeopleFinder project. “Plan for Burnout”, in particular strikes me, as I’ve consistently burnt-out on activist projects, and never thought to simply accept that as reality and plan for that fact. That’s brilliant, thank you.
I think you’ve covered most of the bases, but here are a couple of other thoughts that came to mind for me:
Why Triage is your friend / Keep an eye on the top of the tree.
One trend – at least in the wiki-page generation area – was that people tended to pick the lowest-hanging fruits. Sites that were easy to index took priority over sites with larger amounts of data. While this isn’t necessarily a bad thing at first, it can mean that important tasks are left undone.
The stunning example of this is the Sun Herald site. It was a complete pain to index, and consequentially, never got done until day 3 or so. But it ended up having more data than any other site besides craigslist (35,000+ entries). Ironically, data entry was suspended 8 hours after the SunHerald page posted, so barely any of that data was entered. Right now, there are still ~15,000 entries that never made it into the database. This is an instance where the system didn’t work.
The simple way to counter this is to have one of the first tasks be a simple triage step. Sites to be pulled are tagged with a rough count of data size, and an indication of how hard it’s going to be to pull that data. Since tasks aren’t really being assigned, this doesn’t require a leader per se, just a description of the work needed. In total, things worked well, but this might have made them work a bit better.
Entropy takes breaks sometimes
I was surprised at how well self-organization worked with what was surprisingly little communication. Since the group was goal-oriented, the simple completion and addition of tasks served as the communication process, and nothing else was needed.
When compared to nearly any other organization i’ve encountered, talk-to-work ratio of this group is simply mind-blowing. Perhaps that’s a function of the task-centered, time critical nature of this project, or perhaps it’s from the medium used to communicate. Either way though, it’s definitely something to note in terms of group dynamics.
In any case, it was great ‘working’ with you. When recovery 3.0 is needed, we’ll have to do it again.
You said it man… I was in a doctor’s office for anxiety meds and will definately need to change my email addresses :)
Pingback: …My heart’s in Accra » Recovery 0.2.1 - more on geeky attempts to help out with Katrina relief
This is why I think people should be taking a closer look at wifi community groups. Not because what we’re doing is absolutely essential (most often it’s not), but what is amazing is that a group of individuals are using open-source methods to organize and create a specific change in their immediate environment.
My wifi group (in montreal) has 20-30 volunteers. We have to hold meetings, organize a board that acts as an API with other bodies (like universities and funders), buy equipment, and trek all around Montreal installing hotspots and repairing them. Also, we organize press relations and marketting. And this is all done based on the open-source model because that’s what most of us have experience with and it is those values that have brought us together. How do we do it? I’m not completely sure, but we try this and that, see what works, discards what doesn’t. For example, we very consciously use reputation and visibility within the community as a way of repaying volunteers. We consciously try to keep things task-based and not role-based to allow for “swarming” and to limit non-use of volunteers. We use transparency and we don’t draw clear limits between being a member and not being a member because it helps in attracting volunteers (turning from a lurker to a volunteer is a short step).
Not revolutionary stuff, but a OSS influence is clearly visible – compared to other, more traditional community groups. And all the wifi-groups (over a hundred or so world-wide) are experimenting with different ways to implement the OSS model in their real-world activities including stuff like searching for funding, dealing with municipalities, attracting press attention, etc.
Based on the stuff that you and Joi talk about often, I think there is a hugely interesting area of looking at ways to use the OSS model in traditionally off-line activities.
Pingback: Weblogs Work » What We Learned From Disaster Blogging
Pingback: WorldChanging: Another World Is Here
it is very instrest article
Pingback: Open Source Disaster Recovery » humanitarian.info
Pingback: Unthinkingly.com» Blog Archive » Geeks Responding to Katrina: Relief 2.0
Pingback: …My heart’s in Accra » Digital activists find ways to help Kenya
Pingback: other » Blog Archive » Digital activists find ways to help Kenya
Pingback: email@example.com » Blog Archive » Digital activists find ways to help Kenya
Pingback: …My heart’s in Accra » Kenya: mapping the dark and the light
Pingback: …My heart’s in Accra » “How do I help?” - Introducing Nabuur
Pingback: Daniel Zimnikov: In the Wake of Krymsk Floods, Social Media Powers Russian Relief Efforts | Russian News by Daniel Zimnikov