Two weekends ago, I wrote a couple of scripts designed to let me (and anyone else who was interested) study the emergence of memes on Twitter over the course of days or weeks. I built the tools to study use of the #pman tag during the Chisinau protests in Moldova, but colleagues immediately pointed towards other stories they wanted to track, like the #amazonfail campaign. I’ve got high hopes that we’ll be able to say something coherent about how ideas spread on Twitter at some point in the future.
This weekend, I’ve been innundated with emails from friends warning me about precautions I should be taking to protect myself from swine flu. (There are some pretty good wikis emerging, for those who are interested.) And though I’m not especially planning on going out of my way to avoid human (or porcine, for that matter) contact, it’s been pretty amazing to watch Twitter get flooded with flu posts. I searched for “flu” on Twitter, walked away from my machine to get a beer, and came back to the message “5670 results since you started searching”.
It’ll be worth studying the spread of swine flu on Twitter – Evgeny Morozov is already worried that Twitter is spreading panic and misinformation, and it would be interesting to see if we can find correlations between the actual incidence of the disease, or discover whether media hype has a cycle independent of disease cycles. But who can wait for real data? Isn’t it worth figuring out just precisely how much people are freaking out, right now?
So I wrote a cute little script that quickly calculates what percentage of current Twitter traffic includes a particular keyword or tag. It takes advantage of the fact that Twitter sequentially numbers its posts, and includes this information in search results. This means you can retrieve a page of 100 search results and calculate how many tweets it took to get 100 results. That, in turn, lets you calculate what percentage of tweets, recently, contained the term you’re searching for.
Earlier today, I saw levels as high as 1.5% of all tweets mentioning the word or string “flu”. It’s quieted down by this point in the evening. Here’s a recent comparison of flu terms:
1.003 % flu
0.794 % swine
0.171 % swineflu
0.143 % #swineflu
0.055 % #influenza
0.005 % #flu
0.004 % gripa
(#influenza is in there because it’s been the dominant term in Spanish-language flu posts. gripa is there because my friend David Sasaki wondered why people weren’t tweeting about “gripacochina”.)
Just for comparison’s sake, “redsox” shows up in 0.12% of posts, and we’re in the 9th inning of a very good Red Sox game.
Some interesting data in there – looks like I can safely ignore the #flu tag, in favor of #swineflu. And I’d love to figure out what’s the most common ratio between people referring to a phrase in plain text and to people using it as a hashtag. But it’s hard to generalize anything from single data points – the fun is probably running this tool once an hour or so and watching how it trends over time – perhaps I’ll do that tonight.
I’ve got a cute little Perl script that will take an arbitrary number of terms to search for as command line arguments – if anyone wants to turn this into a CGI program, let me know and I’l send you my code. Too tired to write the CGI tonight…