I’ve blogged before about the idea that weblogs are a selective amplifier. For the most part, bloggers aren’t doing original reporting – they’re reading websites and RSS feeds and commenting on – or sometimes merely reposting – stories that they find most interesting. As a result, the blogosphere’s coverage of global events is twice distorted – once, by what mainstream media chooses to cover, and again by what bloggers choose to amplify.
Roughly four months ago, I started tracking New York Times headlines, looking for which were – and which weren’t – picked up by the blogosphere, according to Technorati. While there’s a tough methodology problem to solve with the data (more on that further down in this post), I’ve got a couple of months of it to play with – you can see today’s results here and get earlier data by changing the date in the URL…
Earlier this month, I wrote a set of scripts that pull data from the RSS feeds of popular news websites (CNN, BBC, the Guardian) and check them for the next five days for their presence in Technorati. The BBC scripts are sufficiently debugged that I’m putting them on Harvard’s server, and I’ll post a link to them here, and on my research page on Monday, just before I leave for India, so you can keep track of what bloggers are amplifying from the BBC and the New York Times.
With believable data, I decided to spend a bit of time today doing analysis. (Feel free to play along – the data I’m referring to is here – it’s sorted by most popular stories, so you can see what stories from the BBC are currently getting the most blogjuice.)
Of the 504 stories posted to BBC RSS feeds in the past five days, the average story was posted 2.5 times to weblogs. (That’s a bit deceptive – 2.5 is the mean. Both the median and mode are zero, which is to say that less than half of those BBC stories were blogged at all.)
While bloggers are fast, it takes us a while to adopt a story. Stories posted the same day I ran the script had 0.89 posts per story. Stories a day old had 2.28. There was little change for two day old stories – 2.38. Three and four day old stories had 3.28 and 4.09 posts per story, respectively. My guess – it takes 1-2 days for stories to make it off the feeds into the blogs of people who follow those feeds closely. Then there’s an amplification effect where people read about the stories on popular blogs and reblog on their blogs. (Obviously, that’s a huge generalization from a small data set – once I’ve got a bit more data, I’ll try to address this issue in a more convincing fashion.)
Want to get a story into the blogosphere? While the front page of the BBC is a great way to get noticed – 75% of stories that run there get blogged at least once, averaging 5.57 posts per story – it’s got nothing on the technology section, where 90% of stories get blogged, averaging 10.3 posts per story. Rounding out the top five most popular sections are health, sci/tech and the Middle East. (Again, disclaimer about a small data set applies. And again, I’ll try this over a month’s worth of data sometime soon.)
One of the reasons I wanted to try this analysis on the BBC is that their Africa coverage is so consistently good. In some of my previous work, I’ve made the argument that mainstream media (MSM) tends to focus on wealthy nations at the expense of poor ones – BBC is the one MSM I’ve found that doesn’t exhibit this bias.
That said, bloggers do appear to exhibit a bias against African news, which ranks second or third from last in percentage of stories blogged and number of blogposts per story. (19.35% of stories blogged, 0.8 posts per story, less than a third of the average blogpost per story count.) But African stories aren’t the least popular – UK News stories are (17.54% of stories blogged, 0.63 posts per story.) Some possible explanations for this – the BBC runs a lot of UK news – 114 stories in the period watched, versus roughly 32 for each other region. And many bloggers reading BBC may be turning to it as an “alternative voice” for international stories, rather than for domestic coverage.
More weirdness – BBC’s entertainment and business sections are far less popular than I would have thought. Again, I wonder if this is a result of lefty bloggers coming to the BBC for world news at the expense of other sections. Would love folks thoughts on why this behavior might be exhibited – I’ll work to see whether I see the same results in other data sets.
Here are the results I came up with earlier today – apologies for formatting, but the Harvard blog software doesn’t always like tables:
analysis of bbc RSS/Technorati data from 1/27/2005
(extremely preliminary results)
sorted by % of stories picked up by bloggers
section % blogged posts per story
technology 90% 10.3
front page 75% 5.57
health 69.23% 4
science/nature 63.63% 3.73
middle east 45.83% 2.76
asia-pacific 43.58% 1.15
business 42.86% 0.93
americas 41.95% 1.68
entertainment 35.29% 0.59
south asia 32.25% 1.32
europe 29.41% 1.61
africa 19.35% 0.8
uk news 17.54% 0.63
sorted by blog posts per story
section % blogged posts per story
technology 90% 10.3
front page 75% 5.57
health 69.23% 4
science/nature 63.63% 3.73
middle east 45.83% 2.76
americas 41.95% 1.68
europe 29.41% 1.61
south asia 32.25% 1.32
asia-pacific 43.58% 1.15
business 42.86% 0.93
africa 19.35% 0.8
uk news 17.54% 0.63
entertainment 35.29% 0.59
Some quick methodology thoughts, for you methodology geeks out there:
Establishing canonical URLs is a real problem. BBC stories tend to have two URLs per story – the URL is of the form: http://news.bbc.co.uk/2/hi/uk_news/education/4194669.stm. That “2″, right after the co.uk/ sometimes also appears as a “1″. I’m guessing this serves as an edition number. Some stories appear only as 1’s, others only as 2’s, some as both. To get comprehensive counts on Technorati, I’m searching for both 1’s and 2’s and summing the results.
If only it were always that simple. The New York Times uses a content management system that coats URLs in cruft and batter-fries them. Some bloggers remove the cruft before blogging, others don’t. And there’s a special feed that many bloggers use that allows linking to stories in a way that they won’t get hidden behind the for-pay firewall. This makes tracking how a story appears in the blogosphere very, very difficult.
What would help a great deal is if a search for a partial URL within the Technorati API matched all subsidiary URLs. For instance, if I could search for “http://news.bbc.co.uk” and get the thousands of links that begin with that string, doing the sorting on my own to identify which story is which (seizing on the seven integer + .stm string, for instance). The API doesn’t seem to work that way – do that search and you get about 100 BBC matches, instead of the 800+ you’d get by searching URL by URL.
I sat down with Dave Sifry last weekend and he seemed to think the behavior I’m seeing is a bug – he expects an abbreviated URL search to return all the subsidiary pages. I chatted briefly with Kevin Marks over IRC today, and he thinks the behavior is due to the API polling a smaller, faster DB for searches for “popular” URLs, rather than polling the more comprehensive DB. (Makes a lot of sense – we did that both at Tripod and at Lycos to improve catalog speed.) Technorati’s been hugely helpful with this research and I’m sure we’ll find a solution at some point soon.
Pingback: …My heart’s in Accra » The news we share: water in Darfur
Comments are closed.