2011 has been an exciting year for those of us who usually complain that US audiences don’t encounter enough international news. Since protests in Tunisia succeeded in ousting Ben Ali from power in Tunisia, the news cycle has been dominated with stories of revolution in the Arab world and, tragically, with the destruction caused by earthquake and tsunami in Japan and the drama of possible nuclear disaster as a result. International news very rarely is the dominant story in US media – when the fine folks at Project for Excellence in Journalism noted that the protests in Iran were one of the very few international stories that led a US news cycle, I analyzed a few years of their data and concluded that, aside from coverage of the Olympics, it was virtually the only non-US story in recent years to have led a US news cycle. This year, we’re seeing this trend reversed – interest in the Japan disasters was extremely high in US media, and in protests in Egypt and Libya – perhaps there’s been a shift in public attention, in media coverage, or both.
We (by which I mean the folks actually capable of making it run – Hal Roberts, David Larochelle, Zoe Fraade-Blanar) have been revamping the Media Cloud tool this past year, trying to make it more powerful for media watchers to understand what stories have been receiving mainstream and citizen media attention, and how to characterize that coverage. We’re in a closed beta test of the new tool until May, but will be rolling it out to the general public before all the snow is melted in my native Pittsfield.
The most obvious way Media Cloud helps us understand what’s being covered is via word clouds, visualizations of what terms appear most frequently in news stories or blog posts in a set of media sources. The cloud above is the words that appeared most often in a set of 25 mainstream media sources in a week that ended on March 21st. Our servers subscribe to all the RSS feeds offered on the websites of those 25 MSM sources, and we download the entire text of the stories posted on those sites several times a day. So our word clouds visualize the words most common in the full text of all those stories, minus common words we remove using a stoplist we’ve developed. This is a pretty international word cloud – Japan, Libya, Qaddafi are some of the largest words visible, and prominent terms include Fukushima, a term we’ve probably never seen in a word cloud this broad until the nuclear crisis story broke.
How different is this word cloud from a more typical, less international cloud? Well, here’s the 3/21 MSM cloud next to a cloud of the same sources from late October, 2010, when US media attention was focused on congressional races and not on revolution and natural disaster. While it’s helpful to compare the two side by side, it’s easier to see some differences when we superimpose one on another.
Terms in blue appeared more often in our set of MSM sources in March, while those in red appeared more often in October. Terms in purple were approximately equally prominent in both sets – many represent evergreen topics or terms, figures who are always in the news, like President Obama.
While this is helpful in showing how the foci of coverage changed between November and now, it’s less helpful in helping us understand the magnitude of that change. Just how different is the coverage we saw this past week from the coverage we saw in November? Was November 2010’s coverage more similar to June 2010 than it is to March 2011?
We can actually make a stab at answering this question quantitatively using a cool trick called “cosine similarity“. This is a technique computer scientists use to detect a type of similarity between documents. Basically, a computer program counts the appearances of words in a document (in this case, a week’s worth of media coverage by 25 outlets) and compares that frequency list to that of another document. If those documents are identical in word frequency – both mention Obama 23 times, Libya 5 times and basketball twice – they score a 1. If they’ve got no words in common, they score a zero.
(The actual math behind this is wonderfully cool, if slightly mind-bending. Imagine a set of documents with only two words in them – “Obama” and “NCAA”. In source A, Obama is mentioned 8 times, NCAA 2 times. Put a point on a graph at (8,2) – Obama’s our X axis, NCAA our Y axis, and draw a line that passes through 0,0 and 8,2 – that’s the vector that represents set A. In source B, Obama gets mentioned twice, NCAA 8 times – put the point at 2,8 and draw the vector for source B. The angle between vectors A and B is a measure of how similar the sets are, and taking the cosine of that angle is a simple way to scale the value to be between 0 and 1 for angles between 0 and 90 degrees. The trick, of course, is that documents contain words other than Obama and NCAA, and cosine similarity adds a new dimension to our graph for each new term. So the vectors we’re measuring when we compare all the words in 25 media sources over a week to another comparable week exist in 3000-dimensional space. Don’t bother imagining 3000-dimensional space – it will make your head hurt. Just imagine three dimensional space and think about two vectors that each emerge from 0,0,0 and each pass through an arbitrary point in positive x,y,z space – it’s easy enough to imagine measuring the angle between those two vectors. Then take it on faith that, mathematically, you can do the same thing in many-dimensional space.)
We’re tracking thousands of media sources with Media Cloud, and we’ve organized some into “media sets” for the sake of comparison and analysis. We’ve got a set of 25 “mainstream” sources that include some large US and British newspapers, some TV networks, and the Huffington Post. Another set of hand-chosen US political blogs includes three subsets, hand-coded (by Hal, over a very painful three days) into left, right and center collections. There’s a set of popular blogs, based on the 1000 most-trafficked blogs on Bloglines. These blogs cover topics much broader than just politics – many focus on technology, and a substantial subset cover topics like knitting, quilting and other crafts.
If cosine similarity is a useful way of comparing sources, we’d expect to see high similarity scores for sources that cover similar topics, and lower similarities for those focused on different subjects. And (fortunately) that’s what we see, comparing all sources in these collections for the week of 3/14 – 3/20/2011:
Highly similar collections are coded in red, highly dissimilar in blue, with orange and green as intermediate shades. We see very high similarity between centrist political blogs and political blogs as a whole (we’d expect to see lots of similarity here, since one collection is a subset of the other!), and more surprisingly, between mainstream media and centrist blogs, suggesting that the two sets use much the same language and talk about many of the same topics.
Collection comparisons that show some similarity include left and right political blogs to political blogs as a whole (again, to be expected as they are subsets), political blogs to mainstream media and mainstream media to popular blogs. Perhaps more surprising is the significant similarity between left and right political blogs – one explanation is that the left and right are often talking about the same topics, even if they’re trying to frame stories differently.
We see the least overlap between right-leaning blogs and mainstream media (perhaps consistent with a frequent rightwing complaint that media has a leftwing bias?) and between political blogs of all stripes and popular blogs. This last finding is reassuring – when we began building collections of popular blogs, we were surprised to discover how apolitical most were, and the cosine similarity test suggests that we might see this topical diversity in the lower similarity score.
This form of cross-comparison offers some intriguing directions for future research. We might compare each of the sources within our mainstream media set, for instance, and see what sources are most similar. This might suggest that one is derivative of the other… or that they’ve simply got similar interests and taste in language. Knowing that the New York Times and The Guardian have a high level of similarity, but the BBC is dissimilar to both (just a for instance – I haven’t calculated and don’t know this to be true) might offer instructions on how to focus our reading if we wanted as broad a diversity of stories as possible. Similarity between blogs and mainstream media might suggest clues for how the larger media ecosystem works – we might expect ideas to spread more often from mainstream sources to blogs (or vice versa) when we see similar coverage patterns.
The question raised at the beginning of this post – are we seeing an unusual focus on international news – suggests that we try another type of comparison – comparison over time. While the news cycle has shortened from weekly, to daily, to hourly, it still takes time for the “restless searchlight” to move from one part of the world to another – events like an earthquake or a rebellion tend to generate stories for days and weeks at a time, and even if other important events arise, it takes time for journalists to redeploy and cover other events. So we’d expect to see more similarity between last week’s news and this week’s than between this week’s and six months ago. And again, we do:
Cosine similarity (y-axis) versus weeks from 3/14, Mainstream Media collection
The graph above shows the similarity between the words used in the 3/14-3/20 coverage of our 25 mainstream media sources and collections 1-48 weeks earlier. (I didn’t test all sets, just the first three weeks, then monthly through six months, then another check to 48 weeks prior.) There’s a pretty distinctive pattern – current coverage is quite similar to last week’s coverage, and the similarity rapidly drops off, reaching what seems to be a steady level. (Mean cosine similarity in this set is 0.723, with a standard deviation of 0.06, so the “steady state” line we might draw at 0.68 is within one SD…) We see similar graphs for popular and political blogs.
Week to week comparisons of MSM data sets. Graph shows cosine similarity versus the start date for the earlier data set.
How similar is the news week to week? Walter Lippman famously observed, “The press is… like the beam of a searchlight that moves restlessly about, bringing one episode and then another out of darkness into vision.” Just how restless is that searchlight?
Well, that depends on the week. I ran comparisons between successive weeks for 27 weeks on our mainstream media collection. The mean cosine similarity from one week to the next is 0.905, with a standard deviation of .061. The only weeks that fall outside the standard deviation are the past four weeks, which differ from one another much more sharply.
When data differs from the mean this sharply, it’s a wise idea to go check your findings again. We did, and we’re confident that we’re seeing something truly unusual happening in news coverage this year. The coverage for the past few weeks has been highly discontinuous, and also highly international. Media Cloud draws maps of media coverage based on mentions of nations. This is an imperfect way to measure coverage to some nations – it’s rare that “US”, “USA” or “America” appears in most US-media stories about national or regional issues, for instance – but we generally see significantly more mentions of the US than of any other nation. In the past two weeks, both “Libya” and “Japan” have been mentioned more than “US”, a situation that’s nearly unprecedented in our data set.
The pair of weeks where we see the biggest discontinuity in coverage – the lowest cosine similarity – is between the week from 2/28-3/6/11 and the week from 3/7/11-3/13/11. (See a comparison word cloud above) The catastrophic Japanese earthquake and tsunami occurred on March 11 and radically shifted coverage from protests in North Africa to recovery from the disaster. The other biggest discontinuity is between the week of 2/21-2/27/11 and the preceding week. There the shift may have to do with attention shifting from Egypt to the Libyan protests… but it’s harder to track the shift to a specific event like the Japanese events.
Personally, I find this discovery – that the agenda’s been shifting more sharply than at other points in our study – to be somewhat reassuring. About six weeks ago, I offered the thought that history appeared to be accelerating in 2011 – there is simply too much happening too quickly for most of us to process and comprehend. The graph above suggests that there may be some validity to that observation. What we’re paying attention to in terms of breaking news is changing much more quickly than it normally does.
It’s worth remembering that cosine similarity is comparing words used in articles, not their meaning. So the shift in focus from popular protests in Egypt to Libya will be very apparent using this metric (less Tahrir and Mubarak, more Ghaddafi and Benghazi), though it’s possible to consider the two stories as part of a larger narrative of public protest. On the other hand, a shift in coverage from Obama and republicans arguing over health care to arguing over deficits might register much less of a difference in cosine similarity, as many of the words involved in stories are going to be identical even with a change in subject of the stories.
There’s a valid and dismissive takeaway one might have to these results – we’ve just statistically proven that revolutions, catastrophic natural disasters and potential nuclear meltdowns are major news stories. Duh. But it’s a good sign that this method can detect huge shifts in media attention. Now we just have to see if it can detect much more subtle shifts.
If we are able to determine that more subtle shifts in cosine similarity really do correspond to relevant shifts in media focus, we might have a useful statistical technique to measure the relative tranquility of the overall media environment, a quantitative definition of the “slow news week”. That would be a useful data point for advocates looking to call attention to a cause – don’t launch a new campaign at a moment of turmoil as you’ll get ignored – or for advertisers promoting a new product. And it might be a helpful tool for those of us trying to understand broader dynamics of the media ecosystem. Is there a natural length of attention to major stories like the Japanese earthquakes or protests in Libya? Do we need additional developments to keep a story in the news beyond that initial period of interest? Do journalists begin looking for “the next thing” at regular intervals, or are they reacting to external factors – what actually happens in the world? I don’t know that we’ll discover satisfying answers to these questions, but it’s exciting to have a new tool we can try.
To tease out context, would it maybe be useful to take your top words and then run a cosine similarity between them in your document/periodical space? Using pairwise similarities you could then construct a weighted graph, and kick out any edges that fall below some relevant threshold. Looking at the local properties of the topic space over time rather than the global properties of the document space over time would be revealing (using persistence or centrality measures especially). Really looking forward to the launch of the platform though. Hope the raw data is made available!
Pingback: Joho the Blog » Berkman Buzz
Fascinating. Do you guys have any plans to interview news editors from these organizations to see how closely your theories about the reasons for focusing on specific things at specific times mesh with their actual decision-making process?
Interviews, RMack? You mean ethnography? Talking to people instead of looking at numbers? Madness!
(It’s a great idea. I’d like to get slightly more comfortable with the quantitative analysis, then talk to editors to see if we could get them to think about decisionmaking at points of maximum and minimum newscycle churn…)
Yeah.. in my own personal experience, decisionmaking in news orgs is a very human, intuitive, unscientific, and not particularly rational thing. It has a lot to do with personalities of and relationships between individual editors and individual journalists, etc., plus all kinds of things related to management’s understanding of what the audience both needs and wants, on top of budget and resource allocation issues, etc. etc.. So quant is only going to get you so far in understanding why news orgs behave as they do. But what do I know, I’m a journalist who always sucked at math…
Pingback: …My heart’s in Accra » Media Cloud, relauched
Pingback: The MediaCloud Team Starts Its Engines