Home » Blog » Berkman » Introducing MediaCloud

Introducing MediaCloud

We’re launching a very cool new project at Berkman today, Media Cloud. Basically, Media Cloud is a platform to help researchers find quantitative answers to questions like:

– What type of stories are covered more heavily in blogs than in newspapers?
– How does coverage of a topic like Iran differ between national newspapers, local newspapers and political blogs?
– How much overlap in coverage do two news sources have? If you’re reading the New York Times and the Boston Globe, how much topical difference do the sources have?
– How do news stories move between bloggers and mainstream journalists? How common or infrequent is it that bloggers “break” stories or introduce new analytic frames?

This last question is one that Yochai Benkler and I have had a friendly disagreement about. Yochai has documented the value of peer production in his groundbreaking “The Wealth of Networks” and believes bloggers will be increasingly important in breaking news stories and framing issues. I’m more skeptical and worry that lots of what happens in blogs follows, rather than leads the news media.

This argument usually involves the two of us marshalling our favorite anecdotes and tossing them back and forth, neither convincing the other. We realized that many of the debates about media – new and old – are centered on anecdotes and not on data, and that there’s too little hard data about what topics are covered in different mediums to answer these questions.

That’s when a project Hal Roberts, Berkman’s brilliant chief geek, and I had been working on metastasized into Media Cloud. Hal and I had been reworking Global Attention Profiles, the first project I’d worked on at Berkman, which mapped attention paid to different nations of the world on different media websites. We realized that by updating the antiquated methods GAP had relied on, and embracing new toolsets like Calais, a content analysis system from Thompson Reuters, we could built a tool useful not just for my research, or the questions Yochai and I were debating, but for a wide range of media researchers.

The site we’re launching today gives a peek at the system we’ve built, which is still in early stages. For six months – thanks to the hard work of Hal, Yochai, Steven Schultze and David Larochelle, as well as amazing support from the Berkman Center – we’ve been collecting data from several hundred US political blogs, from the US’s largest newspapers, a selection of smaller newspapers and some international newspapers and news agencies like the BBC. We subscribe to these sources RSS feeds, retrieve the full HTML of every piece of content posted, use a set of algorithms to separate story text from formatting information, feed story text into Calais and other classification tools to associate “named entities” and topic tags with each story.

The result? We can report, with a pretty good degree of certainty, the main topics covered on Fox News in the past week. Or on any of a thousand other news sources. In some ways, this is very similar to what our friends at the PEJ News Coverage Index are doing with their excellent weekly profile of media coverage. But while they’re using a high-accuracy, hand-counting strategy, we’re casting a wider net and using automated tools. We’re also releasing tools that let you dive more deeply into the data – you can see what topics are most closely associated with a term like “Iran” in different media sources, or build maps that visualize what parts of the world different media sources are paying attention to.

Ethan Zuckerman on Media Cloud from Nieman Journalism Lab on Vimeo.

My friend Josh Benton from Nieman Journalism Lab came by the Center yesterday and I gave him a tour of the tools we’re released. The video of that tour is above, and Josh has a full transcript on the Nieman site. And you can play with almost all of the tools we tour on the Media Cloud site. (It’s a bit slow today – we’re getting slammed with traffic, which is exciting but can be frustrating if you have trouble getting it to load. Sorry about that.)

One of the things that becomes clear in the video is that Media Cloud is more a tool to find new research leads than to answer questions. Josh and I play with a visualization of the term “Iran” through the various sources we track for a month’s interval. The tool shows us when “Iran” is a hot topic for discussion, and how related terms – like Israel, Iraq and Nuclear – fare during that period. We’re also able to see how different media outlets covered Iran during that period, and see what terms occur more frequently in their coverage than in most newspapers or blogs.

Drilling into conservative blog Powerline’s coverage of Iran, we see the terms “pearl” and “kantar” emerge as unusually common in that source, in comparison either to blogs or newspapers. I mention in the video that perhaps a Pearl Kantar writes for Powerline. The truth is far more interesting – Samir Kantar was a member of the Palestine Liberation Front, released as part of an Israel-Hezbollah prisoner swap – Powerline condemns his reception by Iranian president Ahmedinejad and points to an article by Judea Pearl, father of the late Daniel Pearl, expressing concerns about the mainstreaming of terrorism as a resistance tactic.

One of the questions Yochai and I have debated is whether cases of bloggers shaping the news agenda are common or rare. To answer this question, we’d need to track not just cases when bloggers succeed in propogating memes, but those cases where propogation fails. My guess is that looking for terms that are unusually common in specific blogs versus blogs or newspapers as a whole is one way to find these cases – someone at Powerline was hoping that Samir Kantar would become a major discussion point in talking about Obama’s engagement with Iran. It didn’t. Being able to identify these failures as well as successes is a first step towards understanding how ideas do and don’t move between blogs and mainstream media.

The best thing about MediaCloud, in my opinion, is that we’re not trying to answer these questions by ourselves. We’re releasing all the code created for MediaCloud under the GPL later this month, and hope to make a dump of the data we’ve collected thus far accessible shortly afterwards. We’re reaching out to academics and researchers around the world to help them build experiments that lean on the MediaCloud data, and we’re planning on making it possible for other folks to build experiments via the API in the near future.

There’s a lot that I enjoy about being based at the Berkman Center, but this project helps illustrate the single most amazing thing about the center, in my opinion – the incredible willingness to try out cool ideas, whether or not we know where they’re going. We’re still looking for funders for the project, but Berkman’s directors have scraped together the funding to get the project off the ground, not knowing whether it would catch the attention of foundations that study journalism, political science or the Internet. It’s pretty rare to have the chance to build big, complex research tools without having a high degree of certainty that someone will be willing to support that research in the long term – projects like Media Cloud are proof positive to me that Berkman is one of the most creative, inventive and imaginative academic environments out there, and it’s an institution I’m thrilled to work with.

11 thoughts on “Introducing MediaCloud”

  1. Pingback: New Berkman project: Media Cloud « Andy on the Road

  2. LK, we’ll absolutely be adding new sources to the project. One of the challenges at this point is adding sources in non-English languages – Calais, which we use to classify stories, only works in a small number of languages. But as they expand the languages they work in, we’re likely to expand in turn. We’d very much like to be able to compare what the French-language press covers versus the English-language, for instance… but that’s probably several months off.

  3. Pingback: mediacloud.org «

  4. Ethan
    In telecom, gigaom breaks more stories than almost anyone, with only WSJ coming close. The Times picks up my stories with credit regularly, and gigaom is everywhere. We were rapidly treated as peers when we did good reporting. There’s a great deal of informal back and forth between the top print in telecom and those online (Gigaom, Jeff Pulver) that break substantial stories. Techcrunch and Techdirt, with a slightly different focus, also get read and respected. Arrington breaks more startup and silicon valley stories than anyone these days.
    Which may not apply in less technical fields, but is how it’s gone where I work. (I’m an inline newsletter rather than a blog, but I think the distinction doesn’t matter here.)

  5. Pingback: Online nieuws is doorgeefnieuws « De nieuwe reporter

  6. Pingback: Blog Interactive » Online nieuws is doorgeefnieuws

  7. Pingback: Berkman Center’s data mining Media Cloud launches | tylersoron.com

  8. Pingback: Cross-disciplinary! (and disjointed…) « MediaChao

  9. Pingback: Joho the Blog » Berkman Buzz

  10. Pingback: Online nieuws is doorgeefnieuws | Bas Broekhuizen

Comments are closed.