Media and provenance - Ethan Zuckerman

On Wednesday, June 20th, Matt Smith and Aura Bogado broke a harrowing story about the Shiloh Treatment Center, south of Houston, TX, one of the contractors the Trump administration is using to house migrant children who were separated from their parents. Their report for Reveal, a Center for Investigative Reporting publication, and The Texas Tribune is based on an analysis of federal court filings, which allege that children held at Shiloh have been forcibly subdued with powerful psychiatric drugs. Released at a moment when media attention has been focused on separation of children from their families at the US/Mexico border, the story was widely shared online – as of this morning, Reveal’s tweet about the story had been retweeted 22,000 times.

The story gained attention for reasons other than its harrowing revelations. When Reveal tried to “boost” their post on Facebook, the platform alerted them that they were “Not Authorized for Ads with Political Content”. This is a new safety feature implemented by Facebook in the wake of scrutiny towards the company’s role in the 2016, permitting over 3000 ads to be illegally posted by the Russia-based Internet Research Agency, with the goal of sowing discontent in the US. Facebook is in a tough bind – they need to vet purchasers of political ads far more carefully than they have been, but thus far, their algorithmic review process is flagging some stories as ads, and allowing some ads to pass through unscreened. And Facebook Ads VP, Rob Goldman, didn’t help clarify matters by telling Reveal “…this ad, not the story, was flagged because it contains political content.”

Last night, one of the authors of the Reveal story, Aura Bogado, pointed to another problem she and Matt Smith are experiencing:

Iâ€™m an immigrant woman of color. Iâ€™m also an investigative reporter for @reveal and the ONLY reporter who has talked to a child who was at Shiloh. We broke this story with the @TexasTribune. As this story goes viral, Iâ€™d appreciate it if fellow reporters credited us for our work.

— Aura Bogado (@aurabogado) June 21, 2018

One of the long-standing patterns of the news industry is the tendency to copy reporting someone has already done. In the days when most people subscribed to a single newspaper, this copying served a helpful civic function – it helped spread news to multiple audiences, helping citizens have a common basis of news to inform democratic participation. A very clear journalistic ethic emerged around this practice: you prominently credit the publication that broke the story. You’ll see even fierce competitors, like the New York Times and the Washington Post, do this with their biggest scoops.

The internet has changed these dynamics. On the one hand, there’s no longer any civic need to copy stories – you could simply link to them instead. But there’s also a powerful financial incentive to make any story your own – the ad clicks. This story, written by Andrew Hay and bylined “Reuters staff”, shows how easily original reporters and outlets can disappear – it contains original reporting, in that it has a novel quote from Carlos Holguin, a lawyer for the Center for Human Rights & Constitutional Law, who’s cited in the Reveal piece… but it doesn’t mention Smith and Bogado, the Texas Tribune or Reveal. (Reuters is not the only outlet that’s scrubbed provenance from this story. But they are a publicly traded company with 45,000 employees, $11 billion in annual revenue, and have been in the news industry since 1851. They should know better.)

This is not only a shitty thing to do, it’s a profitable thing to do. Reuters gets the ad views from the story they largely rewrote, while the two non-profits responsible for the original reporting get nothing, not even credit.

I’ve been thinking about this problem for some time, because the origins of important news stories is one of the main uses for Media Cloud, the system we’ve been developing for almost a decade at Center for Civic Media and the Berkman Klein Center. One of our first publications, “The Battle for Trayvon Martin: Mapping a Media Controversy online and offline” is at its heart a provenance paper, trying to understand who first reported on Trayvon’s death as a way of understanding how the story turned into a national conversation on race and violence. (TL;DR: Trayvon’s family worked with civil rights attorney Benjamin Crump to pitch the story to Reuters and CBS: This Morning. It was well over a week before the internet began amplifying the story with petitions and protests.) Rob Faris and Yochai Benkler’s massive Media Cloud analysis of the 2016 US Presidential elections focuses on provenance, tracing influential stories in mainstream media publications to their origins in the fringes of the right-wing blogosphere that surround Breitbart, Gateway Pundit and others.

Media Cloud works by ingesting (usually via RSS, sometimes via scraping) all the stories from tens of thousands of media publications, multiple times a day. We can often trace the provenance of a story by identifying an appropriate search string – “Shiloh” AND (migrant* OR drug*) might work in this case – and looking to see what stories hit our database first. Often a story breaks in several places simultaneously – that’s often an indicator that it was written in reaction to a statement made by a public official or a corporate leader, not the result of long investigative reporting. This process is imperfect and requires the input of knowledgeable humans to create search strings. What if we could automate it?

We’re working on this problem, looking to create automatic signatures that identify clusters of related stories. Duncan Watts is working on it at MSR as well, generating “fingerprints” for these clusters that rely in part on named entities. And obviously Google has a clustering system working that they use to organize related stories in Google News. With automated signatures and clustering, combined with a deep database of stories collected many times a day, we might be able to identify the initial stream that leads to a later media cascade.

Attention in US mainstream media to “Larry Nassar” from January 2017 to present, via mediacloud.org

What then? Well, that would depend on what media platforms did with this data. Consider a major, ongoing story like Dr. Larry Nassar’s abuse of US gymnasts. That horrific story was uncovered by the Indy Star, who began a massive investigative series on sexual abuse within US gymnastics in August 2017, months before Nassar’s name became a household word. When platforms that aggregate, distribute and monetize news – Apple, Google, Facebook – share revenues with publishers, maybe they should check against a provenance service to find out whether they’re rewarding someone who did original journalism, or someone who’s simply chasing clicks. Perhaps one or more platform would end up sharing revenues between the publisher that captured the clicks and the one that initially sponsored the investigation.

Could this ever really happen? Yes, but it would require not only the technology to work, but for there to be pressure from readers for ethically sourced journalism. It took a great deal of work for consumers to demand that their coffee be sustainably grown and that Apple look into whether suppliers are using child labor. What Bogado and her colleagues are asking for is good for anyone who cares about the long-term future of journalism. We need more resources to investigate stories like the abuse of children at the hands of the US government. We don’t need hundreds of news outlets rushing to cover the same stories. Establishing – and rewarding – provenance of stories that start with investigative journalism could help shift the playing field for original reporting.