The NSA documents Edward Snowden leaked have sparked a debate within the US about surveillance. While Americans understood that the US government was likely intercepting telephone and social media data from terrorism suspects, it’s been an uncomfortable discovery that the US collected massive sets of email and telephone data from Americans and non-Americans who aren’t suspected of any crimes. These revelations add context to other discoveries of surveillance in post 9-11 America, including the Mail Isolation Control and Tracking program, which scans the outside of all paper mail sent in the US and stores it for later analysis. (The Smoking Gun reported on the program early last month – I hadn’t heard of it until the Times report today.)
The Obama administration and supporters have responded to criticism of these programs by assuring Americans that the information collected is “metadata”, information on who is talking to whom, not the substance of conversations. As Senator Dianne Feinstein put it, “This is just metadata. There is no content involved.” By analyzing the metadata, officials claim, they can identify potential suspects then seek judicial permission to access the content directly. Nothing to worry about. You’re not being spied on by your government – they’re just monitoring the metadata.
Of course, that’s a naïve and oversimplified view of metadata, which turns out to be a surprisingly rich source of information on who people are, who they know and what they do. Congress has historically recognized that metadata is important and deserves protection – while the Supreme Court ruled in Smith vs. Maryland that phone numbers dialed should not be expected to be private information, as they are exposed to the phone company, Congress put restrictions on the use of “pen registers”, devices that can track what calls are made and received by a phone, requiring law enforcement to go to court to institute such tracking. The same logic in Smith vs. Maryland applies to the Mail Isolation Control and Tracking program – since information on envelopes is visible to the public, or at least to mail carriers, it’s monitorable and storable, even without “mail covers“, US Postal Service administrative orders used to trace mail coming to criminal suspects. And, perhaps, the policymakers who approved NSA’s surveillance projects would argue that the logic applies to email headers as well.
Put aside for the moment the question of whether monitoring metadata is reading public information or is more analogous to a pen register. There’s a scale issue that comes into play here. One major constraint on pen registers and mail covers historically has been the sheer amount of data they generate. Potential overreach by law enforcement is held in check by two factors – the need to get court or administrative approval to trace metadata, and the ability to process said metadata. As a result, USPS insiders report that it processes about 15,000 – 20,000 mail covers a year related to crime, and as security researcher Chris Soghoian discovered, internet and telecommunications companies charge law enforcement agencies for pen registers, putting some practical limits on their use.
But the NSA surveillance of email and phone networks, and the Mail Isolation Control and Tracking program have no such limits. While it’s likely quite expensive to scan all US mail, once you’ve committed to doing so, it’s comparatively cheap to store that information and analyze it at later dates, as investigators evidently did to arrest Shannon Richardson for sending ricin to President Obama and New York City mayor Bloomberg. And, since the costs of NSA surveillance are evidently borne primarily by internet and telephony companies, it’s downright cheap to keep metadata on email and phone calls. All the postal mail, email and phone calls.
It’s also much, much cheaper to analyze this data than in years past. The current frenzy for “big data” and “data science” has called attention to techniques that allow analysts to pull subtle patterns out of data – a New York Times story that suggests that retailer Target was able to identify pregnant customers based on their purchasing behavior (unscented lotion!) and target ad flyers to them gives a sense for the commercial applications of these techniques.
Sociologist Kieran Healy shows another set of applications of these techniques, using a much smaller, historical data set. He looks at a small number of 18th century colonists and the societies in Boston they were members of to identify Paul Revere as a key bridge tie between different organizations. In Healy’s brilliant piece, he writes in the voice of a junior analyst reporting his findings to superiors in the British government, and suggests that his superiors consider investigating Revere as a traitor. He closes with this winning line: “…if a mere scribe such as I — one who knows nearly nothing — can use the very simplest of these methods to pick the name of a traitor like Paul Revere from those of two hundred and fifty four other men, using nothing but a list of memberships and a portable calculating engine, then just think what weapons we might wield in the defense of liberty one or two centuries from now.”
If you are a member of a secret organization planning overthrow of the government, you’ve probably already thought hard about what your metadata might reveal. But if you’re an average citizen with “nothing to hide”, it may be less obvious why your metadata may not be something you are comfortable sharing. After all, Frank Rich recently proclaimed that “privacy jumped the shark in America long ago” and that we are all members of “the America that prefers to be out there, prizing networking, exhibitionism, and fame more than privacy, introspection, and solitude.” Lured by reality television and social networks, we all want to be watched and have therefore have given up our distaste for surveillance.
I think it’s possible to be both a heavy user of social media, and concerned about the security of your metadata. It simply requires understanding that, for many of us, social media is a performance. When I share links on Twitter, I’m aware that I’m constructing an image to my followers as someone who’s interested in certain topics and disinterested in others. I don’t share every article that I read, both because I suspect not all are interesting to my followers and also because I don’t really want my professional community to know just how much mental energy I spend worrying about who the Green Bay Packers will field at running back in the coming season.
This may not be how you use social media, but it probably should be. As danah boyd and others have pointed out, youth have had to figure out how to navigate a world in which their interpersonal and social interactions are archived, searchable and persist long enough to present a problem in adulthood – as a result, they’re continually engaged in “identity performance”, as well as in developing codes and other ways to speak on social networks to defy monitoring.
By contrast, most of us aren’t maintaining a persistent, public performance when we’re using telephones or email. (For an example of what this might feel like, consider this story from This American Life, where lawyers who work with Guantanamo detainees talk about how having the US government monitor their personal phone calls changes their behavior.) Our metadata can reveal things we may not want to share with others, or may not know ourselves.
As it happens, I have a pretty good sense for what my email metadata might tell an investigator. This fall, I co-taught a class with Cesar Hidalgo, Catherine Havasi and Sep Kamvar at the Media Lab titled “Big Data”. Two of the students who took the class, Daniel Smilkov and Deepak Jagdish, worked on a project called Immersion which uses Gmail metadata to map someone’s social network. I’m one of about 500 alpha testers of the software, developed by Cesar, Daniel and Deepak, and have been one of the poster boys for the project as it’s been on display at the Media Lab, as I’ve got the largest network of Gmail contacts of anyone who’s used the system. (This isn’t because I’m especially popular, I suspect. Most of my MIT colleagues use mit.edu addresses. As someone new to MIT, who maintains a number of different affiliations, I have been a heavy Gmail user.)
The largest node in the graph, the person I exchange the most email with, is my wife, Rachel. I find this reassuring, but Daniel and Deepak have told me that people’s romantic partners are rarely their largest node. Because I travel a lot, Rachel and I have a heavily email-dependent relationship, but many people’s romantic relationships are conducted mostly face to face and don’t show up clearly in metadata. But the prominence of Rachel in the graph is, for me, a reminder that one of the reasons we might be concerned about metadata is that it shows strong relationships, whether those relationships are widely known or are secret.
The other large nodes on the graph are associated with specific clusters. Rebecca is my co-founder at Global Voices and Ivan and Georgia run the organization day-to-day – they dominate the green cluster, which includes key people in that organization. Hal is my chief collaborator at the Berkman Center, and Colin is my boss – they dominate the orange cluster, which includes fellow Berkman folks as well as a number of prominent internet law and policy folks who work closely with the Center. Lorrie is assistant director at Center for Civic Media and is the person I work with most closely at MIT – the red cluster represents the people I work with at the Media Lab.
Anyone who knows me reasonably well could have guessed at the existence of these ties. But there’s other information in the graph that’s more complicated and potentially more sensitive. My primary Media Lab collaborators are my students and staff – Cesar is the only Media Lab node who’s not affiliated with Civic who shows up on my network, which suggests that I’m collaborating less with my Media Lab colleagues than I might hope to be. One might read into my relationships with the students I advise based on the email volume I exchange with them – I’d suggest that the patterns have something to do with our preferred channels of communication, but it certainly shows who’s demanding and receiving attention via email. In other words, absence from a social network map is at least as revealing as presence on it.
Another sensitive piece of information comes from how Immersion draws and codes clusters. Immersion’s algorithm is sensitive to who you include on the same email. Global Voices emails include Ivan, Georgia, Rebecca and others – people who I email when I email those three get placed in the same cluster. People who exist as bridges between clusters are particularly interesting, as they are people who appear in multiple roles in your social network. Joi Ito appears on my graph twice (as “Joi” and “Joichi”) because he uses multiple email addresses, but in either role, he’s a bridge between my MIT existence, my Global Voices existence and my Berkman life, which reflects my long and multi-faceted relationship with him. But he’s colored red, as a Media Lab person, whereas other bridge figures like danah boyd show up as blue, as they have close relationships with Rachel as well. In other words, I have important, long-standing, multifaceted relationships with both danah and Joi, but danah is part of my family life as well, while Joi is not.
My point here isn’t to elucidate all the peculiarities of my social network (indeed, analyzing these diagrams is a bit like analyzing your dreams – fascinating to you, but off-putting to everyone else). It’s to make the case that this metadata paints a very revealing portrait of oneself. And while there’s currently a waiting list to use Immersion, this is data that’s accessible to NSA analysts and to the marketing teams at Google. That makes me uncomfortable, and it makes me want to have a public conversation about what’s okay and what’s not okay to track.
While popular outcry over revelations about the NSA has been somewhat muted so far, it’s possible that widespread protests planned for July 4th will spark more dialog about what represents unconstitutional surveillance. Here’s hoping that conversation will take a close look at metadata and ask hard questions about whether or not this is information we are willing to share with governments and corporations, or whether we need to regulate and limit this power to monitor as we’ve historically done in the United States. Restore the Fourth.
For another example of what metadata may reveal, see Malte Spitz’s phone records. As I discuss in “Rewire”, Spitz sued his mobile phone provider to obtain his records, then worked with Zeit Online to build a visualization of his movements based purely on that set of data.