Cynthia Dwork is a theoretical computer scientist who is known for applying the theory of her field to problems that seem like they might not be best solved by computer science. One of those problems is privacy, the topic of her lunch talk at Berkman titled, “I’m in the Database, But Nobody Knows”.
When we think about loss of privacy, we often think about threats like data theft, phishing, virii as well as the changing privacy policies of sites like Facebook. There’s another risk – the danger of data leakage through data analysis.
We often want to analyze large data sets without knowing the identity of individuals represented in those sets. Dwork distinguishes between first-tier and second-tier examples. First tier examples deal with potentially personally identifying data (census data, medical outcomes data, epidemiological data. One apparently innocuous data set is vehicle braking data. It’s used in aggregate to design brake systems – used specifically, an insurance company might use it to identify a specific dangerous driver. Second-tier data involves analyzing preferences in aggregate, training an advertising classifier (to determine which people respond to which ads), or recommendation systems like Amazon or Netflix. Data leakage is less clear in these systems, but possible.
Dwork describes the issues with making data accessible to analysis without allowing release of information about a specific individual as a “very pure privacy problem.” In other words, it’s difficult to solve, even if a curator sits between an analyst and the database, even if the curator’s behavior is unimpeachable and the data is secured in a vault.
We have a database that includes sensitive medical information, including whether people are susceptible to certain genetically-linked diseases. We want to allow queries on this database, but we don’t want anyone to be able to discover the attributes of a specific individual.
Dwork proposes that we query for the susceptibilities associated with female distinguished scientists who work at Microsoft and have very curly hair… i.e., a set of one. If we don’t prohibit these queries, we discover she’s susceptible to Sickle Cell. So we put in place a policy that won’t return very small sets of information, i.e., information on individuals.
The attacker might respond by asking how many Microsoft employees have the sickle cell trait, then asking how many Microsoft employees who are not female distinguished scientists with very curly hair who have the trait. Take the difference between those queries and you’ve got the answer to whether Dwork has the trait – limiting to large set queries alone doesn’t protect privacy.
Another possible solution involves adding random noise to the answer the database gives, at a level low enought that it shouldn’t distort results, but it should make it possible to pull individuals out of the set. The problem here is that we can average responses to repeated queries and obtain a result that converges to true answers. She argues that it’s computationally impossible to detect repetition – the problem is undecidable – and therefore we can’t simply limit queries to defeat this method. (Salil Vadhan walked me through this after the talk – the difficulty is that, with a sufficiently rich query language, it’s possible to encode the same query many different ways, and it’s not possible to reduce all those queries to an identical query and prevent it from being asked iteratively.)
Finally, there are cases where having the curator of a database decide not to answer a question discloses information.
Because it’s so difficult to build a database that’s analyzable and doesn’t release an individual’s data, there’s been a litany of cases where data has been revealed:
– In 2006, a journalist identified an individual in Georgia based on analyzing “anonymized” AOL search queries
– In 1998, a graduate student discovered Massachusetts governor William Weld’s medical record by linking voter registration data and medical records – the overlap of zipcode, birthdate and sex allowed private information to be discovered via this linkage attack
– In 2006, Netflix stopped their second recommendation challenge because a researcher was able to identify an anonymous participant by linking recommendations to behavior on IMDB.
– In 2008, Nils Homer and colleagues demonstrated that you can use certain statistics about single nucleotide polymorphisms (SNPs) to tell whether someone has participated in a genome worldwide associated study. Participation in those studies usually indicates diagnosis with a disease, and so could be very sensitive information.
– In 2007, Lars Backstrom (working with Dwork) published a theoretical paper on social networks, suggesting that in an anonymized social network, 12 participants in the network could work together to determine whether any two people in the network were friends.
Are these failures of privacy? Yes, but they’re also failures of definition. Definitions of privacy tend to fail to cope with auxiliary information – database privacy policies weren’t intended to deal with information in other, future databases. And the definitions, in general, were “syntactic and ad-hoc”, and “don’t capture a robust, global notion of privacy.”
In 1977, statistician Tor Dalenius suggested an “ad omnia” guarantee of database privacy: “Anything that can be learned about a respondent from the statistical database can be learned without access to the database.” In other words, it’s possible to get personal data from a database, but only if the person is an extrovert and puts their information on the web. This is a great definition… unfortunately, it’s unachievable.
To explain this unachievability, Dwork offers a parable, which she discloses is an “Admittedly Unreasonable Impossibility Proof.” Assume Pamela Jones of Groklaw doesn’t want her height disclosed. However, she’s a participant in a database that reveals the average heights of population subgroups. Someone discloses the information that Paula is two inches shorter than the average Swedish woman. (This is the unreasonable part of this proof, as it’s unclear where this information comes from.) If you have access to the database, you can trivially learn her height. This, Dwork says, shows a hole in Dalenius’s definition – we can learn sensitive information not by querying about her in the database, but by issuing another type of query. Dwork argues that Jones loses privacy whether or not she is in the database – the leakage comes from the stated and desired utility of the database.
Dwork offers a different definition for database privacy – differential privacy. The outcome of any analysis is, essentially, independent of whether an individual joins or refrains from join in the dataset. We can define differential privacy by calculating the probability of release occurring if an individual is in the dataset versus someone not in the dataset. Epsilon is the difference between those privacy functions, so a small epsilon is desirable.
The useful property of this definition is that it neutralizes linkage attacks. We’re no longer concerned with what the adversary can bring in from the outside. Differential privacy gives us resilience to all auxiliary information.
This, she tells us, is achievable. There are low-error, high privacy differential privacy techniques that exist for many problems in
datamining. We can reveal data that supports analysis through association rules, decision trees, clustering, contingency tables, histograms, synthetic data sets, machine learning and recommendation systems in ways that are differential privacy compliant. Indeed, there are even programming platforms designed to allow us to analyze our data sets and protect ourselves from disclosure using a differential privacy model.
“Never store the data in the first place,” she argues. That way you can’t respond to subpoenas. And there’s a set of algorithms she’s working on – pan private streaming algorithms – that store information on data streams in a way that inherently keeps the patterns of appearance of any individual undetectable, protecting against mission creep, subpoena, and intrusion.
There are downsides to differential privacy – the techniques tend to hide outliers within a data set, so sets make it difficult to study outliers. Privacy erodes over multiple analyses – cumulatively – and also over the inclusion of an individual into multiple databases.
The question Dwork and colleagues are currently worried about is the size of epsilon… which is to say, how much more dangerous can it be to be in the database rather than not in the database? To answer this question, she and colleagues are modeling a “lifetime of exposure against worst-cases” to offer a formal definition of that a “reasonable” value might be.
To ensure that a differential privacy solution is useful, Dwork needs to work closely with social scientists to understand what questions we might be asking about social networks. We know that removing individuals from social network graphs can sharply change the properties we’re trying to understand in these networks – which implies that we really need solutions that allow us to study all members in a network without revealing private information. If we know more about what social scientists want to know about networks, perhaps that will shape the algorithms we build.
There’s also a way of analyzing differential privacy that considers the harms that can come from the exposure of data. A database that shows us that smoking causes cancer can cause harm to an individual located in the database – smoker S in the database might be harmed with higher insurance premiums. But, Dwork argues, since learning that smoking causes cancer is the point, the database might lead smoker S to enroll in a smoking cessation program. Who gets to decide whether that tradeoff of harms and benefits is appropriate in first and second-tier cases of data exposure?
The benefits of defining differential privacy, Dwork tells us, are that we’ve got an ad omnia definition of privacy that’s achievable, and frequently achievable in a way that gives us highly accurate data usage. You can program using these techniques using a privacy-preserving interface. This might present a path forward for addressing real-world problems involving privacy in behavioral targeting, as well as in retrieving data from a database.
Dwork offers examples of possible attacks on privacy via behavioral targeting. She asks us to imagine an ad targeted to a strange fetish – she offers women with their legs in casts as an example. If you click on the ad, you’re communicating your interest in women’s legs in casts. But “what if the ad is targeted not just to cast enthusiasts but to wealthy, whale loving long leg cast enthusiasts?” The user knows the ad was targeted to characteristic 1, but not the second characteristic – it may turn out that the set of wealthy, whale-loving cast fetishists is small enough that it’s a single person, and that clicking on the ad reveals your identity.
This might be worth worrying about even if you’re not a whale-hugging cast fetishist. It raises the question of whether ads should be targetable in terms of race, for example. Should poor children see different ads that rich children, given that we believe ads have a role in forming aspirations. The Wall Street Journal reported in August that Capital One financial used tracking information from company [x+1] to steer minority customers to loans that carried higher rates. In principle, Dwork tells us, the law seems to allow credit companies to target products based on a customer’s browsing history… but what if that history encodes race? Does this fall into the realm of unethical or illegal targeting of products based on race?
This was one of the more challenging talks to blog of the presentations I’ve heard at Berkman. It’s very clear that Dwork is doing important work in the field and that the questions she’s tackling are ones that are fascinating to the lawyers, librarians, sociologists and whatever-I-am that fill our conference room. But there’s a huge gap between the mathematic and algorithmic work she’s doing and the issues we generally talk about within a Berkman setting. It’s a good challenge, and an excuse for me to lean on my colleagues from CRCS to understand the issues at play. Apologies for stupid errors I’ve made in attempting to blog this talk – Dr. Dwork’s work is worlds away from what I work on and I may be misunderstanding some of her points.