Home > data science, open source tools > Differential privacy

Differential privacy

January 3, 2012

Do you know what’s awesome about writing a blog? Sometimes you’re left with a technical question, and someone with the technical answer comes right along and comments. It’s like magic.

That’s what happened to me when I wrote a post about privacy vs. openness and suggested that the world needs more people to think about anonymizing data. Along comes Aaron Roth who explains to me that the world is already doing that. Cool!

He sent me a few links to survey articles (here and here) on a concept called differential privacy. The truth is, though, I got confused and ended up just reading the wikipedia entry on it anyway.

The setup is that there is some data, stored in a database, and there’s some “release mechanism” that allows outside users to ask  questions about the data- this is called querying for a statistic. Each row of the data is assumed to be associated with a person, so for example could contain their test scores on some medical test, as well as other private information that identifies them.

The basic question is, how can we set up the mechanism so that the users can get as much useful information as possible while never exposing an individual’s information?

Actually the exact condition posed is even a bit more nuanced: how can we set up the mechanism so that any individual, whether they are in the database or not, is indifferent to being taken out or added?

This is a kind of information theory question, and it’s tricky. First they define a metric of information loss or gain when you take out exactly one person from the database- how much do the resulting statistics change? Do they change enough for the outside user to infer (with confidence) what the statistics were for that lost record? If so, not good.

For example, if the user queries for the mean test score of a population with and without a given record (call the associated person the “flip-flopper” (my term)), and gets the exact answer both times, and knows how many people were in the population (besides the flip-flopper), then the user could figure out the exact test score of the flip-flopper.

One example of making this harder for the user, which is a bad one for a reason I’ll explain shortly, is to independently “add noise” to a given statistic after computing it. Then the answers aren’t exact in either case, so you’d have a low confidence in your resulting guess at the test score of the flip-flopper, assuming of course that the population is large and their test score isn’t a huge outlier.

But this is a crappy solution in the sense that I originally wanted to use an anonymization method for, namely to set up an open source data center to allow outside users (i.e. you) go query to their hearts’ content. A given user could simply query the same thing over and over, and after a million queries (depending on how much noise we’ve added) would get a very good approximation of the actual answer (i.e. the actual average test score). The added noise would be canceled out.

So instead, you’d have to add noise in a different way, i.e. not independently, to each statistic. Another possibility is to add noise to the original data, but that doesn’t feel as good to me, especially for the privacy of outliers, unless the noise is really noisy. But then again maybe that’s exactly what you do, so that any individual’s numbers are obfuscated but on a large enough scale you have a good sense of the statistics. I’ll have to learn more about it.

Aaron offered another possibility which I haven’t really understood yet, namely for the mechanism to be stateful. In fact I’m not sure what that means, but it seems to have something to do with it being aware of other databases. I’ve still got lots more to learn, obviously, which is exactly how I like to feel about interesting things.

  1. January 3, 2012 at 11:27 am

    I’ve never seen an example (that scientists actually probed) where differential privacy worked effectively. So, you really could build new ground.

    Like

  2. Aaron
    January 3, 2012 at 1:16 pm

    Hi Cathy,

    One thing you can do if you want to have a safe database you can release to the public/query arbitrarily many times is to produce “synthetic data” privately. As you say, one way to do that is to just add noise directly to the database, but this might end up destroying a lot of the information present. But thats not the only thing you can do.

    Another approach that tends to have much better results is to “learn” a good model for the database with respect to some class of statistics, and then release the model. It turns out that for many kinds of statistics, you can learn such a model by only making a small number of noisy queries to the original database (and because the number of queries is small, the noise can be small). For example, suppose your database entries have d binary attributes, and you are interested in the set of all (2^d) -marginals- on those attributes. (i.e. estimates of the form “What fraction of entries satisfy attribute 1, 15, and 23?”). It turns out you can learn a model which agrees on all of these 2^d marginals up to a maximum error E by making only (d/E)^2 queries to the database. You can get the answers to these queries with high accuracy while preserving privacy, so long as your database has >> d/E entries in it.

    This avoids Cathy’s attack on the flip-flopper, because every time a data analyst queries the synthetic data with the same query, they necessarily get the same answer. The result is the database administrator has been able to release the answers to 2^d queries, while only having needed to actually make (d/E)^2 privacy preserving queries to the database.

    Re: Stephen above, here is a paper that runs something like the above learning algorithm on real data sets and is able to accurately release the answers to all 3-way marginals. (http://arxiv.org/abs/1012.4763)

    Like

  3. itsik
    January 3, 2012 at 5:49 pm

    Geneticists have been obsessed about privacy for a while, and more specifically about de-anonymization of (then public) summary statistics since http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000167&annotationId=info%3Adoi%2F10.1371%2Fannotation%2F717d0f33-e634-48c0-afe8-2699f12db55e
    In response, NIH just pulled all those stats off the public domain.

    Like

  1. No trackbacks yet.
Comments are closed.
%d bloggers like this: