Do you know what’s awesome about writing a blog? Sometimes you’re left with a technical question, and someone with the technical answer comes right along and comments. It’s like magic.
That’s what happened to me when I wrote a post about privacy vs. openness and suggested that the world needs more people to think about anonymizing data. Along comes Aaron Roth who explains to me that the world is already doing that. Cool!
The setup is that there is some data, stored in a database, and there’s some “release mechanism” that allows outside users to ask questions about the data- this is called querying for a statistic. Each row of the data is assumed to be associated with a person, so for example could contain their test scores on some medical test, as well as other private information that identifies them.
The basic question is, how can we set up the mechanism so that the users can get as much useful information as possible while never exposing an individual’s information?
Actually the exact condition posed is even a bit more nuanced: how can we set up the mechanism so that any individual, whether they are in the database or not, is indifferent to being taken out or added?
This is a kind of information theory question, and it’s tricky. First they define a metric of information loss or gain when you take out exactly one person from the database- how much do the resulting statistics change? Do they change enough for the outside user to infer (with confidence) what the statistics were for that lost record? If so, not good.
For example, if the user queries for the mean test score of a population with and without a given record (call the associated person the “flip-flopper” (my term)), and gets the exact answer both times, and knows how many people were in the population (besides the flip-flopper), then the user could figure out the exact test score of the flip-flopper.
One example of making this harder for the user, which is a bad one for a reason I’ll explain shortly, is to independently “add noise” to a given statistic after computing it. Then the answers aren’t exact in either case, so you’d have a low confidence in your resulting guess at the test score of the flip-flopper, assuming of course that the population is large and their test score isn’t a huge outlier.
But this is a crappy solution in the sense that I originally wanted to use an anonymization method for, namely to set up an open source data center to allow outside users (i.e. you) go query to their hearts’ content. A given user could simply query the same thing over and over, and after a million queries (depending on how much noise we’ve added) would get a very good approximation of the actual answer (i.e. the actual average test score). The added noise would be canceled out.
So instead, you’d have to add noise in a different way, i.e. not independently, to each statistic. Another possibility is to add noise to the original data, but that doesn’t feel as good to me, especially for the privacy of outliers, unless the noise is really noisy. But then again maybe that’s exactly what you do, so that any individual’s numbers are obfuscated but on a large enough scale you have a good sense of the statistics. I’ll have to learn more about it.
Aaron offered another possibility which I haven’t really understood yet, namely for the mechanism to be stateful. In fact I’m not sure what that means, but it seems to have something to do with it being aware of other databases. I’ve still got lots more to learn, obviously, which is exactly how I like to feel about interesting things.