Privacy vs. openness
I believe in privacy rights when it comes to modern technology and data science models, especially when the models are dealing with things like email or health records. It makes sense, for example, that the data itself is not made public when researchers study diseases and treatments.
The IRB rules deal with questions like whether the study participants have agreed to let their data be shared if the data is first anonymized. But the crucial question is whether it’s really possible to anonymize data at all.
It turns out it’s not that easy, especially if the database is large. There have been famous cases (Netflix prize) where people have been identified even though the data was “anonymized.”
On the other hand, we don’t want people creating and running with secret models with the excuse that they are protecting people’s privacy. First, because the models may not work: we want scientific claims to be substantiated by retesting, for example (this was the point of Gelman’s post). But also we generally want a view into how people are using personal information about people.
Most modeling going on nowadays involving personal information is probably not fueled by academic interest in curing diseases, but rather how to sell stuff and how to monitor people.
As two examples, this Bloomberg article describes how annoyed people get when they are being tracked while they’re shopping in malls, even though the actual intrusiveness of the tracking is arguably much worse when people shop online, and this Wall Street Journal article describes the usage of French surveillance systems in the Gadhafi regime.
I think we should separate two issues here, namely the model versus the data. In the cases of public surveillance, like at the mall or online, or something involving public employees, I think people should be able to see how their data is being used even if the entire database is being kept out of their view. This way nobody can say their privacy is being invaded.
For example, if the public school system uses data from students and teachers to score the value added of teaching, then the teachers should have access to the model being used to score them. In particular this would mean they’d be able to see how their score would have changed if certain of their attributes changed, like which school they teach at or how many kids are in their class.
It is unlikely that private companies would be happy to expose the models they use to sell merchandise or clicks. If private companies don’t want to reveal their secret sauce, then one possibility is to make their modeling opt-in (rather than opt-out). By the way, right now you can opt out of most things online by consistently clearing your cookies.
I am being pretty extreme here in my suggestions, but even if we don’t go this far, I think it’s clear that we will have to consider these questions and many more questions like this soon. The idea that the online data modeling can be self-regulating is pretty laughable to me, especially when you consider how well that worked in finance. The kind of “stalker apps” that are popping up everywhere are very scary and very creepy to people who like the idea of privacy.
In the meantime we need some nerds to figure out a better way to anonymize data. Please tell me if you know of progress in that field.