Privacy vs. openness

Home > data science > Privacy vs. openness

Privacy vs. openness

December 15, 2011 Cathy O'Neil, mathbabe

I believe in privacy rights when it comes to modern technology and data science models, especially when the models are dealing with things like email or health records. It makes sense, for example, that the data itself is not made public when researchers study diseases and treatments.

Andrew Gelman’s blog post recently brought this up, and clued me into the rules of sharing data coming from the Institutional Review Board (IRB).

The IRB rules deal with questions like whether the study participants have agreed to let their data be shared if the data is first anonymized. But the crucial question is whether it’s really possible to anonymize data at all.

It turns out it’s not that easy, especially if the database is large. There have been famous cases (Netflix prize) where people have been identified even though the data was “anonymized.”

On the other hand, we don’t want people creating and running with secret models with the excuse that they are protecting people’s privacy. First, because the models may not work: we want scientific claims to be substantiated by retesting, for example (this was the point of Gelman’s post). But also we generally want a view into how people are using personal information about people.

Most modeling going on nowadays involving personal information is probably not fueled by academic interest in curing diseases, but rather how to sell stuff and how to monitor people.

As two examples, this Bloomberg article describes how annoyed people get when they are being tracked while they’re shopping in malls, even though the actual intrusiveness of the tracking is arguably much worse when people shop online, and this Wall Street Journal article describes the usage of French surveillance systems in the Gadhafi regime.

I think we should separate two issues here, namely the model versus the data. In the cases of public surveillance, like at the mall or online, or something involving public employees, I think people should be able to see how their data is being used even if the entire database is being kept out of their view. This way nobody can say their privacy is being invaded.

For example, if the public school system uses data from students and teachers to score the value added of teaching, then the teachers should have access to the model being used to score them. In particular this would mean they’d be able to see how their score would have changed if certain of their attributes changed, like which school they teach at or how many kids are in their class.

It is unlikely that private companies would be happy to expose the models they use to sell merchandise or clicks. If private companies don’t want to reveal their secret sauce, then one possibility is to make their modeling opt-in (rather than opt-out). By the way, right now you can opt out of most things online by consistently clearing your cookies.

I am being pretty extreme here in my suggestions, but even if we don’t go this far, I think it’s clear that we will have to consider these questions and many more questions like this soon. The idea that the online data modeling can be self-regulating is pretty laughable to me, especially when you consider how well that worked in finance. The kind of “stalker apps” that are popping up everywhere are very scary and very creepy to people who like the idea of privacy.

In the meantime we need some nerds to figure out a better way to anonymize data. Please tell me if you know of progress in that field.

Categories: data science

Comments (5)

Mary

December 15, 2011 at 9:05 am

Important correction: it is not “the International Review Board” that Andrew is discussing, which would imply a single monolithic standard. Rather–as your wikipedia link makes clear– he is referring to *institutional* review boards (abbreviated IRB) in the plural, which each research institution (i.e., college, university or think-tank accepting grant funds) must set up to set its own standards for ethical conduct of research involving human subjects.

There are broad guidelines for IRBs, but they also have a lot of individual lattitude to interpret those guidelines. There is no universal consensus on exactly how IRBs should apply the guidelines in every individual case. There are cases of collaborating researchers from different universities on a single project where one gets her research proposal approved by the IRB at her university where another submits the same proposal to her university’s IRB and gets turned down flatly.

LikeLike
- Cathy O'Neil, mathbabe
  
  December 15, 2011 at 9:11 am
  
  Thanks Mary, I’ve corrected that.
  
  Cathy
  
  LikeLike
Bertie

December 15, 2011 at 7:31 pm

Off topic, sorry, an article for anyone interested that discusses how recent events have changed traditional ideas about class and how these changes are embodied in the OWS movement

http://www.tomdispatch.com/blog/175480/

LikeLike
Aaron

January 2, 2012 at 1:08 pm

There is starting to be a lot of research on how to compute on data while providing provable privacy guarantees. If you are interested, you can google around for the phrase “differential privacy”. I’ve also got some course notes here: http://www.cis.upenn.edu/~aaroth/courses/privacyF11.html

LikeLike
- Cathy O'Neil, mathbabe
  
  January 2, 2012 at 1:11 pm
  
  Wish I could come to this course! Thanks for telling me about it, really cool.
  
  Cathy
  
  LikeLike