Privacy concerns

Home > data science > Privacy concerns

Privacy concerns

May 10, 2012 Cathy O'Neil, mathbabe

The theme of the day yesterday here at the IMA in Minnesota was privacy. There was a talk on privacy and information as well as a discussion later in the afternoon with the participants.

The talk, given by Vitaly Shmatikov, was pretty bad news for anyone who is still hoping to keep private info out of the hands of random internet trolls. Vitaly explained how de-identifying data is a relatively hopeless task, at least if you want to retain useful information in the data, because of the sparseness of most human data.

An example he gave was that the average Netflix user rates 200 films but can be identified on just 4 ratings, at least if you include timestamps. He also pointed out that, even though Netflix doesn’t directly expose its users’ movie preferences, you can infer quite a bit just by looking at how recommendations by Netflix (of the form “people who like House also like Friends”) evolve over time.

And never mind Facebook or other social media, which he explained people can crawl directly and infer the graph structure of; even without any labels for the nodes, which refer to people, one can infer an outrageous amount if you have a separate, labeled graph with overlap on the first.

Conclusion: when people think they’ve de-identified data they haven’t, because other people can collect lots of such data sets, and on top of that some small amount of partial private, identified information about individual users, and piece it together to get creep amounts of stuff.

An example I heard later that day is that someone has figured out how to take pictures of people and give them a large part of their social security number.

The conversation later that day focused more on how companies should protect their client data (through a de-indentifying algorithm), and how for the most part they do absolutely nothing right now. This is perhaps because the problem “isn’t solved” so people don’t see the reason to do something that’s not a perfect solution even though some basic procedures would make it a lot harder. My suspicion is that if they do nothing they are betting they’re more protected from litigation than if they do something that turns out to be too weak. Call me a skeptic but it’s always about litigation.

My contribution to both the talk and the conversation was this: why don’t we stop worrying about data getting out, since it’s kind of a done deal (no data ever disappears and our algorithms are only getting better). Why don’t we assume that all historical information about everyone is out there.

So, besides how to get into my bank account (I haven’t sorted that one out yet, maybe I’ll just have to carry all my money in physical form) I’ll assume everyone knows everything about me, including my DNA.

Then the question becomes, how can we live in a reasonable world given that? How can we, for example, prevent private insurance companies from looking up someone’s DNA or HIV status in order to deny coverage?

This is not an idle concern. As someone pointed out yesterday, 15 years ago nobody would have believed someone who described how much information about the average Facebook user is available now. We have no idea what it will look like in another 15 years.

Categories: data science

Comments (3)

David L.

May 10, 2012 at 10:34 am

If data scientists exploit personal data ‘in a race to the bottom’, analogous to what we’ve seen with some HFT models, some risk models, Apple’s labor practices with Foxconn, etc., then there will eventually be ‘push back’. The data science profession will loose respect and credibility (it currently has ‘rock star’ status as once the ‘Master of the Universe’ did in finance before they ‘blew-up’ the global economy).

Is there any way to ‘weed the garden’ and prevent the abuse of data? Even a NY Times front page expose on Apple’s use of labor at Foxconn has had limited effect in changing there practices (although they I’ve read Foxconn will be replacing humans with 1.2 million robots).

I know this Blog is dedicated to this goal.

LikeLike
Lenore Cowen

May 10, 2012 at 10:39 am

No, that isn’t the question, because I am not willing to accept a world with no privacy. Sure, you and I are part of the generation that, based on the fact that we’ve left footprints in the margarine by belonging to social networks or blogging, or whatever, will not have the luxury of privacy. But we haven’t screwed it up yet for children currently under the age of 13, and if we haven’t gotten it right by the time they are college age, there’s always the younger children. And part of it isn’t just the problem of current information, it’s old information. Are we really going to let persist in our profiles of people their preferences and interests at age 15? Sure, the fact that I had an enormous crush on Remington Steele seems now cute, but many of my other interests at that age are now just plain embarrassing. People grow and change. If you segregate people too early into “winners” and “losers” or even “warriors” and “peaceniks” then aren’t you going to trap people into old patterns and prevent them from growing and changing? If you target people based on income, don’t the rich get richer because they get better opportunities over time? If you send people political stuff that you think they want to read, don’t they get more and more away from the center over time as they are segmented into groups that largely agree with them? I want my kids (and everyone’s kids) to have a safe space to try out new ideas and opinions and make mistakes, without being labeled for life. I will accept nothing less. Technically it’s a super-hard problem and it’s going to take years to figure out, but to throw up our hands and say, ok, privacy is dead so just deal with it, is completely unacceptable (on the other hand, we have to safeguard the current generation whose teenage opinions and mistakes will always be out there are well).

LikeLike
Jay

May 11, 2012 at 1:47 pm

How can we, for example, prevent private insurance companies from looking up someone’s DNA or HIV status in order to deny coverage

Can we even prevent insurers from denying children and grandchildren coverage, based on the parent’s DNA test? If Mom or Dad has #horriblegene, then Bro and Sis are 50% risks for #horriblegene, and insurers will definitely try to include that risk in their premiums.

LikeLike