De-anonymizing what used to be anonymous: NYC taxicabs

Home > data science, modeling, open source tools, statistics > De-anonymizing what used to be anonymous: NYC taxicabs

De-anonymizing what used to be anonymous: NYC taxicabs

October 17, 2014 Cathy O'Neil, mathbabe

Thanks to Artem Kaznatcheev, I learned yesterday about the recent work of Anthony Tockar in exploring the field of anonymization and deanonymization of datasets.

Specifically, he looked at the 2013 cab rides in New York City, which was provided under a FOIL request, and he stalked celebrities Bradley Cooper and Jessica Alba (and discovered that neither of them tipped the cabby). He also stalked a man who went to a slew of NYC titty bars: found out where the guy lived and even got a picture of him.

Previously, some other civic hackers had identified the cabbies themselves, because the original dataset had scrambled the medallions, but not very well.

The point he was trying to make was that we should not assume that “anonymized” datasets actually protect privacy. Instead we should learn how to use more thoughtful approaches to anonymizing stuff, and he proposes a method called “differential privacy,” which he explains here. It involves adding noise to the data, in a certain way, so that at the end any given person doesn’t risk too much of their own privacy by being included in the dataset versus being not included in the dataset.

Bottomline, it’s actually pretty involved mathematically, and although I’m a nerd and it doesn’t intimidate me, it does give me pause. Here are a few concerns:

It means that most people, for example the person in charge of fulfilling FOIL requests, will not actually understand the algorithm.
That means that, if there’s a requirement that such a procedure is used, that person will have to use and trust a third party to implement it. This leads to all sorts of problems in itself.
Just to name one, depending on what kind of data it is, you have to implement differential privacy differently. There’s no doubt that a complicated mapping of datatype to methodology will be screwed up when the person doing it doesn’t understand the nuances.
Here’s another: the third party may not be trustworthy and may have created a backdoor.
Or they just might get it wrong, or do something lazy that doesn’t actually work, and they can get away with it because, again, the user is not an expert and cannot accurately evaluate their work.

Altogether I’m imagining that this is at best an expensive solution for very important datasets, and won’t be used for your everyday FOIL requests like taxicab rides unless the culture around privacy changes dramatically.

Even so, super interesting and important work by Anthony Tockar. Also, if you think that’s cool, take a look at my friend Luis Daniel‘s work on de-anonymizing the Stop & Frisk data.

Categories: data science, modeling, open source tools, statistics

Comments (11)

FogOfWar

October 17, 2014 at 8:00 am

I was about to correct you and say “It’s FOIA, not FOIL”, but checked, and FOIL is the NY State equivalent of FOIA (pronounced “foi-yah”). Unless you’re a NY State law junkie, most people think of it as a “FOIA request” so I’ll use that interchangeably here.

FOIA requests are pretty boring-but-lordy-we’re-required-by-law-to-do-this-crap stuff when you’re working at council’s office for a government institution. Redacting emails to remove individual names isn’t fun or exciting legal work, and it’s nothing you’re going to get rewarded for doing a *great* job at, just get it off your plate and I don’t want it coming back on my desk, thank you.

Hopefully Anthony’s work will get a little more attention to doing these right. Cross post to abovethelaw.com for visibility?

FoW

LikeLike
- Cathy O'Neil, mathbabe
  
  October 17, 2014 at 1:53 pm
  
  Thanks! Yeah I’d love to, good idea.
  
  LikeLike
FogOfWar

October 17, 2014 at 8:04 am

….oh, and the “no recorded tip” is more likely a tax dodge than a big-shot celebrity shafting their cabbie. If the passenger paid with cash or paid the base fare with CC and tip with cash, then of course the cabbie isn’t going to put the tip in the system because they’ll owe taxes on that money if they do!

FoW

LikeLike
- abekohen
  
  October 17, 2014 at 11:50 am
  
  FoW, you beat me to it.
  
  LikeLike
Samuel Roby

October 17, 2014 at 10:35 am

The ride-sharing private corporations keep all that data. And more, about you.

LikeLike
Aaron Roth

October 17, 2014 at 10:43 am

Yep — one current obstacle to using differential privacy is that it takes some expertise to design and analyze (and prove privacy properties of!) algorithms you want. There is some work trying to make this more accessible though. For example, some of my more practically oriented colleauges here at Penn are working on a programming language called Fuzz (http://privacy.cis.upenn.edu/software.html). Fuzz has the property that any algorithm written in it is certifiably differentially private — the fact that the program compiles is a proof of privacy. This of course doesn’t mean that its not still hard to write _good_ algorithms, but at least you can write _private_ algorithms without much expertise in differential privacy itself.

And for those who nevertheless want to develop human expertise in differential privacy, Cynthia Dwork and I now have a (free online!) textbook on the subject. http://www.cis.upenn.edu/~aaroth/privacybook.html (Please excuse the shameless plug)

LikeLike
- Cathy O'Neil, mathbabe
  
  October 17, 2014 at 1:52 pm
  
  Very cool, thanks!
  
  Cathy
  
  LikeLike
abekohen

October 17, 2014 at 11:54 am

Of course all this “horrible” identifying data could be used for good, such as invalidating a murderer’s alibi, or catching cabbies which were charging “out-of-town” surcharges on “in-town” rides. But I get your point, Cathy, which is one reason that I think people should not answer Census questionnaires.

LikeLike
Auros

October 17, 2014 at 5:41 pm

It seems like in order to make this work, you need two separate contractors. The “blue team” contractor is responsible for anonymizing the data. The “red team” contractor attempts to break the blue team’s anonymization; they get paid a bonus, taken out of the blue team’s fee, if they succeed.

LikeLike
- Cathy O'Neil, mathbabe
  
  October 17, 2014 at 5:42 pm
  
  Yes. We would do that if we were serious about it.
  
  On Fri, Oct 17, 2014 at 5:41 PM, mathbabe wrote:
  
  >
  
  LikeLike
Dan C.

October 19, 2014 at 8:27 am

These are excellent points.

LikeLike