Big Data’s Disparate Impact

Take a look at this paper by Solon Barocas and Andrew D. Selbst entitled Big Data’s Disparate Impact.

It deals with the question of whether current anti-discrimination law is equipped to handle the kind of unintentional discrimination and digital redlining we see emerging in some “big data” models (and that we suspect are hidden in a bunch more). See for example this post for more on this concept.

The short answer is no, our laws are not equipped.

Here’s the abstract:

This article addresses the potential for disparate impact in the data mining processes that are taking over modern-day business. Scholars and policymakers had, until recently, focused almost exclusively on data mining’s capacity to hide intentional discrimination, hoping to convince regulators to develop the tools to unmask such discrimination. Recently there has been a noted shift in the policy discussions, where some have begun to recognize that unintentional discrimination is a hidden danger that might be even more worrisome. So far, the recognition of the possibility of unintentional discrimination lacks technical and theoretical foundation, making policy recommendations difficult, where they are not simply misdirected. This article provides the necessary foundation about how data mining can give rise to discrimination and how data mining interacts with anti-discrimination law.

The article carefully steps through the technical process of data mining and points to different places within the process where a disproportionately adverse impact on protected classes may result from innocent choices on the part of the data miner. From there, the article analyzes these disproportionate impacts under Title VII. The Article concludes both that Title VII is largely ill equipped to address the discrimination that results from data mining. Worse, due to problems in the internal logic of data mining as well as political and constitutional constraints, there appears to be no easy way to reform Title VII to fix these inadequacies. The article focuses on Title VII because it is the most well developed anti-discrimination doctrine, but the conclusions apply more broadly because they are based on the general approach to anti-discrimination within American law.

I really appreciate this paper, because it’s an area I know almost nothing about: discrimination law and what are the standards for evidence of discrimination.

Sadly, what this paper explains to me is how very far we are away from anything resembling what we need to actually address the problems. For example, even in this paper, where the writers are well aware that training on historical data can unintentionally codify discriminatory treatment, they still seem to assume that the people who build and deploy models will “notice” this treatment. From my experience working in advertising, that’s not actually what happens. We don’t measure the effects of our models on our users. We only see whether we have gained an edge in terms of profit, which is very different.

Essentially, as modelers, we don’t humanize the people on the other side of the transaction, which prevents us from worrying about discrimination or even being aware of it as an issue. It’s so far from “intentional” that it’s almost a ridiculous accusation to make. Even so, it may well be a real problem and I don’t know how we as a society can deal with it unless we update our laws.

Aunt Pythia’s advice

Quick, get on the bus! Hurry!

Aunt Pythia is gonna be super fast this morning because she’s got crepes to make and apples to pick.

And then many, many apple pies to bake.

And then many, many apple pies to bake.

Are you ready? Belts buckled? OK great, let’s do this. And afterwards:

please think of something to ask Aunt Pythia at the bottom of the page!

By the way, if you don’t know what the hell Aunt Pythia is talking about, go here for past advice columns and here for an explanation of the name Pythia.


Dear Aunt Pythia,

Now I’m dying to know – what are some Dan Savage answers that you disagree with?? Say, what are your top 3?

An obliging – and curious! – good friend

Dear Ao-ac-gf,

First, let me say I’m glad this is a written word thing and I don’t have to pronounce your name.

Second, I only disagree with Dan Savage on (pretty much) one thing. And he’s a gay man, and without meaning to offend may I say he has typical gay man aesthetics coming from mostly interacting with other men. You see this is fashion as well, which is dominated by gay men.

Which is to say, he’s really judgmental about fatness. And I find it peculiar, coming from a man who is pro-sex and anti-shame on most topics. As is typical of people who are judgy about fatness, he claims it’s coming from a place of worrying about health, which I first of all object to strenuously as a super healthy fat woman, but secondly it just strikes me as almost comically parallel to how people complain about gayness and hide behind some weird argument that it’s for the sake of the gay person’s soul.

UPDATE: please read this totally awesome essay on the subject.

That’s pretty much it. In almost every other way I agree with Dan Savage. And also, I haven’t read his stuff for a while, so who knows, maybe he’s had a total change of heart, and maybe he embraces fat ladies such as myself nowadays (although, not literally, I’m sure).

XOXOX good friend!

Aunt Pythia


Dear Aunt Pythia,

I’m currently in a data quality job that I was promoted into for sheer enthusiasm and work ethic. It’s turned into a data quality/analysis/reporting and visualisation role (can you guess I’m at a non-profit). I’ve taught myself advanced Excel, some data visualisation and how to manage our database since being promoted. I love knowing what all the data shows and being able to explain why certain things are happening. However I want to excel at my job and with no prior training (I’m not even a graduate yet) I find it so stressful as I feel I’m always one step behind.

Currently to improve my skills… (I have your book on my wishlist) I follow your blog and several others in similar fields and I’ve read books on Tableau/Excel/dashboard design and books on how to think statistically. I’m going through the entirety of the maths section on Khan Academy. I’m also studying part time so I will be a graduate soon and I have done some statistics in this course but it’s all been related to psychology experiments (I started the course before being promoted).

Unfortunately no one else in my organisation does anything similar or is any kind of position to train or mentor me. Would you be able to recommend other books/blogs/online courses or even ways of thinking/learning skills that might be useful?

Girl drowing in data

Dear Girl,

Whoa! You rock! Let’s hear it for enthusiasm and work ethic, sister!

And hey, I even have advice: check out the github for my data journalism program this past summer, there’s lots of good stuff there. Also make sure you’ve taken a look at Statistics Done Wrong. And also, the drafts of my book are all on my blog.

Good luck!

Auntie P


Dear Aunt Pythia,

I am fed up with being single, and I am fed up with dating mathematicians, because the aftermath is too awkward. I’d like to try online dating, but I’m too embarrassed to tell my friends. But I feel that I need to tell someone to stay safe. Do you have any suggestions?

Currently Unsure of my Prospects In Dating


OK let me just plug dating mathematicians in spite of the fact that you’ve decided to give up on them. They are actually super nice.

Come to think of it, before I met my husband, I decided on three rules for my next boyfriend and publicly announced them to my friends:

  1. Had to be at least 30 (because younger men were so freaking immature),
  2. Had to love his job (because men who don’t love their job are so freaking insecure)
  3. Couldn’t be a mathematician (because it’s so freaking awkward after breakups)

Then, after I met my mathematician husband and people pointed out my hypocrisy, I’d always say, “two out of three aint bad, amIright?”. So in other words, I’m totally fine with your proclamation that you’re done with math people, guys or girls, as long as you are willing to bend rules for the right nerd.

Back to online dating. Yes, I think it makes sense for at least one of your friends to know about your online activities before you start meeting strangers in night clubs. But I don’t really see why that’s embarrassing, maybe because I’m not easily embarrassed, but also because EVERYONE DOES ONLINE DATING. Seriously, I don’t know anyone who hasn’t tried that.

Why don’t you talk to a friend you trust and ask them what they think of online dating, and kind of poke the topic around a bit. I think you will be surprised to learn that it’s very common, and not at all embarrassing. And once you start doing it, with the disclosure to a good friend who will notice if you go missing, please be aware of the problems with online dating that have nothing to do with safety.

Good luck!

Aunt Pythia


Dear Aunt Pythia,

Our daughter has recently started watching way too much Faux News and blaming everything wrong in her life on “the liberals.” Not wanting to damage our relationship with her or our grandkids, my wife and I tend not to respond to her tea-partyish pronouncements. Alas, our silence is characterized as “uncomfortable,” and if we look at one another we’re presumed to be eye-rolling. I am afraid the whole thing may be escalating to the point that the kids start to see us as the villains responsible for the tensions in the air. The alternatives to silence appear to be: responding truthfully, which would probably get us ejected, or feigning agreement (i.e., lying), which we simply will not do. Agreeing honestly with minor details only gets us pressed for our positions on the larger issues, and we’re back to those two choices. Any ideas you have would be welcome.

Virtually Unspeaking Leftish Parents In No-win Exercise


What a foxy sign-off!!!

OK, so this is your daughter, right? Not your daughter-in-law? So presumably you raised her? And presumably she knows all about how leftish you guys are?

If so, it’s a weird situation. My best guess, from way over here in unspeakably leftish territory, is that she has hostility for you two and wants to blame you for her problems but the closest she can get to blaming you is blaming people like you, namely liberals.

Even if I’m wrong, there really does seem to be more than enough blame and hostility to go around in the above description, mostly coming from her, but also being passed around like a hot potato by all concerned. If I were you I’d focus on the underlying hostility, although maybe not talk directly about it with her. Some ideas:

  1. Maybe you could have dinner with just her (or with her husband if he’s around) and talk about how you guys don’t have to agree about everything to get along as a family. Focus on the interactions rather than the details of what you don’t agree about. Try to make a plan with her to avoid hot topics and enjoy your time together. Plan an apple-picking trip!
  2. If that’s too direct, think about what she’s actually accomplishing when she makes “tea-partyish pronouncements”. Does she do this right after something happens to embarrass her or put a spotlight on her vulnerabilities? Is there a pattern to the behaviors? Understanding what gives rise to those moments might help you defuse them. And if you can’t defuse them, it still might help you to know when things are coming up. Plan ahead about what you will say to change the subject.
  3. You can try to address the frustration by giving her lots of love in other ways. In other words, just find things where you guys get along and stick with them. Try to make a habit out of emphasizing common ground. Maybe you all love certain kinds of food or entertainment? Karaoke?
  4. If all those distraction methods fail, I think an articulate discussion of polite (even if strenuous!) disagreement is great for kids. And it shouldn’t ban you from spending time with the kids either, if you keep it relatively civilized.
  5. Here’s what might get you into real trouble: if you ever tell the grandkids what you really think when their mom isn’t around. That will get back to her and she will feel betrayed and might take away your private time with the grandkids. I think the disagreements have to happen out in the open in front of everyone.
  6. Finally, it just might not be possible. If she is on a tear for being hostile and blaming, then that’s what she’s gonna do. Some people are just filled with anger and there’s nothing anyone can do about it. I would just try the other stuff and if they don’t work try to be there for the grandkids, especially when they’re going through puberty.

Good luck, grandpa! I hope this was somewhat helpful.

Aunt Pythia


Dear Aunt Pythia,

CA has just adopted legislation to require that colleges require students to give positive consent before sex. In other words, lack of protest does not constitute consent. The change seems appropriate, but I wonder about the basic structure of the system.

My question: why are schools responsible rather than the police and does this empirically make the situation better? Are there fewer incidents, faster prosecution, more victim support, etc, because the universities are involved or does it function to shield perpetrators from criminal punishment?

Sorry this is only a quasi-sex question.

Sex Questions Unlikely In Near Term



I’m on the verge of making a huge rant about this issue. I’ll probably still do it actually, but yes, yes yes. Here’s an imaginary Q&A I have with myself on a daily basis.

Why are schools responsible? Mostly historical, towns don’t want to have to hire extra police to deal with the nuisance problems (think: vomit everywhere) that proliferate on campus, so schools are like, “we got this!”.

Does this make sense? It does for actual nuisance problems, but not for violent crime. In fact it leads to ridiculous situations where professors of philosophy are expected to decide whether something was a sex act or just really terrible sex by asking whether it’s really possible for someone to be ass-raped without lubrication. Yes, it is.

Why don’t students go straight to the real police when there is a violent crime committed against them? Partly because the campus police are nearby and present, but mostly because the “real” police are not sufficiently responsive to their complaints.

So doesn’t that mean that there are two entirely different systems available to 19-year-old rape victims, depending on whether they happen to be college students or not? Yes, and it’s bullshit, and elitist, although neither system actually works for the victims.

So what should we do? We should require that claims of violent crimes on campuses go straight to the real police and we should also require that real police learn how to do their jobs when it comes to rape, so it’s a fair system for all 19-year-olds.

Aunt Pythia


Please submit your well-specified, fun-loving, cleverly-abbreviated question to Aunt Pythia!

Click here for a form.

Categories: Aunt Pythia

De-anonymizing what used to be anonymous: NYC taxicabs

Thanks to Artem Kaznatcheev, I learned yesterday about the recent work of Anthony Tockar in exploring the field of anonymization and deanonymization of datasets.

Specifically, he looked at the 2013 cab rides in New York City, which was provided under a FOIL request, and he stalked celebrities Bradley Cooper and Jessica Alba (and discovered that neither of them tipped the cabby). He also stalked a man who went to a slew of NYC titty bars: found out where the guy lived and even got a picture of him.

Previously, some other civic hackers had identified the cabbies themselves, because the original dataset had scrambled the medallions, but not very well.

The point he was trying to make was that we should not assume that “anonymized” datasets actually protect privacy. Instead we should learn how to use more thoughtful approaches to anonymizing stuff, and he proposes a method called “differential privacy,” which he explains here. It involves adding noise to the data, in a certain way, so that at the end any given person doesn’t risk too much of their own privacy by being included in the dataset versus being not included in the dataset.

Bottomline, it’s actually pretty involved mathematically, and although I’m a nerd and it doesn’t intimidate me, it does give me pause. Here are a few concerns:

  1. It means that most people, for example the person in charge of fulfilling FOIL requests, will not actually understand the algorithm.
  2. That means that, if there’s a requirement that such a procedure is used, that person will have to use and trust a third party to implement it. This leads to all sorts of problems in itself.
  3. Just to name one, depending on what kind of data it is, you have to implement differential privacy differently. There’s no doubt that a complicated mapping of datatype to methodology will be screwed up when the person doing it doesn’t understand the nuances.
  4. Here’s another: the third party may not be trustworthy and may have created a backdoor.
  5. Or they just might get it wrong, or do something lazy that doesn’t actually work, and they can get away with it because, again, the user is not an expert and cannot accurately evaluate their work.

Altogether I’m imagining that this is at best an expensive solution for very important datasets, and won’t be used for your everyday FOIL requests like taxicab rides unless the culture around privacy changes dramatically.

Even so, super interesting and important work by Anthony Tockar. Also, if you think that’s cool, take a look at my friend Luis Daniel‘s work on de-anonymizing the Stop & Frisk data.

Bad Paper by Jake Halpern

Yesterday I finished Jake Halpern’s new book, Bad Paper: Chasing Debt From Wall Street To The Underground.

It’s an interesting series of close-up descriptions of the people who have been buying and selling revolving debt since the credit crisis, as well as the actual business of debt collecting. He talks about the very real problem, for debt collectors, of having no proof of debt, of having other people who have stolen on your debt trying to collect on it at the same time, and of course the fact that some debt collectors resort to illegal threats and misleading statements to get debtors – or possibly ex-debtors, it’s never entirely clear – to pay up or suffer the consequences. An arms race of quasi-legal and illegal cultural practices.

Halpern does a good job explaining the plight of the debt collectors, including the people hired for the call centers. It’s the poor pitted against the poorer here, a dirty fight where information asymmetry is absolutely essential to the profit margin of any given tier of the system.

Halpern outlines those tiers well, as well as the interesting lingo created by this subculture centered, at least until recently, in Buffalo, New York. People at the top are credit card companies themselves or hedge fund buyers from credit card companies; in other words, people who get “fresh debt” lists in the form of excel spreadsheets, where the people listed have recently stopped paying and might have some resources to pull. Then there are people who deal in older debt, which is harder to collect on. After that are people who have yet older debt which may or may not be stolen, so other collectors might simultaneously be picking over the carcasses. At the very bottom of the pile, from Halpern’s perspective, come the lawyers. They bring debtors to court and try to garnish wages.

Somewhat buried at very end of Halpern’s book is some quite useful information for the debtors. So for example, if you ever get dragged to court by a debt collection lawyer,

  1. definitely show up (or else they will just garnish your wages)
  2. ask for proof that they own the debt and how you spent it. They will likely not have such documentation and will dismiss your case.

Overall Bad Paper is a good book, and it explains a lot of interesting and useful information, but from my perspective, being firmly on the side of (most of) the debtors, everyone who gets a copy of the book should also get a copy of Strike Debt’s Debt Resistors’ Operation Manual, which has way more useful information, and even form letters, for the debtor.

As far as real solutions, we see the usual problems: underfunded and impotent regulators in the FTC, the CFPB, and the Attorney General’s office, as well as ridiculously small fines when actually caught that amount to fractions of the profit already made by illegal tactics. Everyone is feasting, even when they don’t find much meat on the bones.

Given how big a problem this is, and how many people are being pursued by debt collectors, you’d think they might set up a system of incentives so lawyers can make money by nailing illegal actions instead of just leveraging outdated information and trying to squeeze poor people out of their paychecks.

The bigger problem, once again, is that so many people are flat broke and largely go into debt for things like emergency expenses. And yes, of course there are people who buy a bunch of things they don’t need and then refuse to pay off their debts – Halpern profiles one such person – but the vast majority of the people we’re talking about are the struggling poor. It would be nice to see our country become a place where we don’t need so much damn debt in the first place, then the scavengers wouldn’t have so many rubbish piles to live off of.

Categories: #OWS, economics, journalism

Upcoming data journalism and data ethics conferences


Today I’m super excited to go to the opening launch party of danah boyd’s Data and Society. Data and Society has a bunch of cool initiatives but I’m particularly interested in their Council for Big Data, Ethics, and Society. They were the people that helped make the Podesta Report on Big Data as good as it was. There will be a mini-conference this afternoon I’m looking forward to very much. Brilliant folks doing great work and talking to each other across disciplinary lines, can’t get enough of that stuff.

This weekend

This coming Saturday I’ll be moderating a panel called Spotlight on Data-Driven Journalism: The job of a data journalist and the impact of computational reporting in the newsroom at the New York Press Club Conference on Journalism. The panelists are going to be great:

  • John Keefe @jkeefe, Sr. editor, data news & J-technology, WNYC
  • Maryanne Murray @lightnosugar, Global head of graphics, Reuters
  • Zach Seward @zseward, Quartz
  • Chris Walker @cpwalker07, Dir., data visualization, Mic News

The full program is available here.

December 12th

In mid-December I’m on a panel myself at the Fairness, Accountability, and Transparency in Machine Learning Conference in Montreal. This conference seems to directly take up the call of the Podesta Report I mentioned above, and seeks to provide further research into the dangers of “encoding discrimination in automated decisions”. Amazing! So glad this is happening and that I get to be part of it. Here are some questions that will be taken up at this one-day conference (more information here):

  • How can we achieve high classification accuracy while eliminating discriminatory biases? What are meaningful formal fairness properties?
  • How can we design expressive yet easily interpretable classifiers?
  • Can we ensure that a classifier remains accurate even if the statistical signal it relies on is exposed to public scrutiny?
  • Are there practical methods to test existing classifiers for compliance with a policy?

What male allies should *really* be doing

Chris Wiggins was kind enough to forward me this article on a recent panel discussion of “Male Allies of Women” at the 2014 Grace Hopper Celebration, which is a big deal conference for women in tech.

Panelists included Facebook CTO Mike Schroepfer, Google’s SVP of Search Alan Eustace, GoDaddy CEO Blake Irving, and Intuit CTO Tayloe Stansbury. The advice was stale and trite and included things like “speak up,” “lean in,” and “get excited about your ideas like men do.”

Yes, I said GoDaddy.

Yes, I said GoDaddy.

By far the best part was the audience response – I wish I’d been there just for that part.

Screen Shot 2014-10-10 at 7.13.05 AM

There was a Bingo game on the phrases that were anticipated:



What male allies should really be doing, step 1

Here’s the thing. If you haven’t seen this video of gamer Anita Sarkeesian speaking at the Feminist Frequency conference (hat tip Josh Vekhter), go take a look. It’s a fantastic and articulate diatribe against sexism and misogyny, and it ends with a super reasonable request of the men in the audience and in the world:

Trust women who say they experience sexism.

What’s amazing to me is how hard this is to hear for men in my life. When I repeated this to a couple of them, they actually said that I didn’t experience the stuff that I had. It was kind of nuts, and I had to point out to them that they were failing on the most basic level.

Yes, it requires empathy, and observation, and yes it sucks, because once you start seeing it you will be disappointed in the world. Tough shit, it’s reality.

What male allies should really be doing, step 2

Once men start trusting the women they love and admire and work with, then the next thing they can do is start acting on that knowledge.

I don’t know how many times I’ve been the target of sexism in front of other men and somehow it’s my job to confront it and deal with it. Men, step the fuck up and, when you see sexism happening, once you can manage that, defend the target and put a stop to it. Speak up and defend your friend, or your wife, or your daughter, or your colleague. Thanks.

Categories: rant

Reverse-engineering Chinese censorship

This recent paper written by Gary King, Jennifer Pan, and Margaret Roberts explores the way social media posts are censored in China. It’s interesting, take a look, or read this article on their work.

Here’s their abstract:

Existing research on the extensive Chinese censorship organization uses observational methods with well-known limitations. We conducted the first large-scale experimental study of censorship by creating accounts on numerous social media sites, randomly submitting different texts, and observing from a worldwide network of computers which texts were censored and which were not. We also supplemented interviews with confidential sources by creating our own social media site, contracting with Chinese firms to install the same censoring technologies as existing sites, and—with their software, documentation, and even customer support—reverse-engineering how it all works. Our results offer rigorous support for the recent hypothesis that criticisms of the state, its leaders, and their policies are published, whereas posts about real-world events with collective action potential are censored.

Interesting that they got so much help from the Chinese to censor their posts. Also keep in mind a caveat from the article:

Yu Xie, a sociologist at the University of Michigan, Ann Arbor, says that although the study is methodologically sound, it overemphasizes the importance of coherent central government policies. Political outcomes in China, he notes, often rest on local officials, who are evaluated on how well they maintain stability. Such officials have a “personal interest in suppressing content that could lead to social movements,” Xie says.

I’m a sucker for reverse-engineering powerful algorithms, even when there are major caveats.


Get every new post delivered to your Inbox.

Join 1,801 other followers