When big data goes bad in a totally predictable way

Home > data science, modeling > When big data goes bad in a totally predictable way

When big data goes bad in a totally predictable way

August 19, 2013 Cathy O'Neil, mathbabe

Three quick examples this morning in the I-told-you-so category. I’d love to hear Kenneth Neil Cukier explain how “objective” data science is when confronted with this stuff.

1. When an unemployed black woman pretends to be white her job offers skyrocket (Urban Intellectuals, h/t Mike Loukides). Excerpt from the article: “Two years ago, I noticed that Monster.com had added a “diversity questionnaire” to the site. This gives an applicant the opportunity to identify their sex and race to potential employers. Monster.com guarantees that this “option” will not jeopardize your chances of gaining employment. You must answer this questionnaire in order to apply to a posted position—it cannot be skipped. At times, I would mark off that I was a Black female, but then I thought, this might be hurting my chances of getting employed, so I started selecting the “decline to identify” option instead. That still had no effect on my getting a job. So I decided to try an experiment: I created a fake job applicant and called her Bianca White.”

2. How big data could identify the next felon – or blame the wrong guy (Bloomberg). From the article: “The use of physical characteristics such as hair, eye and skin color to predict future crimes would raise ‘giant red privacy flags’ since they are a proxy for race and could reinforce discriminatory practices in hiring, lending or law enforcement, said Chi Chi Wu, staff attorney at the National Consumer Law Center.”

3. How algorithms magnify misbehavior (the Guardian, h/t Suresh Naidu). From the article: “For one British university, what began as a time-saving exercise ended in disgrace when a computer model set up to streamline its admissions process exposed – and then exacerbated – gender and racial discrimination.”

This is just the beginning, unfortunately.

Categories: data science, modeling

Comments (7)

An igyt

August 19, 2013 at 8:03 am

Why does the first story have anything to do with Big Data? Maybe I misunderstand Monster, but this sounds to me like old-fashioned racial prejudice by human employers.

LikeLike
- Cathy O'Neil, mathbabe
  
  August 19, 2013 at 8:18 am
  
  Hard to say. It was presented to her as objective and very mathematical model-y, but we don’t actually know whether it was a model on Monster.com, the people at the other end, or some kind of combination.
  
  LikeLike
Boris

August 19, 2013 at 10:07 am

Re the last story, what genius algorithm is able to automatically pick up rich features like “non-European-looking names”? My guess is that (assuming the article’s claim is true) someone actually manually put in “name” as a feature. In that case, the modeler is the one responsible for the discrimination, not the “algorithm” or the “data” — what non-discriminatory information did he(!) expect the name to contain when it was thrown into the model?
For me, “modelers behaving badly” seems like an easier problem to address than “algorithms behaving badly”: for one, education. Two, this essentially codifies discrimination, which should expose the institution responsible for the models to lawsuits.

LikeLike
- Abe Kohen
  
  August 25, 2013 at 8:34 am
  
  Boris, you took the words out of my mouth. Amen.
  
  LikeLike
Brad Davis

August 19, 2013 at 2:04 pm

I’m not sure that big data is good (or bad) but rather it’s just a tool, and a tool is only as useful as the skill and sincerity of it’s operator. I’m not a farmer, if you put a hoe in my hand I can mangle the soil a bunch, and I can expend a great deal of effort with it, but much work of any value actually happens, even though to the untrained eye it looks like I’ve accomplished a great deal (‘he’s really sweaty, he must be working hard!’). On the other hand, if you give it to my brother in law, he can get a lot done really quickly and with a fraction of the effort it would take me. I think it’s the same with all tools. The problem is when the users / evangelists fail to understand (either because they are naive or because they have some sort of axe to grind) the short comings of the tool and they sell the public (and perhaps themselves) on how this new tool is the panacea, and if you dare argue with these ‘Truthtellers’, then you’re simply a member of the old tribe trying to keep progress down. To wit: I had a boss (PhD in molecular genetics, my PhD is in population genetics, molecular genetics, and computational biology with a few dribs and drabs of statistics thrown in for good measure) tell me that I didn’t have to worry about whether or not the data we were working with was normally distributed (it wasn’t, it was often exponentially distributed, although sometimes it would follow a beta-distribution) or had any other sorts of biases within it because having a large amount of data solved all those problems. Apparently she had (mis)learned about the central limit theorem and the law of large numbers at some point and thought that just gathering enough data solved all data analytics problems (We were working with microarray data). I repeatedly tried to explain to her how she was wrong, even going so far as coming up with numerous real and synthetic examples, but to no avail (she also didn’t seem to understand that the law of large numbers applies to a having a large number of samples, and not the product of samples times the number of tests run on each of those samples, but I digress). She ‘knew’ what she knew, and it didn’t matter if I had specific training and could demonstrate the problem, she had been ‘told’ by someone in the past, and so I was wrong. It didn’t seem to me that she cared about the legitimacy of the data, she had some crutch she could lean on, and she was going to grind it down as hard and as long as she could. She said that the peer review process would ultimately demonstrate the correct answer and so I shouldn’t worry about such things. I just needed to ‘get with the program’.

In the end I did, but at another work place where they not only listen to me but open ask for my input. Anyway, that’s just my bitter rant for the day.

LikeLike
David

August 19, 2013 at 4:00 pm

The first article reminds me of this question on Quora.

https://www.quora.com/Have-there-been-any-studies-on-the-disparate-impact-of-being-overqualified

LikeLike
Cat

August 19, 2013 at 8:39 pm

The first article sounded very fishy just based on the title since I’ve never received a job offer without an interview. I have never having tried to impersonate some one of a different race during a face to face interview so I assume its very hard to do. After reading the article the title says ‘offers’ but it seems to me she’s only getting phone screens and interviews.

Monster has a bunch of meta data that you as the applicant can’t see, but the recruiters that use it and do resume searches can see. If she had created a new account for her ‘black’ persona at the same time as her ‘white’ persona it would have been a much better experiment.

Of course there are more markers of otherness then a checkbox on Monster’s forms which are the more likely cause of the discrimination the OP is facing such as a minority sounding name, attending historically black colleges, or working for minority non-profits which have all been documented in the past.

LikeLike