Big Data’s Disparate Impact

Home > data science, feedback loop, modeling > Big Data’s Disparate Impact

Big Data’s Disparate Impact

October 20, 2014 Cathy O'Neil, mathbabe

Take a look at this paper by Solon Barocas and Andrew D. Selbst entitled Big Data’s Disparate Impact.

It deals with the question of whether current anti-discrimination law is equipped to handle the kind of unintentional discrimination and digital redlining we see emerging in some “big data” models (and that we suspect are hidden in a bunch more). See for example this post for more on this concept.

The short answer is no, our laws are not equipped.

Here’s the abstract:

This article addresses the potential for disparate impact in the data mining processes that are taking over modern-day business. Scholars and policymakers had, until recently, focused almost exclusively on data mining’s capacity to hide intentional discrimination, hoping to convince regulators to develop the tools to unmask such discrimination. Recently there has been a noted shift in the policy discussions, where some have begun to recognize that unintentional discrimination is a hidden danger that might be even more worrisome. So far, the recognition of the possibility of unintentional discrimination lacks technical and theoretical foundation, making policy recommendations difficult, where they are not simply misdirected. This article provides the necessary foundation about how data mining can give rise to discrimination and how data mining interacts with anti-discrimination law.

The article carefully steps through the technical process of data mining and points to different places within the process where a disproportionately adverse impact on protected classes may result from innocent choices on the part of the data miner. From there, the article analyzes these disproportionate impacts under Title VII. The Article concludes both that Title VII is largely ill equipped to address the discrimination that results from data mining. Worse, due to problems in the internal logic of data mining as well as political and constitutional constraints, there appears to be no easy way to reform Title VII to fix these inadequacies. The article focuses on Title VII because it is the most well developed anti-discrimination doctrine, but the conclusions apply more broadly because they are based on the general approach to anti-discrimination within American law.

I really appreciate this paper, because it’s an area I know almost nothing about: discrimination law and what are the standards for evidence of discrimination.

Sadly, what this paper explains to me is how very far we are away from anything resembling what we need to actually address the problems. For example, even in this paper, where the writers are well aware that training on historical data can unintentionally codify discriminatory treatment, they still seem to assume that the people who build and deploy models will “notice” this treatment. From my experience working in advertising, that’s not actually what happens. We don’t measure the effects of our models on our users. We only see whether we have gained an edge in terms of profit, which is very different.

Essentially, as modelers, we don’t humanize the people on the other side of the transaction, which prevents us from worrying about discrimination or even being aware of it as an issue. It’s so far from “intentional” that it’s almost a ridiculous accusation to make. Even so, it may well be a real problem and I don’t know how we as a society can deal with it unless we update our laws.

Categories: data science, feedback loop, modeling

Comments (10)

medicalquackblog

October 20, 2014 at 11:17 am

One of my harped on topics as a privacy advocate. There’s also been the World Forum on Privacy. Recently a lawyer wrote about this in the New York Times and he too got it all wrong with the lack of data mechanics logic. It bothered me as he speaks with me on Twitter and on the parts he did get right, which came from my blog, no byline. I’m glad others chimed in too on what he wrote as his solution of was off base as far as what data brokers should do and was data mechanics wise impossible.

I kept trying to send this lawyer over to my Killer Algorithm page, where I have two of your videos along with others from folks a lot smarter than me. I tried to pick out videos where the layman could understand some of this in my curation. I end up half the time running off reporters that do talk with me as they won’t give the credit. I know it sounds silly, but it’s true and I’m a former medical records developer that quit writing code a few years ago, so when it comes to medical records I know it from the bottom up as that’s how it was in the early days before writing code for platforms took over. I just wrote about an insurer, United Healthcare that sucks in the millenials to write code for them and I call it code for cash as they want developers to write cheap for them so they can buy it up and incorporate into some of their current technologies and the young deserve a chance and I hate that fleecing of the younger generation too.

Anyway, back on topic, I do get pretty blatant out there as the phony news makes me mad at times. I harp on this topic all the time and thus I started my little campaign 3 years ago and pester the heck out of the FTC, and a few Congressmen about “licensing” data distributors as to me it’s data base 101, you need an index to identify the group you want to try to regulate and who are they? So my cause here is to get a law passed requiring all data distributors to buy a license so we know who they are. I’ve dug up even more companies that buy your credit card data and then analyze and score you so bingo, now we have two things to resell, your data and a score. You get repackaged and resold over and over and myself I’ve been a victim of flawed data here, enough to the point one day where Axciom decided to take me on one day on Twitter and then they disappeared. I just stated facts and they read every link I gave them as I went back and checked my web stats:)

There’s absolutely “nothing” to protect people at all, the code runs hog as wild” as I say it and of course that phrase does draw attention so part of my routine to use it as well. One of the big problems I say is the fact that we keep appointing useless lawyers to navigate these areas as they focus on verbiage only and someone better start looking at some code here as I have not resolved myself to be further blatant and just call it “code hosing”. Even Eric over Nanex got a kick out of that one as he covers stock exchanges where he too is talking the same thing, while collar code hosing crimes if you will:)

Again with laws on this licensing to me is step one as how in the world do you regulate any group without an index? You don’t and it just keeps rolling and I would think also if consumers had a reference to look up “licensed” data distributors along with what kind of data they sell, it would serve to clean things up a bit with a little transparency and addition laws could be created to regulate from there. I’ve posted as well about our Health and Human Services agency doing not a lot more than “lawyering up” as I have specifics on those too and yes they read me back in DC too and find those blog posts and maybe one day a light bulb will go on, who knows.

Again I’ve got 3 years into this, writing to yet another useless lawyer at the FTC and people in Congress. I’have had Senator Schumer on my blog reading as well on some of this so I just keep putting it out there as the activity that takes place is a fact, not speculation by all means. Before I became a developer, i spent many years in outside sales so the nerd side of me looks at the tech end and the sales side says “how are they going to market this latest code hosing” if you will and so far and sadly I keep hitting the nail on the head so again I look at a law requiring a license as data base 101 and a way to establish an “index” so we know who they all are. Look at the company Argus making millions ‘scoring’ you, nobody even knows they exist as far as being yet another data seller.

Anyway a short while back I put up a little campaign and there’s there’s lot of links to many of my posts on it worth reading whether or not folks want to donate or not. It’s just the right thing to do to inform consumers on what’s really happening out there and again a law to require all data sellers to buy a license. Health insurers are in fact driving themselves off a cliff with some of this as they continue to buy, collect and sell more non relevant risk assessment type data as when it comes down to it, they’re not getting an ROI on all of this either and it gets reflected of course to us with higher premiums to cover their big data analytics nonsense. What it does do though is deny access to consumers to something, money, loans, etc. and it’s all ok at they sell this as “intelligence” and data out of context, well it’s a weapon that can be used whenever they desire.

http://www.gofundme.com/auyxd0

LikeLike
Zathras

October 20, 2014 at 1:51 pm

These laws weren’t ready for redlining either, for the exact same reason they aren’t ready for data mining technique now. What happened then? Courts used their common law experience to be have to flexibility to allow disparate impact analysis to apply to redlining. And now? Conservative courts have become much less flexible in their analysis. They could have drawn on their common law experience to allow for this flexibility, but now they apply a formalistic analysis that does not permit these results.

LikeLike
n8chz

October 20, 2014 at 6:30 pm

The question of intentional/unintentional discrimination seems straightforward enough. The unintentional discrimination, if any, is against protected classes. The intentional discrimination is against unprotected classes.

LikeLike
- Cathy O'Neil, mathbabe
  
  October 20, 2014 at 6:34 pm
  
  Not sure why you say that. Or if you are kidding.
  
  LikeLike
quasihumanist

October 20, 2014 at 11:26 pm

I think part of the problem here is that our moral expectations of businesses have moved back quite a bit over the last 50 years or so.

Consider the following classical ethical dilemma. Racistville has two supermarkets, and each needs close to half of the customers in the town to stay in business. (Most supermarkets have notoriously thin margins.) Three quarters of Racistville decide to boycott Acme Supermarket because they have hired black workers, and vow to stay away until the black workers are fired. What should Acme Supermarket do?

(Legally, Acme is required to keep the black workers, but I’m interested in the moral question supposing the absence of legal requirements. Also, I don’t think there is any argument that the three quarters of Racistville boycotting Acme are not in the wrong.)

Fifty years ago, I get the sense that non-racist people would think Acme has a moral obligation to keep its black employees. Not only that, but non-racist people might try to support Acme in various ways, such as being willing to pay higher prices to support Acme. Also, non-racist people from neighboring towns might have shopped at Acme.

Now, I think most people would just say that Acme should give in. After all, if it didn’t give in, it would just go out of business and the black workers would be out of a job anyway. Also, the management of Acme has a responsibility to its investors to maximize their returns, and not giving in would violate this responsibility.

In effect, the argument is that businesses have a moral responsibility to NOT be moral agents! (which to me points out how ridiculous this argument is.)

Why am I bringing this up in the context of “big data models”?

First, there is the indirect argument that these big data models are serving entities whose sole goal should be maximizing profit, and if maximizing profit means going along with current (not just historical!) discrimination, then so be it.

Second, there is also a direct parallel argument that these big data models are meant to find the truth about what happens in society from some value-neutral standpoint, and if the truth about what happens in society includes some effects from discrimination, then that might reflect badly on society, but the big data model is not supposed to be a moral agent but rather just as moral or immoral as society itself.

LikeLike
quixote

October 21, 2014 at 10:07 am

As you and some of the commenters point out, the real problem is our priorities and assumptions as a society. The simple solution is not to worry about the intricacies of the workings of big data.

Same as in the old redlining days, look at the results. If they’re discriminatory, deal with it. The anti-discrimination regulators at the time didn’t try to redesign the mortgage application forms at the time, hope that better data would solve the problem, and when it didn’t, re-tweak the forms some more.

The solution, if we wanted to apply it, is straightforward. But (enough) people don’t, so it isn’t.

LikeLike
joe

October 21, 2014 at 12:36 pm

In very simple terms, The Fair Credit Reporting Act authorizes regulators to test whether credit scores are correlated to variables such as race, gender and age regardless of these pieces of information being used in the credit scoring models. If the correlation is significant, the scores cannot be used.

I don’t think this standard is used in other areas where data mining is applied. I think it would be wise to pass regulations that apply these standards to data mining activity that can adversely impact minorities. For example, these standards have been applied to credit scoring but not marketing activities, they should.

LikeLike
- Cathy O'Neil, mathbabe
  
  October 21, 2014 at 12:47 pm
  
  Joe,
  
  Do regulators really do those tests? Can you send me a reference to that?
  
  Thanks!
  Cathy
  
  LikeLike
  - joe
    
    October 21, 2014 at 2:24 pm
    
    Hi Cathy,
    
    Yes, regulators do require these tests. I am not sure how effective the regulators are as the problem of regulatory capture exists in this sphere as well.
    
    Here is a Fed document that details the examination procedures:
    
    Click to access fair.pdf
    
    Here is a Fed document that concludes that redlining exists in credit card decisions:
    
    Click to access qau0801.pdf
    
    Here is a Consumer Law Center document that concludes that discrimination exists in insurance decisions:
    http://www.docstoc.com/docs/14419765/Credit-Scoring-as-Insurance-Redlining-Deepening-the-Economic
    
    Here are some documents from SAS, a software vendor that helps financial institutions “comply” with the regulatory requirements in home mortgages (HMDA) and credit (CRA):
    
    Click to access SASWhitePaper.pdf
    
    Click to access mcneil.pdf
    
    Statistics has been used in the above areas since the 60s and hence the regulatory response. While the regulations have not been perfect, they have been helpful. I am suggesting that similar regulations should follow the explosion of data availability in other spheres.
    
    Another area where regulations exist is the participation and consent of human beings in psychological experiments. These regulations should be extended/follow the explosion of such experiments in the social sphere.
    
    Hope this is helpful.
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      October 21, 2014 at 2:33 pm
      
      Holy crap I love you.
      
      LikeLike