## Two out of three “fairness” criteria can be satisfied

This is a continuation of a discussion I’ve been having with myself about the various definition of fairness in scoring systems. Yesterday I mentioned a recent paper entitled *Inherent Trade-Offs in the Fair Determination of Risk Scores *that has a proof of the following statement:

You cannot simultaneously ask for a model to be well-calibrated, to have equal false positive rates for blacks and whites, and to have equal false negative rates unless you are in the presence of equal “base rates” or a perfect predictor.

The good news is that you can ask for two out of three of these. Here’s a picture of a specific example of this, where I’ve simplified the situation so there are two groups of people being scores, B and W, and they each can be scored as either empty or full, and then the reality is that could either be empty or full. They have different “base rates,” which is to say that in reality, a different proportion of the B group is empty (70%) than the W group (50%). We insist, moreover, that the labeling scheme is “well-calibrated”, so the right proportion of them are labeled empty or full. I’ve drawn 10 “perfect representatives” from each group here:

In my picture, I’ve assumed there was some mislabeling – there’s a full in the empty bin and there are empties in the full bin. Because we are assuming the model is well-calibrated, every time we have one kind of mistake we have to make up for that mistake with exactly one of the other type. In the picture there’s exactly one of each mistake for both the W group and the B group, so that’s fine.

Quick calculation: in the picture above, the “false full rate”, which we can think of as the “false positive rate,” for B is 1/3 = 33% but the “false positive rate” for W is 1/5 = 20%, even though they each have only one mislabeled representative each.

Now it’s obvious that, theoretically, the scoring system could adjust the false positive rate for B to match that of W, which would mean having 3/5 of a representative be mislabeled. But again, that’d mean we would need only 3/5 of a representative be mislabeled in the empty bin as well.

That’s a false negative rate for B of 3/35 = 8.6% (note it used to be 1/7 = 14.3%). By contrast the false negative rate for A stays fixed at 1/5 = 20%.

If you think about it, what we’ve done is sacrificed some false negative rate balance for a perfect match on the false positive rate, while keeping the model well-calibrated.

Applying this to recidivism scores, we can ask for the high scores to reflect base rates for the populations, and we can ask for similar false positive rates for populations, but we cannot also ask for false negative rates to be equal. That might be better overall, though, because the harm that comes from unequal false positive rate – sending someone to jail for longer – is arguably more toxic than an unequal false negative rate, which means certain groups are let off the hook more often than the others.

By the way, I want to be clear that I don’t think recidivism risk algorithms should actually be the goal, summed up in this conversation I had with Tom Slee. I’m not even sure why their use is constitutional, to tell the truth. But given that they are in use, I think it makes sense to try to make them as good as possible, and to investigate what “good” means in this context.

Would never have thought a book about mathematics and algorithms would be a “page-turner’ but after seeing your interview on Bloomberg TV, I purchased your Weapons of Math Destruction and and devoured it in an evening. I have been aware, like most, of how powerful predictive algorithms have become. (By “powerful” I’m referring to their consequences, not necessarily their predictive analytics.) I taught Criminal Justice and Sociology for nearly 15 years until downsized by the rise of The Adjunct, but during that time I watched with fascination as Compstat took hold and first generation recidivism risk assessments gave way to things like SAQ (Self-Appraisal Questionnaire) and Risk Items Factor Scales, essentially reinforcing the biases already inherent in a racist justice system. I continue to be appalled at how much faith judges, parole boards, and correctional personnel place in these instruments as compared to how little they understand how the process works. They are pleased to be protected by “the score” and happy to remove themselves from any personal or human accountability that might, god forbid, require them to admit to the essential human conditions of relationship and subjectivity.

I tried getting jobs with Citibank, Wells Fargo, and HSBC but never got past the “silly” online personality test. “Red-lighted” as you say. I now know that they were subtly measuring mental health and/or screening for conformity. Today one of the larger social service agencies, serving people with disabilities and special needs, announced to their employees that henceforth they will be contracting with Kronos to conduct pre-hire screening. I have seen the test and believe it to be discriminatory by identifying and screening out those persons with mental health and other possible disabilities, as well as discriminating against many people in our large Hmong, Somali, Native American, African-American, and Elder communities. Are you aware of any successful challenges to the use of this kind of black box hiring? Until enough people demand it, these algorithms will remain unassailable and intimidating to the general public.

LikeLiked by 1 person

Nice. You have done the maths like an epidemiology model but I can feel Arrow’s theorem lurking nearby.

LikeLike

“…the harm that comes from unequal false positive rate – sending someone to jail for longer – is arguably more toxic than an unequal false negative rate, which means certain groups are let off the hook more often than the others.”

I used to ask this question in my college statistics courses (in NYC; “Which is worse: Type I or type II error in a criminal drug case?”), and I had to stop because the results were so depressing. Over time, larger proportions of the students would agree that jailing innocent people was the better tradeoff versus ever letting a drug user go free by accident.

Sometimes I would insert B. Franklin’s statement as part of a test question to try and rectify this. Not something they’d ever heard of before.

LikeLiked by 1 person

http://www.ajlunited.org/

The Algorithmic Justice League

LikeLike

I’m not convinced that their “calibration” condition is a desiderata of fairness. It asks that the average score assigned to people from group t should equal the probability that someone from group t is positive. This seems reasonable to me only if group membership is the most important consideration. Otherwise why should an individual’s scores necessarily be calibrated to an average associated to a group that might behave more like a red herring for whatever is being measured.

I would, however, hope that a scoring system had equal false positive and false negative rates across groups; that 2/3rds feels much more important.

LikeLike

One thing I notice in your toy example is that, while the false ‘full’ rates are different between B and W, the number of individuals falsely assigned the ‘full’ label is the same. Is this something that we expect to happen every time? And if so, can we re-formulate these fairness criteria in terms of raw counts, rather than rates? The comment with Tom Slee seems to brush up against this distinction also. Either way I can’t seem to quite all of this straight in my head, maybe I need more coffee…

LikeLike