Home > Uncategorized > ProPublica report: recidivism risk models are racist

ProPublica report: recidivism risk models are racist

May 24, 2016

Yesterday an exciting ProPublica article entitled Machine Bias came out. Written by Julia Angwin, author of Dragnet Nation, and Jeff Larson, data journalist extraordinaire, the piece explains in human terms what it looks like when algorithms are biased.

Specifically, they looked into a class of models I featured in my upcoming book, Weapons of Math Destruction, called “recidivism risk” scoring models. These models score defendants and give those scores to judges to help them decide how long to sentence them to prison, for example. Higher scores of recidivism are supposed to correlate to a higher likelihood of returning to prison, and people who have been assigned high scores also tend to get sentenced to longer prison terms.

What They Found

Angwin and Larson studied the recidivism risk model called COMPAS. Starting with COMPAS scores for 10,000 criminal defendants in Broward County, Florida, they looked at the  difference between who was predicted to get rearrested by COMPAS versus who actually did. This was a direct test of the accuracy of the risk model. The highlights of their results:

  • Black defendants were often predicted to be at a higher risk of recidivism than they actually were. Our analysis found that black defendants who did not recidivate over a two-year period were nearly twice as likely to be misclassified as higher risk compared to their white counterparts (45 percent vs. 23 percent).
  • White defendants were often predicted to be less risky than they were. Our analysis found that white defendants who re-offended within the next two years were mistakenly labeled low risk almost twice as often as black re-offenders (48 percent vs. 28 percent).
  • The analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 45 percent more likely to be assigned higher risk scores than white defendants.
  • Black defendants were also twice as likely as white defendants to be misclassified as being a higher risk of violent recidivism. And white violent recidivists were 63 percent more likely to have been misclassified as a low risk of violent recidivism, compared with black violent recidivists.
  • The violent recidivism analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 77 percent more likely to be assigned higher risk scores than white defendants.

Here’s one of their charts (lower scores mean low-risk):

Screen Shot 2016-05-23 at 8.48.00 AM

How They Found It

ProPublica is awesome and has the highest standards in data journalism. Which is to say, they published their methodology, including a description of the (paltry) history of other studies that looked into racial differences for recidivism risk scoring methods. They even have the data and the ipython notebook they used for their analysis on github.

They made heavy use of the open records law in Florida to do their research, including the original scores, the subsequent arrest records, and the classification of each person’s race. That data allowed them to build their analysis. They tracked both “recidivism” and “violent recidivism” and tracked both the original scores and the error rates. Take a look.

How Important Is This?

This is a triumph for the community of people (like me!) who have been worrying about exactly this kind of thing but who haven’t had hard proof until now. In my book I made multiple arguments for why we should expect this exact result for recidivism risk models, but I didn’t have a report to point to. So, in that sense, it’s extremely useful.

More broadly, it sets the standard for how to do this analysis. The transparency involved is hugely important, because nobody will be able to say they don’t know how these statistics were computed. They are basic questions by which every recidivism risk model should be measured.

What’s Next?

Until now, recidivism risk models have been deployed naively, in judicial systems all across the country, and judges in those systems have been presented with such scores as if they are inherently “fair.”

But now, people deploying these models – and by people I mostly mean Department of Corrections decision-makers – will have pressure to make sure the models are audited for racism before using them. And they can do this kind of analysis in-house with much less work. I hope they do.

Categories: Uncategorized
  1. Ben
    May 24, 2016 at 8:19 am

    Suppose that Italians are 90% likely to eat pasta in a given week, and German people only 30%. Take an Italian who did not eat pasta. You can’t blame the algorithm for predicting she was likely to eat pasta.

    You can’t look at a particular subset of the realizations and say the prediction was flawed _for them_. This does not make any sense. The question is whether the prediction algorithm is correct _on average_.

    I am surprised you let this slip through.




    • May 24, 2016 at 8:20 am

      Actually, that’s not the question. The real question is whether we are all equal in front of the law, and whether justice is being served.


      • RTG
        May 26, 2016 at 2:35 pm

        Jumping in a little late to the discussion, but I think your comment hits the nail on the head w.r.t. to these models and the use of data science to impact real-world decisions more broadly…and it’s deeply misunderstood by non-practioners and, unfortunately, practioners alike. Per the discussion below, it may well be that the model being used predicts averages well, and that reality is just “racist” in the sense that recidivism rates are higher for blacks than whites. No doubt, there are many more factors that explain that than the ones that could be included in any reasonable model (e.g. whether the defendant is the descendant of slaves or of 20th century immigrants…don’t know this for sure, but I’m guessing it matters a lot).

        It would be almost impossible to take into account every factor that accounts for whether a convict will recidivate…so you can’t exactly blame the model for being incomplete. But, knowing that pretty much any model will be incomplete, the onus falls upon the user (and the person advising the user) to determine whether the model serves the intended purpose. If the purpose is to predict crime statistics on average, maybe this is a good model. But, if the purpose is to provide a more fair and unbiased method of sentencing, one that seeks to mitigate the internal and unconscious biases of judges, for example, then this model would seem to fail miserably. Though, actually, we can’t really determine that either. This model may over-predict recidivism for blacks and under-predict it for whites, *and* it may also be more fair than what judges would do in the absence of this model. Given the data, that should be something that can be, ahem, modeled once these models have been in use for long enough. As I think you’ve pointed out with respect to VAR models for teachers, though, the incentive to go back and check and maybe improve the model starts to decline once a model comes into use.

        So, to me, the underlying question is always how do we help people better understand the benefits and limitations of data-modeling. Have just finished up a two day argument with a non-data scientist colleague about why I’m not being overly cautious in refusing to provide projections 3-4 months out without a better model than just arbitrarily fitting a line to the data that kind of looks like it’s moving upward. Trying to explain to him that any random number I could provide will be no more useful than saying it will most likely be higher than today…and I would be very concerned if the false security of providing a number rather than a qualitative description caused a major business decision to be made that wouldn’t have been made based on a qualitative analysis.


    • May 24, 2016 at 3:31 pm

      But that isn’t the failure of the algorithm. The failure (not accepting your numbers in reality, but only for the sake of conversation) is that the algorithm predicts 99% of Italians to eat pasta, and only 10% of Germans. I can’t figure out how to link to an individual table, but search the document for “makes the opposite mistake among whites” and look at the 2×2 table — the algorithm is biased towards Type I errors for blacks and Type II for whites.


      • Ben
        May 24, 2016 at 4:39 pm

        Dear David,

        This is not correct. Again, take an Italian who did not eat pasta. She was predicted to eat pasta 9/10. Take a German who did not eat pasta, he was predicted 3/10. Yet neither ate pasta. If you want to produce a table like the one you point too, add covariates independent from nationality but predicting past-eating, like gender. Then _among ex-post non-pasta eaters_ Italians will have been more frequently classified as high-pasta-eating-risk. The Type I and II errors analogy is misleading as we are not trying to uncover a true, fixed, hidden future-criminal characteristic, but rather quantifying a probability distribution on future crime committing.

        Focusing ex-post on those who turned out no to eat pasta alone will never tell you anything about the correctness of the prediction. The red flag is clear in the description of the data you point to “Our analysis found that black defendants who did not recidivate over a two-year period were nearly twice as likely to be misclassified as higher risk compared to their white counterparts” (as quoted by Cathy). Translating: “Our analysis found that Italians who did not eat pasta were nearly twice as likely to be counted as ‘high-pasta-eating-risk’ compared to the Germans who did not eat pasta”. Indeed they were.


        What you are saying (the algorithm doing 99% / 10%) is what I thought Cathy (and the article) were saying. Cathy was actually clear in her answer. The problem is not that the algorithm is flawed. You could even say that the issue is that it is too good. I think that the point that Cathy makes is that race should not be allowed as a covariate in the risk model used by a court of justice. And the interest of the article is that it tends to show that race is used (at least indirectly through uncorrelated covariates).

        Given you reaction and mine, I think it is fair to say that Cathy’s post (and probably the article) are somewhat ambiguous.


        • Ben
          May 24, 2016 at 4:58 pm

          (In the last sentence of the second to last paragraph, I meant to type “at least indirectly through correlated covariates”, of course. Sorry about that.)


        • May 24, 2016 at 8:57 pm

          Thanks. I understand where you are coming from now, but I am not sure if I agree with you. Certainly, if I agree that everyone has some “probability to offend” correlated with race and that the algorithm’s job is to measure it, then I agree with your point. I feel a little weird about acting as if “probability to offend” were a real quantity, and would rather ask about the final binary decisions: How often are they right, and are the Type I versus Type II errors uncorrelated with things which we think it is wrong to base judgments on.

          That said, I do know that there are tools to evaluate an algorithm which returns probability estimates: See https://en.wikipedia.org/wiki/Scoring_rule for a good guide. I need to think how I would quantify the issue of fairness in terms of a scoring rule.

          In the covariate toy problem I gave, the model would actually be both more accurate and fairer (in my sense) if it DID incorporate race as an explicit variable, so that it could use a lower income cut off for blacks than whites.


  2. May 24, 2016 at 10:45 am

    I don’t know if you cover this in your book (I can’t wait to find out!), but do you cover what an audit for race, gender, class biases would look like?


  3. May 24, 2016 at 4:11 pm

    I tried to post this and the site seemed to freeze. My apologies if this double posts.

    Here is a toy model which could explain how this effect could happen without deliberate racism. (Which, just to be clear, doesn’t mean it is okay; it is very much not okay.) Lets say we have 4000 arrestees. Half are of each race and half of them, uncorrelated with race, are habitual criminals. (That second number is way too high, but it makes the numbers easy.)

    Let’s say that the incomes of these four types of people are uniformly distributed between

    10K-50K for black criminals
    20K-60K for black good guys
    20K-60K for white criminals
    30K-70K for white good guys.

    Our algorithm labels incomes below 40K as high risk and incomes above that as low risk. That means that it labels 2000 people as high risk, who break up as (750, 500, 500, 250) in the above categories. Ignoring race, it is right 62.5% of the time (pretty close to the actual algorithm). It is also right 62.5% of the time when it predicts low risk.

    But, looked at by race, we get the following table of errors

    labeled high risk, but good 12.5% of whites, 25% of blacks
    labeled low risk, but bad 25% of whites, 12.5% of blacks.

    \footnote{I am reporting my results as a fraction of all people of that race. I think the paper is using some other denominator, but I am confused as to what. You can see that, for me, the odds of the two types of error add up to 37.5%, the odds of total error. In the paper, the odds of error add up to about 70% although they also say that the odds of total error are 39%. If someone can clear this up, I’d appreciate it.}

    In summary, if both race and criminality affect some measured variable in the same direction, a model based on that variable will be racist, even if race and criminality are uncorrelated.


  4. Guest2
    May 24, 2016 at 10:57 pm

    What is the context for COMPAS? What role — if any — does COMPAS have, outside basic research? What if it only inhabits the research domain?

    If not, then just as much attention needs to be given the political and organizational supports COMPAS has, who developed it, where their funding comes from — a complete analysis of the political economic niche that it is located in. And where does COMPAS reside in the reputational space of its rivals and relatives? What is it in competition with? Who is winning and why are they winning? What is its role in the ecology of big data technology, politics, organizations and the judicial system?


  5. May 26, 2016 at 9:00 pm

    It appears to be an algorithm based on personality tests. Think Myers-Briggs. I’m amazed that it’s able to predict 5/8 of the time. Those tests are easily gamed. Perhaps the hardened white criminals are better at gaming the personality test.

    Liked by 1 person

  1. No trackbacks yet.
Comments are closed.
%d bloggers like this: