BISG Methodology

I’ve been tooling around with the slightly infamous BISG methodology lately. It’s a simple concept which takes the last name of a person, as well as the zip code of their residence, and imputes the probabilities of that person being of various races and ethnicities using the Bayes updating rule.

The methodology is implemented with the most recent U.S. census data and critically relies on the fact that segregation is widespread in this country, especially among whites and blacks, and that Asian and Hispanic last names are relatively well-defined. It’s not a perfect methodology, of course, and it breaks down in the cases that people marry people of other races, or there are names in common between races, and especially when they live in diverse neighborhoods.

The BISG methodology came up recently in this article (hat tip Don Goldberg) about the man who invented it and the politics surrounding it. Specifically, it was recently used by the CFPB to infer disparate impact in auto lending, and the Republicans who side with auto lending lobbyists called it “junk science.” I blogged about this here and, even earlier, here.

Their complaints, I believe, center around the fact that the methodology, being based on the entire U.S. population, isn’t entirely accurate when it comes to auto lending, or for that matter when it comes to mortgages, which was the CFPB’s “ground truth” testing arena.

And that’s because minorities basically have less wealth, due to a bunch of historical racist reasons, but the upshot is that this methodology assumes a random sampling of the U.S. population but what we actually see in auto financing isn’t random.

Which begs the question, why don’t we update the probabilities with the known distribution of auto lending? That’s the thing about Bayes Law, we can absolutely do that. And once we did that, the Republican’s complaint would disappear. Please, someone tell me what I’m misunderstanding.

Between you and me, I think the real gripe is something along the lines of the so-called voter fraud problem, which is not really a problem statistically but since examples can be found of mistakes, we might imagine they’re widespread. In this case, the “mistake” is a white person being offered restitution for racist auto lending practices, which happens, and is a strange problem to have, but needs to be compared to not offering restitution to a lot of people who actually deserve it.

Anyhoo, I’m planning to add the below code to github, but I recently purchased a new laptop and I haven’t added a public key yet, so I’ll get to it soon. To be clear, the below code isn’t perfect, and it only uses zip code whereas a more precise implementation would use addresses. I’m supplying this because I didn’t find it online in python, only in STATA or something crazy expensive like that. Even so, I stole their munged census data, which you can too, from this github page.

Also, I can’t seem to get the python spacing to work in WordPress, so this is really pretty terrible, but python users will be able to figure it out until I can get it on github.

%matplotlib inline

import numpy
import matplotlib
from pandas import *
import pylab
pylab.rcParams[‘figure.figsize’] = 16, 12

#Clean your last names and zip codes.

def get_last_name(fullname):
parts_list = fullname.split(‘ ‘)
while parts_list[-1] in [”, ‘ ‘,’ ‘,’Jr’, ‘III’, ‘II’, ‘Sr’]:
parts_list = parts_list[:-1]
if len(parts_list)==0:
return “”
return parts_list[-1].upper().replace(“‘”, “”)

def clean_zip(fullzip):
if len(str(fullzip))<5:
return 0
return int(str(fullzip)[:5])
return 0

Test = read_csv(“file.csv”)
Test[‘Name’] = Test[‘name’].map(lambda x: get_last_name(x))
Test[‘Zip’] = Test[‘zip’].map(lambda x: clean_zip(x))

#Add zip code probabilities. Note these are probability of living in a specific zip code given that you have a given race. They are extremely small numbers.

F = read_stata(“zip_over18_race_dec10.dta”)
print “read in zip data”

names =[‘NH_White_alone’,’NH_Black_alone’, ‘NH_API_alone’, ‘NH_AIAN_alone’,       ‘NH_Mult_Total’, \

trans = dict(zip(names, [‘White’, ‘Black’, ‘API’, ‘AIAN’, ‘Mult’, ‘Hisp’, ‘Other’]))
totals_by_race = [float(F[r].sum()) for r in names]
sum_dict = dict(zip(names, totals_by_race))

#I’ll use the generic_vector down below when I don’t have better name information

generic_vector = numpy.array(totals_by_race)/numpy.array(totals_by_race).sum()

for r in names:
F[‘pct of total %s’ %(trans[r])] = F[r]/sum_dict[r]

print “ready to add zip probabilities”

def get_zip_probs(zip):
G = F[F[‘ZCTA5’]==str(zip)][[‘pct of total White’,’pct of total Black’, ‘pct of total API’, \
‘pct of total AIAN’, ‘pct of total Mult’, ‘pct of total Hisp’, \
‘pct of total Other’]]
if len(G.values)>0:
return numpy.array(G.values[0])
print “no data for zip = “, zip
return numpy.array([1.0]*7)

Test[‘Prob of zip given race’] = Test[‘Zip’].map(lambda x: get_zip_probs(x))

#Next, compute the probability of each race given a specific name.

Names = read_csv(“app_c.csv”)

print “read in name data”

def clean_probs(p):
return float(p)
return 0.0

for cat in [‘pctwhite’, ‘pctblack’, ‘pctapi’, ‘pctaian’, ‘pct2prace’, ‘pcthispanic’]:
Names[cat] = Names[cat].map(lambda x: clean_probs(x)/100.0)

Names[‘pctother’] = Names.apply(lambda row: max (0, 1 – float(row[‘pctwhite’]) – \
float(row[‘pctblack’]) – float(row[‘pctapi’]) – \
float(row[‘pctaian’]) – float(row[‘pct2prace’]) – \
float(row[‘pcthispanic’])), axis = 1)

print “ready to add name probabilities”

def get_name_probs(name):
G = Names[Names[‘name’]==name][[‘pctwhite’, ‘pctblack’, ‘pctapi’, ‘pctaian’,  ‘pct2prace’, ‘pcthispanic’, ‘pctother’]]
if len(G.values)>0:
return numpy.array(G.values[0])
return generic_vector

Test[‘Prob of race given name’] = Test[‘Name’].map(lambda x: get_name_probs(x))

#Finally, use the Bayesian updating formula to compute overall probabilities of each race.

Test[‘Prod’] = Test[‘Prob of zip given race’]*Test[‘Prob of race given name’]
Test[‘Dot’] = Test[‘Prod’].map(lambda x: x.sum())
Test[‘Final Probs’] = Test[‘Prod’]/Test[‘Dot’]

Test[‘White Prob’] = Test[‘Final Probs’].map(lambda x: x[0])
Test[‘Black Prob’] = Test[‘Final Probs’].map(lambda x: x[1])
Test[‘API Prob’] = Test[‘Final Probs’].map(lambda x: x[2])
Test[‘AIAN Prob’] = Test[‘Final Probs’].map(lambda x: x[3])
Test[‘Mult Prob’] = Test[‘Final Probs’].map(lambda x: x[4])
Test[‘Hisp Prob’] = Test[‘Final Probs’].map(lambda x: x[5])
Test[‘Other Prob’] = Test[‘Final Probs’].map(lambda x: x[6])

Categories: Uncategorized

Book Tour Events!

Readers, I’m so happy to announce upcoming public events for my book tour, which starts in 2 weeks! Holy crap!

The details aren’t all entirely final, and there may be more events added later, but here’s what we’ve got so far. I hope I see some of you soon!

Events for Cathy O’Neil

Author of

How Big Data Increases Inequality and Threatens Democracy

(Crown; September 6, 2016)

Thursday, September 8


Reading/Signing/Talk with Felix Salmon

Barnes & Noble Upper East Side

150 E 86th St.

New York, NY 10028


Tuesday, September 13


In Conversation Event

Town Hall Seattle

1119 8th Ave.

Seattle, WA 98101 

Wednesday, September 14

Democracy/Citizenship Series

Mechanics’ Institute Library

57 Post St.

San Francisco, CA 94104


Wednesday, September 14

In Conversation with Lianna McSwain

Book Passage

51 Tamal Vista Blvd.

Corte Madera, CA 94925


Thursday, September 15


Privacy.Security.Risk. 2016

San Jose Marriott

301 S. Market Street

San Jose, CA 95113

Tuesday, September 20


In Conversation with Jen Golbeck

Busboys and Poets (w/Politics & Prose)

1025 5th Street NW

Washington, D.C. 20001

Monday, October 3



Harvard Book Store

1256 Mass Ave.

Cambridge, MA 02138

Saturday, October 22nd


Wisconsin Book Festival

Wisconsin Institutes for Discovery

DeLuca Forum

For more information or to schedule an interview contact:
Sarah Breivogel, 212-572-2722, or
Liz Esman, 212-572-6049,

Categories: Uncategorized

Chicago’s “Heat List” predicts arrests, doesn’t protect people or deter crime

A few months ago I publicly pined for a more scientific audit of the Chicago Police Department’s “Heat List” system. The excerpt from that blogpost:

…the Chicago Police Department uses data mining techniques of social media to determine who is in gangs. Then they arrest scores of people on their lists, and finally they tout the accuracy of their list in part because of the percentage of people who were arrested who were also on their list. I’d like to see a slightly more scientific audit of this system.

Thankfully, my request has officially been fulfilled!

Yesterday I discovered via Marcos Carreiro on Twitter, that a paper has been written entitled Predictions put into practice: a quasi-experimental evaluation of Chicago’s predictive policing pilot, written by Priscillia Hunt, and John S. Hollywood and published in the Journal of Experimental Criminology.

The paper’s main result upheld my suspicions:

Individuals on the SSL are not more or less likely to become a victim of a homicide or shooting than the comparison group, and this is further supported by city-level analysis. The treated group is more likely to be arrested for a shooting.

Inside the paper, they make the following important observations. First, crime rates have been going down over time, and the “Heat List” system has not effected that trend. An excerpt:

…the statistically significant reduction in monthly homicides predated the introduction of the SSL, and that the SSL did not cause further reduction in the average number of monthly homicides above and beyond the pre-existing trend.

Here’s an accompanying graphic:

Screen Shot 2016-08-18 at 6.14.39 AM.png

This is a really big and important point, one that smart people like Gillian Tett get thrown off by when discussing predictive policing tools. We cannot automatically attribute success to any policing policy in the context of meta-effects.

Next, being on the list doesn’t protect you:

However, once other demographics, criminal history variables, and social network risk have been controlled for using propensity score weighting and doubly-robust regression modeling, being on the SSL did not significantly reduce the likelihood of being a murder or shooting victim, or being arrested for murder.

But it does make it more likely for you to get surveilled by police:

Seventy-seven percent of the SSL subjects had at least one contact card over the year following the intervention, with a mean of 8.6 contact cards, and 60 % were arrested at some point, with a mean of 1.53 arrests. In fact, almost 90 % had some sort of interaction with the Chicago PD (mean = 10.72 interactions) during the year-long observation window. This increased surveillance does appear to be caused by being placed on the SSL. Individuals on SSL were 50 % more likely to have at least one contact card and 39 % more likely to have any interaction (including arrests, contact cards, victimizations, court appearances, etc.) with the Chicago PD than their matched comparisons in the year following the intervention. There was no statistically significant difference in their probability of being arrested or incapacitated8 (see Table 4). One possibility for this result, however, is that, given the emphasis by commanders to make contact with this group, these differences are due to increased reporting of contact cards for SSL subjects.

And, most importantly, being on the list means you are likely to be arrested for shooting, but it doesn’t cause that to be true:

In other words, the additional contact with police did not result in an increased likelihood for arrests for shooting, that is, the list was not a catalyst for arresting people for shootings. Rather, individuals on the list were people more likely to be arrested for a shooting regardless of the increased contact.

That also comes with an accompanying graphic:

Screen Shot 2016-08-18 at 6.29.13 AM.png

From now on, I’ll refer to Chicago’s “Heat List” as a way for the police to predict their own future harassment and arrest practices.

Categories: Uncategorized

What is alpha?

Last week on Slate Money I had a disagreement, or at least a lively discussion, with Felix Salmon and Josh Barro on the definition of alpha.

They said it was anything that a portfolio returned above and beyond the market return, given the amount of risk the portfolio was carrying. That’s not different from how wikipedia defines alpha, and I’ve seen it said in more or less this way in a lot of places. Thus the confusion.

However, while working as a quant at a hedge fund, I was taught that alpha was the return of a portfolio that was uncorrelated to the market.

It’s a confusing thing to discuss, partly because the concept of “risk” is somewhat self-referential – more on that soon – and partly because we sometimes embed what’s called the capital asset pricing model (CAPM) into our assumptions when we talk about how portfolio returns work.

Let’s start with the following regression, which refers to stock-based portfolios, and which defines alpha:

R_{i, t} - R_f = \alpha + \beta (R_{M, t} - R_f) + \epsilon_t

Now, the term term R_f refers to the risk-free rate, or in other words how much interest you get on US treasuries, which we can approximate by 0 because it’s easier to ignore them and because it’s actually pretty close to 0 anyway. That cleans up our formula:

R_{i, t} = \alpha + \beta R_{M, t} + \epsilon_t

In this regression, we are fitting the coefficients \alpha and \beta to many instances of time windows where we’ve measured our portfolio’s return R_{i, t} and the market’s return R_{M, t}. Think of market as the S&P500 index, and think of the time windows as days.

So first, defining alpha with the above regression does what I claimed it would do: it “picks off” that part of the portfolio returns that are correlated to the market and put it in the beta coefficient, and the rest is left to alpha. If beta is 1, alpha is 0, and if the error terms are all zero, you are following the market exactly.

On the other hand, the above formulation also seems to support Felix’s suggestion that alpha is the return that is not accounted for by risk. The thing is, it’s true, at least according to the CAPM theory of investing, which says you can’t do better than the market, that you’re rewarded by market your risk in a direct way, and that everyone knows this and refuses to take on other, unrewarded risks. In particular, alpha in the above equation should be zero, but anything “extra” that you earn beyond the expected market returns would be represented by alpha in the above regression.

So, are we actually agreeing?

Well, no. The two approaches to defining alpha are very different. In particular, my definition has no reference to CAPM. Say for a moment we don’t believe in CAPM. We can still run the regression above. All we’re doing, when we run that regression, is measuring the extent to which our portfolio’s returns are “explained” by its overlap with the market.

In particular, we do not expect the true risk of our portfolio to be apparent in the above equation. Which brings us to how risk is defined, and it’s weird, because it cannot be directly measured. Instead, we typically infer risk from the volatility – computed as standard deviation – of past returns.

This isn’t a terrible idea, because if something moves around wildly on a daily basis, it would appear to be pretty risky. But it’s also not the greatest idea, as we learned in 2008, because lots of credit instruments like credit default swaps move very little on a daily basis but then suddenly lose tremendous value overnight. So past performance is not always indicative of future performance.

But it’s what we’ve got, so let’s hold on to it for the discussion. The key observation is the following:

The above regression formula only displays the market-correlated risk, and the remaining risk is unmeasured. A given portfolio might have incredibly wild swings in value, but as long as they are uncorrelated to the market, they will be invisible to the above equation, showing up only in the error terms.

Said another way, alpha is not truly risk-adjusted. It’s only market-risk-adjusted.

We might have an investment portfolio with a large alpha and a small beta, and someone who only follows CAPM theory would tell me we’re amazing investors. In fact hedge funds try to minimize their relationship to market returns – that’s the “hedge” in hedge funds – and so they’d want exactly that, a large alpha, a tiny beta, and quite a bit of risk. [One caveat: some people stipulate that a lot of that uncorrelated return is fabricated through sleazy accounting.]

It’s not like I am alone here – for a long time people have been aware that there’s lots of risk that’s not represented by market risk – for example, other instrument classes and such. So instead of using a simplistic regression like the one above, people generalize everything in sight and use the Sharpe ratio, which is the ratio of returns (often relative to some benchmark or index) to risks, where risks are measured by more complicated volatility-like computations.

However, that more general concept is also imperfect, mostly because it’s complicated and highly gameable. Portfolio managers are constantly underestimating the risk they take on, partly because – or entirely because – they can then claim to have a high Sharpe ratio.

How much does this matter? People have a colloquial use for the word alpha that’s different from my understanding, which isn’t surprising. The problem lies in the possibility that people are bragging when they shouldn’t, especially when they’re hiding risk, and especially especially if your money is on the line.

Categories: Uncategorized

The truth about clean swimming pools

There’s been a lot of complaints about the Olympic pools turning green and dirty in Rio. People seem worried that the swimmers’ health may be at risk and so on.

Well, here’s what I learned last month when my family rented a summer house with a pool. Pools that look clean are not clean. They would be better described as, “so toxic that algae cannot live in it.”

I know what I’m talking about. One weekend my band visiting the house, and the pool guy had been missing for 2 weeks straight. This is what my pool looks like:


Album cover, obviously.

Then we added an enormous vat of chemicals, specifically liquid chlorine, and about 24 hours later this is what happened:


It wasn’t easy to recreate this. I had to throw the shark’s tail at Jamie like 5 times because it kept floating away. Also, back of the album, obviously.

Now you might notice that it’s not green anymore, but it’s also not clear. To get to clear, blue water, you need to add yet another tub of some other chemical.

Long story short: don’t be deceived by “clean” pool water. There’s nothing clean about it.

Update: I’m not saying “chemicals are bad,” and please don’t compare me to the – ugh – Food Babe! I’m just saying “clean water” isn’t an appropriate description. It’s not as if it’s pure water, and we pour tons of stuff in to get it to look like that. So yes, algae and germs can be harmful! And yes, chlorine in moderate amounts is not bad for you!

Categories: Uncategorized

Donald Trump is like a biased machine learning algorithm

Bear with me while I explain.

A quick observation: Donald Trump is not like normal people. In particular, he doesn’t have any principles to speak of, that might guide him. No moral compass.

That doesn’t mean he doesn’t have a method. He does, but it’s local rather than global.

Instead of following some hidden but stable agenda, I would suggest Trump’s goal is simply to “not be boring” at Trump rallies. He wants to entertain, and to be the focus of attention at all times. He’s said as much, and it’s consistent with what we know about him. A born salesman.

What that translates to is a constant iterative process whereby he experiments with pushing the conversation this way or that, and he sees how the crowd responds. If they like it, he goes there. If they don’t respond, he never goes there again, because he doesn’t want to be boring. If they respond by getting agitated, that’s a lot better than being bored. That’s how he learns.

A few consequences. First, he’s got biased training data, because the people at his rallies are a particular type of weirdo. That’s one reason he consistently ends up saying things that totally fly within his training set – people at rallies – but rub the rest of the world the wrong way.

Next, because he doesn’t have any actual beliefs, his policy ideas are by construction vague. When he’s forced to say more, he makes them benefit himself, naturally, because he’s also selfish. He’s also entirely willing to switch sides on an issue if the crowd at his rallies seem to enjoy that.

In that sense he’s perfectly objective, as in morally neutral. He just follows the numbers. He could be replaced by a robot that acts on a machine learning algorithm with a bad definition of success – or in his case, a penalty for boringness – and with extremely biased data.

The reason I bring this up: first of all, it’s a great way of understanding how machine learning algorithms can give us stuff we absolutely don’t want, even though they fundamentally lack prior agendas. Happens all the time, in ways similar to the Donald.

Second, some people actually think there will soon be algorithms that control us, operating “through sound decisions of pure rationality” and that we will no longer have use for politicians at all.

And look, I can understand why people are sick of politicians, and would love them to be replaced with rational decision-making robots. But that scenario means one of three things:

  1. Controlling robots simply get trained by the people’s will and do whatever people want at the moment. Maybe that looks like people voting with their phones or via the chips in their heads. This is akin to direct democracy, and the problems are varied – I was in Occupy after all – but in particular mean that people are constantly weighing in on things they don’t actually understand. That leaves them vulnerable to misinformation and propaganda.
  2. Controlling robots ignore people’s will and just follow their inner agendas. Then the question becomes, who sets that agenda? And how does it change as the world and as culture changes? Imagine if we were controlled by someone from 1000 years ago with the social mores from that time. Someone’s gonna be in charge of “fixing” things.
  3. Finally, it’s possible that the controlling robot would act within a political framework to be somewhat but not completely influenced by a democratic process. Something like our current president. But then getting a robot in charge would be a lot like voting for a president. Some people would agree with it, some wouldn’t. Maybe every four years we’d have another vote, and the candidates would be both people and robots, and sometimes a robot would win, sometimes a person. I’m not saying it’s impossible, but it’s not utopian. There’s no such thing as pure rationality in politics, it’s much more about picking sides and appealing to some people’s desires while ignoring others.
Categories: Uncategorized

Holy crap – an actual book!

Yo, everyone! The final version of my book now exists, and I have exactly one copy! Here’s my editor, Amanda Cook, holding it yesterday when we met for beers:


Here’s my son holding it:


He’s offered to become a meme in support of book sales.

Here’s the back of the book, with blurbs from really exceptional people:


In other exciting book news, there’s a review by Richard Beales from Reuter’s BreakingViews, and it made a list of new releases in Scientific American as well.


I want to apologize in advance for all the book news I’m going to be blogging, tweeting, and otherwise blabbing about. To be clear, I’ve been told it’s my job for the next few months to be a PR person for my book, so I guess that’s what I’m up to. If you come here for ideas and are turned off by cheerleading, feel free to temporarily hate me, and even unsubscribe to whatever feed I’m in for you!

But please buy my book first, available for pre-order now. And feel free to leave an amazing review.

Categories: Uncategorized

Get every new post delivered to your Inbox.

Join 3,922 other followers