Citi Bike comes to Columbia

I’m unreasonably excited that Citi Bike has finally expanded to the area where I live, Columbia University. Here’s the situation:

Screen Shot 2016-09-02 at 8.12.16 AM

Specifically, this means I can drop my kid off at school at 110th and Broadway and then bike downtown.

People, this is huge. It means I never have to get on the 1 train at rush hour again! Unless everyone else has the same plan as me, of course.

Categories: Uncategorized

Excerpt of my book in the Guardian!

Wow, people, an excerpt of my book has been published in the Guardian this morning. How exciting is this? I hope you like it, it’s got a fancy graphic:

Screen Shot 2016-09-01 at 7.25.04 AM.png

This is the result of my amazing Random House UK publicity team, who have been busy promoting my book in the UK.

That same team is bringing me to London at the end of September for a book tour, and as part of that I’m excited to announce I’ll be at the How To: Academy on September 27th, talking about my book, which was Tickets are available here.

Also, if you haven’t gotten enough of Weapons of Math Destruction this morning, take a look at Evelyn Lamb’s review in Scientific American.

Update: the print edition of the Guardian also looks smashing:

Cover photo

Double spread photo

Categories: Uncategorized

In Time Magazine!

The amazing and talented Rana Foroohar, whom I spoke with on Slate Money not so long ago about her fascinating book, Makers and Takers: the Rise of Finance and the Fall of American Business, has written a fantastic piece about my upcoming book for Time Magazine.

The link is here. Take a look, it’s a great piece.

Also, I was profiled as a math nerd last week by a Bloomberg journalist.

Categories: Uncategorized

BISG Methodology

I’ve been tooling around with the slightly infamous BISG methodology lately. It’s a simple concept which takes the last name of a person, as well as the zip code of their residence, and imputes the probabilities of that person being of various races and ethnicities using the Bayes updating rule.

The methodology is implemented with the most recent U.S. census data and critically relies on the fact that segregation is widespread in this country, especially among whites and blacks, and that Asian and Hispanic last names are relatively well-defined. It’s not a perfect methodology, of course, and it breaks down in the cases that people marry people of other races, or there are names in common between races, and especially when they live in diverse neighborhoods.

The BISG methodology came up recently in this article (hat tip Don Goldberg) about the man who invented it and the politics surrounding it. Specifically, it was recently used by the CFPB to infer disparate impact in auto lending, and the Republicans who side with auto lending lobbyists called it “junk science.” I blogged about this here and, even earlier, here.

Their complaints, I believe, center around the fact that the methodology, being based on the entire U.S. population, isn’t entirely accurate when it comes to auto lending, or for that matter when it comes to mortgages, which was the CFPB’s “ground truth” testing arena.

And that’s because minorities basically have less wealth, due to a bunch of historical racist reasons, but the upshot is that this methodology assumes a random sampling of the U.S. population but what we actually see in auto financing isn’t random.

Which begs the question, why don’t we update the probabilities with the known distribution of auto lending? That’s the thing about Bayes Law, we can absolutely do that. And once we did that, the Republican’s complaint would disappear. Please, someone tell me what I’m misunderstanding.

Between you and me, I think the real gripe is something along the lines of the so-called voter fraud problem, which is not really a problem statistically but since examples can be found of mistakes, we might imagine they’re widespread. In this case, the “mistake” is a white person being offered restitution for racist auto lending practices, which happens, and is a strange problem to have, but needs to be compared to not offering restitution to a lot of people who actually deserve it.

Anyhoo, I’m planning to add the below code to github, but I recently purchased a new laptop and I haven’t added a public key yet, so I’ll get to it soon. To be clear, the below code isn’t perfect, and it only uses zip code whereas a more precise implementation would use addresses. I’m supplying this because I didn’t find it online in python, only in STATA or something crazy expensive like that. Even so, I stole their munged census data, which you can too, from this github page.

Also, I can’t seem to get the python spacing to work in WordPress, so this is really pretty terrible, but python users will be able to figure it out until I can get it on github.

%matplotlib inline

import numpy
import matplotlib
from pandas import *
import pylab
pylab.rcParams[‘figure.figsize’] = 16, 12

#Clean your last names and zip codes.

def get_last_name(fullname):
parts_list = fullname.split(‘ ‘)
while parts_list[-1] in [”, ‘ ‘,’ ‘,’Jr’, ‘III’, ‘II’, ‘Sr’]:
parts_list = parts_list[:-1]
if len(parts_list)==0:
return “”
return parts_list[-1].upper().replace(“‘”, “”)

def clean_zip(fullzip):
if len(str(fullzip))<5:
return 0
return int(str(fullzip)[:5])
return 0

Test = read_csv(“file.csv”)
Test[‘Name’] = Test[‘name’].map(lambda x: get_last_name(x))
Test[‘Zip’] = Test[‘zip’].map(lambda x: clean_zip(x))

#Add zip code probabilities. Note these are probability of living in a specific zip code given that you have a given race. They are extremely small numbers.

F = read_stata(“zip_over18_race_dec10.dta”)
print “read in zip data”

names =[‘NH_White_alone’,’NH_Black_alone’, ‘NH_API_alone’, ‘NH_AIAN_alone’,       ‘NH_Mult_Total’, \

trans = dict(zip(names, [‘White’, ‘Black’, ‘API’, ‘AIAN’, ‘Mult’, ‘Hisp’, ‘Other’]))
totals_by_race = [float(F[r].sum()) for r in names]
sum_dict = dict(zip(names, totals_by_race))

#I’ll use the generic_vector down below when I don’t have better name information

generic_vector = numpy.array(totals_by_race)/numpy.array(totals_by_race).sum()

for r in names:
F[‘pct of total %s’ %(trans[r])] = F[r]/sum_dict[r]

print “ready to add zip probabilities”

def get_zip_probs(zip):
G = F[F[‘ZCTA5’]==str(zip)][[‘pct of total White’,’pct of total Black’, ‘pct of total API’, \
‘pct of total AIAN’, ‘pct of total Mult’, ‘pct of total Hisp’, \
‘pct of total Other’]]
if len(G.values)>0:
return numpy.array(G.values[0])
print “no data for zip = “, zip
return numpy.array([1.0]*7)

Test[‘Prob of zip given race’] = Test[‘Zip’].map(lambda x: get_zip_probs(x))

#Next, compute the probability of each race given a specific name.

Names = read_csv(“app_c.csv”)

print “read in name data”

def clean_probs(p):
return float(p)
return 0.0

for cat in [‘pctwhite’, ‘pctblack’, ‘pctapi’, ‘pctaian’, ‘pct2prace’, ‘pcthispanic’]:
Names[cat] = Names[cat].map(lambda x: clean_probs(x)/100.0)

Names[‘pctother’] = Names.apply(lambda row: max (0, 1 – float(row[‘pctwhite’]) – \
float(row[‘pctblack’]) – float(row[‘pctapi’]) – \
float(row[‘pctaian’]) – float(row[‘pct2prace’]) – \
float(row[‘pcthispanic’])), axis = 1)

print “ready to add name probabilities”

def get_name_probs(name):
G = Names[Names[‘name’]==name][[‘pctwhite’, ‘pctblack’, ‘pctapi’, ‘pctaian’,  ‘pct2prace’, ‘pcthispanic’, ‘pctother’]]
if len(G.values)>0:
return numpy.array(G.values[0])
return generic_vector

Test[‘Prob of race given name’] = Test[‘Name’].map(lambda x: get_name_probs(x))

#Finally, use the Bayesian updating formula to compute overall probabilities of each race.

Test[‘Prod’] = Test[‘Prob of zip given race’]*Test[‘Prob of race given name’]
Test[‘Dot’] = Test[‘Prod’].map(lambda x: x.sum())
Test[‘Final Probs’] = Test[‘Prod’]/Test[‘Dot’]

Test[‘White Prob’] = Test[‘Final Probs’].map(lambda x: x[0])
Test[‘Black Prob’] = Test[‘Final Probs’].map(lambda x: x[1])
Test[‘API Prob’] = Test[‘Final Probs’].map(lambda x: x[2])
Test[‘AIAN Prob’] = Test[‘Final Probs’].map(lambda x: x[3])
Test[‘Mult Prob’] = Test[‘Final Probs’].map(lambda x: x[4])
Test[‘Hisp Prob’] = Test[‘Final Probs’].map(lambda x: x[5])
Test[‘Other Prob’] = Test[‘Final Probs’].map(lambda x: x[6])

Categories: Uncategorized

Book Tour Events!

Readers, I’m so happy to announce upcoming public events for my book tour, which starts in 2 weeks! Holy crap!

The details aren’t all entirely final, and there may be more events added later, but here’s what we’ve got so far. I hope I see some of you soon!

Events for Cathy O’Neil

Author of

How Big Data Increases Inequality and Threatens Democracy

(Crown; September 6, 2016)

Thursday, September 8


Reading/Signing/Talk with Felix Salmon

Barnes & Noble Upper East Side

150 E 86th St.

New York, NY 10028


Tuesday, September 13


In Conversation Event

Town Hall Seattle

1119 8th Ave.

Seattle, WA 98101 

Wednesday, September 14

Democracy/Citizenship Series

Mechanics’ Institute Library

57 Post St.

San Francisco, CA 94104


Wednesday, September 14

In Conversation with Lianna McSwain

Book Passage

51 Tamal Vista Blvd.

Corte Madera, CA 94925


Thursday, September 15


Privacy.Security.Risk. 2016

San Jose Marriott

301 S. Market Street

San Jose, CA 95113

Tuesday, September 20


In Conversation with Jen Golbeck

Busboys and Poets (w/Politics & Prose)

1025 5th Street NW

Washington, D.C. 20001

Monday, October 3



Harvard Book Store

1256 Mass Ave.

Cambridge, MA 02138

Saturday, October 22nd


Wisconsin Book Festival

Wisconsin Institutes for Discovery

DeLuca Forum

For more information or to schedule an interview contact:
Sarah Breivogel, 212-572-2722, or
Liz Esman, 212-572-6049,

Categories: Uncategorized

Chicago’s “Heat List” predicts arrests, doesn’t protect people or deter crime

A few months ago I publicly pined for a more scientific audit of the Chicago Police Department’s “Heat List” system. The excerpt from that blogpost:

…the Chicago Police Department uses data mining techniques of social media to determine who is in gangs. Then they arrest scores of people on their lists, and finally they tout the accuracy of their list in part because of the percentage of people who were arrested who were also on their list. I’d like to see a slightly more scientific audit of this system.

Thankfully, my request has officially been fulfilled!

Yesterday I discovered via Marcos Carreiro on Twitter, that a paper has been written entitled Predictions put into practice: a quasi-experimental evaluation of Chicago’s predictive policing pilot, written by Priscillia Hunt, and John S. Hollywood and published in the Journal of Experimental Criminology.

The paper’s main result upheld my suspicions:

Individuals on the SSL are not more or less likely to become a victim of a homicide or shooting than the comparison group, and this is further supported by city-level analysis. The treated group is more likely to be arrested for a shooting.

Inside the paper, they make the following important observations. First, crime rates have been going down over time, and the “Heat List” system has not effected that trend. An excerpt:

…the statistically significant reduction in monthly homicides predated the introduction of the SSL, and that the SSL did not cause further reduction in the average number of monthly homicides above and beyond the pre-existing trend.

Here’s an accompanying graphic:

Screen Shot 2016-08-18 at 6.14.39 AM.png

This is a really big and important point, one that smart people like Gillian Tett get thrown off by when discussing predictive policing tools. We cannot automatically attribute success to any policing policy in the context of meta-effects.

Next, being on the list doesn’t protect you:

However, once other demographics, criminal history variables, and social network risk have been controlled for using propensity score weighting and doubly-robust regression modeling, being on the SSL did not significantly reduce the likelihood of being a murder or shooting victim, or being arrested for murder.

But it does make it more likely for you to get surveilled by police:

Seventy-seven percent of the SSL subjects had at least one contact card over the year following the intervention, with a mean of 8.6 contact cards, and 60 % were arrested at some point, with a mean of 1.53 arrests. In fact, almost 90 % had some sort of interaction with the Chicago PD (mean = 10.72 interactions) during the year-long observation window. This increased surveillance does appear to be caused by being placed on the SSL. Individuals on SSL were 50 % more likely to have at least one contact card and 39 % more likely to have any interaction (including arrests, contact cards, victimizations, court appearances, etc.) with the Chicago PD than their matched comparisons in the year following the intervention. There was no statistically significant difference in their probability of being arrested or incapacitated8 (see Table 4). One possibility for this result, however, is that, given the emphasis by commanders to make contact with this group, these differences are due to increased reporting of contact cards for SSL subjects.

And, most importantly, being on the list means you are likely to be arrested for shooting, but it doesn’t cause that to be true:

In other words, the additional contact with police did not result in an increased likelihood for arrests for shooting, that is, the list was not a catalyst for arresting people for shootings. Rather, individuals on the list were people more likely to be arrested for a shooting regardless of the increased contact.

That also comes with an accompanying graphic:

Screen Shot 2016-08-18 at 6.29.13 AM.png

From now on, I’ll refer to Chicago’s “Heat List” as a way for the police to predict their own future harassment and arrest practices.

Categories: Uncategorized

What is alpha?

Last week on Slate Money I had a disagreement, or at least a lively discussion, with Felix Salmon and Josh Barro on the definition of alpha.

They said it was anything that a portfolio returned above and beyond the market return, given the amount of risk the portfolio was carrying. That’s not different from how wikipedia defines alpha, and I’ve seen it said in more or less this way in a lot of places. Thus the confusion.

However, while working as a quant at a hedge fund, I was taught that alpha was the return of a portfolio that was uncorrelated to the market.

It’s a confusing thing to discuss, partly because the concept of “risk” is somewhat self-referential – more on that soon – and partly because we sometimes embed what’s called the capital asset pricing model (CAPM) into our assumptions when we talk about how portfolio returns work.

Let’s start with the following regression, which refers to stock-based portfolios, and which defines alpha:

R_{i, t} - R_f = \alpha + \beta (R_{M, t} - R_f) + \epsilon_t

Now, the term term R_f refers to the risk-free rate, or in other words how much interest you get on US treasuries, which we can approximate by 0 because it’s easier to ignore them and because it’s actually pretty close to 0 anyway. That cleans up our formula:

R_{i, t} = \alpha + \beta R_{M, t} + \epsilon_t

In this regression, we are fitting the coefficients \alpha and \beta to many instances of time windows where we’ve measured our portfolio’s return R_{i, t} and the market’s return R_{M, t}. Think of market as the S&P500 index, and think of the time windows as days.

So first, defining alpha with the above regression does what I claimed it would do: it “picks off” that part of the portfolio returns that are correlated to the market and put it in the beta coefficient, and the rest is left to alpha. If beta is 1, alpha is 0, and if the error terms are all zero, you are following the market exactly.

On the other hand, the above formulation also seems to support Felix’s suggestion that alpha is the return that is not accounted for by risk. The thing is, it’s true, at least according to the CAPM theory of investing, which says you can’t do better than the market, that you’re rewarded by market your risk in a direct way, and that everyone knows this and refuses to take on other, unrewarded risks. In particular, alpha in the above equation should be zero, but anything “extra” that you earn beyond the expected market returns would be represented by alpha in the above regression.

So, are we actually agreeing?

Well, no. The two approaches to defining alpha are very different. In particular, my definition has no reference to CAPM. Say for a moment we don’t believe in CAPM. We can still run the regression above. All we’re doing, when we run that regression, is measuring the extent to which our portfolio’s returns are “explained” by its overlap with the market.

In particular, we do not expect the true risk of our portfolio to be apparent in the above equation. Which brings us to how risk is defined, and it’s weird, because it cannot be directly measured. Instead, we typically infer risk from the volatility – computed as standard deviation – of past returns.

This isn’t a terrible idea, because if something moves around wildly on a daily basis, it would appear to be pretty risky. But it’s also not the greatest idea, as we learned in 2008, because lots of credit instruments like credit default swaps move very little on a daily basis but then suddenly lose tremendous value overnight. So past performance is not always indicative of future performance.

But it’s what we’ve got, so let’s hold on to it for the discussion. The key observation is the following:

The above regression formula only displays the market-correlated risk, and the remaining risk is unmeasured. A given portfolio might have incredibly wild swings in value, but as long as they are uncorrelated to the market, they will be invisible to the above equation, showing up only in the error terms.

Said another way, alpha is not truly risk-adjusted. It’s only market-risk-adjusted.

We might have an investment portfolio with a large alpha and a small beta, and someone who only follows CAPM theory would tell me we’re amazing investors. In fact hedge funds try to minimize their relationship to market returns – that’s the “hedge” in hedge funds – and so they’d want exactly that, a large alpha, a tiny beta, and quite a bit of risk. [One caveat: some people stipulate that a lot of that uncorrelated return is fabricated through sleazy accounting.]

It’s not like I am alone here – for a long time people have been aware that there’s lots of risk that’s not represented by market risk – for example, other instrument classes and such. So instead of using a simplistic regression like the one above, people generalize everything in sight and use the Sharpe ratio, which is the ratio of returns (often relative to some benchmark or index) to risks, where risks are measured by more complicated volatility-like computations.

However, that more general concept is also imperfect, mostly because it’s complicated and highly gameable. Portfolio managers are constantly underestimating the risk they take on, partly because – or entirely because – they can then claim to have a high Sharpe ratio.

How much does this matter? People have a colloquial use for the word alpha that’s different from my understanding, which isn’t surprising. The problem lies in the possibility that people are bragging when they shouldn’t, especially when they’re hiding risk, and especially especially if your money is on the line.

Categories: Uncategorized