## Citi Bike comes to Columbia

I’m unreasonably excited that Citi Bike has finally expanded to the area where I live, Columbia University. Here’s the situation:

Specifically, this means I can drop my kid off at school at 110th and Broadway and then bike downtown.

People, this is huge. It means I never have to get on the 1 train at rush hour again! Unless everyone else has the same plan as me, of course.

Categories: Uncategorized

## Excerpt of my book in the Guardian!

Wow, people, an excerpt of my book has been published in the Guardian this morning. How exciting is this? I hope you like it, it’s got a fancy graphic:

This is the result of my amazing Random House UK publicity team, who have been busy promoting my book in the UK.

That same team is bringing me to London at the end of September for a book tour, and as part of that I’m excited to announce I’ll be at the How To: Academy on September 27th, talking about my book, which was Tickets are available here.

Also, if you haven’t gotten enough of Weapons of Math Destruction this morning, take a look at Evelyn Lamb’s review in Scientific American.

Update: the print edition of the Guardian also looks smashing:

Categories: Uncategorized

## In Time Magazine!

The amazing and talented Rana Foroohar, whom I spoke with on Slate Money not so long ago about her fascinating book, Makers and Takers: the Rise of Finance and the Fall of American Business, has written a fantastic piece about my upcoming book for Time Magazine.

The link is here. Take a look, it’s a great piece.

Also, I was profiled as a math nerd last week by a Bloomberg journalist.

Categories: Uncategorized

## BISG Methodology

I’ve been tooling around with the slightly infamous BISG methodology lately. It’s a simple concept which takes the last name of a person, as well as the zip code of their residence, and imputes the probabilities of that person being of various races and ethnicities using the Bayes updating rule.

The methodology is implemented with the most recent U.S. census data and critically relies on the fact that segregation is widespread in this country, especially among whites and blacks, and that Asian and Hispanic last names are relatively well-defined. It’s not a perfect methodology, of course, and it breaks down in the cases that people marry people of other races, or there are names in common between races, and especially when they live in diverse neighborhoods.

The BISG methodology came up recently in this article (hat tip Don Goldberg) about the man who invented it and the politics surrounding it. Specifically, it was recently used by the CFPB to infer disparate impact in auto lending, and the Republicans who side with auto lending lobbyists called it “junk science.” I blogged about this here and, even earlier, here.

Their complaints, I believe, center around the fact that the methodology, being based on the entire U.S. population, isn’t entirely accurate when it comes to auto lending, or for that matter when it comes to mortgages, which was the CFPB’s “ground truth” testing arena.

And that’s because minorities basically have less wealth, due to a bunch of historical racist reasons, but the upshot is that this methodology assumes a random sampling of the U.S. population but what we actually see in auto financing isn’t random.

Which begs the question, why don’t we update the probabilities with the known distribution of auto lending? That’s the thing about Bayes Law, we can absolutely do that. And once we did that, the Republican’s complaint would disappear. Please, someone tell me what I’m misunderstanding.

Between you and me, I think the real gripe is something along the lines of the so-called voter fraud problem, which is not really a problem statistically but since examples can be found of mistakes, we might imagine they’re widespread. In this case, the “mistake” is a white person being offered restitution for racist auto lending practices, which happens, and is a strange problem to have, but needs to be compared to not offering restitution to a lot of people who actually deserve it.

Anyhoo, I’m planning to add the below code to github, but I recently purchased a new laptop and I haven’t added a public key yet, so I’ll get to it soon. To be clear, the below code isn’t perfect, and it only uses zip code whereas a more precise implementation would use addresses. I’m supplying this because I didn’t find it online in python, only in STATA or something crazy expensive like that. Even so, I stole their munged census data, which you can too, from this github page.

Also, I can’t seem to get the python spacing to work in WordPress, so this is really pretty terrible, but python users will be able to figure it out until I can get it on github.

%matplotlib inline

import numpy
import matplotlib
from pandas import *
import pylab
pylab.rcParams[‘figure.figsize’] = 16, 12

#Clean your last names and zip codes.

def get_last_name(fullname):
parts_list = fullname.split(‘ ‘)
while parts_list[-1] in [”, ‘ ‘,’ ‘,’Jr’, ‘III’, ‘II’, ‘Sr’]:
parts_list = parts_list[:-1]
if len(parts_list)==0:
return “”
else:
return parts_list[-1].upper().replace(“‘”, “”)

def clean_zip(fullzip):
if len(str(fullzip))<5:
return 0
else:
try:
return int(str(fullzip)[:5])
except:
return 0

Test[‘Name’] = Test[‘name’].map(lambda x: get_last_name(x))
Test[‘Zip’] = Test[‘zip’].map(lambda x: clean_zip(x))

#Add zip code probabilities. Note these are probability of living in a specific zip code given that you have a given race. They are extremely small numbers.

names =[‘NH_White_alone’,’NH_Black_alone’, ‘NH_API_alone’, ‘NH_AIAN_alone’,       ‘NH_Mult_Total’, \
‘Hispanic_Total’,’NH_Other_alone’]

trans = dict(zip(names, [‘White’, ‘Black’, ‘API’, ‘AIAN’, ‘Mult’, ‘Hisp’, ‘Other’]))
totals_by_race = [float(F[r].sum()) for r in names]
sum_dict = dict(zip(names, totals_by_race))

#I’ll use the generic_vector down below when I don’t have better name information

generic_vector = numpy.array(totals_by_race)/numpy.array(totals_by_race).sum()

for r in names:
F[‘pct of total %s’ %(trans[r])] = F[r]/sum_dict[r]

def get_zip_probs(zip):
G = F[F[‘ZCTA5’]==str(zip)][[‘pct of total White’,’pct of total Black’, ‘pct of total API’, \
‘pct of total AIAN’, ‘pct of total Mult’, ‘pct of total Hisp’, \
‘pct of total Other’]]
if len(G.values)>0:
return numpy.array(G.values[0])
else:
print “no data for zip = “, zip
return numpy.array([1.0]*7)

Test[‘Prob of zip given race’] = Test[‘Zip’].map(lambda x: get_zip_probs(x))

#Next, compute the probability of each race given a specific name.

def clean_probs(p):
try:
return float(p)
except:
return 0.0

for cat in [‘pctwhite’, ‘pctblack’, ‘pctapi’, ‘pctaian’, ‘pct2prace’, ‘pcthispanic’]:
Names[cat] = Names[cat].map(lambda x: clean_probs(x)/100.0)

Names[‘pctother’] = Names.apply(lambda row: max (0, 1 – float(row[‘pctwhite’]) – \
float(row[‘pctblack’]) – float(row[‘pctapi’]) – \
float(row[‘pctaian’]) – float(row[‘pct2prace’]) – \
float(row[‘pcthispanic’])), axis = 1)

def get_name_probs(name):
G = Names[Names[‘name’]==name][[‘pctwhite’, ‘pctblack’, ‘pctapi’, ‘pctaian’,  ‘pct2prace’, ‘pcthispanic’, ‘pctother’]]
if len(G.values)>0:
return numpy.array(G.values[0])
else:
return generic_vector

Test[‘Prob of race given name’] = Test[‘Name’].map(lambda x: get_name_probs(x))

#Finally, use the Bayesian updating formula to compute overall probabilities of each race.

Test[‘Prod’] = Test[‘Prob of zip given race’]*Test[‘Prob of race given name’]
Test[‘Dot’] = Test[‘Prod’].map(lambda x: x.sum())
Test[‘Final Probs’] = Test[‘Prod’]/Test[‘Dot’]

Test[‘White Prob’] = Test[‘Final Probs’].map(lambda x: x[0])
Test[‘Black Prob’] = Test[‘Final Probs’].map(lambda x: x[1])
Test[‘API Prob’] = Test[‘Final Probs’].map(lambda x: x[2])
Test[‘AIAN Prob’] = Test[‘Final Probs’].map(lambda x: x[3])
Test[‘Mult Prob’] = Test[‘Final Probs’].map(lambda x: x[4])
Test[‘Hisp Prob’] = Test[‘Final Probs’].map(lambda x: x[5])
Test[‘Other Prob’] = Test[‘Final Probs’].map(lambda x: x[6])

Categories: Uncategorized

## Book Tour Events!

Readers, I’m so happy to announce upcoming public events for my book tour, which starts in 2 weeks! Holy crap!

The details aren’t all entirely final, and there may be more events added later, but here’s what we’ve got so far. I hope I see some of you soon!

Events for Cathy O’Neil

Author of
WEAPONS OF MATH DESTRUCTION:

How Big Data Increases Inequality and Threatens Democracy

(Crown; September 6, 2016)

Thursday, September 8

7:00pm

Barnes & Noble Upper East Side

150 E 86th St.

New York, NY 10028

–

Tuesday, September 13

7:30pm

In Conversation Event

Town Hall Seattle

1119 8th Ave.

Seattle, WA 98101

Wednesday, September 14

12:00pm
Democracy/Citizenship Series

Mechanics’ Institute Library

57 Post St.

San Francisco, CA 94104

–

Wednesday, September 14

7:00pm
In Conversation with Lianna McSwain

Book Passage

51 Tamal Vista Blvd.

–

Thursday, September 15

9:00am

Privacy.Security.Risk. 2016

San Jose Marriott

301 S. Market Street

San Jose, CA 95113

Tuesday, September 20

6:30pm

In Conversation with Jen Golbeck

Busboys and Poets (w/Politics & Prose)

1025 5th Street NW

Washington, D.C. 20001

Monday, October 3

7:00pm

Harvard Book Store

1256 Mass Ave.

Cambridge, MA 02138

Saturday, October 22nd

12:00pm

Wisconsin Book Festival

Wisconsin Institutes for Discovery

DeLuca Forum

Sarah Breivogel, 212-572-2722, sbreivogel@penguinrandomhouse.com or
Liz Esman, 212-572-6049, lesman@penguinrandomhouse.com

Categories: Uncategorized

## Chicago’s “Heat List” predicts arrests, doesn’t protect people or deter crime

A few months ago I publicly pined for a more scientific audit of the Chicago Police Department’s “Heat List” system. The excerpt from that blogpost:

…the Chicago Police Department uses data mining techniques of social media to determine who is in gangs. Then they arrest scores of people on their lists, and finally they tout the accuracy of their list in part because of the percentage of people who were arrested who were also on their list. I’d like to see a slightly more scientific audit of this system.

Thankfully, my request has officially been fulfilled!

Yesterday I discovered via Marcos Carreiro on Twitter, that a paper has been written entitled Predictions put into practice: a quasi-experimental evaluation of Chicago’s predictive policing pilot, written by Priscillia Hunt, and John S. Hollywood and published in the

The paper’s main result upheld my suspicions:

Individuals on the SSL are not more or less likely to become a victim of a homicide or shooting than the comparison group, and this is further supported by city-level analysis. The treated group is more likely to be arrested for a shooting.

Inside the paper, they make the following important observations. First, crime rates have been going down over time, and the “Heat List” system has not effected that trend. An excerpt:

…the statistically significant reduction in monthly homicides predated the introduction of the SSL, and that the SSL did not cause further reduction in the average number of monthly homicides above and beyond the pre-existing trend.

Here’s an accompanying graphic:

This is a really big and important point, one that smart people like Gillian Tett get thrown off by when discussing predictive policing tools. We cannot automatically attribute success to any policing policy in the context of meta-effects.

Next, being on the list doesn’t protect you:

However, once other demographics, criminal history variables, and social network risk have been controlled for using propensity score weighting and doubly-robust regression modeling, being on the SSL did not significantly reduce the likelihood of being a murder or shooting victim, or being arrested for murder.

But it does make it more likely for you to get surveilled by police:

Seventy-seven percent of the SSL subjects had at least one contact card over the year following the intervention, with a mean of 8.6 contact cards, and 60 % were arrested at some point, with a mean of 1.53 arrests. In fact, almost 90 % had some sort of interaction with the Chicago PD (mean = 10.72 interactions) during the year-long observation window. This increased surveillance does appear to be caused by being placed on the SSL. Individuals on SSL were 50 % more likely to have at least one contact card and 39 % more likely to have any interaction (including arrests, contact cards, victimizations, court appearances, etc.) with the Chicago PD than their matched comparisons in the year following the intervention. There was no statistically significant difference in their probability of being arrested or incapacitated8 (see Table 4). One possibility for this result, however, is that, given the emphasis by commanders to make contact with this group, these differences are due to increased reporting of contact cards for SSL subjects.

And, most importantly, being on the list means you are likely to be arrested for shooting, but it doesn’t cause that to be true:

In other words, the additional contact with police did not result in an increased likelihood for arrests for shooting, that is, the list was not a catalyst for arresting people for shootings. Rather, individuals on the list were people more likely to be arrested for a shooting regardless of the increased contact.

That also comes with an accompanying graphic:

From now on, I’ll refer to Chicago’s “Heat List” as a way for the police to predict their own future harassment and arrest practices.

Categories: Uncategorized

## What is alpha?

Last week on Slate Money I had a disagreement, or at least a lively discussion, with Felix Salmon and Josh Barro on the definition of alpha.

They said it was anything that a portfolio returned above and beyond the market return, given the amount of risk the portfolio was carrying. That’s not different from how wikipedia defines alpha, and I’ve seen it said in more or less this way in a lot of places. Thus the confusion.

However, while working as a quant at a hedge fund, I was taught that alpha was the return of a portfolio that was uncorrelated to the market.

It’s a confusing thing to discuss, partly because the concept of “risk” is somewhat self-referential – more on that soon – and partly because we sometimes embed what’s called the capital asset pricing model (CAPM) into our assumptions when we talk about how portfolio returns work.

Let’s start with the following regression, which refers to stock-based portfolios, and which defines alpha:

$R_{i, t} - R_f = \alpha + \beta (R_{M, t} - R_f) + \epsilon_t$

Now, the term term $R_f$ refers to the risk-free rate, or in other words how much interest you get on US treasuries, which we can approximate by 0 because it’s easier to ignore them and because it’s actually pretty close to 0 anyway. That cleans up our formula:

$R_{i, t} = \alpha + \beta R_{M, t} + \epsilon_t$

In this regression, we are fitting the coefficients $\alpha$ and $\beta$ to many instances of time windows where we’ve measured our portfolio’s return $R_{i, t}$ and the market’s return $R_{M, t}.$ Think of market as the S&P500 index, and think of the time windows as days.

So first, defining alpha with the above regression does what I claimed it would do: it “picks off” that part of the portfolio returns that are correlated to the market and put it in the beta coefficient, and the rest is left to alpha. If beta is 1, alpha is 0, and if the error terms are all zero, you are following the market exactly.

On the other hand, the above formulation also seems to support Felix’s suggestion that alpha is the return that is not accounted for by risk. The thing is, it’s true, at least according to the CAPM theory of investing, which says you can’t do better than the market, that you’re rewarded by market your risk in a direct way, and that everyone knows this and refuses to take on other, unrewarded risks. In particular, alpha in the above equation should be zero, but anything “extra” that you earn beyond the expected market returns would be represented by alpha in the above regression.

So, are we actually agreeing?

Well, no. The two approaches to defining alpha are very different. In particular, my definition has no reference to CAPM. Say for a moment we don’t believe in CAPM. We can still run the regression above. All we’re doing, when we run that regression, is measuring the extent to which our portfolio’s returns are “explained” by its overlap with the market.

In particular, we do not expect the true risk of our portfolio to be apparent in the above equation. Which brings us to how risk is defined, and it’s weird, because it cannot be directly measured. Instead, we typically infer risk from the volatility – computed as standard deviation – of past returns.

This isn’t a terrible idea, because if something moves around wildly on a daily basis, it would appear to be pretty risky. But it’s also not the greatest idea, as we learned in 2008, because lots of credit instruments like credit default swaps move very little on a daily basis but then suddenly lose tremendous value overnight. So past performance is not always indicative of future performance.

But it’s what we’ve got, so let’s hold on to it for the discussion. The key observation is the following:

The above regression formula only displays the market-correlated risk, and the remaining risk is unmeasured. A given portfolio might have incredibly wild swings in value, but as long as they are uncorrelated to the market, they will be invisible to the above equation, showing up only in the error terms.