data science | mathbabe

Why should you care about statistical modeling?

August 5, 2011 Cathy O'Neil, mathbabe 3 comments

One of the major goals of this blog is to let people know how statistical modeling works. My plan is to explain as much as I can in simple plain English, with the least amount of confusion, and the maximum amount of elucidation at every possible level, so every reader can take at least a basic understanding away.

Why? What’s so important about you knowing about what nerds do?

Well, there are different answers. First, you may be interested in it from a purely cerebral perspective – you may yourself be a nerd or a potential nerd. Since it is interesting, and since there will be I suspect many more job openings coming soon that use this stuff, there’s nothing wrong with getting technical; it may come in handy.

But I would argue that even if it’s not intellectually stimulating for you, you should know at least the basics of this stuff, kind of like how we should all know how our government is run and how to conserve energy; kind of a modern civic duty, if you will.

Civic duty? Whaaa?

Here’s why. There’s an incredible amount of data out there, more than every before, and certainly more than when I was growing up. I mean, sure, we always kept track of our GDP and the stock market, that’s old school data collection. And marketers and politicians have always experimented with different ads and campaigns and kept track of what does and what doesn’t work. That’s all data too. But the sheer volume of data that we are now collecting about people and behaviors is positively stunning. Just think of it as a huge and exponentially growing data vat.

And with that data comes data analysis. This is a young field. Even though I encourage every nerd out there to consider becoming a data scientist, I know that if a huge number of them agreed to it today, there wouldn’t be enough jobs out there for everyone. Even so, there will be, and very soon. Each CEO of each internet startup should be seriously considering hiring a data scientist, if they don’t have one already. The power in data mining is immense and it’s only growing. And as I said, the field is young but it’s growing in sophistication rapidly, for good and for evil.

And that gets me to the evil part, and with it the civic duty part.

I claim two things. First, that statistical modeling can and does get out of hand, which I define as when it starts controlling things in a way that is not intended or understood by the people who built the model (or who use the model, or whose lives are affected by the model). And second, that by staying informed about what models are, what they aren’t, what limits they have and what boundaries need to be enforced, we can, as a society, live in a place which is still data-intensive but reasonable.

To give evidence to my first claim, I point you to the credit crisis. In fact finance is a field which is not that different from others like politics and marketing, except that it is years ahead in terms of data analysis. It was and still is the most data-driven, sophisticated place where models rule and the people typically stand back passively and watch (and wait for the money to be transferred to their bank accounts). To be sure, it’s not the fault of the models. In fact I firmly believe that nobody in the mortgage industry, for example, really believed that the various tranches of the mortgage backed securities were in fact risk-free; they knew they were just getting rid of the risk with a hefty reward and they left it at that. And yet, the models were run, and their numbers were quoted, and people relied on them in an abstract way at the very least, and defended their AAA ratings because that’s what the models said. It was a very good example of models being misapplied in situations that weren’t intended or appropriate. The result, as we know, was and still is an economic breakdown when the underlying numbers were revealed to be far far different than the models had predicted.

Another example, which I plan to write more about, is the value-added models being used to evaluate school teachers. In some sense this example is actually more scary than the example of modeling in finance, in that in this case, we are actually talking about people being fired based on a model that nobody really understands. Lives are ruined and schools are closed based on the output of an opaque process which even the model’s creators do not really comprehend (I have seen a technical white paper of one of the currently used value-added models, and it’s my opinion that the writer did not really understand modeling or at best tried not to explain it if he did).

In summary, we are already seeing how statistical modeling can and has affected all of us. And it’s only going to get more omnipresent. Sometimes it’s actually really nice, like when I go to Pandora.com and learn about new bands besides Bright Eyes (is there really any band besides Bright Eyes?!). I’m not trying to stop cool types of modeling! I’m just saying, we wouldn’t let a model tell us what to name our kids, or when to have them. We just like models to suggest cool new songs we’d like.

Actually, it’s a fun thought experiment to imagine what kind of things will be modeled in the future. Will we have models for how much insurance you need to pay based on your DNA? Will there be modeling of how long you will live? How much joy you give to the people around you? Will we model your worth? Will other people model those things about you?

I’d like to take a pause just for a moment to mention a philosophical point about what models do. They make best guesses. They don’t know anything for sure. In finance, a successful model is a model that makes the right bet 51% of the time. In data science we want to find out who is twice as likely to click a button- but that subpopulation is still very unlikely to click! In other words, in terms of money, weak correlations and likelihoods pay off. But that doesn’t mean they should decide peoples’ fates.

My appeal is this: we need to educate ourselves on how the models around us work so we can spot one that’s a runaway model. We need to assert our right to have power over the models rather than the other way around. And to do that we need to understand how to create them and how to control them. And when we do, we should also demand that any model which does affect us needs to be explained to us in terms we can understand as educated people.

Categories: data science, finance, hedge funds, internet startup, rant

Some R code and a data mining book

August 4, 2011 Cathy O'Neil, mathbabe 2 comments

I’m very pleased to add some R code which does essentially the same thing as my python code for this post, which was about using Bayesian inference to thing about women on boards of directors of S&P companies, and for this post, which was about measuring historical volatility for the S&P index. I have added the code to those respective posts. Hopefully the code will be useful for some of you to start practicing manipulating visualizing data in the two languages.

Thanks very much to Daniel Krasner for providing the R code!

Also, I wanted to mention a really good book I’m reading about data mining, namely “Data Analysis with Open Source Tools,” by Phillipp Janert, published by O’Reilly. He wrote it without assuming much mathematics, but in a sophisticated manner. In other words, for people who are mathematicians, the lack of explanation of the math will be fine, but the good news is he doesn’t dumb down the craft of modeling itself. And I like his approach, which is to never complicate stuff with fancy methods and tools unless you have a very clear grasp on what it will mean and why it’s going to improve the situation. In the end this is very similar to the book I would have imagined writing on data analysis, so I’m kind of annoyed that it’s already written and so good.

Speaking of O’Reilly, I’ll be at their “Strata: Making Data Work” conference next month here in New York, who’s going to meet me there? It looks pretty great, and will be a great chance to meet other people who are as in love with sexy data as I am.

Categories: data science, math education, open source tools, rant

Cool example of Bayesian methods applied to education

August 2, 2011 Cathy O'Neil, mathbabe 1 comment

My friend Matt DeLand teamed up recently with Jared Chung to enter a data mining hacking contest sponsored by Donors Choose, which is a well-known online charity connecting low-income classrooms across the country to donors who get to choose which projects to support.

Their goal was to figure out how many of the thousands of projects up for funding were directly related to career preparation, and they performed a nifty Bayesian analysis to do it. Turns out it’s less than 1%!

Here’s their report. It’s really well explained in the 5-page pdf, if you have a few minutes.

Speaking of Donors Choose, it was featured at a HackNY Summer Fellows event I went to last week. The Summer Fellows is essentially like the math camp I taught at for high school students except it’s a computer camp for college students – same level of nerdy loveliness though. The event was a showcase for the fantastically nerdy student hackers, and there were some very impressive exhibits.

The hack involving Donors Choose shows a movie of how the donations are being given from some location to the classroom that’s benefitting on a big map of the country, and shown quickly from 2005 or so really exhibits how quickly the concept grew. It’s not unlike this visualization of the history of the world through the lens of Wikipedia.

Categories: data science, math education, news

What kind of math nerd job should you have?

July 31, 2011 Cathy O'Neil, mathbabe 21 comments

Say you’re a math nerd, finishing your Ph.D. or a post-doc, and you’re wondering whether academics is really the place for you. Well I’ve got some advice for you! Actually I will have some advice for you, after you’ve answered a few questions. It’s all about fit. Since I know them best, I will center my questions and my advice around academic math vs. hedge fund quant vs. data scientist at a startup.

By the way, this is the advice I find myself telling people when they ask. It’s supposed to be taken over a beer and with lots of tongue in cheek.

1) What are your vices?

It turns out that the vices of the three jobs we are considering are practically disjoint! If you care about a good fit for your vices, then please pay attention.

NOTE: I am not saying that everyone in these fields has all of these vices! Far from it! It’s more like, if one or more of these vices drives you nuts, then you may get frustrated when you encounter them in these fields.

In academics, the major vices are laziness, envy, and arrogance. It’s perhaps true that laziness (at least outside of research) is typically not rewarded until after tenure, but at that point it’s pretty much expected, unless you want to be the fool who spends all of his(her) time writing recommendation letters and actually advising undergraduates. Envy is, of course, a huge deal in academics, because the only actual feedback is in the form of adulating rumor. Finally, arrogance in academics is kind of too obvious to explain.

At a hedge fund, the major vices are greed, covetousness, and arrogance. The number one source of feedback is pay, after all, so it’s all about how much you got (and how much your officemate got). Plus the isolation even inside your own office can lead to the feeling that you know more and more interesting, valuable, things than anyone else, thus the arrogance.

Finally, at a startup, the major vices are vanity, impatience, and arrogance. People really care about their image- maybe because they are ready to jump ship and land a better job as soon as they start to smell something bad. Plus it’s pretty easy in startups as well to live inside a bubble of self-importance and coolness and buzz. Thus the arrogance. On the flip side of vanity, startups are definitely the sexiest of the three, and the best source by far for good karaoke singers.

Okay it turns out they all have arrogance. Maybe that’s just a property of any job category.

2) What do you care about?

Do you care about titles? Don’t work at a startup.

Do you care about stability? Don’t work at a startup. Actually you might think I’d say don’t work at a hedge fund either, but I’ve found that hedge funds are surprisingly stable, and are full of people who are surprisingly risk averse. Maybe small hedge funds are unstable.

Do you care about feedback? Don’t work in academics.

Do you care about publishing? Don’t work outside academics (it’s sometimes possible to publish outside of academics but it’s not always possible and it’s not always easy).

Do you care about making lots of money? Don’t work in academics. In a startup you make a medium amount of money but there are stock options which may pan out someday, so it’s kind of in between academics and Wall St.

Do you care about being able to decide what you’re working on? Definitely stay in academics.

Do you care about making the world a better place? I’m still working on that one. There really should be a way of doing that if you’re a math nerd. It’s probably not Wall Street.

3) What do you not care about?

If you just like math, and don’t care exactly what kind of math you’re doing, then any of these choices can be really interesting and challenging.

If you don’t mind super competitive and quasi-ethical atmospheres, then you may really enjoy hedge fund quant work- the modeling is really interesting, the pay is good, and you are part of the world of finance and economics, which leaks into politics as well and is absolutely fascinating.

If you don’t mind getting nearly no vacation days and yet feeling like your job may blow up any minute, you may like working at a startup. The people there are real risk lovers, care about their quality of life (at least at the office!), and know how to throw a great party.

If you don’t mind being relatively isolated mathematically, and have enormous internal motivation and drive, then academics is a pretty awesome job, and teaching is really fun and rewarding. Also academic jobs have lots of flexibility as well as cool things like sabbaticals.

4) What about for women who want kids?

Let’s face it, the tenure clock couldn’t have been set up worse for women who want children. And startups have terrible vacation policies and child-care policies as well; it’s just the nature of living on a Venture Capitalist’s shoestring. So actually I’d say the best place to balance work and life issues is at an established hedge fund or bank, where the maternity policies are good; this is assuming though that your personality otherwise fits well with a Wall St. job. Actually many of the women I’ve met who have left academics for government research jobs (like at NASA or the NSA) are very happy as well.

Categories: data science, hedge funds, math education, women in math

Historical volatility on the S&P index

July 30, 2011 Cathy O'Neil, mathbabe 1 comment

In a previous post I described the way people in finance often compute historical volatility, in order to try to anticipate future moves in a single stock. I’d like to give a couple of big caveats to this method as well as a worked example, namely on daily returns of the S&P index, with the accompanying python code. I will use these results in a future post I’m planning about errorbars and how people abuse and misuse them.

Two important characteristics of returns

First, market returns in general have fat-tailed distributions; things can seem “quiet” for long stretches of time (longer than any lookback window), during which the sample volatility is a possibly severe underestimate of the “true” standard of deviation of the underlying distribution (if that even makes sense – for the sake of this discussion let’s assume it does). Then when a fat-tailed event occurs, the sample volatility typically spikes to being an overestimate of the standard of deviation for that distribution.

Second, in the markets, there is clustering of volatility- another way of saying this is that volatility itself is rather auto-correlated, so even if we can’t predict the direction of the return, we can still estimate the size of the return. This is particularly true right after a shock, and there are time series models like ARCH and its cousins that model this phenomenon; they in fact allow you to model an overall auto-correlated volatility, which can be thought of as scaling for returns, and allows you to then approximate the normalized returns (returns divided by current volatility) as independent, although still not normal (because they are still fat-tailed even after removing the clustered volatility effect). See below for examples of normalized daily S&P returns with various decays.

Example: S&P daily returns

I got this data from Yahoo Finance, where they let you download daily S&P closes since 1950 to an excel spreadsheet. I could have used some other instrument class, but the below results would be stronger (especially for things like credit default swamps), not weaker- the S&P, being an index, is already the sum of a bunch of things and tends to be more normal as a result; in other words, the Central Limit Theorem is already taking effect on an intraday basis.

First let’s take a look at the last 3 years of closes, so starting in the summer of 2008:

Next we can look at the log returns for the past 3 years:

Now let’s look at how the historical volatility works out with different decays (decays are numbers less than 1 which you use to downweight old data: see this post for an explanation):

For each choice of the above decays, we can normalize the log returns. to try to remove the “volatility clustering”:

As we see, the long decay doesn’t do a very good job. In fact, here are the histograms, which are far from normal:

Here’s the python code I used to generate these plots from the data (see also R code below):

#!/usr/bin/env python

import csv
from matplotlib.pylab import *
from numpy import *
from math import *
import os
os.chdir(‘/Users/cathyoneil/python/sandp/’)

dataReader = csv.DictReader(open(‘SandP_data.txt’, ‘rU’), delimiter=’,’, quotechar=’|’)

close_list = []
for row in dataReader:
#print row[“Date”], row[“Close”]
close_list.append(float(row[“Close”]))
close_list.reverse()
close_array = array(close_list)
close_log_array = array([log(x) for x in close_list])
log_rets = array(diff(close_log_array))
perc_rets = array([exp(x)-1 for x in log_rets])

figure()
plot(close_array[-780:-1], label = “raw closes”)
title(“S&P closes for the last 3 years”)
legend(loc=2)
#figure()
#plot(log_rets, label = “log returns”)
#legend()
#figure()
#hist(log_rets, 100, label = “log returns”)
#legend()
#figure()
#hist(perc_rets, 100, label = “percentage returns”)
#legend()
#show()

def get_vol(d):
var = 0.0
lam = 0.0
var_list = []
for r in log_rets:
lam = lam*(1.0-1.0/d) + 1
var = (1-1.0/lam)*var + (1.0/lam)*r**2
var_list.append(var)
return [sqrt(x) for x in var_list]

figure()
for d in [10, 30, 100]:
plot(get_vol(d)[-780:-1], label = “decay factor %.2f” %(1-1.0/d))
title(“Volatility in the S&P in the past 3 years with different decay factors”)
legend()
for d in [10, 30, 100]:
figure()
these_vols = get_vol(d)
plot([log_rets[i]/these_vols[i-1] for i in range(len(log_rets) – 780, len(log_rets)-1)], label = “decay %.2f” %(1-1.0/d))
title(“Volatility normalized log returns (last three years)”)
legend()
figure()
plot([log_rets[i] for i in range(len(log_rets) – 780, len(log_rets)-1)], label = “raw log returns”)
title(“Raw log returns (last three years)”)
for d in [10, 30, 100]:
figure()
these_vols = get_vol(d)
normed_rets = [log_rets[i]/these_vols[i-1] for i in range(len(log_rets) – 780, len(log_rets)-1)]
hist(normed_rets, 100,label = “decay %.2f” %(1-1.0/d))
title(“Histogram of volatility normalized log returns (last three years)”)
legend()

Here’s the R code Daniel Krasner kindly wrote for the same plots:

setwd(“/Users/cathyoneil/R”)

dataReader <- read.csv(“SandP_data.txt”, header=T)

close_list <- as.numeric(dataReader$Close)

close_list <- rev(close_list)

close_log_list <- log(close_list)

log_rets <- diff(close_log_list)

perc_rets = exp(log_rets)-1

x11()

plot(close_list[(length(close_list)-779):(length(close_list))], type=’l’, main=”S&P closes for the last 3 years”, col=’blue’)

legend(125, 1300, “raw closes”, cex=0.8, col=”blue”, lty=1)

get_vol <- function(d){

var = 0

lam=0

var_list <- c()

for (r in log_rets){

lam <- lam*(1 – 1/d) + 1

var = (1 – 1/lam)*var + (1/lam)*r^2

var_list <- c(var_list, var)

}

return (sqrt(var_list))

}

L <- (length(close_list))

x11()

plot(get_vol(10)[(L-779):L], type=’l’, main=”Volatility in the S&P in the past 3 years with different decay factors”, col=1)

lines(get_vol(30)[(L-779):L], col=2)

lines(get_vol(100)[(L-779):L], col=3)

legend(550, 0.05, c(“decay factor .90”, “decay factor .97″,”decay factor .99”) , cex=0.8, col=c(1,2,3), lty = 1:3)

x11()

par(mfrow=c(3,1))

plot((log_rets[2:L]/get_vol(10))[(L-779):L], type=’l’, col=1, lty=1, ylab=”)

legend(620, 3, “decay factor .90”, cex=0.6, col=1, lty = 1)

plot((log_rets[2:L]/get_vol(30))[(L-779):L], type=’l’, col=2, lty =2, ylab=”)

legend(620, 3, “decay factor .97”, cex=0.6, col=2, lty = 2)

plot((log_rets[2:L]/get_vol(100))[(L-779):L], type=’l’, col=3, lty =3, ylab=”)

legend(620, 3, “decay factor .99”, cex=0.6, col=3, lty = 3)

x11()

plot(log_rets[(L-779):L], type=’l’, main = “raw log returns”, col=”blue”, ylab=”)

par(mfrow=c(3,1))

hist((log_rets[2:L]/get_vol(10))[(L-779):L], breaks=200, col=1, lty=1, ylab=”, xlab=”, main=”)

legend(2, 15, “decay factor .90”, cex=.8, col=1, lty = 1)

hist((log_rets[2:L]/get_vol(30))[(L-779):L], breaks=200, col=2, lty =2, ylab=”, xlab=”, main=”)

legend(2, 40, “decay factor .97”, cex=0.8, col=2, lty = 2)

hist((log_rets[2:L]/get_vol(100))[(L-779):L], breaks=200, col=3, lty =3, ylab=”, xlab=”, main=”)

legend(3, 50, “decay factor .99”, cex=0.8, col=3, lty = 3)

Categories: data science, finance, open source tools

What is an earnings surprise?

July 17, 2011 Cathy O'Neil, mathbabe 1 comment

One of my goals for this blog is to provide a minimally watered-down resource for technical but common financial terms. It annoys me when I see technical jargon thrown around in articles without any references.

My audience for a post like this is someone who is somewhat mathematically trained, but not necessarily mathematically sophisticated, and certainly not knowledgeable about finance. I already wrote a similar post about what it means for a statistic to be seasonally adjusted here.

By way of very basic background, publicly traded companies (i.e. companies you can buy stock on) announce their earnings once a quarter. They each have a different schedule for this, and their stock price often has drastic movements after the announcement, depending on if it’s good news or bad news. They usually make their announcement before or after trading hours so that it’s more difficult for news to leak and affect the price in weird ways minutes before and after the announcement, but even so most insider trading is centered around knowing and trading on earnings announcements before the official announcement. (Don’t do this. It’s really easy to trace. There are plenty of other ways to illegally make money on Wall Street that are harder to trace.)

In fact, there’s so much money at stake that there’s a whole squad of “analysts” whose job it is to anticipate earnings announcements. They are supposed to learn lots of qualitative information about the industry and the company and how it’s managed etc. Even so most analysts are pretty bad at forecasting earnings. For that reason, instead of listening to a specific analyst, people sometimes take an average of a bunch of analysts’ opinions in an effort to harness the wisdom of crowds. Unfortunately the opinions of analysts are probably not independent, so it’s not clear how much averaging is really going on.

The bottomline of the above discussion is that the concept of an earnings surprise is really only borderline technical, because it’s possible to define it in a super naive, model-free way, namely as the difference between the “consensus among experts” and the actual earnings announcement. However, there’s also a way to quantitatively model it, and the model will probably be as good or better than most analysts’ predictions. I will discuss this model now.

[As an aside, if this model works as well or better as most analysts’ opinions, why don’t analysts just use this model? One possible answer is that, as an analyst, you only get big payoffs if you make a big, unexpected prediction which turns out to be true; you don’t get much credit for being pretty close to right most of the time. In other words you have an incentive to make brash forecasts. One example of this is Meredith Whitney, who got famous for saying in October 2007 that Citigroup would get hosed. Of course it could also be that she’s really pretty good at learning about companies.]

An earnings surprise is the difference between the actual earnings, known on day t, and a forecast of the earnings, known on day t-1. So how do we forecast earnings? A simple and reasonable way to start is to use an autoregressive model, which is a fancy way of saying do a regression to tell you how past earnings announcements can be used as signals to predict future earnings announcements. For example, at first blush we may use last earning’s announcement as a best guess of this coming one. But then we may realize that companies tend to drift in the same direction for some number of quarters (we would find this kind of thing out by pooling data over lots of companies over lots of time), so we would actually care not just about what the last earnings announcement was but also the previous one or two or three. [By the way, this is essentially the same first step I want to use in the diabetes glucose level model, when I use past log levels to predict future log levels.]

The difference between two quarters ago and last quarter gives you a sense of the derivative of the earnings curve, and if you take an alternating sum over the past three you get a sense of the curvature or acceleration of the earnings curve.

It’s even possible you’d want to use more than three past data points, but in that case, since the number of coefficients you are regressing is getting big, you’d probably want to place a strong prior on those coefficients in order to reduce the degrees of freedom; otherwise we would be be fitting the coefficients to the data too much and we’d expect it to lose predictive power. I will devote another post to describing how to put a prior on this kind of thing.

Once we have as good a forecast of the earnings knowing past earnings as we can get, we can try adding macroeconomic or industry-specific signals to the model and see if we get better forecasts – such signals would bring up or bring down the earnings for the whole industry. For example, there may be some manufacturing index we could use as a proxy to the economic environment, or we could use the NASDAQ index for the tech environment.

Since there is never enough data for this kind of model, we would pool all the data we had, for all the quarters and all the companies, and run a causal regression to estimate our coefficients. Then we would calculate a earnings forecast for a specific company by plugging in the past few quarterly results of earnings for that company.

Categories: data science, finance, hedge funds, news

Glucose Prediction Model: absorption curves and dirty data

July 5, 2011 Cathy O'Neil, mathbabe 7 comments

In this post I started visualizing some blood glucose data using python, and in this post my friend Daniel Krasner kindly rewrote my initial plots in R.

I am attempting to show how to follow the modeling techniques I discussed here in order to try to predict blood glucose levels. Although I listed a bunch of steps, I’m not going to be following them in exactly the order I wrote there, even though I tried to make them in more or less the order we should at least consider them.

For example, it says first to clean the data. However, until you decide a bit about what your model will be attempting to do, you don’t even know what dirty data really means or how to clean it. On the other hand, you don’t want to wait too long to figure something out about cleaning data. It’s kind of a craft rather than a science. I’m hoping that by explaining the steps the craft will become apparent. I’ll talk more about cleaning the data below.

Next, I suggested you choose in-sample and out-of-sample data sets. In this case I will use all of my data for my in-sample data since I happen to know it’s from last year (actually last spring) so I can always ask my friend to send me more recent data when my model is ready for testing. In general it’s a good idea to use at most two thirds of your data as in-sample; otherwise your out-of-sample test is not sufficiently meaningful (assuming you don’t have that much data, which always seems to be the case).

Next, I want to choose my predictive variables. First, we should try to see how much mileage we can get out of predicting future blood glucose levels with past glucose levels. Keeping in mind that the previous post had us using log levels instead of actual glucose levels, since then the distribution of levels is more normal, we will actually be trying to predict log glucose levels (log levels) knowing past log glucose levels.

One good stare at the data will tell us there’s probably more than one past data point that will be needed, since we see that there is pretty consistent moves upwards and downwards. In other words, there is autocorrelation in the log levels, which is to be expected, but we will want to look at the derivative of the log levels in the near past to predict the future log levels. The derivative can be computed by taking the difference of the most recent log level and the previous one to that.

Once we have the best model we can with just knowing past log levels, we will want to add reasonable other signals. The most obvious candidates are the insulin intakes and the carb intakes. These are presented as integer values with certain timestamps. Focusing on the insulin for now, if we know when the insulin is taken and how much, we should be able to model how much insulin has been absorbed into the blood stream at any given time, if we know what the insulin absorption curve looks like.

This leads to the question of, what does the insulin (rate of) absorption curve look like? I’ve heard that it’s pretty much bell-shaped, with a maximum at 1.5 hours from the time of intake; so it looks more or less like a normal distribution’s probability density function. It remains to guess what the maximum height should be, but it very likely depends linearly on the amount of insulin that was taken. We also need to guess at the standard deviation, although we have a pretty good head start knowing the 1.5 hours clue.

Next, the carb intakes will be similar to the insulin intake but trickier, since there is more than one type of carb and different types get absorbed at different rates, but are all absorbed by the bloodstream in a vaguely similar way, which is to say like a bell curve. We will have to be pretty careful to add the carb intake model, since probably the overall model will depend dramatically on our choices.

I’m getting ahead of myself, which is actually kind of good, because we want to make sure our hopeful path is somewhat clear and not too congested with unknowns. But let’s get back to the first step of modeling, which is just using past log glucose levels to predict the next glucose level (we will later try to expand the horizon of the model to predict glucose levels an hour from now).

Looking back at the data, we see gaps and we see crazy values sometimes. Moreover, we see crazy values more often near the gaps. This is probably due to the monitor crapping out near the end of its life and also near the beginning. Actually the weird values at the beginning are easy to take care of- since we are going to work causally, we will know there had been a gap and the data just restarted, so we we will know to ignore the values for a while (we will determine how long shortly) until we can trust the numbers. But it’s much trickier to deal with crazy values near the end of the monitor’s life, since, working causally, we won’t be able to look into the future and see that the monitor will die soon. This is a pretty serious dirty data problem, and the regression we plan to run may be overly affected by the crazy crapping-out monitor problems if we don’t figure out how to weed them out.

There are two things that may help. First, the monitor also has a data feed which is trying to measure the health of the monitor itself. If this monitor monitor is good, it may be exactly what we need to decide, “uh-oh the monitor is dying, stop trusting the data.” The second possible saving grace is that my friend also measured his blood glucose levels manually and inputted those numbers into the machine, which means we have a way to check the two sets of numbers against each other. Unfortunately he didn’t do this every five minutes (well actually that’s a good thing for him), and in particular during the night there were long gaps of time when we don’t have any manual measurements.

A final thought on modeling. We’ve mentioned three sources of signals, namely past blood glucose levels, insulin absorption forecasts, and carbohydrate absorption forecasts. There are a couple of other variables that are known to effect the blood glucose levels. Namely, the time of day and the amount of exercise that the person is doing. We won’t have access to exercise, but we do have access to timestamps. So it’s possible we can incorporate that data into the model as well, once we have some idea of how the glucose is effected by the time of day.

Categories: data science, open source tools

Women on a board of directors: let’s use Bayesian inference

June 30, 2011 Cathy O'Neil, mathbabe 6 comments

I wanted to show how to perform a “women on the board of directors” analysis using Bayesian inference. What this means is that we need to form a “prior” on what we think the distribution of the answer could be, and then we update our prior with the data available. In this case we simplify the question we are trying to answer: given that we see a board with 3 women and 7 men (so 10 total), what is the fraction of women available for the board of directors in the general population? The reason we may want to answer this question is that then we can compare the answer to other available answers, derived other ways (say by looking at the makeup of upper level management) and see if there’s a bias.

In order to illustrate Bayesian techniques, I’ve simplified it further to be a discrete question. So I’ve pretended that there are only 11 answers you could possible have, namely that the fraction of available women (in the population of people qualified to be put on the board of directors) is 0%, 10%, 20%, …, 90%, or 100%.

Moreover, I’ve put the least judgmental prior on the situation, namely that there is an equal chance for any of these 11 possibilities. Thus the prior distribution is uniform:

We have absolutely no idea what the fraction of qualified women is.

The next step is to update our prior with the available data. In this case we have the data point that there a board with 3 women and 7 men. In this case we are sure that there are some women and some men available, so the updated probability of there being 0% women or 100% women should both be zero (and we will see that this is true). Moreover, we would expect to see that the most likely fraction will be 30%, and we will see that too. What Bayesian inference gives to us, though, is the relative probabilities of the other possibilities, based on the likelihood that one of them is true given the data. So for example if we are assuming for the moment that 70% of the qualified people are women, what is the likelihood that the board ends up being 3 women and 7 men? We can compute that as (0.70)^3*(0.30)^7. We multiply that by 1/11, the probability that 70% is the right answer (according to our prior) to get the “unscaled posterior distribution”, or the likelihoods of each possibility. Here’s a graph of these numbers when I do it for all 11 possibilities:

We learn the relative likelihoods of the outcome "3 out of 10" given the various ratios of women

In order to make this a probability distribution we need to make sure the total adds up to 1, so we scale to get the actual posterior distribution:

We scale these to add up to 1

What we observe is, for example, that it’s about twice as likely for 50% of women to be qualified as it is for 10% of women to be qualified, even though those answers are equally distant from the best guess of 30%. This kind of “confidence of error” is what Bayesian inference is good for. Also, keep in mind that if we had had a more informed prior the above graph would look different; for example we could use the above graph as a prior for the next time we come across a board of directors. In fact that’s exactly how this kind of inference is used: iteratively, as we travel forward through time collecting data. We typically want to start out with a prior that is pretty mild (like the uniform distribution above) so that we aren’t skewing the end results too much, and let the data speak for itself. In fact priors are typically of the form, “things should vary smoothly”; more on what that could possibly mean in a later post.

Here’s the python code I wrote to make these graphs:

#!/usr/bin/env python

from matplotlib.pylab import *

from numpy import *

# plot prior distribution:

figure()

bar(arange(0,1.1,0.1), array([1.0/11]*11), width = 0.1, label = “prior probability distribution”)

xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )

xlim(0, 1.1)

legend()

show()

# compute likelihoods for each of the 11 possible ratios of women:

likelihoods = []

for x in arange(0, 1.1, 0.1):

likelihoods.append(x**3*(1-x)**7)

# plot unscaled posterior distribution:

figure()

bar(arange(0,1.1,0.1), array([1.0/11]*11)*array(likelihoods), width = 0.1, label = “unscaled posterior probability distribution”)

xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )

xlim(0, 1.1)

legend()

show()

# plot scaled posterior distribution:

figure()

bar(arange(0,1.1,0.1), array([1.0/11]*11)*array(likelihoods)/sum(array([1.0/11]*11)*array(likelihoods)), width = 0.1, label = “scaled posterior probability distribution”)

xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )

xlim(0, 1.1)

legend()

show()

Here’s the R code that Daniel Krasner wrote for these graphs:

barplot( rep((1/11), 11), width = .1, col=”blue”, main = “prior probability distribution”)

likelihoods = c()

for (x in seq(0, 1.0, by = .1))

likelihoods = c(likelihoods, (x^3)*((1-x)^7));

barplot(likelihoods, width = .1, col=”blue”, main = “unscaled posterior probability distribution”)

barplot(likelihoods/sum(seq((1/11), 11)*likelihoods), width = .1, col=”blue”, main = “scaled posterior probability distribution”)

Categories: data science, open source tools

Woohoo!

June 27, 2011 Cathy O'Neil, mathbabe 3 comments

First of all, I changed the theme of the blog, because I am getting really excellent comments from people but I thought it was too difficult to read the comments and to leave comments with the old theme. This way you can just click on the word “Go to comments” or “Leave a comment” which is a bit more self-evident to design-ignorant people like me. Hope you like it.

Next, I had a bad day today, but I’m very happy to report that something has raised my spirits. Namely, Jake Porway from Data Without Borders and I have been corresponding, and I’ve offered to talk to prospective NGO’s about data, what they should be collecting depending on what kind of studies they want to be able to perform, and how to store and revise data. It looks like it’s really going to happen!

In fact his exact words were: I will definitely reach out to you when we’re talking to NPOs / NGOs.

Oh, and by the way, he also says I can blog about our conversations together as well as my future conversations with those NGO’s (as long as they’re cool with it), which will be super interesting.

Oh, yeah. Can I get a WOOHOO?!?

Categories: data science, open source tools

Women on S&P500 boards of directors

June 26, 2011 Cathy O'Neil, mathbabe 6 comments

This is a co-post with FogOfWar.

Here’s an interesting article about how many board of directors for S&P500 companies consist entirely of men. Turns out it’s 47. Well, but we’d expect there to be some number of boards (out of 500) which consist entirely of men even if half of the overall set of board members are women. So the natural question arises, what is the most likely actual proportion of women given this number 47 out of 500?

In fact we know that many people are on multiple boards but for the sake of this discussion let’s assume that there’s a line of board seekers standing outside waiting to get in, and that we will randomly distribute them to boards as they walk inside, and we are wondering how many of them are women given that we end up with 47 all-men boards out of 500. Also, let’s assume there are 8 slots per board, which is of course a guess but we can see how robust that guess is by changing it at the end.

By the way, I can think of two arguments as to why the simplification that nobody is on multiple boards argument might skew the results. On the one hand, we all know it’s an old boys network so there are a bunch of connections that a few select men enjoy which puts them on a bunch of boards, which probably means the average number of boards that a man is on, who is on at least one board, is pretty large. On the other hand, it’s also well-known that, in order to seem like you’re diverse and modern, companies are trying to get at least a token woman on their board, and for some reason consider the task of finding a qualified woman really difficult. Thus I imagine it’s quite likely that once a woman has been invited to be on a board, and she’s magically dubbed “qualified,” then approximately 200 other boards will immediately invite that same woman to be on their board (“Oh my god, they’ve actually found a qualified woman!”). In other words I imagine that the average number of boards a given woman is on, assuming she’s on one, is probably even higher than for men, so our simplifying assumptions will in the end be overestimating the number of women on boards. But this is just a guess.

Now that I’ve written that argument down, I realize another reason our calculation below will be overestimating women is this concept of tokenism- once a board has one woman they may think their job is done, so to speak, in the diversity department. I’m wishing I could really get my hands on the sizes and composition of each board and see how many of them have exactly one woman (and compare that to what you’d expect with random placement). This could potentially prove (in the sense of providing statistically significant evidence for) a culture of tokenism. If anyone reading this knows how to get their hands on that data, please write!

Now to the calculation. Assuming, once more, that each board member is on exactly one board and that there are 8 people (randomly distributed) per board, what is the most likely percentage of overall women given that we are seeing 47 all-male boards out of 500? This boils down to a biased coin problem (with the two sides labeld “F” and “M” for female and male) where we are looking for the bias. For each board we flip the coin 8 times and see how many “F”s we get and how many “M”s we get and that gives us our board.

First, what would the expected number of all-male boards be if the coin is unbiased? Since expectation is additive and we are modeling the boards as independent, we just need to figure out the probability that one board is all-male and multiply by 500. But for an unbiased coin that boils down to (1/2)^8 = 0.39%, so after multiplying by 500 we get 1.95, in other words we’d expect 2 all-male boards. So the numbers are definitely telling us that we should not be expecting 50% women. What is the most likely number of women then? In this case we work backwards: we know the answer is 47, so divide that by 500 to get 0.094, and now find the probability p of the biased coin landing on F so that all-maleness has probability 0.094. This is another way of saying that (1-p)^8 = 0.094, or that 1-p is 0.744, the eighth root of 0.094. So our best guess is p = 25.6%. Here’s a table with other numbers depending on the assumed size of the boards:

If anyone reading this has a good sense of the distribution of the size of boards for the S&P500, please write or comment, so I can improve our estimates.

Categories: data science, finance, FogOfWar

Step 0 Revisited: Doing it in R

June 25, 2011 Cathy O'Neil, mathbabe Comments off

A nerd friend of mine kindly rewrote my python scripts in R and produced similar looking graphs. I downloaded R from here and one thing that’s cool is that once it’s installed, if you open an R source code (ending with “.R”), an R console pops up automatically and you can just start working. Here’s the code:

gdata <- read.csv('large_data_glucose.csv', header=TRUE)

#We can open a spreadsheet type editor to check out and edit the data:
edit(gdata)
#Since we are interested in the glucose sensor data, column 31, but the name is a bit awkward to deal with, a good thing to do is to change it:
colnames(gdata)[31] <- "GSensor"

#Lets plot the glucose sensor data:
plot(gdata$GSensor, col="darkblue")

#Here's a histogram plot:
hist(gdata$GSensor, breaks=100, col="darkblue")

#and now lets plot the logarithm of the data:
hist(log(gdata$GSensor), breaks=100, col="darkblue")

And here are the plots:

Sensor_Glucose_plot

Sensor_Glucose_histogram

Log_Sensor_Glucose_histogram

One thing my friend mentions is that R automatically skips missing values (whereas we had to deal with them directly in python). He also mentions that other things can be done in this situation, and to learn more we should check out this site.

R seems to be really good at this kind of thing, that is to say doing the first thing you can think about with data. I am wondering how it compares to python when you have to really start cleaning and processing the data before plotting. We shall see!

Categories: data science, open source tools

Step 0: Installing python and visualizing data

June 22, 2011 Cathy O'Neil, mathbabe 5 comments

A friend of mine has type I diabetes, and lots of data (glucose levels every five minutes) from his monitor. We’ve talked on and off about how to model future (as in one hour hence) glucose levels, using information on the current level, insulin intake, and carb intake. He was kind enough to allow me to work on this project on this blog. It’s an exciting and potentially really useful project, and it will be great to use as an example for each step of the modeling process.

To be clear: I don’t know if I will be able to successfully model glucose levels (or even better be able to make suggestions for how much insulin or carbs to take in order to keep glucose levels within reasonable levels), but it’s exciting to try and it’s totally worth a try. I’m counting on you to give me suggestions if I’m being dumb and missing something!

I decided to use python to do my modeling, and I went to this awesomely useful page and followed the instructions to install python and matplotlib on my oldish mac book. It worked perfectly (thanks, nerd who wrote that page!).

The data file, which contains 3 months of data, is a csv (comma separated values) file, with the first line describing the name of the values in the lines below it:

Index,Date,Time,Timestamp,New Device Time,BG Reading (mg/dL),Linked BG Meter ID,Temp Basal Amount (U/h),Temp Basal Type,Temp Basal Duration (hh:mm:ss),Bolus Type,Bolus Volume Selected (U),Bolus Volume Delive\
red (U),Programmed Bolus Duration (hh:mm:ss),Prime Type,Prime Volume Delivered (U),Suspend,Rewind,BWZ Estimate (U),BWZ Target High BG (mg/dL),BWZ Target Low BG (mg/dL),BWZ Carb Ratio (grams),BWZ Insulin Sens\
itivity (mg/dL),BWZ Carb Input (grams),BWZ BG Input (mg/dL),BWZ Correction Estimate (U),BWZ Food Estimate (U),BWZ Active Insulin (U),Alarm,Sensor Calibration BG (mg/dL),Sensor Glucose (mg/dL),ISIG Value,Dail\
y Insulin Total (U),Raw-Type,Raw-Values,Raw-ID,Raw-Upload ID,Raw-Seq Num,Raw-Device Type
1,12/15/10,00:00:00,12/15/10 00:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,28.4,ResultDailyTotal,"AMOUNT=28.4, CONCENTRATION=null",5472682886,50184670,236,Paradigm 522
2,12/15/10,00:04:00,12/15/10 00:04:00,,,,,,,,,,,,,,,,,,,,,,,,,,,120,16.54,,GlucoseSensorData,"AMOUNT=120, ISIG=16.54, VCNTR=null, BACKFILL_INDICATOR=null",5472689886,50184670,4240,Paradigm 522
3,12/15/10,00:09:00,12/15/10 00:09:00,,,,,,,,,,,,,,,,,,,,,,,,,,,116,16.21,,GlucoseSensorData,"AMOUNT=116, ISIG=16.21, VCNTR=null, BACKFILL_INDICATOR=null",5472689885,50184670,4239,Paradigm 522

I made a new directory below my home directory for this file and for the python scripts to live, and I started up python from the command line inside that directory. Then I opened emacs (could have been TextEdit or any other editor you like) to write simple script to see my data.

A really easy way of importing this kind of file into python is to use a DictReader. DictReader is looking for a file formatted exactly as this file is, and it’s easy to use. I wrote this simple script to take a look at the values in the “Sensor Glucose” field (note there are sometimes gaps and I had to decide what to do in that case):

#!/usr/bin/env python

import csv

from matplotlib.pylab import *

dataReader = csv.DictReader(open(‘Jason_large_dataset.csv’, ‘rU’), delimiter=’,’, quotechar=’|’)

i=0

datalist = []

for row in dataReader:

print i, row[“Sensor Glucose (mg/dL)”]

if row[“Sensor Glucose (mg/dL)”] == “”:

datalist.append(-1)

else:

datalist.append(float(row[“Sensor Glucose (mg/dL)”]))

i+=1

continue

print min(datalist), max(datalist)

scatter(arange(len(datalist)), datalist)

show()

And this is the picture that popped out:

Taking a quick look at the Glucose levels

I don’t know how easy it is to see this but there are lots of gaps (when there’s a gap I plotted a dot at -1, and the line at -1 looks pretty thick). Moreover, it’s clear this data is being kept in a pretty tight range (probably good news for my friend). Another thing you might notice is that the data looks more likely to be in the lower half of the range than in the upper half. To get at this we will draw a histogram of the data, but this time we will *not* fill in gaps with a bunch of fake “-1″s since that would throw off the histogram. Here are the lines I added in the code:

#!/usr/bin/env python

import csv

from matplotlib.pylab import *

dataReader = csv.DictReader(open(‘Jason_large_dataset.csv’, ‘rU’), delimiter=’,’, quotechar=’|’)

i=0

datalist = []

skip_gaps_datalist = []

for row in dataReader:

print i, row[“Sensor Glucose (mg/dL)”]

if row[“Sensor Glucose (mg/dL)”] == “”:

datalist.append(-1)

else:

datalist.append(float(row[“Sensor Glucose (mg/dL)”]))

skip_gaps_datalist.append(float(row[“Sensor Glucose (mg/dL)”]))

i+=1

continue

print min(datalist), max(datalist)

figure()

scatter(arange(len(datalist)), datalist)

figure()

hist(skip_gaps_datalist, bins = 100)

show()

And this is the histogram that resulted:

This is a pretty skewed, pretty long right-tailed distribution. Since we know the data is always positive (it’s measuring the presence of something in the blood stream), and since the distribution is skewed, this makes me consider using the log values instead of the actual values. This is because, as a rule of thumb, it’s better to use variables that are more or less normally distributed. To picture this I replace one line in my code:

skip_gaps_datalist.append(log(float(row["Sensor Glucose (mg/dL)"])))

And this is the new histogram:

This is definitely more normal.

Next time we will talk more about cleaning this data and what other data we will use for the model.

Categories: data science, open source tools

Data Without Borders

June 21, 2011 Cathy O'Neil, mathbabe 1 comment

How freaking cool is this?! I signed up today and wrote to the founder, Jake Porway. He seems fantastic. I’m very excited about his project and how we (meaning you and me, kind reader) can help use our data scientist hats to help NGOs think about what data to collect and how to analyze it once they have it. Please consider signing up!

Categories: data science, news, open source tools

Quantitative risk management

June 19, 2011 Cathy O'Neil, mathbabe Comments off

After the credit crisis hit we all realized that there’s a lot more risk out there than can be described by trailing volatility measures. Once I decided to leave the hedge fund world, I was thinking about working for the “other side,” namely to help quantify risk and/or work on the side of the regulators. I applied to the SEC, the New York Fed, and Riskmetrics, a software company which had a good reputation. I never heard from the Fed, and the SEC didn’t seem to have something for me, but I landed a job at Riskmetrics.

I figured it this way: if you work on a risk in a good way, if you make a better risk model, then you can at least argue you are improving the world. If you are instead making a bad risk model, and you know it, then you’re making the world a worse, riskier place. For example if you are working for a rating agency and get paid to ignore signs of riskiness, then that would be the not improving the world kind.

I really enjoyed my job, and after some months I was put in charge of “risk methodology,” which meant I got to think about how to quantify risk and why. I worked on our credit default model, which was super interesting, and I got to talk to the head trader of one of the biggest CDS trading desks regularly to understand the details of the market. In fact many of the biggest hedge funds and banks and pension funds send their portfolios daily to companies such as Riskmetrics to get overnight assessments of the riskiness of their portfolios. Bottomline is that my job kind of rocked, but it didn’t last forever; we were acquired soon after that by a company which didn’t offer me the same kind of position and I left pretty soon.

Here’s an article that very clearly articulates some of the problems in the field of quantitative risk. In my opinion it doesn’t go far enough with respect to their last point, or maybe it misses something, where they talk about “forecasting extreme risks.” This refers to the kind of thing that happens in a crisis, when all sorts of people are pulling out of the market at the same time and there are cascading, catastrophic losses.

What gets to me about this is that everyone talks about moments like these as if they can’t be modeled, but of course they can be, to a limited extent. Namely, although we don’t know what the next huge crisis will be, there are a few obvious candidates (like the Greek, Portuguese, Irish, or U.S. defaulting on their debt) which we should be keeping an eye on to the best of our quantitative abilities. Many of the “panic” situations (like the mortgage-backed securities debacle) were pretty obvious risks weeks or months in advance of their occurring, but people just didn’t know how to anticipate the consequences. That’s fine for a given individual trader but shouldn’t be true for the government.

I think the first step should be to compile a longish list of possible disaster scenarios (include the ones we’ve already seen happen) and decide what the probability of each scenario is- these probabilities can be updated each week by a crew of economists or what have you. Secondly and separately, set up a quantitative model which tries to capture the resulting cascade of consequences that each scenario would create; this would be complicated and involve things like guessing the losses at which hedge funds start liquidating their books, but should be aided by amassing huge amounts of information of the underlying portfolios of the largest institutions.

In my opinion the regulators have made a huge mistake in the past three years by _not_ insisting on getting the entire portfolio from every major hedge fund and bank every night (which from above we know is possible for them since they already send them to Riskmetrics-like software companies, although I’ve read articles where they claim this would be way too onerous a task) and, with that deep information, model the effect of a crisis scenario from our above list; how would it affect the bond market? The CDS market? The model which already exists at quantitative hedge funds now, which measures the impact and decay on trades, is a great start. Moreover, this model is not impossible to train (i.e. the actual coefficients inside the model’s formulas aren’t that hard to estimate), in fact it wouldn’t be that big a deal if we had as much data as I’m talking about. To me it’s unbelievable that we aren’t getting this portfolio information every day (or even intraday) and creating a “systemic impact model,” because it would clearly make us better prepared for future events (although not of course perfectly prepared) and no hedge fund or bank could argue that we shouldn’t be worried – it should be one of the costs of doing business on Wall Street.

Categories: data science, finance, hedge funds, news

Data mining contests

June 16, 2011 Cathy O'Neil, mathbabe 2 comments

So a friend of mine came over last night and he recently became a data scientist in a New York startup too. In fact we have an eery number of things in common, although he only considered working in finance but didn’t actually go through it. It was pretty awesome to see him.

He was also pretty into the idea of this blog and making quantitative techniques more open-source and collaborative. And with that goal in mind he sent me these links:

So what do you guys say? Should we work on something together on this blog that may actually help the world/ make us some prize money? That would be filthy good.

I’m also hoping to get this guy to make a guest post on some quantitative techniques he wants to add to my list. Please comment if you have more suggestions! I will start writing about the list topics very soon.

Categories: data science, news, open source tools

The basics of quantitative modeling

June 16, 2011 Cathy O'Neil, mathbabe 4 comments

One exciting goal I have for this blog is to articulate the basic methods of quantitative modeling, followed by, hopefully, collaborative real-time examples of how this craft works out in given examples. Today I just want to outline the techniques, and in later posts I will follow up with a post which goes into more detail on one or more points.

Data cleaning: bad data (corrupt) vs. outliers (actual data which have unusual values)
In sample/ out of sample data
Predictive variables: choosing and preparing which ones and how many
Exponential down-weighting of “old” data
Remaining causal: predictive vs. descriptive modeling
Regressions: linear and multivariate with exponentially down-weighted data
Bayesian priors and how to implement them
Open source tools
When do you have enough data?
When do you have statistically significant results?
Visualizing everything
General philosophy of avoiding fitting your model to the data

For those of you reading this who know a thing or two about being a quant, please do tell me if I’ve missed something.

I can’t wait!

Categories: data science, open source tools

What is seasonal adjustment?

June 12, 2011 Cathy O'Neil, mathbabe 5 comments

One thing that kind of drives me crazy in economic or business news (which I’m frankly addicted to (which makes me incredibly old and boring)) is the lack of precision exactly when there seems to be some actual data- so at the very moment when you think you’re going to be told what the hard cold facts are, so you can make up your own mind about whether the economy is still sucking or is finally recovering, you get a pseudo-statistic with a side of caveat. I make it a point to try to formally separate the true bullshit from the stuff that actually is pretty informative if you know what they are talking about. I consider “seasonal adjustment” pretty much in the latter category, although there are exceptions (more on that later).

So what does “seasonal adjustment” mean? Let’s take an example: a common one is home sales. It’s a well known fact that people don’t buy as many homes in January and February as they do in May and June– due to some combination of people sitting in their houses eating ice cream straight from the Ben & Jerry’s container when it’s cold outside and the dirty snow tracks on their immaculate rugs during open houses making people trying to sell their houses enraged. So people delay house-hunting til Spring and they delay house-selling til house-hunting starts (side note: because of this, desperate people getting divorced or being forced to move often have to sell their houses at major discounts, so always do your house-hunting right after a huge blizzard).

Considering the cyclical and predictable nature of home sales, people want to “seasonally adjust” the data so that they can discern a move that is *not* due to the time of the year, in other words they want to detect whether a more macroeconomic issue is affecting home sales, such as a recession or housing glut (or both). It’s a reasonable approach- how does it work exactly?

Say you have a bunch of housing data, maybe 20 years of monthly home sales. You see that every single year the same pattern emerges, more or less. Then you could, for a given year, compute the average sale per month for that year. It’s important to compute this average, as we will see, because one golden rule of adjusting data is that the sum of the adjusted data must equal the original data, otherwise you introduce a problem that’s bigger than the one you’re solving.

Once you have the average sale per month, you figure out (using all 20 years) the typical divergence from the average that you see per month, as a percentage of the average per month that year. So for example, January is the worst month for home sales, and in the 20 years of data you see that on average there are 20% fewer home sales in January than there are on the average month of that year, whereas in June there are typically (in your sample) 15% more sales than in the average month that year. Using this historical data, you come up with numbers for each month (-20% for January, 15% for June, etc.). I can finally say what “seasonally adjusted” means: it is the rate of sales for the average month or for the year given these numbers. So if we saw 80,000 home sales in January, and our number for January is -20%, then we will say we have a seasonally adjusted rate of 100,000 sales per month or 1.2 million sales per year.

Note that this system of adjustment follows the golden rule at least for the historical data; by the end of each calendar year, we have attributed the correct overall number of sales, spread out over the months. However, if we start predicting July sales from what we’ve seen from home sales from January to March, taking into account these adjustments, we will also be tacitly assuming an overall number of sales for the year, and the golden rule will probably not hold. This is just another way to say that we won’t really know how many home sales have occurred in a given year until the year is over, so duh. But it’s not hard to believe that knowing these numbers is pretty useful if you want to make a ballpark estimate of the yearly rate of home sales and it’s only March.

A slightly more sophisticated way of doing this, which doesn’t depend as much on the calendar year, is to use the 20 years of data and a rolling 12 month window (i.e. where we add a month in the front and drop off a month in the back and thus always consider 12 consecutive months at a time) to compute the monthly adjustment for each month relative not to the average for the upcoming year, but rather relative to the average of the 12 past months. This has the advantage of be a causal model, (i.e. a model which only uses data in the past to predict the future- I’ll write a post soon about causal modeling) but has the disadvantage of not following the golden rule, at least in a short amount of time. For example, if housing sales are on a slow slide over months and months, this model will consistently fail to predict how low home sale figures should be.

The biggest problems with seasonally adjusted numbers are, in my opinion, that the model itself is never described- do we use 20 years of historical data? 3 years? Do we use a rolling window or calendar years? Without this kind of information, I’m frankly left wondering if you could frigging show me the raw data and let me decide whether it’s good news or bad news.

A few comments have trickled in from friends (over email) who are quants, and I wanted to add them here.

First, any predicting is hard and assumes a model, i.e., each year is the same, or each month is the same. In other words, as soon as you are talking about something being surprisingly anything, you are modeling, even when you don’t think you are. Most assumptions go unnoticed in fact. Part of being a good quant is simply being able to list your modeling assumptions.
As we will see when we discuss quant techniques further, a very important metric of a model is how many independent data points you have going into the model- this informs the calculation of statistical significance, for example. The comment then is that modeling seasonal adjustment as I’ve described above lowers your “number of independent data points” count by a factor of 12, because you are basically using all 12 months of a year to predict the _next year_, so what looked like 12 data points is really becoming only one. However, you could try to fit a smaller (than 12) parameter curve to the seasonal data differences, but then there’s overfit from having chosen the family of curves to be one that looks right. More on questions like this when we explore the concept of fitting model to the data, and in particular on how many different models you try for a given data set.
The final comment is this: all predictions likely violate the golden rule, but the point is you at least want one that isn’t biased, so in expectation it matches the rule.

Categories: data science, finance

Hello world! [stet]

June 11, 2011 Cathy O'Neil, mathbabe 4 comments

Welcome to my new “mathbabe” blog! I’d like to outline my aspirations for this blog, at least as I see it now.

First, I want to share my experiences as a female mathematician, for the sake of young women wanting to know what things are like as a professional woman mathematician. Second, I want to share my experiences as an academic mathematician and as a quant in finance, and finally as a data scientist in internet advertising. (Wait, did I say finally?)

I also want to share explicit mathematical and statistical techniques that I’ve learned by doing these jobs. For some reason being a quant is treated like a closed guild, and I object to that, because these are powerful techniques that are not that difficult to learn and use.

Next I want to share thoughts and news on subjects such as mathematics and science education, open-source software packages, and anything else I want, since after all this is a blog.

Finally, I want to use this venue to explore new subjects using the techniques I have under my belt, and hopefully develop new ones. I have a few in mind already and I’m really excited by them, and hopefully with time and feedback from readers some progress can be made. I want to primarily focus on things that will actually help people, or at least have the potential to help people, and which lend themselves to quantitative analysis.

Woohoo!

Categories: data science, finance, hedge funds, math education, news, open source tools, women in math

Newer Entries

mathbabe

Archive

Why should you care about statistical modeling?

Some R code and a data mining book

Cool example of Bayesian methods applied to education

What kind of math nerd job should you have?

Historical volatility on the S&P index

Two important characteristics of returns

Example: S&P daily returns

Here’s the python code I used to generate these plots from the data (see also R code below):

Here’s the R code Daniel Krasner kindly wrote for the same plots:

What is an earnings surprise?

Glucose Prediction Model: absorption curves and dirty data

Women on a board of directors: let’s use Bayesian inference

Here’s the python code I wrote to make these graphs:

Here’s the R code that Daniel Krasner wrote for these graphs:

Woohoo!

Women on S&P500 boards of directors

Step 0 Revisited: Doing it in R

Step 0: Installing python and visualizing data

Data Without Borders

Quantitative risk management

Data mining contests

The basics of quantitative modeling

What is seasonal adjustment?

Hello world! [stet]

Top Posts & Pages

Follow Blog via Email

Recent Posts

Meta