Archive for the ‘open source tools’ Category

Data Viz

The picture below is a visualization of the complexity of algebra. The vertices are theorems and the edges between theorems are dependencies. Technically the edges should be directed, since if Theorem A depends on Theorem B, we shouldn’t have it the other way around too!

This comes from data mining my husband’s open source Stacks Project; I should admit that, even though I suggested the design of the picture, I didn’t implement it! My husband used graphviz to generate this picture – it puts heavily connected things in the middle and less connected things on the outside. I’ve also used graphviz to visualize the connections in databases (MySQL automatically generates the graph).

Here’s another picture which labels each vertex with a tag. I designed the tag system, which gives each theorem a unique identifier; the hope is that people will be willing to refer to the theorems in the project even though their names and theorem numbers may change (i.e. Theorem 1.3.3 may become Theorem 1.3.4 if someone adds a new result in that section). It’s also directed, showing you dependency (Theorem A points to Theorem B if you need Theorem A to prove Theorem B). This visualizes the results needed to prove Chow’s Lemma:

Some R code and a data mining book

I’m very pleased to add some R code which does essentially the same thing as my python code for this post, which was about using Bayesian inference to thing about women on boards of directors of S&P companies, and for this post, which was about measuring historical volatility for the S&P index. I have added the code to those respective posts. Hopefully the code will be useful for some of you to start practicing manipulating visualizing data in the two languages.

Thanks very much to Daniel Krasner for providing the R code!

Also, I wanted to mention a really good book I’m reading about data mining, namely “Data Analysis with Open Source Tools,” by Phillipp Janert, published by O’Reilly. He wrote it without assuming much mathematics, but in a sophisticated manner. In other words, for people who are mathematicians, the lack of explanation of the math will be fine, but the good news is he doesn’t dumb down the craft of modeling itself. And I like his approach, which is to never complicate stuff with fancy methods and tools unless you have a very clear grasp on what it will mean and why it’s going to improve the situation. In the end this is very similar to the book I would have imagined writing on data analysis, so I’m kind of annoyed that it’s already written and so good.

Speaking of O’Reilly, I’ll be at their “Strata: Making Data Work” conference next month here in New York, who’s going to meet me there? It looks pretty great, and will be a great chance to meet other people who are as in love with sexy data as I am.

Historical volatility on the S&P index

In a previous post I described the way people in finance often compute historical volatility, in order to try to anticipate future moves in a single stock. I’d like to give a couple of big caveats to this method as well as a worked example, namely on daily returns of the S&P index, with the accompanying python code. I will use these results in a future post I’m planning about errorbars and how people abuse and misuse them.

Two important characteristics of returns

First, market returns in general have fat-tailed distributions; things can seem “quiet” for long stretches of time (longer than any lookback window), during which the sample volatility is a possibly severe underestimate of the “true” standard of deviation of the underlying distribution (if that even makes sense – for the sake of this discussion let’s assume it does). Then when a fat-tailed event occurs, the sample volatility typically spikes to being an overestimate of the standard of deviation for that distribution.

Second, in the markets, there is clustering of volatility- another way of saying this is that volatility itself is rather auto-correlated, so even if we can’t predict the direction of the return, we can still estimate the size of the return. This is particularly true right after a shock, and there are time series models like ARCH and its cousins that model this phenomenon; they in fact allow you to model an overall auto-correlated volatility, which can be thought of as scaling for returns, and allows you to then approximate the normalized returns (returns divided by current volatility) as independent, although still not normal (because they are still fat-tailed even after removing the clustered volatility effect). See below for examples of normalized daily S&P returns with various decays.

Example: S&P daily returns

I got this data from Yahoo Finance, where they let you download daily S&P closes since 1950 to an excel spreadsheet. I could have used some other instrument class, but the below results would be stronger (especially for things like credit default swamps), not weaker- the S&P, being an index, is already the sum of a bunch of things and tends to be more normal as a result; in other words, the Central Limit Theorem is already taking effect on an intraday basis.

First let’s take a look at the last 3 years of closes, so starting in the summer of 2008:

Next we can look at the log returns for the past 3 years:

Now let’s look at how the historical volatility works out with different decays (decays are numbers less than 1 which you use to downweight old data: see this post for an explanation):

For each choice of the above decays, we can normalize the log returns. to try to remove the “volatility clustering”:

As we see, the long decay doesn’t do a very good job. In fact, here are the histograms, which are far from normal:

Here’s the python code I used to generate these plots from the data (see also R code below):

#!/usr/bin/env python

import csv
from matplotlib.pylab import *
from numpy import *
from math import *
import os

dataReader = csv.DictReader(open(‘SandP_data.txt’, ‘rU’), delimiter=’,’, quotechar=’|’)

close_list = []
for row in dataReader:
#print row[“Date”], row[“Close”]
close_array = array(close_list)
close_log_array = array([log(x) for x in close_list])
log_rets = array(diff(close_log_array))
perc_rets = array([exp(x)-1 for x in log_rets])

plot(close_array[-780:-1], label = “raw closes”)
title(“S&P closes for the last 3 years”)
#plot(log_rets, label = “log returns”)
#hist(log_rets, 100, label = “log returns”)
#hist(perc_rets, 100, label = “percentage returns”)

def get_vol(d):
var = 0.0
lam = 0.0
var_list = []
for r in log_rets:
lam = lam*(1.0-1.0/d) + 1
var = (1-1.0/lam)*var + (1.0/lam)*r**2
return [sqrt(x) for x in var_list]

for d in [10, 30, 100]:
plot(get_vol(d)[-780:-1], label = “decay factor %.2f” %(1-1.0/d))
title(“Volatility in the S&P in the past 3 years with different decay factors”)
for d in [10, 30, 100]:
these_vols = get_vol(d)
plot([log_rets[i]/these_vols[i-1] for i in range(len(log_rets) – 780, len(log_rets)-1)], label = “decay %.2f” %(1-1.0/d))
title(“Volatility normalized log returns (last three years)”)
plot([log_rets[i] for i in range(len(log_rets) – 780, len(log_rets)-1)], label = “raw log returns”)
title(“Raw log returns (last three years)”)
for d in [10, 30, 100]:
these_vols = get_vol(d)
normed_rets = [log_rets[i]/these_vols[i-1] for i in range(len(log_rets) – 780, len(log_rets)-1)]
hist(normed_rets, 100,label = “decay %.2f” %(1-1.0/d))
title(“Histogram of volatility normalized log returns (last three years)”)

Here’s the R code Daniel Krasner kindly wrote for the same plots:


dataReader <- read.csv(“SandP_data.txt”, header=T)

close_list <- as.numeric(dataReader$Close)

close_list <- rev(close_list)

close_log_list <- log(close_list)

log_rets <- diff(close_log_list)

perc_rets = exp(log_rets)-1


plot(close_list[(length(close_list)-779):(length(close_list))], type=’l’, main=”S&P closes for the last 3 years”, col=’blue’)

legend(125, 1300, “raw closes”, cex=0.8, col=”blue”, lty=1)

get_vol <- function(d){

var = 0


var_list <- c()

for (r in log_rets){

lam <- lam*(1 – 1/d) + 1

var = (1 – 1/lam)*var + (1/lam)*r^2

var_list <- c(var_list, var)


return (sqrt(var_list))


L <- (length(close_list))


plot(get_vol(10)[(L-779):L], type=’l’, main=”Volatility in the S&P in the past 3 years with different decay factors”, col=1)

lines(get_vol(30)[(L-779):L],  col=2)

lines(get_vol(100)[(L-779):L],  col=3)

legend(550, 0.05, c(“decay factor .90”, “decay factor .97″,”decay factor .99”) , cex=0.8, col=c(1,2,3), lty = 1:3)



plot((log_rets[2:L]/get_vol(10))[(L-779):L], type=’l’,  col=1, lty=1, ylab=”)

legend(620, 3, “decay factor .90”, cex=0.6, col=1, lty = 1)

plot((log_rets[2:L]/get_vol(30))[(L-779):L], type=’l’, col=2, lty =2, ylab=”)

legend(620, 3, “decay factor .97”, cex=0.6, col=2, lty = 2)

plot((log_rets[2:L]/get_vol(100))[(L-779):L], type=’l’, col=3, lty =3, ylab=”)

legend(620, 3, “decay factor .99”, cex=0.6, col=3, lty = 3)


plot(log_rets[(L-779):L], type=’l’, main = “raw log returns”, col=”blue”, ylab=”)


hist((log_rets[2:L]/get_vol(10))[(L-779):L],  breaks=200, col=1, lty=1, ylab=”, xlab=”, main=”)

legend(2, 15, “decay factor .90”, cex=.8, col=1, lty = 1)

hist((log_rets[2:L]/get_vol(30))[(L-779):L],  breaks=200, col=2, lty =2, ylab=”,  xlab=”, main=”)

legend(2, 40, “decay factor .97”, cex=0.8, col=2, lty = 2)

hist((log_rets[2:L]/get_vol(100))[(L-779):L],  breaks=200,  col=3, lty =3, ylab=”,  xlab=”, main=”)

legend(3, 50, “decay factor .99”, cex=0.8, col=3, lty = 3)

Glucose Prediction Model: absorption curves and dirty data

In this post I started visualizing some blood glucose data using python, and in this post my friend Daniel Krasner kindly rewrote my initial plots in R.

I am attempting to show how to follow the modeling techniques I discussed here in order to try to predict blood glucose levels. Although I listed a bunch of steps, I’m not going to be following them in exactly the order I wrote there, even though I tried to make them in more or less the order we should at least consider them.

For example, it says first to clean the data. However, until you decide a bit about what your model will be attempting to do, you don’t even know what dirty data really means or how to clean it. On the other hand, you don’t want to wait too long to figure something out about cleaning data. It’s kind of a craft rather than a science. I’m hoping that by explaining the steps the craft will become apparent. I’ll talk more about cleaning the data below.

Next, I suggested you choose in-sample and out-of-sample data sets. In this case I will use all of my data for my in-sample data since I happen to know it’s from last year (actually last spring) so I can always ask my friend to send me more recent data when my model is ready for testing. In general it’s a good idea to use at most two thirds of your data as in-sample; otherwise your out-of-sample test is not sufficiently meaningful (assuming you don’t have that much data, which always seems to be the case).

Next, I want to choose my predictive variables. First, we should try to see how much mileage we can get out of predicting future blood glucose levels with past glucose levels. Keeping in mind that the previous post had us using log levels instead of actual glucose levels, since then the distribution of levels is more normal, we will actually be trying to predict log glucose levels (log levels) knowing past log glucose levels.

One good stare at the data will tell us there’s probably more than one past data point that will be needed, since we see that there is pretty consistent moves upwards and downwards. In other words, there is autocorrelation in the log levels, which is to be expected, but we will want to look at the derivative of the log levels in the near past to predict the future log levels. The derivative can be computed by taking the difference of the most recent log level and the previous one to that.

Once we have the best model we can with just knowing past log levels, we will want to add reasonable other signals. The most obvious candidates are the insulin intakes and the carb intakes. These are presented as integer values with certain timestamps. Focusing on the insulin for now, if we know when the insulin is taken and how much, we should be able to model how much insulin has been absorbed into the blood stream at any given time, if we know what the insulin absorption curve looks like.

This leads to the question of, what does the insulin (rate of) absorption curve look like? I’ve heard that it’s pretty much bell-shaped, with a maximum at 1.5 hours from the time of intake; so it looks more or less like a normal distribution’s probability density function. It remains to guess what the maximum height should be, but it very likely depends linearly on the amount of insulin that was taken. We also need to guess at the standard deviation, although we have a pretty good head start knowing the 1.5 hours clue.

Next, the carb intakes will be similar to the insulin intake but trickier, since there is more than one type of carb and different types get absorbed at different rates, but are all absorbed by the bloodstream in a vaguely similar way, which is to say like a bell curve. We will have to be pretty careful to add the carb intake model, since probably the overall model will depend dramatically on our choices.

I’m getting ahead of myself, which is actually kind of good, because we want to make sure our hopeful path is somewhat clear and not too congested with unknowns. But let’s get back to the first step of modeling, which is just using past log glucose levels to predict the next glucose level (we will later try to expand the horizon of the model to predict glucose levels an hour from now).

Looking back at the data, we see gaps and we see crazy values sometimes. Moreover, we see crazy values more often near the gaps. This is probably due to the monitor crapping out near the end of its life and also near the beginning. Actually the weird values at the beginning are easy to take care of- since we are going to work causally, we will know there had been a gap and the data just restarted, so we we will know to ignore the values for a while (we will determine how long shortly) until we can trust the numbers. But it’s much trickier to deal with crazy values near the end of the monitor’s life, since, working causally, we won’t be able to look into the future and see that the monitor will die soon. This is a pretty serious dirty data problem, and the regression we plan to run may be overly affected by the crazy crapping-out monitor problems if we don’t figure out how to weed them out.

There are two things that may help. First, the monitor also has a data feed which is trying to measure the health of the monitor itself. If this monitor monitor is good, it may be exactly what we need to decide, “uh-oh the monitor is dying, stop trusting the data.” The second possible saving grace is that my friend also measured his blood glucose levels manually and inputted those numbers into the machine, which means we have a way to check the two sets of numbers against each other. Unfortunately he didn’t do this every five minutes (well actually that’s a good thing for him), and in particular during the night there were long gaps of time when we don’t have any manual measurements.

A final thought on modeling. We’ve mentioned three sources of signals, namely past blood glucose levels, insulin absorption forecasts, and carbohydrate absorption forecasts. There are a couple of other variables that are known to effect the blood glucose levels. Namely, the time of day and the amount of exercise that the person is doing. We won’t have access to exercise, but we do have access to timestamps. So it’s possible we can incorporate that data into the model as well, once we have some idea of how the glucose is effected by the time of day.

Women on a board of directors: let’s use Bayesian inference

I wanted to show how to perform a “women on the board of directors” analysis using Bayesian inference. What this means is that we need to form a “prior” on what we think the distribution of the answer could be, and then we update our prior with the data available.  In this case we simplify the question we are trying to answer: given that we see a board with 3 women and 7 men (so 10 total), what is the fraction of women available for the board of directors in the general population? The reason we may want to answer this question is that then we can compare the answer to other available answers, derived other ways (say by looking at the makeup of upper level management) and see if there’s a bias.

In order to illustrate Bayesian techniques, I’ve simplified it further to be a discrete question.  So I’ve pretended that there are only 11 answers you could possible have, namely that the fraction of available women (in the population of people qualified to be put on the board of directors) is 0%, 10%, 20%, …, 90%, or 100%.

Moreover, I’ve put the least judgmental prior on the situation, namely that there is an equal chance for any of these 11 possibilities.  Thus the prior distribution is uniform:

We have absolutely no idea what the fraction of qualified women is.

The next step is to update our prior with the available data.  In this case we have the data point that there a board with 3 women and 7 men.  In this case we are sure that there are some women and some men available, so the updated probability of there being 0% women or 100% women should both be zero (and we will see that this is true).  Moreover, we would expect to see that the most likely fraction will be 30%, and we will see that too.  What Bayesian inference gives to us, though, is the relative probabilities of the other possibilities, based on the likelihood that one of them is true given the data.  So for example if we are assuming for the moment that 70% of the qualified people are women, what is the likelihood that the board ends up being 3 women and 7 men?  We can compute that as (0.70)^3*(0.30)^7.  We multiply that by 1/11, the probability that 70% is the right answer (according to our prior) to get the “unscaled posterior distribution”, or the likelihoods of each possibility.  Here’s a graph of these numbers when I do it for all 11 possibilities:

We learn the relative likelihoods of the outcome "3 out of 10" given the various ratios of women

In order to make this a probability distribution we need to make sure the total adds up to 1, so we scale to get the actual posterior distribution:

We scale these to add up to 1

What we observe is, for example, that it’s about twice as likely for 50% of women to be qualified as it is for 10% of women to be qualified, even though those answers are equally distant from the best guess of 30%.  This kind of “confidence of error” is what Bayesian inference is good for.  Also, keep in mind that if we had had a more informed prior the above graph would look different; for example we could use the above graph as a prior for the next time we come across a board of directors.  In fact that’s exactly how this kind of inference is used: iteratively, as we travel forward through time collecting data.  We typically want to start out with a prior that is pretty mild (like the uniform distribution above) so that we aren’t skewing the end results too much, and let the data speak for itself.  In fact priors are typically of the form, “things should vary smoothly”; more on what that could possibly mean in a later post.

Here’s the python code I wrote to make these graphs:

#!/usr/bin/env python
from matplotlib.pylab import *
from numpy import *
# plot prior distribution:
bar(arange(0,1.1,0.1), array([1.0/11]*11), width = 0.1, label = “prior probability distribution”)
xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )
xlim(0, 1.1)
# compute likelihoods for each of the 11 possible ratios of women:
likelihoods = []
for x in arange(0, 1.1, 0.1):
# plot unscaled posterior distribution:
bar(arange(0,1.1,0.1), array([1.0/11]*11)*array(likelihoods), width = 0.1, label = “unscaled posterior probability distribution”)
xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )
xlim(0, 1.1)
# plot scaled posterior distribution:
bar(arange(0,1.1,0.1), array([1.0/11]*11)*array(likelihoods)/sum(array([1.0/11]*11)*array(likelihoods)), width = 0.1, label = “scaled posterior probability distribution”)
xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )
xlim(0, 1.1)

Here’s the R code that Daniel Krasner wrote for these graphs:

barplot( rep((1/11), 11), width = .1, col=”blue”, main = “prior probability distribution”)
likelihoods = c()
for (x in seq(0, 1.0, by = .1))
    likelihoods = c(likelihoods, (x^3)*((1-x)^7));
barplot(likelihoods, width = .1, col=”blue”, main =  “unscaled posterior probability distribution”)
barplot(likelihoods/sum(seq((1/11), 11)*likelihoods), width = .1, col=”blue”, main =  “scaled posterior probability distribution”)


First of all, I changed the theme of the blog, because I am getting really excellent comments from people but I thought it was too difficult to read the comments and to leave comments with the old theme. This way you can just click on the word “Go to comments” or “Leave a comment” which is a bit more self-evident to design-ignorant people like me.  Hope you like it.

Next, I had a bad day today, but I’m very happy to report that something has raised my spirits. Namely, Jake Porway from Data Without Borders and I have been corresponding, and I’ve offered to talk to prospective NGO’s about data, what they should be collecting depending on what kind of studies they want to be able to perform, and how to store and revise data. It looks like it’s really going to happen!

In fact his exact words were: I will definitely reach out to you when we’re talking to NPOs / NGOs.

Oh, and by the way, he also says I can blog about our conversations together as well as my future conversations with those NGO’s (as long as they’re cool with it), which will be super interesting.

Oh, yeah.  Can I get a WOOHOO?!?

Step 0 Revisited: Doing it in R

June 25, 2011 Comments off

A nerd friend of mine kindly rewrote my python scripts in R and produced similar looking graphs.  I downloaded R from here and one thing that’s cool is that once it’s installed, if you open an R source code (ending with “.R”), an R console pops up automatically and you can just start working.  Here’s the code:

gdata <- read.csv('large_data_glucose.csv', header=TRUE)
#We can open a spreadsheet type editor to check out and edit the data:
#Since we are interested in the glucose sensor data, column 31, but the name is a bit awkward to deal with, a good thing to do is to change it:
colnames(gdata)[31] <- "GSensor"

#Lets plot the glucose sensor data:
plot(gdata$GSensor, col="darkblue")

#Here's a histogram plot:
hist(gdata$GSensor, breaks=100, col="darkblue")
#and now lets plot the logarithm of the data:
hist(log(gdata$GSensor), breaks=100, col="darkblue")

And here are the plots:




One thing my friend mentions is that R automatically skips missing values (whereas we had to deal with them directly in python).  He also mentions that other things can be done in this situation, and to learn more we should check out this site.

R seems to be really good at this kind of thing, that is to say doing the first thing you can think about with data.  I am wondering how it compares to python when you have to really start cleaning and processing the data before plotting.  We shall see!