Lagged autocorrelation plots
I wanted to share with you guys a plot I drew with python the other night (the code is at the end of the post) using blood glucose data that I’ve talked about previously in this post and I originally took a look at in this post.
First I want to motivate lagged autocorrelation plots. The idea is, given that you want to forecast something, say in the form of a time series (so a value every day or every ten minutes or whatever), the very first thing you can do is try to use past values to forecast the next value. In other words, you want to squeeze as much juice out of that orange as you can before you start using outside variable to predict future values.
Of course this won’t always work- it will only work, in fact, if there’s some correlation between past values and future values. To estimate how much “signal” there is in such an approach, we draw the correlation between values of the time series for various lags. At no (=0) lag, we are comparing a time series to itself so the correlation is perfect (=1). Typically there are a few lags after 0 which show some positive amount of correlation, then it quickly dies out.
We could also look at correlations between returns of the values, or differences of the values, in various situations. It depends on what you’re really trying to predict: if you’re trying to predict the change in value (which is usually what quants in finance do, since they want to bet on stock market changes for example), probably the latter will make more sense, but if you actually care about the value itself, then it makes sense to compute the raw correlations. In my case, since I’m interested in forecasting the blood glucose levels, which essentially have maxima and minima, I do care about the actual number instead of just the relative change in value.
Depending on what kind of data it is, and how scrutinized it is, and how much money can be made by betting on the next value, the correlations will die out more quickly. Note that, for example, if you did this with daily S&P returns and saw a nontrivial positive correlation after 1 lag, so the next day, then you could have a super simple model, namely bet that whatever happened yesterday will happen again today, and you would statistically make money on that model. At the same time, it’s a general fact that as “the market” recognizes and bets on trends, they tend to disappear. This means that such a simple, positive one-day correlation of returns would be “priced in” very quickly and would therefore disappear with new data. This tends to happen a lot with quant models- as the market learns the model, the predictability of things decreases.
However, in cases where there’s less money riding on the patterns, we can generally expect to see more linkage between lagged values. Since nobody is making money betting on blood glucose levels inside someone’s body, I had pretty high hopes for this analysis. Here’s the picture I drew:
What do you see? Basically I want you to see that the correlation is quite high for small lags, then dies down with a small resuscitation near 300 (hey, it turns out that 288 lags equals one day! So this autocorrelation lift is probably indicating a daily cyclicality of blood glucose levels). Here’s a close-up for the first 100 lags:
We can conclude that the correlation seems significant to about 30 lags, and is decaying pretty linearly.
This means that we can use the previous 30 lags to predict the next level. Of course we don’t want to let 30 parameters vary independently- that would be crazy and would totally overfit the model to the data. Instead, I’ll talk soon about how to place a prior on those 30 parameters which essentially uses them all but doesn’t let them vary freely- so the overall number of independent variables is closer to 4 or 5 (although it’s hard to be precise).
On last thing: the data I have used for this analysis is still pretty dirty, as I described here. I will do this analysis again once I decide how to try to remove crazy or unreliable readings that tend to happen before the blood glucose monitor dies.
Here’s the python code I used to generate these plots:
#!/usr/bin/env python import csv from matplotlib.pylab import * import os from datetime import datetime os.chdir('/Users/cathyoneil/python/diabetes/') gap_threshold = 12 dataReader = csv.DictReader(open('Jason_large_dataset.csv', 'rb'), delimiter=',', quotechar='|') i=0 datelist =  datalist =  firstdate = 4 skip_gaps_datalist =  for row in dataReader: #print i, row["Sensor Glucose (mg/dL)"] if not row["Raw-Type"] == "GlucoseSensorData":continue if firstdate ==4: print i firstdate = \ datetime.strptime(row["Timestamp"], '%m/%d/%y %H:%M:%S') if row["Sensor Glucose (mg/dL)"] == "": datalist.append(-1) else: thisdate = datetime.strptime(row["Timestamp"], '%m/%d/%y %H:%M:%S') diffdate = thisdate-firstdate datelist.append(diffdate.seconds + 60*60*24*diffdate.days) datalist.append(float(row["Sensor Glucose (mg/dL)"])) skip_gaps_datalist.append(log(float(row["Sensor Glucose (mg/dL)"]))) i+=1 continue print min(datalist), max(datalist) ##figure() ##scatter(arange(len(datalist)), datalist) ## ##figure() ##hist(skip_gaps_datalist, bins = 100) ##show() def lagged_correlation(g): d = dict(zip(datelist, datalist)) s1 =  s2 =  for date in datelist: if date + 60*5 in datelist: s1.append(d[date]) s2.append(d[date + 60*5]) return corrcoef(s1, s2)[1, 0] figure() plot([lagged_correlation(f) for f in range(1,900)])