Historical volatility on the S&P index
In a previous post I described the way people in finance often compute historical volatility, in order to try to anticipate future moves in a single stock. I’d like to give a couple of big caveats to this method as well as a worked example, namely on daily returns of the S&P index, with the accompanying python code. I will use these results in a future post I’m planning about errorbars and how people abuse and misuse them.
Two important characteristics of returns
First, market returns in general have fattailed distributions; things can seem “quiet” for long stretches of time (longer than any lookback window), during which the sample volatility is a possibly severe underestimate of the “true” standard of deviation of the underlying distribution (if that even makes sense – for the sake of this discussion let’s assume it does). Then when a fattailed event occurs, the sample volatility typically spikes to being an overestimate of the standard of deviation for that distribution.
Second, in the markets, there is clustering of volatility another way of saying this is that volatility itself is rather autocorrelated, so even if we can’t predict the direction of the return, we can still estimate the size of the return. This is particularly true right after a shock, and there are time series models like ARCH and its cousins that model this phenomenon; they in fact allow you to model an overall autocorrelated volatility, which can be thought of as scaling for returns, and allows you to then approximate the normalized returns (returns divided by current volatility) as independent, although still not normal (because they are still fattailed even after removing the clustered volatility effect). See below for examples of normalized daily S&P returns with various decays.
Example: S&P daily returns
I got this data from Yahoo Finance, where they let you download daily S&P closes since 1950 to an excel spreadsheet. I could have used some other instrument class, but the below results would be stronger (especially for things like credit default swamps), not weaker the S&P, being an index, is already the sum of a bunch of things and tends to be more normal as a result; in other words, the Central Limit Theorem is already taking effect on an intraday basis.
First let’s take a look at the last 3 years of closes, so starting in the summer of 2008:
Next we can look at the log returns for the past 3 years:
Now let’s look at how the historical volatility works out with different decays (decays are numbers less than 1 which you use to downweight old data: see this post for an explanation):
For each choice of the above decays, we can normalize the log returns. to try to remove the “volatility clustering”:
As we see, the long decay doesn’t do a very good job. In fact, here are the histograms, which are far from normal:
Here’s the python code I used to generate these plots from the data (see also R code below):
#!/usr/bin/env python
import csv
from matplotlib.pylab import *
from numpy import *
from math import *
import os
os.chdir(‘/Users/cathyoneil/python/sandp/’)
dataReader = csv.DictReader(open(‘SandP_data.txt’, ‘rU’), delimiter=’,’, quotechar=’’)
close_list = []
for row in dataReader:
#print row[“Date”], row[“Close”]
close_list.append(float(row[“Close”]))
close_list.reverse()
close_array = array(close_list)
close_log_array = array([log(x) for x in close_list])
log_rets = array(diff(close_log_array))
perc_rets = array([exp(x)1 for x in log_rets])
figure()
plot(close_array[780:1], label = “raw closes”)
title(“S&P closes for the last 3 years”)
legend(loc=2)
#figure()
#plot(log_rets, label = “log returns”)
#legend()
#figure()
#hist(log_rets, 100, label = “log returns”)
#legend()
#figure()
#hist(perc_rets, 100, label = “percentage returns”)
#legend()
#show()
def get_vol(d):
var = 0.0
lam = 0.0
var_list = []
for r in log_rets:
lam = lam*(1.01.0/d) + 1
var = (11.0/lam)*var + (1.0/lam)*r**2
var_list.append(var)
return [sqrt(x) for x in var_list]
figure()
for d in [10, 30, 100]:
plot(get_vol(d)[780:1], label = “decay factor %.2f” %(11.0/d))
title(“Volatility in the S&P in the past 3 years with different decay factors”)
legend()
for d in [10, 30, 100]:
figure()
these_vols = get_vol(d)
plot([log_rets[i]/these_vols[i1] for i in range(len(log_rets) – 780, len(log_rets)1)], label = “decay %.2f” %(11.0/d))
title(“Volatility normalized log returns (last three years)”)
legend()
figure()
plot([log_rets[i] for i in range(len(log_rets) – 780, len(log_rets)1)], label = “raw log returns”)
title(“Raw log returns (last three years)”)
for d in [10, 30, 100]:
figure()
these_vols = get_vol(d)
normed_rets = [log_rets[i]/these_vols[i1] for i in range(len(log_rets) – 780, len(log_rets)1)]
hist(normed_rets, 100,label = “decay %.2f” %(11.0/d))
title(“Histogram of volatility normalized log returns (last three years)”)
legend()
Here’s the R code Daniel Krasner kindly wrote for the same plots:
setwd(“/Users/cathyoneil/R”)
dataReader < read.csv(“SandP_data.txt”, header=T)
close_list < as.numeric(dataReader$Close)
close_list < rev(close_list)
close_log_list < log(close_list)
log_rets < diff(close_log_list)
perc_rets = exp(log_rets)1
x11()
plot(close_list[(length(close_list)779):(length(close_list))], type=’l’, main=”S&P closes for the last 3 years”, col=’blue’)
legend(125, 1300, “raw closes”, cex=0.8, col=”blue”, lty=1)
get_vol < function(d){
var = 0
lam=0
var_list < c()
for (r in log_rets){
lam < lam*(1 – 1/d) + 1
var = (1 – 1/lam)*var + (1/lam)*r^2
var_list < c(var_list, var)
}
return (sqrt(var_list))
}
L < (length(close_list))
x11()
plot(get_vol(10)[(L779):L], type=’l’, main=”Volatility in the S&P in the past 3 years with different decay factors”, col=1)
lines(get_vol(30)[(L779):L], col=2)
lines(get_vol(100)[(L779):L], col=3)
legend(550, 0.05, c(“decay factor .90″, “decay factor .97″,”decay factor .99″) , cex=0.8, col=c(1,2,3), lty = 1:3)
x11()
par(mfrow=c(3,1))
plot((log_rets[2:L]/get_vol(10))[(L779):L], type=’l’, col=1, lty=1, ylab=”)
legend(620, 3, “decay factor .90″, cex=0.6, col=1, lty = 1)
plot((log_rets[2:L]/get_vol(30))[(L779):L], type=’l’, col=2, lty =2, ylab=”)
legend(620, 3, “decay factor .97″, cex=0.6, col=2, lty = 2)
plot((log_rets[2:L]/get_vol(100))[(L779):L], type=’l’, col=3, lty =3, ylab=”)
legend(620, 3, “decay factor .99″, cex=0.6, col=3, lty = 3)
x11()
plot(log_rets[(L779):L], type=’l’, main = “raw log returns”, col=”blue”, ylab=”)
par(mfrow=c(3,1))
hist((log_rets[2:L]/get_vol(10))[(L779):L], breaks=200, col=1, lty=1, ylab=”, xlab=”, main=”)
legend(2, 15, “decay factor .90″, cex=.8, col=1, lty = 1)
hist((log_rets[2:L]/get_vol(30))[(L779):L], breaks=200, col=2, lty =2, ylab=”, xlab=”, main=”)
legend(2, 40, “decay factor .97″, cex=0.8, col=2, lty = 2)
hist((log_rets[2:L]/get_vol(100))[(L779):L], breaks=200, col=3, lty =3, ylab=”, xlab=”, main=”)
legend(3, 50, “decay factor .99″, cex=0.8, col=3, lty = 3)

August 4, 2011 at 6:57 amSome R code and a data mining book « mathbabe