Updating your big data model

Home > data science, open source tools > Updating your big data model

Updating your big data model

January 27, 2012 Cathy O'Neil, mathbabe

When you are modeling for the sake of real-time decision-making you have to keep updating your model with new data, ideally in an automated fashion. Things change quickly in the stock market or the internet, and you don’t want to be making decisions based on last month’s trends.

One of the technical hurdles you need to overcome is the sheer size of the dataset you are using to first train and then update your model. Even after aggregating your model with MapReduce or what have you, you can end up with hundreds of millions of lines of data just from the past day or so, and you’d like to use it all if you can.

The problem is, of course, that over time the accumulation of all that data is just too unwieldy, and your python or Matlab or R script, combined with your machine, can’t handle it all, even with a 64 bit setup.

Luckily with exponential downweighting, you can update iteratively; this means you can take your new aggregated data (say a day’s worth), update the model, and then throw it away altogether. You don’t need to save the data anywhere, and you shouldn’t.

As an example, say you are running a multivariate linear regression. I will ignore bayesian priors (or, what is an example of the same thing in a different language, regularization terms) for now. Then in order to have an updated coefficient vector $\beta$ , you need to update your “covariance matrix” $X^{\tau} X$ and the other term (which must have a good name but I don’t know it) $X^{\tau} y$ and simply compute

$\beta = (X^{\tau} X)^{-1} X^{\tau} y.$

So the problem simplifies to, how can we update $X^{\tau} X$ and $X^{\tau} y$ ?

As I described before in this post for example, you can use exponential downweighting. Whereas before I was expounding on how useful this method is for helping you care about new data more than old data, today my emphasis is on the other convenience, which is that you can throw away old data after updating your objects of interest.

So in particular, we will follow the general rule in updating an object $T$ that it’s just some part old, some part new:

$T(t+1) = \lambda T(t) + (1-\lambda) T(t, t+1),$

where by $T(t)$ I mean the estimate of the thing $T$ at time $t,$ and by $T(t, t+a)$ I mean the estimate of the thing $T$ given just the data between time $t$ and time $t+a.$

The speed at which I forget data is determined by my choice of $\lambda,$ and should be determined by the market this model is being used in. For example, currency trading is fast-paced, and long-term bonds not as much. How long does it take the market to forget news or to acclimate to new news? The same kind of consideration should be used in modeling the internet. How quickly do users change their behaviors? This could depend on the season as well- things change quickly right after Christmas shopping season is done compared to the lazy summer months.

Specifically, I want to give an example of this update rule for the covariance matrix $X^{\tau}X,$ which really isn’t a true covariance matrix because I’m not scaling it correctly, but I’ll ignore that because it doesn’t matter for this discussion.

Namely, I claim that after updating $X^{\tau}X$ with the above exponential downweighting rule, I have the covariance matrix of data that was itself exponentially downweighted. This is totally trivial but also kind of important- it means that we are not creating some kind of new animal when we add up covariance matrices this way.

Just to be really dumb, start with a univariate regression example, so where we have a single signal $x$ and a single response $y$ . Say we get our first signal $x_1$ and our first reponse $y_1.$ Our first estimate for the covariance matrix is $x_1^2.$

Now we get a new piece of data $(x_2, y_2)$ , and we want to downweight the old stuff, so we multiply $x_1$ and $y_1$ by some number $\mu.$ Then our signal vector looks like $[\mu x_1 x_2]$ and the new estimate for the covariance matrix is

$M(2) = \mu^2 x_1^2 + x_2^2 = \mu^2 M(1) + M(1, 2),$

where by $M(t)$ I mean the estimate of the covariance matrix at time $t$ as above. Up to scaling this is the exact form from above, where $\lambda = \frac{\mu^2}{1+\mu^2}.$

Things to convince yourself of:

This works when we move from $n$ pieces of data to $n+1$ pieces of data.
This works when we move from a univariate regression to a multivariate regression and we’re actually talking about square matrices.
Same goes for the $X^{\tau} y$ term in the same exact way (except it ends up being a column matrix rather than a square matrix).
We don’t really have to worry about scaling; this uses the fact that everything in sight is quadratic in $\mu$ , the downweighting scalar, and the final product we care about is $\beta =(X^{\tau}X)^{-1} X^{\tau}y,$ where, if we did decide to care about scalars, we would mutliply $X^{\tau} y$ by the appropriate scalar but then end up dividing by that same scalar when we find the inverse of $X^{\tau} X.$
We don’t have to update one data point at a time. We can instead compute the `new part’ of the covariance matrix and the other thingy for a whole day’s worth of data, downweight our old estimate of the covariance matrix and other thingy, and then get a new version for both.
We can also incorporate bayesian priors into the updating mechanism, although you have decide whether the prior itself needs to be downweighted or not; this depends on whether the prior is coming from a fading prior belief (like, oh I think the answer is something like this because all the studies that have been done say something kind of like that, but I’d be convinced otherwise if the new model tells me otherwise) or if it’s a belief that won’t be swayed (like, I think newer data is more important, so if I use lagged values of the quarterly earnings of these companies then the more recent earnings are more important and I will penalize the largeness of their coefficients less).

End result: we can cut our data up into bite-size chunks our computer can handle, compute our updates, and chuck the data. If we want to maintain some history we can just store the `new parts’ of the matrix and column vector per day. Then if we later decide our downweighting was too aggressive or not sufficiently aggressive, we can replay the summation. This is much more efficient as storage than holding on to the whole data set, because it depends only on the number of signals in the model (typically under 200) rather than the number of data points going into the model. So for each day you store a 200-by-200 matrix and a 200-by-1 column vector.

Comments (2)

Stephen Purpura (@spurpura)

January 27, 2012 at 2:39 pm

I wrote a quick placeholder note to remind myself to write a more detailed post about challenges with this method and changes in variance.

LikeLike
- Cathy O'Neil, mathbabe
  
  January 29, 2012 at 6:48 am
  
  let’s hear it!
  
  LikeLike