Combining priors and downweighting in linear regression

Home > data science, finance, statistics > Combining priors and downweighting in linear regression

Combining priors and downweighting in linear regression

June 4, 2012 Cathy O'Neil, mathbabe

This is a continuation of yesterday’s post about understand priors on linear regression as minimizing penalty functions.

Today I want to talk about how we can pair different kinds of priors with exponential downweighting. There are two different kinds of priors, namely persistent priors and kick-off priors (I think I’m making up these terms, so there may be other official terms for these things).

Persistent Priors

Sometimes you want a prior to exist throughout the life of the model. Most “small coefficients” or “smoothness” priors are like this. In such a situation, you will aggregate today’s data (say), which means creating an $X^\tau X$ matrix and an $X^\tau y$ vector for that day, and you will add $N \cdot \lambda^2 I$ to $X^\tau X$ every single day before downweighting your old covariance term and adding today’s covariance term.

Kick-Off Priors

Other times you just want your linear regression to start off kind of “knowing” what the expected answer is. In this case you only add the prior terms to the first day’s $X^\tau X$ matrix and $X^\tau y$ vector.

Example

This is confusing so I’m going to work out an example. Let’s say we have a model where we have a prior that the 1) $\beta$ coefficients should look something like $r$ and also that 2) the coefficients should be small. This latter condition is standard and the former happens sometimes when we have older proxy data we can “pretrain” our model on.

Then on the first day, we find the $X(1)^\tau X(1)$ matrix and $X(1)^\tau y(1)$ vector coming from the data, but we add a prior to make it closer to $r$ :

$\beta(1) = (X(1)^\tau X(1) + N(1) \cdot \lambda^2 I)^{-1} (X(1)^\tau y(1) + N(1) \cdot \lambda^2 r).$

How should we choose $\lambda$ ? Note that if we set $\lambda = 0,$ we have no prior, but on the other hand if we make $\lambda$ absolutely huge, then we’d get $\beta = r.$ This is perfect, since we are trying to attract the solution towards $r.$ So we need to tune $\lambda$ to be somewhere in between those two extremes – this will depend on how much you believe $r$ .

On the second day, we downweight data from the first day, and thus we also downweight the $r$ prior. We probably won’t “remind” the model to be close to $r$ anymore, since the idea is we’ve started off this model as if it had already been training on data from the past, and we don’t remind ourselves of old data except through downweighting.

However, we still want to remind the model to make the coefficients small – in other words a separate prior on the size of coefficients. So in fact, on the first day we will have two priors in effect, one as above and the other a simple prior on the covariance term, namely we add $(\lambda')^2 I$ for some other tuning parameter $\lambda'$ . So actually the first day we compute:

$\beta(1) = (X(1)^\tau X(1) + N(1) \cdot \lambda^2 I + N(1) \cdot (\lambda')^2 I)^{-1} (X(1)^\tau y(1) + N(1) \cdot \lambda^2 r).$

And just to be really precise, of we denote by $\gamma$ the downweighting constant, on day 2 we will have:

$A = X(2)^\tau X(2) + N(2) \cdot \lambda'^2 +$ $\gamma[X(1)^\tau X(1) + N(1) \cdot \lambda^2 I + N(1) \cdot (\lambda')^2 I]$ ,

$B = X(2)^\tau y(2)$ $+ \gamma[X(1)^\tau y(1) + N(1) \cdot \lambda^2 r]$ , and

$\beta = A^{-1} \cdot B.$

Categories: data science, finance, statistics

Comments (5)

Luigi Draghi

June 5, 2012 at 7:52 pm

Didn’t know that. Why do you use the downweighting, if I may ask?

More generally, why do you use regression in financial analysis?

I speak as an outsider, I can’t image why one would want to fit financial data with a smooth function.

Thanks.

LikeLike
- Cathy O'Neil, mathbabe
  
  June 5, 2012 at 7:55 pm
  
  You always want to downweight older data, because a golden rule of financial modeling is that more recent market data is more relevant. Exponential downweighting more or less simulates the memory of the market.
  
  You use regression because the signal you’re trying to model is incredibly noisy, and it would be silly to imagine being able to approximate it with something more complex than a linear function.
  
  LikeLike
  - Luigi Draghi
    
    June 5, 2012 at 8:09 pm
    
    Thanks.
    
    I did’t understand downweighting because I couldn’t see (still can’t) the need for regression. Are you talking about tick data? In that case, with a linear function, we are talking of just some kind of trend.
    
    Again, thanks for the response.
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      June 5, 2012 at 8:11 pm
      
      Why do you say trend? Linear regression helps you decide whether certain events have forecasting powers for later events. I’m not restricted to using earlier values of the S&P index to forecast later values of the S&P index. I could use anything to predict anything else.
      
      LikeLike
EvanZ

December 22, 2013 at 2:55 pm

Thanks for this post. I think it’s exactly the idea I was looking for my problem. If you could give me your thoughts on this I’d really appreciate it.

The problem is estimating value for NBA players. Each player can be thought of as having a coefficient beta that we are trying to find from the regression. The data for the regression comes from point differential (+/-) for each matchup of 10 players during a possession of a game.

I was thinking of combining priors, one based on the previous season, which inferring from your post here would be the “kickoff” prior. The other prior is from the current season based on “box score” statistics (which can also be used to give an estimate of player value). In fact, it seems that there could even be another “persistent” prior on the shape of the distribution.

LikeLike