An easy way to think about priors on linear regression

Home > data science, finance, statistics > An easy way to think about priors on linear regression

An easy way to think about priors on linear regression

June 3, 2012 Cathy O'Neil, mathbabe

Every time you add a prior to your multivariate linear regression it’s equivalent to changing the function you’re trying to minimize. It sometimes makes it easier to understand what’s going on when you think about it this way, and it only requires a bit of vector calculus. Of course it’s not the most sophisticated way of thinking of priors, which also have various bayesian interpretations with respect to the assumed distribution of the signals etc., but it’s handy to have more than one way to look at things.

Plain old vanilla linear regression

Let’s first start with your standard linear regression, where you don’t have a prior. Then you’re trying to find a “best-fit” vector of coefficients $\beta$ for the linear equation $y = \beta x$ . For linear regression, we know the solution will minimize the sum of the squares of the error terms, namely

$\sum_i (y_i - x_i \beta)^2$ .

Here the various $i$ ‘s refer to the different data points.

How do we find the minimum of that? First rewrite it in vector form, where we have a big column vector of all the different $y_i$ ‘s and we just call it $y,$ and similarly we have a matrix for the $x_i$ ‘s and we call it $x.$ Then we are aiming to minimize $(y- x \beta)^\tau (y-x \beta).$

Now we appeal to an old calculus idea, namely that we can find the minimum of an upward-sloping function by locating where its derivative is zero.

Moreover, the derivative of $v^\tau v$ is just $dv^\tau v + v^\tau dv,$ or in other words $2 \cdot dv^\tau v.$ In our case this works out to $2 \cdot d(y - x \beta)^\tau (y- x \beta),$ or, since we’re taking the derivative with respect to $\beta$ and so $x$ and $y$ are constants, we can rewrite as $-x^\tau (y- x\beta).$ Setting that equal to zero, we can ignore the factor of 2 and we get $x^\tau x \beta = x^\tau y,$ or in other words the familiar formula:

$\beta = (x^\tau x)^{-1} x^\tau y$ .

Adding a prior on the variance, or penalizing large coefficients

There are various ways people go about adding a diagonal prior – and various ways people explain why they’re doing it. For the sake of simplicity I’ll use one “tuning parameter” for this prior, called $\lambda$ (but I could let there be a list of different $\lambda_j$ ‘s if I wanted) and I’ll focus on how we’re adding a “penalty term” for large coefficients.

In other words, we can think of trying to minimize the following more complicated sum:

$\frac{\sum_i (y_i - x_i \beta)^2}{N} + \sum_j \lambda^2 \beta_j^2$ .

Here the $i$ ‘s refer to different data points (and $N$ is the number of data points) but the $j$ ‘s refer to the different $\beta$ coefficients, so the number of signals in the regression, which is typically way smaller.

When we minimize this, we are simultaneously trying to find a “good fit” in the sense of a linear regression, and trying to find that good fit with small coefficients, since the sum on the right grows larger as the coefficients get bigger. The extent to which we care more about the first goal or the second is just a question about how large $\lambda^2$ is compared to the variances of the signals $x_i.$ This is why $\lambda$ is sometimes called a tuning parameter. We normalize the left term by $N$ so the solution is robust to adding more data.

How do we minimize that guy? Same idea, where we rewrite it in vector form first:

$(y - x \beta)^\tau (y-x\beta)/N + (\lambda I \beta)^\tau (\lambda I \beta)$

Again, we set the derivative to zero and ignore the factor of 2 to get:

$- x^\tau (y - x \beta)/N + \lambda I^\tau (\lambda I \beta) = 0.$

Since $I$ is symmetric, we can simplify to $x^\tau x \beta/N + \lambda^2 I \beta = x^\tau y,$ or:

$\beta = (x^\tau x/N + \lambda^2 I)^{-1} x^\tau y/N,$

which of course can be rewritten as

$\beta = (x^\tau x + N \cdot \lambda^2 I)^{-1} x^\tau y.$

If you have a prior on the actual values of the coefficents of $\beta$

Next I want to talk about a slightly fancier version of the same idea, namely when you have some idea of what you think the coefficients of $\beta$ should actually be, maybe because you have some old data or some other study or whatever. Say your prior is that $\beta$ should be something like the vector $r,$ and so you want to penalize not the distance to zero (i.e. the sheer size of the coefficients of $\beta$ ) but rather the distance to the vector $r.$ Then we want to minimize: