The overburdened prior
At my new job I’ve been spending my time editing my book with Rachel Schutt (who is joining me at JRL next week! Woohoo!). It’s called Doing Data Science and it’s based on these notes I took when she taught a class on data science at Columbia last semester. Right now I’m working on the alternating least squares chapter, where we learned from Matt Gattis how to build and optimize a recommendation system. A very cool algorithm.
However, to be honest I’ve started to feel very sorry for the one parameter we call It’s also sometimes referred to as “the prior”.
Let me tell you, the world is asking too much from this little guy, and moreover most of the big-data world is too indifferent to its plight. Let me explain.
First, he’s supposed to reflect an actual prior belief – namely, his size is supposed to reflect a mathematical vision of how big we think the coefficients in our solution should be.
In an ideal world, we would think deeply about this question of size before looking at our training data, and think only about the scale of our data (i.e. the input), the scale of the preferences (i.e. the recommendation system output) and the quality and amount of training data we have, and using all of that, we’d figure out our prior belief on the size or at least the scale of our hoped-for solution.
I’m not statistician, but that’s how I imagine I’d spend my days if I were: thinking through this reasoning carefully, and even writing it down carefully, before I ever start my training. It’s a discipline like any other to carefully state your beliefs beforehand so you know you’re not just saying what the data wants to hear.
as convergence insurance
But then there’s the next thing we ask of our parameter namely we assign him the responsibility to make sure our algorithm converges.
Because our algorithm isn’t a closed form solution, but rather we are discovering coefficients of two separate matrices and , fixing one while we tweak the other, then switching. The algorithm stops when, after a full cycle of fixing and tweaking, none of the coefficients have moved by more than some pre-ordained
The fact that this algorithm will in fact stop is not obvious, and in fact it isn’t always true.
It is (mostly*) true, however, if our little is large enough, which is due to the fact that our above-mentioned imposed belief of size translates into a penalty term, which we minimize along with the actual error term. This little miracle of translation is explained in this post.
And people say that all the time. When you say, “hey what if that algorithm doesn’t converge?” They say, “oh if is big enough it always does.”
But that’s kind of like worrying about your teenage daughter getting pregnant so you lock her up in her room all the time. You’ve solved the immediate problem by sacrificing an even bigger goal.
Because let’s face it, if the prior is too big, then we are sacrificing our actual solution for the sake of conveniently small coefficients and convergence. In the asymptotic limit, which I love thinking about, our coefficients all go to zero and we get nothing at all. Our teenage daughter has run away from home with her do-nothing boyfriend.
By the way, there’s a discipline here too, and I’d suggest that if the algorithm doesn’t converge you might also want to consider reducing your number of latent variables rather than increasing your since you could be asking too much from your training data. It just might not be able to distinguish that many important latent characteristics.
as tuning parameter
Finally, we have one more job for our little , we’re not done with him yet. Actually for some people this is his only real job, because in practice this is how he’s treated. Namely, we optimize him so that our results look good under whatever metric we decide to care about (but it’s probably the mean squared error of preference prediction on a test set (hopefully on a test set!)).
In other words, in reality most of the above nonsense about is completely ignored.
This is one example among many where having the ability to push a button that makes something hard seem really easy might be doing more harm than good. In this case the button says “optimize with respect to “, but there are other buttons that worry me just as much, and moreover there are lots of buttons being built right now that are even more dangerous and allow the users to be even more big-data-blithe.
I’ve said it before and I’ll say it again: you do need to know about inverting a matrix, and other math too, if you want to be a good data scientist.
* There’s a change-of-basis ambiguity that’s tough to get rid of here, since you only choose the number of latent variables, not their order. This doesn’t change the overall penalty term, so you can minimize that with large enough but if you’re incredibly unlucky I can imagine you might bounce between different solutions that differ by a base change. In this case your steps should get smaller, i.e. the amount you modify your matrix each time you go through the algorithm. This is only a theoretical problem by the way but I’m a nerd.