I’ve been enjoying watching Andrew Ng’s video lectures on machine learning. It requires a login to see the videos, but it’s well worth the nuisance. I’ve caught up to the current lecture (although I haven’t done the homework) and it’s been really interesting to learn about the techniques Professor Ng describes to avoid overfitting models.
In particular, he talks about iterative concepts of overfitting and how to avoid them. I will first describe the methods he uses, then I’ll try to make the case that they are insufficient, especially in the case of a weak signal. By “weak signal” I mean anything you’d come across in finance that would actually make money (technically you could define it to mean that the error has the same variance as the response); almost by definition those signals are not very strong (but maybe were in the 1980′s) or they would represent a ridiculous profit opportunity. This post can be seen as a refinement of my earlier post, “Machine Learners are spoiled for data“, which I now realize should have ended “spoiled for signal”.
First I want to define “overfitting”, because I probably mean something different than most people do when they use that term. For me, this means two things. First, that you have a model that is too complex, usually with too many parameters or the wrong kind of parameters, that has been overly trained to your data but won’t have good forecasting ability with new data. This is the standard concept of overfitting- you are modeling noise instead of signal but you don’t know it. The second concept, which is in my opinion even more dangerous, is partly a psychological one, namely that you trust your model too much. It’s not only psychological though, because it also has a quantitative result, namely that the model sucks at forecasting on new data.
How do you avoid overfitting? First, Professor Ng makes the crucial observation that you can’t possibly think that the model you are training will forecast as well on new data as on the data you have trained on. Thus you need to separate “training data” from “testing data”. So far so good.
Next, Professor Ng makes the remark that, if you then train a bunch of different models on the training data, which depend on the number of variables you use for example, then if you measure each model by looking at its performance on the testing data to decide on that parameter, you can no longer expect the resulting model (with that optimized number of parameters) to actually do so extremely well on actually new data, since you’ve now trained your model to the testing data. For that reason he ends up splitting the data into three parts, namely the training data (60%), a so-called validation data set (20%) and finally the true testing set (the last 20%).
I dig it as an idea, this idea of splitting the data into three parts, although it requires you have enough data to think that testing a model on 20% of your data will give you meaningful performance results, which is already impossible when you work in finance, where you have both weak signal and too little data.
But the real problem is that, after you’ve split your data into three parts, you can’t really feel like the third part, the “true test data”, is anything like clean data. Once you’ve started using your validation set to train your data, you may feel like you’ve donated enough to the church, so to speak, and can go out on a sin bender.
Why? Because now the methods that Professor Ng suggests, for example to see how your model is doing in terms of testing for high bias or high variance (I’ll discuss this more below), looks at how the model performs on the test set. This is just one example of a larger phenomenon: training to the test set. If you’ve looked at the results on the test set at all before fixing your model, then the test set is just another part of your training set.
It’s human nature to do it, and that’s why the test set should be taken to a storage closet and locked up, by someone else, until you’ve finished your modeling. Once you have declared yourself done, and you promise you will no longer tweak the results, you should then find the person, their key, and test your model on the test set. If it doesn’t work you give up and try something else. For real.
In terms of weak signals, this is all the more important because it’s so freaking easy to convince yourself there’s signal when there isn’t, especially if there’s cash money involved. It’s super important to have the “test data set”, otherwise known as the out-of-sample data, be kept completely clean and unviolated. In fact there should even be a stipulated statute of limitations on how often you get to go out of sample on that data for any model at all. In other words, you can’t start a new model on the same data once a month until you find something that works, because then you’re essentially training your space of models to that out-of-sample data – you are learning in your head the data and how it behaves. You can’t help it.
One method that Ng suggests is to draw so-called “learning curves” which plot the loss function of the model on the test set and the validation set as a function of the number of data points under consideration. One huge problem with this for weak signals is that the noise would absolutely overwhelm such a loss estimate, and we’d end up looking at two extremely misleading plots, or information-free plots, the only result of which would be that we’ve seen way too much of the test set for comfort.
It seems to me that the method Ng suggests is the direct result of wanting to make the craft of modeling into an algorithm. While I’m not someone who wants to keep things guild-like and closed, I just don’t think that everything is as easy as an algorithm. Sometimes you just need to get used to not knowing something. You can’t test the fuck out of your model til you optimize on every single thing in site, because you will be overfitting your model, and you will have an unrealistic level of confidence in the result. As we know from experience, this could be very bad, or it could just be a huge waste of everyone’s time.