Overfitting

Home > data science > Overfitting

Overfitting

November 12, 2011 Cathy O'Neil, mathbabe

I’ve been enjoying watching Andrew Ng’s video lectures on machine learning. It requires a login to see the videos, but it’s well worth the nuisance. I’ve caught up to the current lecture (although I haven’t done the homework) and it’s been really interesting to learn about the techniques Professor Ng describes to avoid overfitting models.

In particular, he talks about iterative concepts of overfitting and how to avoid them. I will first describe the methods he uses, then I’ll try to make the case that they are insufficient, especially in the case of a weak signal. By “weak signal” I mean anything you’d come across in finance that would actually make money (technically you could define it to mean that the error has the same variance as the response); almost by definition those signals are not very strong (but maybe were in the 1980’s) or they would represent a ridiculous profit opportunity. This post can be seen as a refinement of my earlier post, “Machine Learners are spoiled for data“, which I now realize should have ended “spoiled for signal”.

First I want to define “overfitting”, because I probably mean something different than most people do when they use that term. For me, this means two things. First, that you have a model that is too complex, usually with too many parameters or the wrong kind of parameters, that has been overly trained to your data but won’t have good forecasting ability with new data. This is the standard concept of overfitting- you are modeling noise instead of signal but you don’t know it. The second concept, which is in my opinion even more dangerous, is partly a psychological one, namely that you trust your model too much. It’s not only psychological though, because it also has a quantitative result, namely that the model sucks at forecasting on new data.

How do you avoid overfitting? First, Professor Ng makes the crucial observation that you can’t possibly think that the model you are training will forecast as well on new data as on the data you have trained on. Thus you need to separate “training data” from “testing data”. So far so good.

Next, Professor Ng makes the remark that, if you then train a bunch of different models on the training data, which depend on the number of variables you use for example, then if you measure each model by looking at its performance on the testing data to decide on that parameter, you can no longer expect the resulting model (with that optimized number of parameters) to actually do so extremely well on actually new data, since you’ve now trained your model to the testing data. For that reason he ends up splitting the data into three parts, namely the training data (60%), a so-called validation data set (20%) and finally the true testing set (the last 20%).

I dig it as an idea, this idea of splitting the data into three parts, although it requires you have enough data to think that testing a model on 20% of your data will give you meaningful performance results, which is already impossible when you work in finance, where you have both weak signal and too little data.

But the real problem is that, after you’ve split your data into three parts, you can’t really feel like the third part, the “true test data”, is anything like clean data. Once you’ve started using your validation set to train your data, you may feel like you’ve donated enough to the church, so to speak, and can go out on a sin bender.

Why? Because now the methods that Professor Ng suggests, for example to see how your model is doing in terms of testing for high bias or high variance (I’ll discuss this more below), looks at how the model performs on the test set. This is just one example of a larger phenomenon: training to the test set. If you’ve looked at the results on the test set at all before fixing your model, then the test set is just another part of your training set.

It’s human nature to do it, and that’s why the test set should be taken to a storage closet and locked up, by someone else, until you’ve finished your modeling. Once you have declared yourself done, and you promise you will no longer tweak the results, you should then find the person, their key, and test your model on the test set. If it doesn’t work you give up and try something else. For real.

In terms of weak signals, this is all the more important because it’s so freaking easy to convince yourself there’s signal when there isn’t, especially if there’s cash money involved. It’s super important to have the “test data set”, otherwise known as the out-of-sample data, be kept completely clean and unviolated. In fact there should even be a stipulated statute of limitations on how often you get to go out of sample on that data for any model at all. In other words, you can’t start a new model on the same data once a month until you find something that works, because then you’re essentially training your space of models to that out-of-sample data – you are learning in your head the data and how it behaves. You can’t help it.

One method that Ng suggests is to draw so-called “learning curves” which plot the loss function of the model on the test set and the validation set as a function of the number of data points under consideration. One huge problem with this for weak signals is that the noise would absolutely overwhelm such a loss estimate, and we’d end up looking at two extremely misleading plots, or information-free plots, the only result of which would be that we’ve seen way too much of the test set for comfort.

It seems to me that the method Ng suggests is the direct result of wanting to make the craft of modeling into an algorithm. While I’m not someone who wants to keep things guild-like and closed, I just don’t think that everything is as easy as an algorithm. Sometimes you just need to get used to not knowing something. You can’t test the fuck out of your model til you optimize on every single thing in site, because you will be overfitting your model, and you will have an unrealistic level of confidence in the result. As we know from experience, this could be very bad, or it could just be a huge waste of everyone’s time.

Categories: data science

Comments (12)

Pat Burns

November 12, 2011 at 1:30 pm

Cathy,

I agree with your conclusions. I find it hard to believe that we will ever fully conquer overfitting. There are techniques like cross validation that help to limit our self-delusion, but ultimately judgement is still involved.

I did a more pictorial blog post on overfitting a while ago, it is at http://www.portfolioprobe.com/2011/03/28/the-devil-of-overfitting/

LikeLike
- Cathy O'Neil, mathbabe
  
  March 26, 2012 at 4:02 pm
  
  Pat,
  
  Thanks, I’ve actually used this post a bunch of times to explain things to people.
  
  Cathy
  
  LikeLike
Roger Witte

November 13, 2011 at 4:50 am

From my experience of modelling transport,: There is always insufficient data available to calibrate the models properly, there is always a conflict between wanting to use all the available data for calibration and wanting an independent validation data set.

I distinguish three cases:
1) You have an underlying theory that is well established and that the model embodies
2) You have an underlying theory that you wish to test
3) Your model is a correlation not an explanation

Overfitting is a big danger in the second and third cases.

LikeLike
Pat Burns

November 13, 2011 at 7:11 am

In my experience finance is always firmly in Roger’s case 3.

LikeLike
Uzair

November 16, 2011 at 10:40 am

Ditto what Pat said — I’ve found cross-validation quite useful for named entity tagging, etc.

LikeLike
Dmitry Zotikov

November 19, 2011 at 2:47 am

Hello,

there is another technique for using the available data efficiently — you probably know / head of it — called k-fold cross-validation: the data can be split into k equal pieces and then on each of the k total rounds of learning, you simply leave out 1/k of the data as a test set and train the model on the remaining examples.

Personally, never used it in practice, not sure if it “works well” at all or not.

LikeLike
Yuliya

November 20, 2011 at 5:52 pm

I rewatched the videos, and I’m pretty sure he doesn’t do what you say he does. He only uses the cross-validation set to choose any parameters.

LikeLike
- Cathy O'Neil, mathbabe
  
  November 20, 2011 at 9:07 pm
  
  Wait, did you see the learning curve video?
  
  LikeLike
  - Yuliya
    
    November 27, 2011 at 3:59 pm
    
    Yes, I did. He uses his cross-validation set in it (though once he says “or test set”, I’m not sure if he meant it…)
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      November 28, 2011 at 5:11 am
      
      No I really think he’s using his test set there and comparing it to the validation set! Unless I’m completely wrong. In any case he never said anything about hiding the test set until he fixes his model, so he’s definitely letting you peak into the results for the test set before the end, which is taboo in my philosophy.
      
      LikeLike
    - Yuliya
      
      November 29, 2011 at 4:16 pm
      
      The learning curves he’s comparing are for the training set and the cross validation set. The training set grows in size and the cross validation set is constant.
      
      At the end of the video where he introduces the test and cross validation tests, he talks about not touching the test set.
      
      LikeLike
    - Cathy O'Neil, mathbabe
      
      November 29, 2011 at 4:43 pm
      
      Yeah I think you’re right. Thanks! I guess my only complaint is that he didn’t seem vehement enough about it.
      
      LikeLike