Resampling

Home > data science, finance, internet startup > Resampling

Resampling

December 11, 2011 Cathy O'Neil, mathbabe

I’m enjoying reading and learning about agile software development, which is a method of creating software in teams where people focus on short and medium term “iterations”, with the end goal in sight but without attempting to map out the entire path to that end goal. It’s an excellent idea considering how much time can be wasted by businesses in long-term planning that never gets done. And the movement has its own manifesto, which is cool.

The post I read this morning is by Mike Cohn, who seems heavily involved in the agile movement. It’s a good post, with a good idea, and I have just one nerdy pet peeve concerning it.

I’m a huge fan of stealing good ideas from financial modeling and importing them into other realms. For example, I stole the idea of stress testing of portfolios and use them in stress testing the business itself where I work, replacing scenarios like “the Dow drops 9% in a day” with things like, “one of our clients drops out of the auction.”

I’ve also stolen the idea of “resampling” in order to forecast possible future events based on past data. This is particularly useful when the data you’re handling is not normally distributed, and when you have quite a few data points.

To be more precise, say you want to anticipate what will happen over the next week (5 days) with something. You have 100 days of daily results in the past, and you think the daily results are more or less independent of each other. Then you can take 5 random days in the past and see how that “artificial week” would look if it happened again. Of course, that’s only one artificial week, and you should do that a bunch of times to get an idea of the kind of weeks you may have coming up.

If you do this 10,000 times and then draw a histogram, you have a pretty good sense of what might happen, assuming of course that the 100 days of historical data is a good representation of what can happen on a daily basis.

Here comes my pet peeve. In Mike Cohn’s blog post, he goes to the trouble of resampling to get a histogram, so a distribution of fake scenarios, but instead of really using that as a distribution, for the sake of computing a confidence interval, he only computes the average and standard deviation and then replaces the artificial distribution with a normal distribution with those parameters. From his blog:

Armed with 200 simulations of the ten sprints of the project (or ideally even more), we can now answer the question we started with, which is, How much can this team finish in ten sprints? Cells E17 and E18 of the spreadsheet show the average total work finished from the 200 simulations and the standard deviation around that work.

In this case the resampled average is 240 points (in ten sprints) with a standard deviation of 12. This means our single best guess (50/50) of how much the team can complete is 240 points. Knowing that 95% of the time the value will be within two standard deviations we know that there is a 95% chance of finishing between 240 +/- (2*12), which is 216 to 264 points.

What? This is kind of the whole point of resampling, that you could actually get a handle on non-normal distributions!

For example, let’s say in the above example, your daily numbers are skewed and fat-tailed, like a lognormal distribution or something, and say the weekly numbers are just the sum of 5 daily numbers. Then the weekly numbers will also be skewed and fat-tailed, although less so, and the best estimate of a 95% confidence interval would be to sort the scenarios and look at the 2.5th percentile scenario, the 97.5th percentile scenario and use those as endpoints of your interval.

The weakness of resampling is the possibility that the data you have isn’t representative of the future. But the strength is that you get to work with a honest-to-goodness distribution and don’t need to revert to assuming things are normally distributed.

Categories: data science, finance, internet startup

Comments (3)

mikecohn

December 11, 2011 at 7:24 am

Hi Cathy–
Thanks for reading my blog and for writing this post. You are absolutely right and you’ll notice that in my post, the thing I was leading up to didn’t use the standard deviation calculation. I still find it a matter of curiosity that can be addressed with resampling. But, in practice, I don’t use it often. Mostly, though, that’s because I’ve found it easier to explain resampling than to explain how to interpret standard deviation (even though anyone I’ll talk to will remember it from high school or university).
Thanks again.

LikeLike
- Cathy O'Neil, mathbabe
  
  December 11, 2011 at 7:29 am
  
  Yes, I agree! One of the huge wins of resampling is that people totally get it. Very few people really understand standard deviation, for them it’s something they’ve memorized.
  
  Thanks for the comment, and thanks also for your nice blog! I’m enjoying it.
  
  Cathy
  
  LikeLike
Ian Langmore

December 12, 2011 at 12:11 am

Another interesting weakness is shown by the following situation: Suppose we (accurately) believe data to have a specific tail structure, say ~ 1/x^a, then any finite sample will leave a few scattered large data-points, but leave huge gaps (in the tail) with no data. Moreover, our re-sampled data will never be larger than the largest sample point. By presupposing a power-law tail, we achieve a smooth fit that allows for any size data.

LikeLike