I’m enjoying reading and learning about agile software development, which is a method of creating software in teams where people focus on short and medium term “iterations”, with the end goal in sight but without attempting to map out the entire path to that end goal. It’s an excellent idea considering how much time can be wasted by businesses in long-term planning that never gets done. And the movement has its own manifesto, which is cool.
I’m a huge fan of stealing good ideas from financial modeling and importing them into other realms. For example, I stole the idea of stress testing of portfolios and use them in stress testing the business itself where I work, replacing scenarios like “the Dow drops 9% in a day” with things like, “one of our clients drops out of the auction.”
I’ve also stolen the idea of “resampling” in order to forecast possible future events based on past data. This is particularly useful when the data you’re handling is not normally distributed, and when you have quite a few data points.
To be more precise, say you want to anticipate what will happen over the next week (5 days) with something. You have 100 days of daily results in the past, and you think the daily results are more or less independent of each other. Then you can take 5 random days in the past and see how that “artificial week” would look if it happened again. Of course, that’s only one artificial week, and you should do that a bunch of times to get an idea of the kind of weeks you may have coming up.
If you do this 10,000 times and then draw a histogram, you have a pretty good sense of what might happen, assuming of course that the 100 days of historical data is a good representation of what can happen on a daily basis.
Here comes my pet peeve. In Mike Cohn’s blog post, he goes to the trouble of resampling to get a histogram, so a distribution of fake scenarios, but instead of really using that as a distribution, for the sake of computing a confidence interval, he only computes the average and standard deviation and then replaces the artificial distribution with a normal distribution with those parameters. From his blog:
Armed with 200 simulations of the ten sprints of the project (or ideally even more), we can now answer the question we started with, which is, How much can this team finish in ten sprints? Cells E17 and E18 of the spreadsheet show the average total work finished from the 200 simulations and the standard deviation around that work.
In this case the resampled average is 240 points (in ten sprints) with a standard deviation of 12. This means our single best guess (50/50) of how much the team can complete is 240 points. Knowing that 95% of the time the value will be within two standard deviations we know that there is a 95% chance of finishing between 240 +/- (2*12), which is 216 to 264 points.
What? This is kind of the whole point of resampling, that you could actually get a handle on non-normal distributions!
For example, let’s say in the above example, your daily numbers are skewed and fat-tailed, like a lognormal distribution or something, and say the weekly numbers are just the sum of 5 daily numbers. Then the weekly numbers will also be skewed and fat-tailed, although less so, and the best estimate of a 95% confidence interval would be to sort the scenarios and look at the 2.5th percentile scenario, the 97.5th percentile scenario and use those as endpoints of your interval.
The weakness of resampling is the possibility that the data you have isn’t representative of the future. But the strength is that you get to work with a honest-to-goodness distribution and don’t need to revert to assuming things are normally distributed.