What is seasonal adjustment?

Home > data science, finance > What is seasonal adjustment?

What is seasonal adjustment?

June 12, 2011 Cathy O'Neil, mathbabe

One thing that kind of drives me crazy in economic or business news (which I’m frankly addicted to (which makes me incredibly old and boring)) is the lack of precision exactly when there seems to be some actual data- so at the very moment when you think you’re going to be told what the hard cold facts are, so you can make up your own mind about whether the economy is still sucking or is finally recovering, you get a pseudo-statistic with a side of caveat. I make it a point to try to formally separate the true bullshit from the stuff that actually is pretty informative if you know what they are talking about. I consider “seasonal adjustment” pretty much in the latter category, although there are exceptions (more on that later).

So what does “seasonal adjustment” mean? Let’s take an example: a common one is home sales. It’s a well known fact that people don’t buy as many homes in January and February as they do in May and June– due to some combination of people sitting in their houses eating ice cream straight from the Ben & Jerry’s container when it’s cold outside and the dirty snow tracks on their immaculate rugs during open houses making people trying to sell their houses enraged. So people delay house-hunting til Spring and they delay house-selling til house-hunting starts (side note: because of this, desperate people getting divorced or being forced to move often have to sell their houses at major discounts, so always do your house-hunting right after a huge blizzard).

Considering the cyclical and predictable nature of home sales, people want to “seasonally adjust” the data so that they can discern a move that is *not* due to the time of the year, in other words they want to detect whether a more macroeconomic issue is affecting home sales, such as a recession or housing glut (or both). It’s a reasonable approach- how does it work exactly?

Say you have a bunch of housing data, maybe 20 years of monthly home sales. You see that every single year the same pattern emerges, more or less. Then you could, for a given year, compute the average sale per month for that year. It’s important to compute this average, as we will see, because one golden rule of adjusting data is that the sum of the adjusted data must equal the original data, otherwise you introduce a problem that’s bigger than the one you’re solving.

Once you have the average sale per month, you figure out (using all 20 years) the typical divergence from the average that you see per month, as a percentage of the average per month that year. So for example, January is the worst month for home sales, and in the 20 years of data you see that on average there are 20% fewer home sales in January than there are on the average month of that year, whereas in June there are typically (in your sample) 15% more sales than in the average month that year. Using this historical data, you come up with numbers for each month (-20% for January, 15% for June, etc.). I can finally say what “seasonally adjusted” means: it is the rate of sales for the average month or for the year given these numbers. So if we saw 80,000 home sales in January, and our number for January is -20%, then we will say we have a seasonally adjusted rate of 100,000 sales per month or 1.2 million sales per year.

Note that this system of adjustment follows the golden rule at least for the historical data; by the end of each calendar year, we have attributed the correct overall number of sales, spread out over the months. However, if we start predicting July sales from what we’ve seen from home sales from January to March, taking into account these adjustments, we will also be tacitly assuming an overall number of sales for the year, and the golden rule will probably not hold. This is just another way to say that we won’t really know how many home sales have occurred in a given year until the year is over, so duh. But it’s not hard to believe that knowing these numbers is pretty useful if you want to make a ballpark estimate of the yearly rate of home sales and it’s only March.

A slightly more sophisticated way of doing this, which doesn’t depend as much on the calendar year, is to use the 20 years of data and a rolling 12 month window (i.e. where we add a month in the front and drop off a month in the back and thus always consider 12 consecutive months at a time) to compute the monthly adjustment for each month relative not to the average for the upcoming year, but rather relative to the average of the 12 past months. This has the advantage of be a causal model, (i.e. a model which only uses data in the past to predict the future- I’ll write a post soon about causal modeling) but has the disadvantage of not following the golden rule, at least in a short amount of time. For example, if housing sales are on a slow slide over months and months, this model will consistently fail to predict how low home sale figures should be.

The biggest problems with seasonally adjusted numbers are, in my opinion, that the model itself is never described- do we use 20 years of historical data? 3 years? Do we use a rolling window or calendar years? Without this kind of information, I’m frankly left wondering if you could frigging show me the raw data and let me decide whether it’s good news or bad news.

A few comments have trickled in from friends (over email) who are quants, and I wanted to add them here.

First, any predicting is hard and assumes a model, i.e., each year is the same, or each month is the same. In other words, as soon as you are talking about something being surprisingly anything, you are modeling, even when you don’t think you are. Most assumptions go unnoticed in fact. Part of being a good quant is simply being able to list your modeling assumptions.
As we will see when we discuss quant techniques further, a very important metric of a model is how many independent data points you have going into the model- this informs the calculation of statistical significance, for example. The comment then is that modeling seasonal adjustment as I’ve described above lowers your “number of independent data points” count by a factor of 12, because you are basically using all 12 months of a year to predict the _next year_, so what looked like 12 data points is really becoming only one. However, you could try to fit a smaller (than 12) parameter curve to the seasonal data differences, but then there’s overfit from having chosen the family of curves to be one that looks right. More on questions like this when we explore the concept of fitting model to the data, and in particular on how many different models you try for a given data set.
The final comment is this: all predictions likely violate the golden rule, but the point is you at least want one that isn’t biased, so in expectation it matches the rule.