How do you standardize tests?
I’m reading an interesting book by Douglas Harris about the value-added model movement, called Value-added Measures in Education, available here from Harvard Education Press. Harris goes into a very reasonable critique of how “snapshot” views of students, teachers, and school are a very poor assessment of teacher ability, since they are absolute measurements rather than changes in knowledge. Kind of like comparing the Dow to the S&P and concluding that you should definitely invest in Dow stocks since they are ten times better, it’s all about the return on a test score or an index, not the absolute number, when you are trying to gauge learning or profit.
His goal of the book is to explain how value-added models work, how they measure learning, how the take into account things like poverty level and other circumstances beyond the control of the school or the teachers, and other such factors. In his introduction he also promises not to be unreasonable about applying the results of these tests beyond where it makes sense. He certainly seems to be a smart guy; smart enough to know about errors and the problems with badly set up incentives – he uses the financial crisis as a model of how not to do it. I’m hopeful!
Here’s what I am interested in talking about today, which is how the “standardized” gets into standardized testing, because already at this point the mathematical modeling is pretty tricky (and involves lots of choices). There are many ways a test is ultimately standardized, assuming for simplicity that it’s a national test given at many grade levels yearly (pretend it’s an SAT that every grade takes):
- the test is normalized for being harder or easier than it was last year, for each grade’s test separately, and sometimes per question as well,
- the grading is normalized so that a student who learns exactly as much “as is expected” gets the same grade from year to year, and
- the grading is further normalized so that a student who gets 10 more points than expected in 3rd grade is doing as well as if she got 10 extra points in 4th grade.
One way of accomplishing all of the above would be to draw a histogram of raw results per year and per grade and normalize that distribution of raw scores by some standard mean and standard deviation, just as you would make a normal distribution standard, i.e. mean 0 and standard deviation 1. In fact, go ahead and demean it and divide by the standard deviation. That’s the first thing I’d do.
But if you actually do that, then you lose lots of the information you are actually trying to glean. Namely, how could you then conclude if students are doing better or worse than last year? I’m sure you’ve seen the recent news that SAT scores have fallen this year from last. I guess my question is, how can they tell? If we do something as simple as what I suggested, then the definition of doing as well “as is expected” is that you did “as well as the average person did”. But clearly this is not what the SAT people do, since they claim people aren’t doing as well as they used to. So how are they standardizing their test?
It isn’t really explained here or here, but there are clues. Namely, if you give 3rd and 4th graders some of the same questions on a given year, then you can infer how much better 4th graders do on those questions than 3rd graders do, and you can use that as a proxy for how to scale between grades (assuming that those questions represent the general questions well). Next, since you can’t repeat questions (at least questions that count towards the score) between years, because the stakes are too high and people would cheat, you can instead have ungraded sections that have repeated questions which give you a standard against which to compare between years. In fact the SAT does have ungraded sections, and so did the GREs as I recall, and my guess is this is why.
That brings up the question, do all standardized tests have ungraded sections? Is there some other clever way to get around this problem? Also in my mind, how well does standardization work, and what is a way to test it?