How do you standardize tests?

Home > math education, news, rant > How do you standardize tests?

How do you standardize tests?

September 19, 2011 Cathy O'Neil, mathbabe

I’m reading an interesting book by Douglas Harris about the value-added model movement, called Value-added Measures in Education, available here from Harvard Education Press. Harris goes into a very reasonable critique of how “snapshot” views of students, teachers, and school are a very poor assessment of teacher ability, since they are absolute measurements rather than changes in knowledge. Kind of like comparing the Dow to the S&P and concluding that you should definitely invest in Dow stocks since they are ten times better, it’s all about the return on a test score or an index, not the absolute number, when you are trying to gauge learning or profit.

His goal of the book is to explain how value-added models work, how they measure learning, how the take into account things like poverty level and other circumstances beyond the control of the school or the teachers, and other such factors. In his introduction he also promises not to be unreasonable about applying the results of these tests beyond where it makes sense. He certainly seems to be a smart guy; smart enough to know about errors and the problems with badly set up incentives – he uses the financial crisis as a model of how not to do it. I’m hopeful!

Here’s what I am interested in talking about today, which is how the “standardized” gets into standardized testing, because already at this point the mathematical modeling is pretty tricky (and involves lots of choices). There are many ways a test is ultimately standardized, assuming for simplicity that it’s a national test given at many grade levels yearly (pretend it’s an SAT that every grade takes):

the test is normalized for being harder or easier than it was last year, for each grade’s test separately, and sometimes per question as well,
the grading is normalized so that a student who learns exactly as much “as is expected” gets the same grade from year to year, and
the grading is further normalized so that a student who gets 10 more points than expected in 3rd grade is doing as well as if she got 10 extra points in 4th grade.

One way of accomplishing all of the above would be to draw a histogram of raw results per year and per grade and normalize that distribution of raw scores by some standard mean and standard deviation, just as you would make a normal distribution standard, i.e. mean 0 and standard deviation 1. In fact, go ahead and demean it and divide by the standard deviation. That’s the first thing I’d do.

But if you actually do that, then you lose lots of the information you are actually trying to glean. Namely, how could you then conclude if students are doing better or worse than last year? I’m sure you’ve seen the recent news that SAT scores have fallen this year from last. I guess my question is, how can they tell? If we do something as simple as what I suggested, then the definition of doing as well “as is expected” is that you did “as well as the average person did”. But clearly this is not what the SAT people do, since they claim people aren’t doing as well as they used to. So how are they standardizing their test?

It isn’t really explained here or here, but there are clues. Namely, if you give 3rd and 4th graders some of the same questions on a given year, then you can infer how much better 4th graders do on those questions than 3rd graders do, and you can use that as a proxy for how to scale between grades (assuming that those questions represent the general questions well). Next, since you can’t repeat questions (at least questions that count towards the score) between years, because the stakes are too high and people would cheat, you can instead have ungraded sections that have repeated questions which give you a standard against which to compare between years. In fact the SAT does have ungraded sections, and so did the GREs as I recall, and my guess is this is why.

That brings up the question, do all standardized tests have ungraded sections? Is there some other clever way to get around this problem? Also in my mind, how well does standardization work, and what is a way to test it?

Categories: math education, news, rant

Comments (3)

Roger Witte

September 19, 2011 at 8:02 am

I don’t know the answers to any of the questions that you pose, but I am almost certain that the standardisation process leaves insufficient information in the results to be able to make all the various comparisons that we want to make in a meaningful way.

1) We want to compare individual students against individual students in the same year
2) We want to compare individual teachers (or schools) against individual teachers (or schools) in the same year
3-4) as above but across years
5) We want to compare this year’s tests against last year’s tests (to guard against grade inflation)

Mix in the fact that the syllabus evolves over time and you end up with too many unknowns. My suspicion is that the order in which I have listed the comparisons is from most reliable to least reliable and that the reliability of cross-year comparisons decreases in proportion to the number of years difference.

On the other hand the unreliability of these measures does not mean that more reliable measurements are possible; some of these things may be incommensurable.

LikeLike
Nathan Dunfield

September 19, 2011 at 7:56 pm

I don’t know how it’s done now, but back in the 90s when you took the SAT there was one extra section that didn’t count toward you final score (of course they didn’t tell you which one it was, and it could be either a verbal or a math section). The problems in that section were questions they were considering for future exams, and the point was to norm their difficulty against the known body of questions on the real exam. They then used that data to create a new exam which was equally hard as the previous one. Lather, rinse, repeat. (I’m sure it was actually more complicated than that, but that was the basic idea as I understood it. )

LikeLike
Bindicap

September 21, 2011 at 1:02 am

Thanks for bringing up the whole VAM issue, which is interesting. A while back, I made a comparison to how medicine only became a real science that clearly improved lives when it started to track good statistics.

I realized applying the same idea to education is harder. The knowledge we want to pass on and goals evolve over time and between cultures, and it’s more complex than objective measures of health. Moreover, the actual outcome you care about in education is years away when the students go into the world. So testing you do now is imperfect and potentially motivates teaching to the test, dropping goals that are hard to test, etc.

Much VAM criticism seems to stem from all that. And then there are arguments that the VAM sometimes is not implemented with proper statistical methodology. Eg, assessments are not stable when calibrated with data from different periods. I would hope this is solvable, and maybe enough can be done, for example, to test which of several alternative approaches to a certain subject is superior for various categories of students.

Anyway, the summary of the book you found sounds like it should be a good take on all this. The standardized testing sounds sensible too. Share what else you learn.

LikeLike