Are SAT scores going down?
First, it needs to be said that, as I have learned in this book I’m reading, it’s probably a bad idea to make statements about learning when you make “cohort-to-cohort comparisons” instead of following actual students along in time. In other words, if you compare how well the 3rd grade did in a test one year to the next, then for the most part the difference could be explained by the fact that they are different populations or demographics. Indeed the College Board, which administers the SAT, explains that the scores went down this year because more and more diverse kids are taking the test. So that’s encouraging, and it makes you think that the statement “SAT scores went down” is in this case pretty meaningless.
But is it meaningless for that reason?
Keep in mind that these are small differences we’re talking about, but with a pretty huge sample size overall. Even so, it would be nice to see some errorbars and see the methodology for computing errorbars.
What I’m really worried about though is the “equating” part of the process. That’s the process by which they decide how to compare tests from year to year, mostly by having questions in common that are ungraded. At least that’s what I’m guessing, it’s actually not clear from their website.
My first question is, are they keeping in mind the errors for the equating process? (I find it annoying how often people, when they calculate errors, only calculate based on the very last step they take in a very sketchy overall process with many steps.) For example, is their equating process so good that they can really tell us with statistical significance that American Indians as a group did 2 points worse on the writing test (see this article for numbers like this)? I am pretty sure that’s a best guess with significant error bars.
Additional note: found this quote in a survey paper on equating methodologies (top of page 519):
Almost all test-equating studies ignore the issue of the standard error of the equating
Second, I’m really worried about the equating process and its errorbars for the following reason: the number of repeat testers varies widely depending on the demographic, and also from year to year. How then can we assess performance on the “linking questions” (the questions that are repeated on different tests) if some kids (in fact the kids more likely to be practicing for the test) are seeing them repeatedly? Is that controlled for, and how? Are they removing repeat testers?
This brings me to my main complaint about all of this. Why is the SAT equating methodology not open source? Isn’t the proprietary “intellectual property” in the test itself? Am I missing a link? I’d really like to take a look. Even better of course if the methodology is open source (as in there’s an available script which actually computes the scores starting with raw data) and the data is also available with anonymization of course.