The “One of Many” Fallacy
I’ve been on book tour for nearly a month now, and I’ve come across a bunch of arguments pushing against my book’s theses. I welcome them, because I want to be informed. So far, though, I haven’t been convinced I made any egregious errors.
Here’s an example of an argument I’ve seen consistently when it comes to the defense of the teacher value-added model (VAM) scores, and sometimes the recidivism risk scores as well. Namely, that the teacher’s VAM scores were “one of many considerations” taken to establish an overall teacher’s score. The use of something that is unfair is less unfair, in other words, if you also use other things which balance it out and are fair.
The obvious irony of the “one of many” argument is, besides the mathematical one I will make below, that the VAM was supposed to actually have a real effect on teachers assessments, and that effect was meant to be valuable and objective. So any argument about it which basically implies that it’s okay to use it because it has very little power seems odd and self-defeating.
Sometimes it’s true that a single inconsistent or badly conceived ingredient in an overall score is diluted by the other stronger and fairer assessment constituents. But I’d argue that this is not the case for how teachers’ VAM scores work in their overall teacher evaluations.
Here’s what I learned by researching and talking to people who build teacher scores. That most of the other things they use – primarily scores derived from categorical evaluations by principals, teachers, and outsider observers – have very little variance. Almost all teachers are considered “acceptable” or “excellent” by those measurements, so they all turn into the same number or numbers when scored. That’s not a lot to work with, if the bottom 60% of teachers have essentially the same score, and you’re trying to locate the worst 2% of teachers.
The VAM was brought in precisely to introduce variance to the overall mix. You introduce numeric VAM scores so that there’s more “spread” between teachers, so you can rank them and you’ll be sure to get teachers at the bottom.
But if those VAM scores are actually meaningless, or at least extremely noisy, then what you have is “spread” without accuracy. And it doesn’t help to mix in the other scores.
In a statistical sense, even if you allow 50% or more of a given teacher’s score to consist of non-VAM information, the VAM score will still dominate the variance of a teacher’s score. Which is to say, the VAM score will comprise much more than 50% of the information that goes into the score.
An extreme version of this is to think about making the non-VAM 50% of a teacher’s score always exactly the same. Denote it by 50. When we take the population of teacher VAM scores and average them with 50, the population of teacher VAM scores are now between 25 and 75, instead of 0 and 100, but besides being squished into a smaller range, they haven’t changed with respect to each other. Their relative rankings, in particular, do not change. So whoever was unlucky enough to get a bad VAM score will still be on the bottom.
This holds true for other choices of “50” as well.
A word about recidivism risk scores. It’s true that judges use all sorts of information in determining a defendant’s sentencing, or bail, or parole. But if one of the most trusted and most statistically variant ones is flawed – and in this case racist – then a similar argument to the above could be made, and the conclusion would be as follows: the overall effect of using flawed recidivism risk scores is stronger, rather than weaker, than one might expect given its weighting. We have to be more worried about it, not less.