## Correlation does not imply equality

One of the reasons I enjoy my blog is that I get to try out an argument and then see if readers can 1) poke holes in my arguement, or 2) if they misunderstand my argument, or 3) if they misunderstand something tangential to my argument.

Today I’m going to write about an issue of the third kind. Yesterday I talked about how I’d like to see the VAM scores for teachers directly compared to other qualitative scores or other VAM scores so we could see how reliably they regenerate various definitions of “good teaching.”

The idea is this. Many mathematical models are meant to replace a human-made model that is deemed too expensive to work out at scale. Credit scores were like that; take the work out of the individual bankers’ hands and create a mathematical model that does the job consistently well. The VAM was originally intended as such – in-depth qualitative assessments of teachers is expensive, so let’s replace them with a much cheaper option.

So all I’m asking is, how good a replacement is the VAM? Does it generate the same scores as a trusted, in-depth qualitative assessment?

When I made the point yesterday that I haven’t seen anything like that, a few people mentioned studies that show *positive correlations* between the VAM scores and principal scores.

But here’s the key point: *positive correlation does not imply equality.*

Of course sometimes positive correlation is good enough, but sometimes it isn’t. It depends on the context. If you’re a trader that makes thousands of bets a day and your bets are positively correlated with the truth, you make good money.

But on the other side, if I told you that there’s a ride at a carnival that has a positive correlation with not killing children, that wouldn’t be good enough. You’d want the ride to be safe. It’s a higher standard.

I’m asking that we make sure we are using that second, higher standard when we score teachers, because their jobs are increasingly on the line, so it matters that we get things right. Instead we have a machine that nobody understand that is *positively correlated* with things we do understand. I claim that’s not sufficient.

Let me put it this way. Say your “true value” as a teacher is a number between 1 and 100, and the VAM gives you a noisy approximation of your value, which is 24% correlated with your true value. And say I plot your value against the approximation according to VAM, and I do that for a bunch of teachers, and it looks like this:

So maybe your “true value” as a teacher is 58 but the VAM gave you a zero. That would not just be frustrating to you, since it’s taken as an important part of your assessment. You might even lose your job. And you might get a score of zero many years in a row, even if your true score stays at 58. It’s increasingly unlikely, to be sure, but given enough teachers it is bound to happen to a handful of people, just by statistical reasoning, and if it happens to you, you will not think it’s unlikely at all.

In fact, if you’re a teacher, you should demand a scoring system that is consistently the same as a system you understand rather than positively correlated with one. If you’re working for a teachers’ union, feel free to contact me about this.

One last thing. I took the above graph from this post. These are actual VAM scores for the same teacher in the same year but for two different class in the same subject – think 7th grade math and 8th grade math. So neither score represented above is “ground truth” like I mentioned in my thought experiment. But that makes it even more clear that the VAM is an insufficient tool, because it is only 24% correlated *with itself*.

From memory, a vital aspect of the Japanese approach to quality management is strict attention to the process inputs – it is considered more important for workers to follow the correct approach than achieve the objective, essentially to avoid encouraging workers cutting corners. It seems with teachers teaching a relatively small number that it would take one kid with an undiscovered learning disorder or unreported ‘difficulties at home’ to completely torpedo a great teacher’s result metric, and that alone should be enough to mean a system of evaluating teachers solely via apparent changes in student achievement is deeply flawed.

Let me state upfront that I agree that nobody’s job should hang on a single metric, however calculated, since for no conceivable job can performance really be measured on one dimension. I’m also not American and completely divorced from the practical debate around VAM and teacher evaluation, other than what I’ve read here.

Having said that, your argument seems to privilege qualitative human judgments over ‘mathematical’ models in a way that doesn’t seem justified – in general. Human judges are fallible, and likely to have all kinds of hidden biases. A 1960s bank manager probably took more nuanced factors into account that a 1990s credit score, but those factors probably included things like ‘he went to the right school’, ‘she’s a single mother’ and ‘they’re an interracial couple’. At least if you want to put those into a credit score you have to be explicit about it and the world (or at least, internal whistleblowers) can denounce you.

In a comment to your previous entry you referred to ‘obscuring those decisions behind mathematics’. But why does mathematics have to obscure – why can’t it enlighten?

The best approach for a lot of cases seems to me to be a mathematical model that produces a default rule, with wide human scope to overrule it, with justification.

Andrew,

I actually agree. I think it’s better to understand what the rules are, clearly, and make sure they’re fair in an abstract way. After all, before we had credit scores we had a lot of racism embedded in credit decisions. I do not think that humans are somehow magically perfect at decisions.

However, with the VAM model in particular, it’s a black box which nobody understands – even the Department of Education doesn’t have a view into it because the contract with the VAM modelers is a licensing agreement whereby they never see the source code. So there’s no actual default rules to discuss or overrule here – just a big machine which isn’t even particularly good at reproducing human rules, however imperfect.

We can and should demand better.

Cathy

Ok, that makes things clearer: I withdraw my concern. I was just a worried about that you’d graduated from healthily skeptical to totally disillusioned!

Of course now what we really need is a model to evaluate evaluation methods, that incorporates bias, variance, cost of evaluating performance, cost of making a wrong decision…

Conversely, if your true VAM is zero, but the model gives you, say scores above 40 for two years in a row, what should the response be?

It’s so hard to imagine what a “true VAM score” really is! 🙂

Sorry, mis-phrased the question: what if your “true-value” is zero (or close to it), and them VAM rates you too high, what should the response be?

I’m still not understanding how any model can compare a teacher in a rich suburban school to a teacher in a poor inner-city school?

Do you really think you can take a “75” from a Fairfax county, VA school and put them in a DC school and they are going to still rate a “75” or the inverse? A “38” in a DC school won’t get rated higher in a NoVa school.

I’m relatively certain the evidence presented in the CA Teachers case showed the best prediction of student achievement showed it was related to the students and parents and the teachers had little influence. So teacher ratings are going to be to dependent on their students demographics since every variable in the model will be dependent on the students.

well done, Cathy!! I fear the use (and misuse) of mathematical models by administrators who don’t understand such things.

“So all I’m asking is, how good a replacement is the VAM? Does it generate the same scores as a trusted, in-depth qualitative assessment?”

some good links:

MathBabe, I’m grateful that you are working on this.

VAM is a linear model. It supposedly shows how a single bad teacher can ruin the lives a many students. But this is entirely symmetrical with reasoning showing how a single good teacher can enhance the lives of many students.

US educational policy is made only the basis of eliminating bad teachers.

If we took into account the probability of error in measurement, a policy of preserving the jobs of the good teachers would look a lot like tenure.

