## The arbitrary punishment of New York teacher evaluations

The Value-Added Model for teachers (VAM), currently in use all over the country, is a terrible scoring system, as I’ve described before. It is approximately a random number generator.

Even so, it’s still in use, mostly because it wields power over the teacher unions. Let me explain why I say this.

Cuomo’s new budget negotiations with the teacher’s union came up with the following rules around teacher tenure, as I understand them (readers, correct me if I’m wrong):

- It will take at least 4 years to get tenure,
- A teacher must get at least 3 “effective” or “highly effective” ratings in those three years,
- A teacher’s yearly rating depends directly on their VAM score: they are not allowed to get an “effective” or “highly effective” rating if their VAM score comes out as “ineffective.”

Now, I’m ignoring everything else about the system, because I want to distill the effect of VAM.

Let’s think through the math of how likely it is that you’d be denied tenure based only on this random number generator. We will assume only that you otherwise get good ratings from your principal and outside observations. Indeed, Cuomo’s big complaint is that 98% of teachers get good ratings, so this is a safe assumption.

My analysis depends on what qualifies as an “ineffective” VAM score, i.e. what the cutoff is. For now, let’s assume that 30% of teachers receive “ineffective” in a given year, because it has to be some number. Later on we’ll see how things change if that assumption is changed.

That means that 30% of the time, a teacher will not be able to receive an “effective” score, no matter how else they behave, and no matter what their principals or outside observations report for a given year.

Think of it as a biased coin flip, and 30% of the time – for any teacher and for any year – it lands on “ineffective”, and 70% of the time it lands on “effective.” We will ignore the other categories because they don’t matter.

How about if you look over a four year period? To avoid getting any “ineffective” coin flips, you’d need to get “effective” every year, which would happen 0.70^4 = 24% of the time. In other words, 76% of the time, you’d get at least one “ineffective” rating* just by chance. *

But remember, you don’t need to get an “effective” rating for all four years, you are allowed one “ineffective rating.” The chances of exactly one “ineffective” coin flip and three “effective” flips is 4 (1-0.70) 0.70^3 = 41%.

Adding those two scenarios together, it means that 65% of the time, over a four year period, you’d get sufficient VAM scores to receive tenure. But it also means that 35% of the time you wouldn’t, through no fault of your own.

This is the political power of a terrible scoring system. More than a third of teachers are being arbitrarily chosen to be punished by this opaque and unaccountable test.

Let’s go back to my assumption, that 30% of teachers are deemed “ineffective.” Maybe I got this wrong. It directly impacts my numbers above. If the overall probability of being deemed “effective” is p, then the overall chance of getting sufficient VAM scores will be

So if I got it totally wrong, and 98% of teachers are described as effective by the VAM model, this would mean almost all teachers get sufficient VAM scores.

On the other hand, remember that the reason VAM is being pushed so hard by people is that they don’t like it when evaluations systems think too many people are effective. In fact, they’d rather see arbitrary and random evaluation than see most people get through unscathed.

In other words, it is definitely more than 2% of teachers that are called “ineffective,” but I don’t know the true cutoff.

If anyone knows the true cutoff, please tell me so I can compute anew the percentage of teachers that are arbitrarily being kept from tenure.

They could instead simply award tenure on a random lottery, after negotiating with the union the number of new awards each year. This would be “fair”, right? (Maybe not fair to the students, but they never matter.) For better or worse, many programs are controlled in this way. For example, some public colleges explicitly select new students randomly from all qualifying students. Many colleges have a system which in effect is highly random, but not acknowledges to be so.

LikeLike

You have as much chance of making headway among

hoi polloiwith this explanation as you will of conveying why Marilyn Vos Savant was right about the Monty Hall problem.LikeLike

I hope you are right.

Most people who listen do come to understand that the right strategy in the Monty Hall problem is to switch and basically anyone who you can get to play the game a few times.

Let’s not be too pessimistic about or dismissive of the public.

LikeLike

Is there actually a fixed cutoff that’s independent of the test results? And if so, who figures it out, and how? I mean, if there’s some good way of figuring out that some percentage of teachers are ineffective, then — drumroll — shouldn’t we just use that method to figure out who those teachers are, instead of using VAM? A good teacher evaluation system should produce a convincing estimate of the number of ineffective teachers as an output, not assume it as an input.

LikeLike

Right, I mean there has to be, if there are bins, which there are.

In some sense I think the VAM is just a complicated obstacle to hide the precise mechanism for this cutoff.

LikeLike

One of the points that Daniel Koretz makes in his excellent book about standardized testing (my review here: https://mathbabe.org/2013/07/09/measuring-up-by-daniel-koretz/) is how totally arbitrary most binning mechanisms in this area are.

LikeLike

Why is there even tenure? Its not like we’re protecting academic freedom at the grade school level (the only valid reason for it, the purported reason it exists at the university level). Few other professions have guaranteed emploment. If we didn’t have tenure we wouldn’t need all this other convoluted BS.

LikeLike

We have tenure because it makes the job more appealing to the people who are considering teaching as a career. It is not strictly necessary, but if we remove tenure and don’t improve the conditions in other ways, we will have fewer good teachers. Their alternatives will be more pleasant.

That’s already pretty much true given the current climate, of course.

LikeLike

Academic freedom does sometimes matter in secondary education. There was an attempt to fire my high school American history teacher, ostensibly for teaching religion. I only knew about it because I was called in as a witness. It was a ridiculous charge, but without tenure he could have been fired without cause. From what I later heard, it had to do with bad feelings between his wife and someone in the school administration.

Also, remember the Scopes Monkey Trial. History and science are areas where a lot of people want children to learn mythology. Teachers need to be protected against that.

And, as mathbabe indicates, job security is a way to get teachers to accept lower pay than they would get otherwise.

LikeLike

“Think of it as a biased coin flip, and 30% of the time – for any teacher and for any year – it lands on “ineffective”, and 70% of the time it lands on “effective.” We will ignore the other categories because they don’t matter.

How about if you look over a four year period? To avoid getting any “ineffective” coin flips, you’d need to get “effective” every year, which would happen 0.70^4 = 24% of the time. ”

This is assuming independence of the results from year to year. Correlation actually creates a rigorous test for how well the program works. If for the nth year, P(Ineffective)(n) is approximately the same as P((Ineffective)(nthyear) (given that (ineffective)(n-1)), then the is independence, and the results are crap. If on the other hand, they are highly (and positively) correlated, the program would be at least somewhat vindicated.

LikeLike

Yeah, no, it’s not. That’s the first link.

I mean, it’s not 0% correlated, but it’s not highly correlated.

LikeLike

Possibly. Correlation in the same year for different grades might be different that correlation in different years for the same grade.

LikeLike

Rubinstein studied both.

LikeLike

The NYS Commissioner of Education sets the cut-off.

The position of NYS Commissioner of Education is now vacant, after John B. King’s resignation at the end of 2014 to take a job as Secretary Arne Duncan’s chief advisor. http://www.nytimes.com/2014/12/11/nyregion/john-king-new-york-state-education-commissioner-is-leaving-for-federal-post.html

From the NYS budget bill

THE COMMISSIONER SHALL DETERMINE THE WEIGHTS AND SCORING RANGES FOR THE SUBCOMPONENT OR SUBCOMPONENTS OF THE STUDENT PERFORMANCE CATEGORY THAT SHALL RESULT IN A COMBINED CATEGORY RATING. THE COMMISSIONER SHALL ALSO SET PARAMETERS FOR APPROPRIATE TARGETS FOR STUDENT GROWTH FOR BOTH SUBCOMPONENTS, AND THE DEPARTMENT MUST AFFIRMATIVELY APPROVE AND SHALL HAVE THE AUTHORITY TO DISAPPROVE OR REQUIRE MODIFICATIONS OF DISTRICT PLANS THAT DO NOT SET APPROPRIATE GROWTH TARGETS, INCLUDING AFTER INITIAL APPROVAL. THE COMMISSIONER SHALL SET SUCH WEIGHTS AND PARAMETERS CONSISTENT WITH THE TERMS CONTAINED HEREIN.

Rubinstein’s analysis of the lack of year-to-year correlation of teachers’ scores in VAM:

http://garyrubinstein.teachforus.org/2012/02/28/analyzing-released-nyc-value-added-data-part-2/

LikeLike

The “subcomponents” in the law are (1) mandatory tests and (2) optional tests that the state will develop or approve. There’s another “component” of evaluation, based on observations.

LikeLike

According to this document from NYSED released in December (full slides at http://www.regents.nysed.gov/meetings/2014/December2014/Evaluation.pdf, linked from press release at http://www.nysed.gov/news/2015/2014-preliminary-statewide-evaluation-results-released) the percentage of teachers rated “ineffective” on the growth measure itself has consistently been at around 6% for the last few years (slide 5).

So I think this is not the strongest avenue of attack against the use of VAM in teacher evaluations (especially given that so many others exist). Kirabo Jackson of Northwestern said (to EdWeek) about his own research on using VAM at high school: “We know there are other ways in which we could be spending our energy to improve student outcomes… my takeaway is that this [VAM] is not it.” I think his statement applies far more broadly.

More generally, I wonder if showing the weakness of the model is even the most salient problem with this approach. Why make a whole infrastructure predicated on identifying the 10% (or 5%, or whatever) of the “worst” teachers, when its effects on the other 90% (or 95%, etc.) of teachers are at best distracting, and at worst actively making this much larger group far LESS effective at promoting student learning?

LikeLike

Hi,

Thanks for the link! Actually I think the relevant numbers are on page 5 of these slides, where it breaks down each of the 4 categories. In our case we are looking at anyone who lands in category 3 OR 4, which is to say 16%, 17%, and 16% in the three years represented.

If you work that out, we’re talking about about 13% of teachers getting denied tenure purely by dint of this ridiculous coin flip. Given that there are around 70K school teachers in the system, only some fraction of them are untenured at a given time, but more than you might expect because the turnover is terrible. Let’s guesstimate that 5% of teachers are up for tenure in a given year, which still means 3,500 teachers. 13% of that is 455 teachers who have been excluded from tenure simply based on this random number generator alone.

LikeLike

This is why it is unacceptable to have uneducated illiterates running educational systems. By “uneducated illiterates” I mean people who can’t explain what angular momentum is.

LikeLike

Meanwhile, over here in Vermont, the State Board of Education recently put out a “Statement and Resolution on the Appropriate Use of SBAC Standardized Tests and School Accountability” (http://education.vermont.gov/documents/EDU-SBE-031715_SBAC_Resolution.pdf ). It’s worth reading the whole thing, but here’s a taste: “RESOLVED that until empirical studies confirm a sound relationship between performance on the SBAC and critical and valued life outcomes (“college and career – ready”), test results should not be used to make normative and consequential judgments about schools and students…”

LikeLike

Thanks so much for continuing to care and to write about this. Here’s a link to a New Yorker article describing the atmosphere and the actions encouraged by reliance on test scores as the only data on which teachers and schools are evaluated:

http://www.newyorker.com/magazine/2014/07/21/wrong-answer

And let us spend a moment thinking about the God-forsaken school personnel in Atlanta who have been prosecuted under RICO statutes and face not just losing their jobs and paying fines, but serious jail time. Did these teachers and administrators just suddenly decide one day to manipulate scores, or was the pressure to do so overwhelming. . .

LikeLike

Linda,

Yes, thanks for tying this to that ridiculous sentencing. The worst is how the reporting sometimes actually makes it seem like these teachers are the problem with the system.

Cathy

On Sat, Apr 4, 2015 at 10:00 AM, mathbabe wrote:

>

LikeLike

I read this quickly so I may be wrong, but you are assuming independence in your calculations, which seems wrong. If a teacher got an effective eval in the first year, the conditional prob, of an effective eval for the second year should not be 0.70 any more, it should be much higher.

LikeLike

It’s true, I’m assuming independence. In fact I think there’s a modest year-to-year correlation, somewhere around 20%. I think, if I took this into account, my measurements would go up a bit, and we’d be getting rid of even more teachers for dubious reasons. If you want to know why the correlation is so low, please read the original post, which links to Gary Rubinstein’s work. That low correlation is the basis of much of my outrage.

LikeLike