New Jersey at risk of implementing untested VAM-like teacher evaluation model

Home > guest post, math education, modeling > New Jersey at risk of implementing untested VAM-like teacher evaluation model

New Jersey at risk of implementing untested VAM-like teacher evaluation model

May 28, 2013 Cathy O'Neil, mathbabe

This is a guest post by Eugene Stern.

A big reason I love this blog is Cathy’s war on crappy models. She has posted multiple times already about the lousy performance of models that rate teachers based on year-to-year changes in student test scores (for example, read about it here). Much of the discussion focuses on the model used in New York City, but such systems have been, or are being, put in place all over the country. I want to let you know about the version now being considered for use across the river, in New Jersey. Once you’ve heard more, I hope you’ll help me try to stop it.

VAM Background

A little background if you haven’t heard about this before. Because it makes no sense to rate teachers based on students’ absolute grades or test scores (not all students start at the same place each year), the models all compare students’ test scores against some baseline. The simplest thing to do is to compare each student’s score on a test given at the end of the school year against their score on a test given at the end of the previous year. Teachers are then rated based on how much their students’ scores improved over the year.

Comparing with the previous year’s score controls for the level at which students start each year, but not for other factors beside the teacher that affect how much they learn. This includes attendance, in-school environment (curriculum, facilities, other students in the class), out-of-school learning (tutoring, enrichment programs, quantity and quality of time spent with parents/caregivers), and potentially much more. Fancier models try to take these into account by comparing each student’s end of year score with a predicted score. The predicted score is based both on the student’s previous score and on factors like those above. Improvement beyond the predicted score is then attributed to the teacher as “value added” (hence the name “value-added models,” or VAM) and turned into a teacher rating in some way, often using percentiles. One such model is used to rate teachers in New York City.

It’s important to understand that there is no single value-added model, rather a family of them, and that the devil is in the details. Two different teacher rating systems, based on two models of the predicted score, may perform very differently – both across the board, and in specific locations. Different factors may be more or less important depending on where you are. For example, income differences may matter more in a district that provides few basic services, so parents have to pay to get extracurriculars for their kids. And of course the test itself matters hugely as well.

Testing the VAM models

Teacher rating models based on standardized tests have been around for 25 years or so, but two things have happened in the last decade:

Some people started to use the models in formal teacher evaluation, including tenure decisions.
Some (other) people started to test the models.

This did not happen in the order that one would normally like. Wanting to make “data-driven decisions,” many cities and states decided to start rating teachers based on “data” before collecting any data to validate whether that “data” was any good. This is a bit like building a theoretical model of how cancer cells behave, synthesizing a cancer drug in the lab based on the model, distributing that drug widely without any trials, then waiting around to see how many people die from the side effects.

The full body count isn’t in yet, but the models don’t appear to be doing well so far. To look at some analysis of VAM data in New York City, start here and here. Note: this analysis was not done by the city but by individuals who downloaded the data after the city had to make it available because of disclosure laws.

I’m not aware of any study on the validity of NYC’s VAM ratings done by anyone actually affiliated with the city – if you know of any, please tell me. Again, the people preaching data don’t seem willing to actually use data to evaluate the quality of the systems they’re putting in place.

Assuming you have more respect for data than the mucky-mucks, let’s talk about how well the models actually do. Broadly, two ways a model can fail are being biased and being noisy. The point of the fancier value-added models is to try to eliminate bias by factoring in everything other than the teacher that might affect a student’s test score. The trouble is that any serious attempt to do this introduces a bunch of noise into the model, to the degree that the ratings coming out look almost random.

You’d think that a teacher doesn’t go from awful to great or vice versa in one year, but the NYC VAM ratings show next to no correlation in a teacher’s rating from one year to the next. You’d think that a teacher either teaches math well or doesn’t, but the NYC VAM ratings show next to no correlation in a teacher’s rating teaching a subject to one grade and their rating teaching it to another – in the very same year! (Gary Rubinstein’s blog, linked above, documents these examples, and a number of others.) Again, this is one particular implementation of a general class of models, but using such noisy data to make significant decisions about teachers’ careers seems nuts.

What’s happening in New Jersey

With all this as background, let’s turn to what’s happening in New Jersey.

You may be surprised that the version of the model proposed by Chris Christie‘s administration (the education commissioner is Christie appointee Chris Cerf, who helped put VAM in place in NYC) is about the simplest possible. There is no attempt to factor out bias by trying to model predicted scores, just a straight comparison between this year’s standardized test score and last year’s. For an overview, see this.

In more detail, the model groups together all students with the same score on last year’s test, and represents each student’s progress by their score on this year’s test, viewed as a percentile across this group. That’s it. A fancier version uses percentiles calculated across all students with the same score in each of the last several years. These can’t be calculated explicitly (you may not find enough students that got exactly the same score each the last few years), so they are estimated, using a statistical technique called quantile regression.

By design, both the simple and the fancy version ignore everything about a student except their test scores. As a modeler, or just as a human being, you might find it silly not to distinguish between a fourth grader in a wealthy suburb who scored 600 on a standardized test from a fourth grader in the projects with the same score. At least, I don’t know where to find a modeler who doesn’t find it silly, because nobody has bothered to study the validity of using this model to rate teachers. If I’m wrong, please point me to a study.

Politics and SGP

But here we get into the shell game of politics, where rating teachers based on the model is exactly the proposal that lies at the end of an impressive trail of doubletalk. Follow the bouncing ball.

These models, we are told, differ fundamentally from VAM (which is now seen as somewhat damaged goods politically, I suspect). While VAM tried to isolate teacher contribution, these models do no such thing – they are simply measuring student progress from year to year, which, after all, is what we truly care about. The models have even been rebranded with a new name: student growth percentiles, or SGP. SGP is sold as just describing student progress rather than attributing it to teachers, there can’t be any harm in that, right? – and nothing that needs validation, either. And because SGP is such a clean methodology – if you’re looking for a data-driven model to use for broad “educational assessment,” don’t get yourself into that whole VAM morass, use SGP instead!

Only before you know it, educational assessment turns into, you guessed it, rating teachers. That’s right: because these models aren’t built to rate teachers, they can focus on the things that really matter (student progress), and thus end up being – wait for it – much better for rating teachers! War is peace, friends. Ignorance is strength.

Creators of SGP

You can find a good discussion of SGP’s and their use in evaluation here, and a lot more from the same author, the impressively prolific Bruce Baker, here. Here’s a response from the creators of SGP. They maintain that information about student growth is useful (duh), and agree that differences in SGP’s should not be attributed to teachers (emphasis mine):

Large-scale assessment results are an important piece of evidence but are not sufficient to make causal claims about school or teacher quality.

SGP and teacher evaluations

But guess what?

The New Jersey Board of Ed and state education commissioner Cerf are putting in place a new teacher evaluation code, to be used this coming academic year and beyond. You can find more details here and here.

Summarizing: for math and English teachers in grades 4-8, 30% of their annual evaluation next year would be mandated by the state to come from those very same SGP’s that, according to their creators, are not sufficient to make causal claims about teacher quality. These evaluations are the primary input in tenure decisions, and can also be used to take away tenure from teachers who receive low ratings.

The proposal is not final, but is fairly far along in the regulatory approval process, and would become final in the next several months. In a recent step in the approval process, the weight given to SGP’s in the overall evaluation was reduced by 5%, from 35%. However, the 30% weight applies next year only, and in the future the state could increase the weight to as high as 50%, at its discretion.

Modeler’s Notes

Modeler’s Note #1: the precise weight doesn’t really matter. If the SGP scores vary a lot, and the other components don’t vary very much, SGP scores will drive the evaluation no matter what their weight.

Modeler’s Note #2: just reminding you again that this data-driven framework for teacher evaluation is being put in place without any data-driven evaluation of its effectiveness. And that this is a feature, not a bug – SGP has not been tested as an attribution tool because we keep hearing that it’s not meant to be one.

In a slightly ironic twist, commissioner Cerf has responded to criticisms that SGP hasn’t been tested by pointing to a Gates Foundation study of the effectiveness of… value-added models. The study is here. It draws pretty positive conclusions about how well VAM’s work. A number of critics have argued, pretty effectively, that the conclusions are unsupported by the data underlying the study, and that the data actually shows that VAM’s work badly. For a sample, see this. For another example of a VAM-positive study that doesn’t seem to stand up to scrutiny, see this and this.

Modeler’s Role Play #1

Say you were the modeler who had popularized SGP’s. You’ve said that the framework isn’t meant to make causal claims, then you see New Jersey (and other states too, I believe) putting a teaching evaluation model in place that uses SGP to make causal claims, without testing it first in any way. What would you do?

So far, the SGP mavens who told us that “Large-scale assessment results are an important piece of evidence but are not sufficient to make causal claims about school or teacher quality” remain silent about the New Jersey initiative, as far as I know.

Modeler’s Role Play #2

Now you’re you again, and you’ve never heard about SGP’s and New Jersey’s new teacher evaluation code until today. What do you do?

I want you to help me stop this thing. It’s not in place yet, and I hope there’s still time.

I don’t think we can convince the state education department on the merits. They’ve made the call that the new evaluation system is better than the current one or any alternatives they can think of, they’re invested in that decision, and we won’t change their minds directly. But we can make it easier for them to say no than to say yes. They can be influenced – by local school administrators, state politicians, the national education community, activists, you tell me who else. And many of those people will have more open minds. If I tell you, and you tell the right people, and they tell the right people, the chain gets to the decision makers eventually.

I don’t think I could convince Chris Christie, but maybe I could convince Bruce Springsteen if I met him, and maybe Bruce Springsteen could convince Chris Christie.

VAM-anifesto

I thought we could start with a manifesto – a direct statement from the modeling community explaining why this sucks. Directed at people who can influence the politics, and signed by enough experts (let’s get some big names in there) to carry some weight with those influencers.

Can you help? Help write it, sign it, help get other people to sign it, help get it to the right audience. Know someone whose opinion matters in New Jersey? Then let me know, and help spread the word to them. Use Facebook and Twitter if it’ll help. And don’t forget good old email, phone calls, and lunches with friends.

Or, do you have a better idea? Then put it down. Here. The comments section is wide open. Let’s not fall back on criticizing the politicians for being dumb after the fact. Let’s do everything we can to keep them from doing this dumb thing in the first place.

Shame on us if we can’t make this right.

Categories: guest post, math education, modeling

Comments (19)

Abe Kohen

May 28, 2013 at 7:14 am

I hear where you are coming from. My wife’s a NYC teacher (and I’ve been one in the past as well) so I’ve looked at the NYC model before. My reaction to quantifying everything, as they did at DE Shaw, for example, has mellowed over the years. Black-Scholes is a terrible model for options pricing, but it has great utility if used properly within the bounds of its assumptions. Teacher evaluation models should be run over many years WITHOUT punishing teachers who the models say are “bad,” so that the models can be tested and calibrated, rather than testing the teachers. Be that as it may, the current system of evaluation by “observation” by a Principal or Assistant Principal lends itself to pernicious bias and has been shown to be ineffective. Rating teachers, if done properly, can weed out the bad teachers – and there are bad teachers, but IMHO it will not solve the problem of low student performance.

LikeLike
- David Wees
  
  May 28, 2013 at 9:31 am
  
  One problem is that we have finite resources and time to use, and we are putting resources into solving a problem that we do not really know how large a contribution to the overall problems in education. We do however know of solutions we could be putting our resources and time into that have been shown to work. See Pedro Noguera’s talk here, for example: https://www.youtube.com/watch?v=SxRL-aOoevE
  
  LikeLike
  - Abe Kohen
    
    May 28, 2013 at 10:26 am
    
    An hour and 6 minutes video? Any chance you can give a brief summary of his ideas?
    
    LikeLike
    - David Wees
      
      May 28, 2013 at 11:06 am
      
      Turn schools into communities with all resources students need to succeed (like health-care, social services, fitness centre, after school care and enrichment programs) in one place, all supervised and coordinated by one institution – the school.
      
      There’s lots of stuff in there, but if you are interested at all in alternatives or extensions to the “fix the bad teachers” model of school reform, I recommend watching his talk.
      
      LikeLike
    - Abe Kohen
      
      May 28, 2013 at 11:12 am
      
      1. Not feasible and won’t happen anytime soon (or ever).
      2. Omits the most important part of the equation: parental involvement, responsibility and expectations. I’ve observed a strong correlation between parents showing up for Parent-Teacher Conference and student performance.
      
      LikeLike
    - David Wees
      
      May 28, 2013 at 5:53 pm
      
      #2 isn’t happening for many children. Should we consign them to poverty because their parents are incompetent?
      
      #1 is already happening, for many children, but the services are organized by many diverse groups and there is little cohesion to the services offered. In some cases, this disorganization leads to higher costs for society (through our taxes that support these programs) and worse outcomes for the children involved. In fact, if you look at some of the highest performing (European countries especially) you’ll find that they have very similar social safety nets already in place, and lower rates of poverty as a result.
      
      LikeLike
    - Abe Kohen
      
      May 28, 2013 at 6:58 pm
      
      The evidence shows that poor Asian immigrant children do well despite poverty. Solving poverty is a noble goal, but in the USA, it doesn’t guarantee better education unless the parents are involved, which remarkably happens for children of Asian immigrants living in poverty. When 90% of parents don’t show up for Parent-Teacher conference, then the goal should be to educate the parents. Satiated ignorant children should not be the goal.
      
      LikeLike
    - David Wees
      
      May 28, 2013 at 7:03 pm
      
      “Satiated ignorant children should not be the goal.” That’s a ridiculous re-framing of my argument. Discussion done.
      
      LikeLike
    - lindapbrown2013
      
      May 29, 2013 at 11:45 am
      
      This is the approach used in the Harlem Children’s Zone, where health care, preschool, parenting classes and other services are provided by the same entity and connected to the schools.
      
      This being the case, it is extremely interesting that the CEO and presiding spirit of the Harlem Children’s Zone, Geoffrey Canada, is one of the main spokespeople for the group of school reformers who maintain that schools alone can do the job of helping kids escape dire circumstances. Just provide committed and talented teachers and it won’t matter if the child is homeless or hungry or afraid that he’ll be jumped on his way to and from school. Just fix those teachers!
      
      LikeLike
    - Abe Kohen
      
      May 29, 2013 at 11:53 am
      
      Yes, it is highly successful. Is it open admissions? Is there parental involvement? Is it scalable?
      
      LikeLike
Cynicism

May 28, 2013 at 9:34 am

Isn’t this exactly the point of the post? That the model needs to be tested to see if it does what it says that it does? I.e. that it could potentially play a part in the rating of teachers?

This feels like more evidence that we need a “Reckoner General” ( http://quomodocumque.wordpress.com/2012/03/13/should-we-have-a-reckoner-general/ ) to officially comment on policies such as this one?

LikeLike
JSE

May 28, 2013 at 11:01 am

Eugene, I’m basically with you, but I think it’s a rhetorical mistake too make to much the quote from the designers of SGP, “Large-scale assessment results are an important piece of evidence but are not sufficient to make causal claims about school or teacher quality.” From the article you linked, it seems pretty clear that a lot of weight is on the word “sufficient” — in other words, they’re rejecting the use of SGP used _alone_ to make hiring and firing decisions, but as I read their article they’re A-OK with SGP being used in teacher evaluation as one chunk of a weighted average.

That doesn’t necessarily mean you or I should be A-OK with it, of course! Just that I don’t think what NJ is proposing is actually an off-label use.

LikeLike
- Eugene
  
  May 28, 2013 at 3:59 pm
  
  Jordan: let’s look at what these guys are saying in context.
  
  Pull up at their article again, and pull up the note by Bruce Baker that they’re responding to. Bruce says that SGP isn’t suitable (without lots more analysis) for use in evaluation: “Unfortunately, while SGPs are becoming quite popular across states including Massachusetts, Colorado and New Jersey, and SGPs are quickly becoming the basis for teacher effectiveness ratings, there doesn’t appear to be a whole lot of specific research addressing these potential shortcomings of SGPs.” The SGP-ers say, no, you’re confusing the measure and the use. But that only makes sense as a reply if they agree that the use being criticized (SGP’s as the basis of teacher ratings) is actually illegitimate!
  
  Let’s stay out of the swamp of parsing “descriptive” vs.”causal,” or what “the basis” or “sufficient” means. The heart of the matter is that the anti-SGP crowd is criticizing a VERY SPECIFIC use of SGP — real legislative proposals that will affect real human beings. Whatever the pro-SGP response actually means, it EITHER agrees that the specific legislative use is off-label (fuzzy as the label may be), OR doesn’t address the original anti-SGP criticism.
  
  LikeLike
  - schoolfinance101
    
    May 29, 2013 at 10:43 am
    
    Exactly! Their response blows my mind, because it is in fact the same individuals who then take these measures on the road selling them to state officials to be used in the way they say they can’t be used. I throw up my hands at this, but applaud this blog for taking a new shot at it.
    
    LikeLike
Greg Taylor

May 28, 2013 at 6:15 pm

Merely pointing out that the proposed evaluation model is fatally flawed won’t win this argument. Other teacher evaluation techniques suffer from fatal flaws. To win the battle you’ll need to get behind a reasonable approach for evaluation that’s implementable and motivates teachers to do good in the classroom. Focusing more on the quantity and quality of the work required by students and the feedback they receive might be a good place to start.

LikeLike
FogOfWar

May 28, 2013 at 10:04 pm

A separate tack would be to look further down and trace back?

So, for example. Take a class of 500 graduating seniors in high school. Assume they were each in 25 person classes over 9-12 (HS only to keep it simple) so that’s roughly 80 teachers who have been involved with the cohort. Each of those teachers has taught 5% of the cohort (another small simplifying assumption) and one would expect to see those 25 students fan out along a roughly lognormal distribution of outcomes with roughly the same mean and std. dev. as the overall cohort. If the mean varies consistently over time plus or minus that’s data to indicate the teacher is improving or holding back the students over time.

The advantage is that there are a bunch of specialized tests (in particular the SATs, Regents, APs, etc) that are only administered at the end of HS. They provide a ton of data, and this approach allows that data to be useful and not test the kids every god-damned year.

5% above is wrong–the science teacher teaches more than one science class per year so it might be more like 20-40%, which would give a better base for analysis.

It seems obvious that one year’s data using any methodology is unlikely to be predictive. I’d hope that all of the serious discussions are only looking at 5 year+ aggregations…

$0.02…

FoW

LikeLike
- Abe Kohen
  
  May 29, 2013 at 10:02 am
  
  FoW, In inner city high schools class sizes are more like 35 students following a diffusion process (students entering mid-stream, students leaving at random times). Teachers also come and go, and with 6-9 periods a day that’s a lot more than 80 teachers over a 4 year period. As for SATs, while participation has gone up, many inner city students do not take them. As for NYS Regents, once again many students in the worst performing schools do not take them. I am intrigued by your assumption of log-normality. I recall a report in the NYTimes which showed that the distributions for certain ethic groups in NYC exhibited extreme leptokurtosis. Also, the 5 year survival rate for new teachers in NYC is not very high.
  
  LikeLike
  - FogOfWar
    
    May 29, 2013 at 11:33 am
    
    Good points, all. May well be this is a no-better-or-maybe-worse approach given the data limitations, although all of those points seem like a good job for a statistician to roll up their sleeves. Anyway, just offering a thought.
    
    Lognormal assumption was just because there’s a zero bound and it’s a common curve–I’d hope anyone looking at this for more than 5 minutes would challenge that assumption.
    
    FoW
    
    LikeLike
Abe Kohen

May 30, 2013 at 7:15 am

Another data point at:
http://www.cbsnews.com/8301-18563_162-57586766/public-charter-schools-team-up-in-cleveland/

with an interesting observation:

You can read the results in the math scores; up nearly 90 percent, or science, up more than 154 percent.

“Many people will say you have to fix poverty before you can fix education. We believe it is upside down. The only way to fix poverty is to provide our children with a quality education,” said Roskamm.

LikeLike