Home > education, feedback loop, math, modeling, statistics > Why Chetty’s Value-Added Model studies leave me unconvinced

Why Chetty’s Value-Added Model studies leave me unconvinced

June 16, 2014

Every now and then when I complain about the Value-Added Model (VAM), people send me links to recent papers written Raj Chetty, John Friedman, and Jonah Rockoff like this one entitled Measuring the Impacts of Teachers II: Teacher Value-Added and Student Outcomes in Adulthood or its predecessor Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates.

I think I’m supposed to come away impressed, but that’s not what happens. Let me explain.

Their data set for students scores start in 1989, well before the current value-added teaching climate began. That means teachers weren’t teaching to the test like they are now. Therefore saying that the current VAM works because an retrograded VAM worked in 1989 and the 1990’s is like saying I must like blueberry pie now because I used to like pumpkin pie. It’s comparing apples to oranges, or blueberries to pumpkins.

I’m surprised by the fact that the authors don’t seem to make any note of the difference in data quality between pre-VAM and current conditions. They should know all about feedback loops; any modeler should. And there’s nothing like telling teachers they might lose their job to create a mighty strong feedback loop. For that matter, just consider all the cheating scandals in the D.C. area where the stakes were the highest. Now that’s a feedback loop. And by the way, I’ve never said the VAM scores are totally meaningless, but just that they are not precise enough to hold individual teachers accountable. I don’t think Chetty et al address that question.

So we can’t trust old VAM data. But what about recent VAM data? Where’s the evidence that, in this climate of high-stakes testing, this model is anything but random?

If it were a good model, we’d presumably be seeing a comparison of current VAM scores and current other measures of teacher success and how they agree. But we aren’t seeing anything like that. Tell me if I’m wrong, I’ve been looking around and I haven’t seen such comparisons. And I’m sure they’ve been tried, it’s not rocket science to compare VAM scores with other scores.

The lack of such studies reminds me of how we never hear about scientific studies on the results of Weight Watchers. There’s a reason such studies never see the light of day, namely because whenever they do those studies, they decide they’re better off not revealing the results.

And if you’re thinking that it would be hard to know exactly how to rate a teacher’s teaching in a qualitative, trustworthy way, then yes, that’s the point! It’s actually not obvious how to do this, which is the real reason we should never trust a so-called “objective mathematical model” when we can’t even decide on a definition of success. We should have the conversation of what comprises good teaching, and we should involve the teachers in that, and stop relying on old data and mysterious college graduation results 10 years hence. What are current 6th grade teachers even supposed to do about studies like that?

Note I do think educators and education researchers should be talking about these questions. I just don’t think we should punish teachers arbitrarily to have that conversation. We should have a notion of best practices that slowly evolve as we figure out what works in the long-term.

So here’s what I’d love to see, and what would be convincing to me as a statistician. If we see all sorts of qualitative ways of measuring teachers, and see their VAM scores as well, and we could compare them, and make sure they agree with each other and themselves over time. In other words, at the very least we should demand an explanation of how some teachers get totally ridiculous and inconsistent scores from one year to the next and from one VAM to the next, even in the same year.

The way things are now, the scores aren’t sufficiently sound be used for tenure decisions. They are too noisy. And if you don’t believe me, consider that statisticians and some mathematicians agree.

We need some ground truth, people, and some common sense as well. Instead we’re seeing retired education professors pull statistics out of thin air, and it’s an all-out war of supposed mathematical objectivity against the civil servant.

  1. June 16, 2014 at 9:13 am

    Is there an objective measure by which you would rate teachers? Are you satisfied with the subjective manner in which teachers are currently rated? Is the subjective manner better than all possible objective measures? Is it better than VAM?

    Like

    • June 16, 2014 at 9:15 am

      Different people disagree on how to measure teacher effectiveness. What I’m saying is that it is in fact complicated and shouldn’t be replaced by some arbitrary, easily gamed mathematical construct.

      Another way to say it is that we always have some evaluation method with lots of decisions made (either by the model or by people), and obscuring those decisions behind mathematics is not making them go away.

      On Mon, Jun 16, 2014 at 9:13 AM, mathbabe wrote:

      >

      Like

      • June 16, 2014 at 11:39 am

        There is a real problem and that is the low performance of so many students in our public schools. Rather than address the problem with solutions there is a propensity to shoot down every proposal as somehow being evil and a part of class warfare. There are low performing students and there are bad teachers. They both exist. We cannot and should not throw the students out, but we should throw out the bad teachers. Is there any objective measure any of your readers can propose to define a bad teacher, besides “I know one when I see one?” Sure, there is no single value, but does that mean that we as humans, as parents, as society members cannot deal with complex or non-linear problems? Do we think that ignoring the problem will make it go away? Is the “human dignity” of the student any less important than the “human dignity” of a bad teacher?

        Like

        • June 16, 2014 at 11:56 am

          Abe,

          All true statements. The temptation to make things efficient at the cost of doing things that actually help is the biggest problem of all.

          By the way, I’m not sure why “I know one when I see one” doesn’t make more sense than some obscure mathematical model.

          Cathy

          Like

        • June 16, 2014 at 12:07 pm

          Cathy,

          We as human beings are highly biased. Too often principals define “good teachers” as yes-(wo)men. Bronx Science has seen several exoduses during the now-forced-into-retirement principal. Those weren’t “bad teachers” who left. Those were teachers who were bullied by her or by her staff.

          http://www.wnyc.org/story/301834-bronx-science-sees-exodus-of-social-studies-teachers/

          So, perhaps some of us can know one when we see one (I know that both you and I can) but as a measure for retaining or firing teachers, our biases preclude that measure from being effective.

          Abe

          Like

  2. June 16, 2014 at 9:25 am

    Cathy, I know you do it for the sake of argument, but you shouldn’t be granting any concessions to an irrational set of values by which the loyal accountants and professional ideologists of an inhumane system (or “economists,” as the euphemism goes) propose to pass judgment on teachers and by extension children.

    It wouldn’t matter if the test performance data corresponded to anything real and were not constructed in a statistically invalid fashion, within the context of an incentive system that rewards profit-seeking homo economicii with grant funding in proportion to their ability to deliver seemingly significant quantifiable results that purport to measure the immeasurable. The resulting corruption of the data is the least of it, an intramural concern for statisticians.

    By what authority do these people presume to speak on the subject? Their acceptance of the idea that results on a standardized test written by non-teachers and administered by commercial enterprises should measure the value of education and its practitioners shows they know nothing about children and do not care. They are merely responding to the incentives provided under the present funding system for social science.

    They support the implicit proposition that the educational system is should reduce human beings to rankable future factors in capitalist production. Note that capitalist production requires and will continue to require fewer and fewer people as a proportion of population. Sadly a system that values the non-owners solely as production factors cannot respond with a shorter work week or guaranteed income for all. The system therefore needs more justifications for why people should be declared superfluous and second-class from childhood forward. That is one of the underlying realities behind the current corporate “education reform” campaigns financed by the billionaires.

    The proponents of VAM have chosen sides in the class war waged by the 1% of the 1% against the poor. They chose to be soldiers for Gates, Broad, Bloomberg, Murdoch, Duncan et al. Here is an example of empirical inquiry that exposes the perpetrators of this attack on human dignity, rather than helping to make it more efficient. http://truth-out.org/opinion/item/18442

    Like

    • Brandon
      June 16, 2014 at 10:39 pm

      This seems more like a string of politically motivated bumper stickers than amicable, thoughtful discussion.

      “Their acceptance of the idea that results on a standardized test written by non-teachers and administered by commercial enterprises should measure the value of education and its practitioners shows they know nothing about children and do not care.”

      I would have guessed they just disagree, and maybe have reason to.

      Like

      • June 16, 2014 at 11:20 pm

        Brandon, your response says nothing about the subject.

        Like

        • Brandon
          June 17, 2014 at 12:46 am

          No, but my post down below does. I didn’t even resort to one single conspiracy theory about corporations, or insinuate that someone doesn’t care about the education of our kids simply because I disagree with them! 🙂

          Like

  3. June 16, 2014 at 9:28 am

    Who are these people and why are they giving data a bad name?

    All they really prove is just how disconnected an economic theory can get from real life.

    Like

  4. cat
    June 16, 2014 at 10:15 am

    I am not sure why people believe there is a measure of teacher ‘effectiveness’. It seems to me that if there really was a single value that could be assigned to teacher it would mean that students are all the same. They learn the same, they have the same interests, they have the same motivation, they have the same ability to focus/concentrate/study/ad infinitum.

    This line of reasoning sort of strains my ability to believe the people involved are acting in good faith.

    Like

  5. LKT
    June 16, 2014 at 1:48 pm

    Who needs peer review? Working papers and white papers are great opinion pieces, especially since they’re usually posted open access, but what is their “Value Added” and have they been subjected to review by others skilled in the art?
    People (economists) seem to make a lot out of y=mx+b, or rather, y=mx+b-previous_score
    A better discussion of VAMs can be found here by Harris, Ingle and Rutledge (2014), in which the authors compare principal ratings of teachers versus the teacher VAM score, and they come to some very interesting conclusions since, as it turns out, many people want many different things from schools and the teachers who teach in them. Also, they actually talk to the principals who are doing the ratings!
    http://aer.sagepub.com/content/51/1/73.abstract

    Like

  6. noneya
    June 16, 2014 at 2:13 pm

    You may argue that these are invalid studies, but they do exist: https://www.google.com/search?q=scientific+studies+on+the+results+of+Weight+Watchers

    Like

  7. June 16, 2014 at 2:14 pm

    There’s quite a bit more on the facts of the case in the Vergara decision on Diane Ravitch’s blog. See, for instance, this tag.

    Like

  8. June 16, 2014 at 2:22 pm
  9. June 16, 2014 at 4:20 pm

    Reblogged this on Art of Teaching Science and commented:
    This is another significant analysis of the use of VAM scores that are being used to make tenure and retention decisions about teachers. If you haven’t read any of Dr. O’Neil’s articles, here is a great one to start with, especially given the Vergara v California tentative decision in Los Angeles.

    Like

  10. June 16, 2014 at 6:25 pm

    Let the teachers devise systems for rating economists. If anyone’s helping to destroy society, it’s the latter.

    Like

  11. Brandon
    June 16, 2014 at 10:30 pm

    “If it were a good model, we’d presumably be seeing a comparison of current VAM scores and current other measures of teacher success and how they agree. But we aren’t seeing anything like that. Tell me if I’m wrong, I’ve been looking around and I haven’t seen such comparisons. And I’m sure they’ve been tried, it’s not rocket science to compare VAM scores with other scores.”

    I did actually link in the previous post about that the VAM was correlated with with other teacher characteristics regarding quality. Again:

    “Moreover, several recent studies have documented that teacher absences have a
    strong, negative association with student achievement, providing evidence that this association is causal (Clotfelter et al. 2007, Miller et al. forthcoming). Indeed, in other work using Chicago data from a similar time period, I show that a teacher’s absences are negatively associated with principal evaluations of the teacher and with a teacher’s value-added contribution to student achievement (Jacob and Walsh 2009).”

    From Jacob and Walsh:

    “We examine the relationship between the formal ratings that principals give teachers and a variety of observable teacher characteristics, including proxies for productivity. Prior work has shown that principals can differentiate between more and less effective teachers, especially at the tails of the quality distribution, and that subjective evaluations of teachers are strongly correlated with subsequent student achievement. However, whereas prior work has relied on survey data, we consider formal ratings from a setting in which the stakes are reasonably high. We find that the ratings are correlated with an array of teacher qualities including experience for young teachers, education credentials, and teacher absenteeism. Our finding that principals reward qualities of teachers known to be related to student productivity provides reason to be optimistic about policies that would assign more weight to principal evaluations of teachers in career decisions and compensation.”

    However, this isn’t exactly what you argued in your last post. You claimed that removing tenure would cause teacher dismissal based off of noisy, or misleading information. Not only is there a correlation between the VAM and characteristics that drive teacher productivity, but principals take into account many other types of information and again:

    “Prior work has shown that principals can differentiate between more and less effective teachers, especially at the tails of the quality distribution, and that subjective evaluations of teachers are strongly correlated with subsequent student achievement.”

    If your claim is that the VAM is unreliable, that is one thing to show, but it’s entirely another to answer that point.

    Like

    • June 17, 2014 at 6:51 am

      I don’t mean “correlations.” That’s much too weak. I want to know if the model is _reliably_ grading teachers the same way that principals do. If I know that the model does that consistently, then we can replace the principals’ score with the VAM score.

      But we don’t have anything like that. Instead we have a weak correlation. Even if the correlation is higher, at the level of 75% or so, it would still need backup from humans before it could be trusted. But the VAM correlations _even with itself_ is more like 24%. Not good enough.

      Let me put it this way. If it was your job, and you could get scored with a machine that nobody understood but was 24% correlated with something that makes sense to you, what would you think?

      Like

  1. June 16, 2014 at 8:05 pm
Comments are closed.