Eugene Stern: How Value Added Models are Like Turds

Home > Uncategorized > Eugene Stern: How Value Added Models are Like Turds

Eugene Stern: How Value Added Models are Like Turds

May 22, 2017 Cathy O'Neil, mathbabe

This is a guest post by Eugene Stern, originally posted on his blog sensemadehere.wordpress.com.

“Why am I surrounded by statistical illiterates?” — Roger Mexico in Gravity’s Rainbow

Oops, they did it again. This weekend, the New York Times put out this profile of William Sanders, the originator of evaluating teachers using value-added models based on student standardized test results. It is statistically illiterate, uses math to mislead and intimidate, and is utterly infuriating.

Here’s the worst part:

When he began calculating value-added scores en masse, he immediately saw that the ratings fell into a “normal” distribution, or bell curve. A small number of teachers had unusually bad results, a small number had unusually good results, and most were somewhere in the middle.

And later:

Up until his death, Mr. Sanders never tired of pointing out that none of the critiques refuted the central insight of the value-added bell curve: Some teachers are much better than others, for reasons that conventional measures can’t explain.

The implication here is that value added models have scientific credibility because they look like math — they give you a bell curve, you know. That sounds sort of impressive until you remember that the bell curve is also the world’s most common model of random noise. Which is what value added models happen to be.

Just to replace the Times’s name dropping with some actual math, bell curves are ubiquitous because of the Central Limit Theorem, which says that any variable that depends on many similar-looking but independent factors looks like a bell curve, no matter what the unrelated factors are. For example, the number of heads you get in 100 coin flips. Each single flip is binary, but when you flip a coin over and over, one flip doesn’t affect the next, and out comes a bell curve. Or how about height? It depends on lots of factors: heredity, diet, environment, and so on, and you get a bell curve again. The central limit theorem is wonderful because it helps explain the world: it tells you why you see bell curves everywhere. It also tells you that random fluctuations that don’t mean anything tend to look like bell curves too.

So, just to take another example, if I decided to rate teachers by the size of the turds that come out of their ass, I could wave around a lovely bell-shaped distribution of teacher ratings, sit back, and wait for the Times article about how statistically insightful this is. Because back in the bad old days, we didn’t know how to distinguish between good and bad teachers, but the Turd Size Model™ produces a shiny, mathy-looking distribution — so it must be correct! — and shows us that teacher quality varies for reasons that conventional measures can’t explain.

Or maybe we should just rate news articles based on turd size, so this one could get a Pulitzer.

Categories: Uncategorized

Comments (12)

Andrew Ekstrom

May 22, 2017 at 12:15 pm

Pulitzer or poop-lister?

LikeLike
Katharine Sawrey

May 22, 2017 at 12:42 pm

OK, I hate the VAM as much as anyone, but I took the VAM premise as, “in the complex world of student performance on a standardized test, we don’t know exactly why students end up at the bottom, but if Mr. HatesTeaching’s students ALWAYS end up on the bottom of these tests, we’re going to assume he’s a crappy teacher.”

Ignoring that we haven’t operationalized “ALWAYS” or legitimized the tests themselves, isn’t that an OK assumption to make?

LikeLike
- Shrikant Kalegaonkar
  
  May 25, 2017 at 9:48 am
  
  If you can confidently isolate Mr. HatesTeaching’s test scores from those of the set of teachers, then you have a potential signal that warrants further investigation, but still cannot conclude he is a crappy teacher until more evidence is collected.
  
  (There are a lot of assumptions here e.g. pool of students between teachers is consistent, resources/facilities available to teachers is consistent, incentives between teachers is consistent, and countless more!)
  
  LikeLike
wgersen

May 22, 2017 at 1:10 pm

Reblogged this on Network Schools – Wayne Gersen and commented:
I didn’t get a chance to react to this on my own blog… and I’m glad I didn’t because this is not only a better explanation of the flaws in Dr. Sanders methodology but also offers a more pungent alternative to using standardized achievement test scores!

LikeLike
Roger Joseph Witte

May 22, 2017 at 1:14 pm

Nice article but it misrepresents the central limit theorem which says that combining lots of independent variables, ALL OF WHICH HAVE FINITE VARIANCE, gives a normal distribution.

The caveat probably doesn’t matter for value added models but in financial modelling, omitting the caveat is equivalent to assuming that risk distributions do not have fat tails.

LikeLiked by 1 person
Shecky R

May 22, 2017 at 1:55 pm

“The Turd Size Model”… gee, the applications seems endless (or, perhaps, bottomless).

LikeLike
howardat58

May 22, 2017 at 2:36 pm

Mr Stern is wrong!
The turd size distribution may show a correlation between size and some critical health problem. This of course dos not imply causation, either way, but it still calls for more investigation.

LikeLike
- Lars
  
  May 22, 2017 at 4:09 pm
  
  In my experience, health problems seem to be inversely related to turd size, but that is just anusdotal evidence.
  
  LikeLike
pennstatemgl

May 22, 2017 at 3:35 pm

Models are tools, not Gods. Blame it on the Rational Enlightenment who believed determinism was the way to go. Mathematics led to certainty and predictability, which goes to the whole business of stability and order. Who can argue with numbers, hmm? A quantity is what it is. 6 is 6 and not 7, according to the “rules.” This sort of logical thinking has exploded out into the world as a means by which to maintain order.

LikeLike
- MikeM
  
  May 22, 2017 at 4:53 pm
  
  Let’s not go too far in this direction and throw out all such models. Here’s a poem, “Paradox,” by the late mathematician Clarence Wylie, Jr.
  
  Not truth, nor certainty. These I foreswore
  In my novitiate, as young men called
  To holy orders must abjure the world.
  “If…, then…,” this only I assert;
  And my successes are but pretty chains
  Linking twin doubts, for it is vain to ask
  If what I postulate be justified,
  Or what I prove possess the stamp of fact.
  Yet bridges stand, and men no longer crawl
  In two dimensions. And such triumphs stem
  In no small measure from the power of this game,
  Played with the thrice-attenuated shades
  Of things, has over their originals.
  How frail the wand, but how profound the spell!
  
  LikeLiked by 1 person
Mel

May 23, 2017 at 8:34 am

The evil spell is cast when the article says “A small number of teachers had unusually bad results,” Were the results bad? Who knows? They were on the low tail of the bell curve …
I got a little enlightenment when I got out of school into industry. School results seemed to be marked on the curve, with a predetermined ratio of the Elect and the Damned. The technical courses at work were different in a couple of ways. They taught less open-ended subjects, and they aimed for a 100% pass rate.
There’s no a priori reason to suppose that the teachers that the VAM assigned to the bottom of the curve weren’t doing perfectly useful work.

LikeLike
Jack Tingle

May 23, 2017 at 9:29 am

I’d feel better about this subject if someone did a good study of the data now that people are paying attention, rather than, “Oh, look, a distinction,” or “Horrors, an erroneous method.” Controlling for parental income & education, overall school/district budget & performance, teacher qualification & ranking, measurements held in suspense & all the other measurements & factors used in similar fields like hardware life management might be good. This is not the first time in history statistics has been used to disentangle complex factors. My impression is that someone came up with a plausible model that didn’t fail on a small scale test, and ran with it.

LikeLike