Search Results

Keyword: ‘Teacher Value Model’

Value-added model doesn’t find bad teachers, causes administrators to cheat

There’ve been a couple of articles in the past few days about teacher Value-Added Testing that have enraged me.

If you haven’t been paying attention, the Value-Added Model (VAM) is now being used in a majority of the states (source: the Economist):

Screen Shot 2013-03-31 at 7.31.53 AM

But it gives out nearly random numbers, as gleaned from looking at the same teachers with two scores (see this previous post). There’s a 24% correlation between the two numbers. Note that some people are awesome with respect to one score and complete shit on the other score:

gradegrade

Final thing you need to know about the model: nobody really understands how it works. It relies on error terms of an error-riddled model. It’s opaque, and no teacher can have their score explained to them in Plain English.

Now, with that background, let’s look into these articles.

First, there’s this New York Times article from yesterday, entitled “Curious Grade for Teachers: Nearly All Pass”. In this article, it describes how teachers are nowadays being judged using a (usually) 50/50 combination of classroom observations and VAM scores. This is different from the past, which was only based on classroom observations.

What they’ve found is that the percentage of teachers found “effective or better” has stayed high in spite of the new system – the numbers are all over the place but typically between 90 and 99 percent of teachers. In other words, the number of teachers that are fingered as truly terrible hasn’t gone up too much. What a fucking disaster, at least according to the NYTimes, which seems to go out of its way to make its readers understand how very much high school teachers suck.

A few things to say about this.

  1. Given that the VAM is nearly a random number generator, this is good news – it means they are not trusting the VAM scores blindly. Of course, it still doesn’t mean that the right teachers are getting fired, since half of the score is random.
  2. Another point the article mentions is that failing teachers are leaving before the reports come out. We don’t actually know how many teachers are affected by these scores.
  3. Anyway, what is the right number of teachers to fire each year, New York Times? And how did you choose that number? Oh wait, you quoted someone from the Brookings Institute: “It would be an unusual profession that at least 5 percent are not deemed ineffective.” Way to explain things so scientifically! It’s refreshing to know exactly how the army of McKinsey alums approach education reform.
  4. The overall article gives us the impression that if we were really going to do our job and “be tough on bad teachers,” then we’d weight the Value-Added Model way more. But instead we’re being pussies. Wonder what would happen if we weren’t pussies?

The second article explained just that. It also came from the New York Times (h/t Suresh Naidu), and it was a the story of a School Chief in Atlanta who took the VAM scores very very seriously.

What happened next? The teachers cheated wildly, changing the answers on their students’ tests. There was a big cover-up, lots of nasty political pressure, and a lot of good people feeling really bad, blah blah blah. But maybe we can take a step back and think about why this might have happened. Can we do that, New York Times? Maybe it had to do with the $500,000 in “performance bonuses” that the School Chief got for such awesome scores?

Let’s face it, this cheating scandal, and others like it (which may never come to light), was not hard to predict (as I explain in this post). In fact, as a predictive modeler, I’d argue that this cheating problem is the easiest thing to predict about the VAM, considering how it’s being used as an opaque mathematical weapon.

The Value Added Teacher Model Sucks

Today I want you to read this post (hat tip Jordan Ellenberg) written by Gary Rubinstein, which is the post I would have written if I’d had time and had known that they released the actual Value-added Model scores to the public in machine readable format here.

If you’re a total lazy-ass and can’t get yourself to click on that link, here’s a sound bite takeaway: a scatter plot of scores for the same teacher, in the same year, teaching the same subject to kids in different grades. So, for example, a teacher might teach math to 6th graders and to 7th graders and get two different scores; how different are those scores? Here’s how different:

Yeah, so basically random. In fact a correlation of 24%. This is an embarrassment, people, and we cannot let this be how we decide whether a teacher gets tenure or how shamed a person gets in a newspaper article.

Just imagine if you got publicly humiliated by a model with that kind of noise which was purportedly evaluating your work, which you had no view into and thus you couldn’t argue against.

I’d love to get a meeting with Bloomberg and show him this scatter plot. I might also ask him why, if his administration is indeed so excited about “transparency,” do they release the scores but not the model itself, and why they refuse to release police reports at all.

Eugene Stern: How Value Added Models are Like Turds

This is a guest post by Eugene Stern, originally posted on his blog sensemadehere.wordpress.com.

 

“Why am I surrounded by statistical illiterates?” — Roger Mexico in Gravity’s Rainbow

Oops, they did it again. This weekend, the New York Times put out this profile of William Sanders, the originator of evaluating teachers using value-added models based on student standardized test results. It is statistically illiterate, uses math to mislead and intimidate, and is utterly infuriating.

Here’s the worst part:

When he began calculating value-added scores en masse, he immediately saw that the ratings fell into a “normal” distribution, or bell curve. A small number of teachers had unusually bad results, a small number had unusually good results, and most were somewhere in the middle.

And later:

Up until his death, Mr. Sanders never tired of pointing out that none of the critiques refuted the central insight of the value-added bell curve: Some teachers are much better than others, for reasons that conventional measures can’t explain.

The implication here is that value added models have scientific credibility because they look like math — they give you a bell curve, you know. That sounds sort of impressive until you remember that the bell curve is also the world’s most common model of random noise. Which is what value added models happen to be.

Just to replace the Times’s name dropping with some actual math, bell curves are ubiquitous because of the Central Limit Theorem, which says that any variable that depends on many similar-looking but independent factors looks like a bell curve, no matter what the unrelated factors are. For example, the number of heads you get in 100 coin flips. Each single flip is binary, but when you flip a coin over and over, one flip doesn’t affect the next, and out comes a bell curve. Or how about height? It depends on lots of factors: heredity, diet, environment, and so on, and you get a bell curve again. The central limit theorem is wonderful because it helps explain the world: it tells you why you see bell curves everywhere. It also tells you that random fluctuations that don’t mean anything tend to look like bell curves too.

So, just to take another example, if I decided to rate teachers by the size of the turds that come out of their ass, I could wave around a lovely bell-shaped distribution of teacher ratings, sit back, and wait for the Times article about how statistically insightful this is. Because back in the bad old days, we didn’t know how to distinguish between good and bad teachers, but the Turd Size Model™ produces a shiny, mathy-looking distribution — so it must be correct! — and shows us that teacher quality varies for reasons that conventional measures can’t explain.

Or maybe we should just rate news articles based on turd size, so this one could get a Pulitzer.

Categories: Uncategorized

Teacher growth score “capricious” and “arbitrary”, judge rules

Holy crap, peoples! I’m feeling extremely corroborated this week, what with the ProPublica report on Monday and also the recent judge’s ruling on a teacher’s growth score. OK, so technically the article actually came out last week (hat tip Chris Wiggins), but I only found out about it yesterday.

Growth scores are in the same class of models as Value-added models, and I’ve complained about them at great length in this blog as well as in my upcoming book.

Here’s what happened.  A teacher named Sheri Lederman in Great Neck, New York got a growth score of 14 one year and 1 the next, even though her students did pretty well on state tests in both years.

Lederman decided to sue New York State for her “ineffective rating”, saying it was a problem of the scoring system, not her teaching. Albany Supreme Court justice Roger McDonough got the case and ruled last week.

McDonough decided to vacate her score, describing it as “arbitrary and capricious”. Here are more details on the ruling, taken from the article:

In his ruling, McDonough cited evidence that the statistical method unfairly penalizes teachers with either very high-performing students or very low-performing students. He found that Lederman’s small class size made the growth model less reliable.

He found an inability of high-performing students to show the same growth using current tests as lower-performing students.

He was troubled by the state’s inability to explain the wide swing in Lederman’s score from year to year, even though her students performed at similar levels.

He was perplexed that the growth model rules define a fixed percentage of teachers as ineffective each year, regardless of whether student performance across the state rose or fell.

This is a great start, hopefully we’ll see less growth models being used in the future.

Update: here’s the text of the decision.

Categories: Uncategorized

Who wants to be a school teacher (or a fruit picker)?

Some of you may have seen the recent New York Times article entitled Teacher Shortages Spur a Nationwide Hiring Scramble (Credentials Optional)As the title indicates, it turns out that not too many people are throwing their hat into the school teacher ring recently. And given the enormous turnover, this is bad news for the profession.

I’ve got a general rule about such headlines that I like to follow. Namely, whenever we hear about a “labor shortage” in a given profession, we should think about four things:

  1. Wages
  2. Conditions on the job
  3. Benefits, including retirement
  4. Cost/ length of training

So for school teachers, we might break it down like this:

  1. Wages – median at around $58K, has been rising a bit ahead of inflation if I’m eyeballing this graph correctly
  2. Conditions on the job – much worse in the past decade due to the Value-Added Model, and other Education Reform measures which remove autonomy and force teachers to teach to the test
  3. Benefits, including tenure and retirement – under relentless fire from gleeful Republican politicians
  4. Cost/ length of training – sizable, which means that it might take the profession quite some time to recover

When you take the above points together, you realize that it’s not a salary thing so much as an environment that has become toxic. A capable person, however earnest, would think twice before entering such an industry. This is particularly true right now, when tenure is on the chopping block but the salary hasn’t risen to compensate for the added risk.

Teachers, as a profession, are not so different from truckers, who I wrote about a couple of weeks ago. We’ve got some skilled workers whose environments have been severely degraded, and whose salaries have not risen in response. Considering the fact that the economy is somewhat better, this means people are unwilling to go get trained and qualify for such jobs. Moreover, there’s a real reason in both industries to avoid lowering the barrier to entry; we don’t want illiterate teachers nor do we want dangerous truckers. The solutions are obvious: either make their lives better or give them more money, or both.

There’s one more profession that’s going through a “labor shortage,” namely fruit pickers (hat tip Tom Adams). This is because we have many fewer Mexicans coming in for work, and Americans are generally unwilling to break their backs for a measly $11.33 per hour median wage. This is somewhat different from the other industries, because there’s really no lower bar for training, and anyone willing to do the work is given a job. There are also no benefits or job security, and obviously conditions are horrendous.

Even so, the solutions are still obvious: make the job better or pay more.

Categories: Uncategorized

The arbitrary punishment of New York teacher evaluations

The Value-Added Model for teachers (VAM), currently in use all over the country, is a terrible scoring system, as I’ve described before. It is approximately a random number generator.

Even so, it’s still in use, mostly because it wields power over the teacher unions. Let me explain why I say this.

Cuomo’s new budget negotiations with the teacher’s union came up with the following rules around teacher tenure, as I understand them (readers, correct me if I’m wrong):

  1. It will take at least 4 years to get tenure,
  2. A teacher must get at least 3 “effective” or “highly effective” ratings in those three years,
  3. A teacher’s yearly rating depends directly on their VAM score: they are not allowed to get an “effective” or “highly effective” rating if their VAM score comes out as “ineffective.”

Now, I’m ignoring everything else about the system, because I want to distill the effect of VAM.

Let’s think through the math of how likely it is that you’d be denied tenure based only on this random number generator. We will assume only that you otherwise get good ratings from your principal and outside observations. Indeed, Cuomo’s big complaint is that 98% of teachers get good ratings, so this is a safe assumption.

My analysis depends on what qualifies as an “ineffective” VAM score, i.e. what the cutoff is. For now, let’s assume that 30% of teachers receive “ineffective” in a given year, because it has to be some number. Later on we’ll see how things change if that assumption is changed.

That means that 30% of the time, a teacher will not be able to receive an “effective” score, no matter how else they behave, and no matter what their principals or outside observations report for a given year.

Think of it as a biased coin flip, and 30% of the time – for any teacher and for any year – it lands on “ineffective”, and 70% of the time it lands on “effective.” We will ignore the other categories because they don’t matter.

How about if you look over a four year period? To avoid getting any “ineffective” coin flips, you’d need to get “effective” every year, which would happen 0.70^4 = 24% of the time. In other words, 76% of the time, you’d get at least one “ineffective” rating just by chance. 

But remember, you don’t need to get an “effective” rating for all four years, you are allowed one “ineffective rating.” The chances of exactly one “ineffective” coin flip and three “effective” flips is 4 (1-0.70) 0.70^3 =  41%.

Adding those two scenarios together, it means that 65% of the time, over a four year period, you’d get sufficient VAM scores to receive tenure. But it also means that 35% of the time you wouldn’t, through no fault of your own.

This is the political power of a terrible scoring system. More than a third of teachers are being arbitrarily chosen to be punished by this opaque and unaccountable test.

Let’s go back to my assumption, that 30% of teachers are deemed “ineffective.” Maybe I got this wrong. It directly impacts my numbers above. If the overall probability of being deemed “effective” is p, then the overall chance of getting sufficient VAM scores will be p^4 + 4 p^3 (1-p).

So if I got it totally wrong, and 98% of teachers are described as effective by the VAM model, this would mean almost all teachers get sufficient VAM scores.

On the other hand, remember that the reason VAM is being pushed so hard by people is that they don’t like it when evaluations systems think too many people are effective. In fact, they’d rather see arbitrary and random evaluation than see most people get through unscathed.

In other words, it is definitely more than 2% of teachers that are called “ineffective,” but I don’t know the true cutoff.

If anyone knows the true cutoff, please tell me so I can compute anew the percentage of teachers that are arbitrarily being kept from tenure.

Categories: education, rant, statistics

Fairness, accountability, and transparency in big data models

As I wrote about already, last Friday I attended a one day workshop in Montreal called FATML: Fairness, Accountability, and Transparency in Machine Learning. It was part of the NIPS conference for computer science, and there were tons of nerds there, and I mean tons. I wanted to give a report on the day, as well as some observations.

First of all, I am super excited that this workshop happened at all. When I left my job at Intent Media in 2011 with the intention of studying these questions and eventually writing a book about them, they were, as far as I know, on nobody’s else’s radar. Now, thanks to the organizers Solon and Moritz, there are communities of people, coming from law, computer science, and policy circles, coming together to exchange ideas and strategies to tackle the problems. This is what progress feels like!

OK, so on to what the day contained and my copious comments.

Hannah Wallach

Sadly, I missed the first two talks, and an introduction to the day, because of two airplane cancellations (boo American Airlines!). I arrived in the middle of Hannah Wallach’s talk, the abstract of which is located here. Her talk was interesting, and I liked her idea of having social scientists partnered with data scientists and machine learning specialists, but I do want to mention that, although there’s a remarkable history of social scientists working within tech companies – say at Bell Labs and Microsoft and such – we don’t see that in finance at all, nor does it seem poised to happen. So in other words, we certainly can’t count on social scientists to be on hand when important mathematical models are getting ready for production.

Also, I liked Hannah’s three categories of models: predictive, explanatory, and exploratory. Even though I don’t necessarily think that a given model will fall neatly into one category or the other, they still give you a way to think about what we do when we make models. As an example, we think of recommendation models as ultimately predictive, but they are (often) predicated on the ability to understand people’s desires as made up of distinct and consistent dimensions of personality (like when we use PCA or something equivalent). In this sense we are also exploring how to model human desire and consistency. For that matter I guess you could say any model is at its heart an exploration into whether the underlying toy model makes any sense, but that question is dramatically less interesting when you’re using linear regression.

Anupam Datta and Michael Tschantz

Next up Michael Tschantz reported on work with Anupam Datta that they’ve done on Google profiles and Google ads. The started with google’s privacy policy, which I can’t find but which claims you won’t receive ads based on things like your health problems. Starting with a bunch of browsers with no cookies, and thinking of each of them as fake users, they did experiments to see what actually happened both to the ads for those fake users and to the google ad profiles for each of those fake users. They found that, at least sometimes, they did get the “wrong” kind of ad, although whether Google can be blamed or whether the advertiser had broken Google’s rules isn’t clear. Also, they found that fake “women” and “men” (who did not differ by any other variable, including their searches) were offered drastically different ads related to job searches, with men being offered way more ads to get $200K+ jobs, although these were basically coaching sessions for getting good jobs, so again the advertisers could have decided that men are more willing to pay for such coaching.

An issue I enjoyed talking about was brought up in this talk, namely the question of whether such a finding is entirely evanescent or whether we can call it “real.” Since google constantly updates its algorithm, and since ad budgets are coming and going, even the same experiment performed an hour later might have different results. In what sense can we then call any such experiment statistically significant or even persuasive? Also, IRL we don’t have clean browsers, so what happens when we have dirty browsers and we’re logged into gmail and Facebook? By then there are so many variables it’s hard to say what leads to what, but should that make us stop trying?

From my perspective, I’d like to see more research into questions like, of the top 100 advertisers on Google, who saw the majority of the ads? What was the economic, racial, and educational makeup of those users? A similar but different (because of the auction) question would be to reverse-engineer the advertisers’ Google ad targeting methodologies.

Finally, the speakers mentioned a failure on Google’s part of transparency. In your advertising profile, for example, you cannot see (and therefore cannot change) your marriage status, but advertisers can target you based on that variable.

Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian

Next up we had Sorelle talk to us about her work with two guys with enormous names. They think about how to make stuff fair, the heart of the question of this workshop.

First, if we included race in, a resume sorting model, we’d probably see negative impact because of historical racism. Even if we removed race but included other attributes correlated with race (say zip code) this effect would remain. And it’s hard to know exactly when we’ve removed the relevant attributes, but one thing these guys did was define that precisely.

Second, say now you have some idea of the categories that are given unfair treatment, what can you do? One thing suggested by Sorelle et al is to first rank people in each category – to assign each person a percentile in their given category – and then to use the “forgetful function” and only consider that percentile. So, if we decided at a math department that we want 40% women graduate students, to achieve this goal with this method we’d independently rank the men and women, and we’d offer enough spots to top women to get our quota and separately we’d offer enough spots to top men to get our quota. Note that, although it comes from a pretty fancy setting, this is essentially affirmative action. That’s not, in my opinion, an argument against it. It’s in fact yet another argument for it: if we know women are systemically undervalued, we have to fight against it somehow, and this seems like the best and simplest approach.

Ed Felten and Josh Kroll

After lunch Ed Felton and Josh Kroll jointly described their work on making algorithms accountable. Basically they suggested a trustworthy and encrypted system of paper trails that would support a given algorithm (doesn’t really matter which) and create verifiable proofs that the algorithm was used faithfully and fairly in a given situation. Of course, we’d really only consider an algorithm to be used “fairly” if the algorithm itself is fair, but putting that aside, this addressed the question of whether the same algorithm was used for everyone, and things like that. In lawyer speak, this is called “procedural fairness.”

So for example, if we thought we could, we might want to turn the algorithm for punishment for drug use through this system, and we might find that the rules are applied differently to different people. This algorithm would catch that kind of problem, at least ideally.

David Robinson and Harlan Yu

Next up we talked to David Robinson and Harlan Yu about their work in Washington D.C. with policy makers and civil rights groups around machine learning and fairness. These two have been active with civil rights group and were an important part of both the Podesta Report, which I blogged about here, and also in drafting the Civil Rights Principles of Big Data.

The question of what policy makers understand and how to communicate with them came up several times in this discussion. We decided that, to combat cherry-picked examples we see in Congressional Subcommittee meetings, we need to have cherry-picked examples of our own to illustrate what can go wrong. That sounds bad, but put it another way: people respond to stories, especially to stories with innocent victims that have been wronged. So we are on the look-out.

Closing panel with Rayid Ghani and Foster Provost

I was on the closing panel with Rayid Ghani and Foster Provost, and we each had a few minutes to speak and then there were lots of questions and fun arguments. To be honest, since I was so in the moment during this panel, and also because I was jonesing for a beer, I can’t remember everything that happened.

As I remember, Foster talked about an algorithm he had created that does its best to “explain” the decisions of a complicated black box algorithm. So in real life our algorithms are really huge and messy and uninterpretable, but this algorithm does its part to add interpretability to the outcomes of that huge black box. The example he gave was to understand why a given person’s Facebook “likes” made a black box algorithm predict they were gay: by displaying, in order of importance, which likes added the most predictive power to the algorithm.

[Aside, can anyone explain to me what happens when such an algorithm comes across a person with very few likes? I’ve never understood this very well. I don’t know about you, but I have never “liked” anything on Facebook except my friends’ posts.]

Rayid talked about his work trying to develop a system for teachers to understand which students were at risk of dropping out, and for that system to be fair, and he discussed the extent to which that system could or should be transparent.

Oh yeah, and that reminds me that, after describing my book, we had a pretty great argument about whether credit scoring models should be open source, and what that would mean, and what feedback loops that would engender, and who would benefit.

Altogether a great day, and a fantastic discussion. Thanks again to Solon and Moritz for their work in organizing it.

Chameleon models

September 29, 2014 13 comments

Here’s an interesting paper I’m reading this morning (hat tip Suresh Naidu) entitled Chameleons: The Misuse of Theoretical Models in Finance and Economics written by Paul Pfleiderer. The paper introduces the useful concept of chameleon models, defined in the following diagram:

Screen Shot 2014-09-29 at 8.46.46 AM

 

Pfleiderer provides some examples of chameleon models, and also takes on the Milton Friedman argument that we shouldn’t judge a model by its assumptions but rather by its predictions (personally I think this is largely dependent on the way a model is used; the larger the stakes, the more the assumptions matter).

I like the term, and I think I might use it. I also like the point he makes that it’s really about usage. Most models are harmless until they are used as political weapons. Even the value-added teacher model could be used to identify school systems that need support, although in the current climate of distorted data due to teaching to the test and cheating, I think the signal is probably very slight.

Categories: economics, modeling

The bad teacher conspiracy

Any time I see an article about the evaluation system for teachers in New York State, I wince. People get it wrong so very often. Yesterday’s New York Times article written by Elizabeth Harris was even worse than usual.

First, her wording. She mentioned a severe drop in student reading and math proficiency rates statewide and attributed it to a change in the test to the Common Core, which she described as “more rigorous.”

The truth is closer to “students were tested on stuff that wasn’t in their curriculum.” And as you can imagine, if you are tested on stuff you didn’t learn, your score will go down (the Common Core has been plagued by a terrible roll-out, and the timing of this test is Exhibit A). Wording like this matters, because Harris is setting up her reader to attribute the falling scores to bad teachers.

Harris ends her piece with a reference to a teacher-tenure lawsuit: ‘In one of those cases, filed in Albany in July, court documents contrasted the high positive teacher ratings with poor student performance, and called the new evaluation system “deficient and superficial.” The suit said those evaluations were the “most highly predictive measure of whether a teacher will be awarded tenure.”’

In other words, Harris is painting a picture of undeserving teachers sneaking into tenure in spite of not doing their job. It’s ironic, because I actually agree with the statement that the new evaluation system is “deficient and superficial,” but in my case I think it is overly punitive to teachers – overly random, really, since it incorporates the toxic VAM model – but in her framing she is implying it is insufficiently punitive.

Let me dumb Harris’s argument down even further: How can we have 26% English proficiency among students and 94% effectiveness among teachers?! Let’s blame the teachers and question the legitimacy of tenure. 

Indeed, after reading the article I felt like looking into whether Harris is being paid by David Welch, the Silicon Valley dude who has vowed to fight teacher tenure nationwide. More likely she just doesn’t understand education and is convinced by simplistic reasoning.

In either case, she clearly needs to learn something about statistics. For that matter, so do other people who drag out this “blame the teacher” line whenever they see poor performance by students.

Because here’s the thing. Beyond obvious issues like switching the content of the tests away from the curriculum, standardized test scores everywhere are hugely dependent on the poverty levels of students. Some data:

naepstates11-1024x744

 

It’s not just in this country, either:

Considering how many poor kids we have in the U.S., we are actually doing pretty well.

Considering how many poor kids we have in the U.S., we are actually doing pretty well.

 

The conclusion is that, unless you think bad teachers have somehow taken over poor schools everywhere and booted out the good teachers, and good teachers have taken over rich schools everywhere and booted out the bad teachers (which is supposed to be impossible, right?), poverty has much more of an effect than teachers.

Just to clarify this reasoning, let me give you another example: we could blame bad journalists for lower rates of newspaper readership at a given paper, but since newspaper readership is going down everywhere we’d be blaming journalists for what is a cultural issue.

Or, we could develop a process by which we congratulate specific policemen for a reduced crime rate, but then we’d have to admit that crime is down all over the country.

I’m not saying there aren’t bad teachers, because I’m sure there are. But by only focusing on rooting out bad teachers, we are ignoring an even bigger and harder problem. And no, it won’t be solved by privatizing and corporatizing public schools. We need to address childhood poverty. Here’s one more visual for the road:

americas-new-race-to-the-top1

Why Chetty’s Value-Added Model studies leave me unconvinced

Every now and then when I complain about the Value-Added Model (VAM), people send me links to recent papers written Raj Chetty, John Friedman, and Jonah Rockoff like this one entitled Measuring the Impacts of Teachers II: Teacher Value-Added and Student Outcomes in Adulthood or its predecessor Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates.

I think I’m supposed to come away impressed, but that’s not what happens. Let me explain.

Their data set for students scores start in 1989, well before the current value-added teaching climate began. That means teachers weren’t teaching to the test like they are now. Therefore saying that the current VAM works because an retrograded VAM worked in 1989 and the 1990’s is like saying I must like blueberry pie now because I used to like pumpkin pie. It’s comparing apples to oranges, or blueberries to pumpkins.

I’m surprised by the fact that the authors don’t seem to make any note of the difference in data quality between pre-VAM and current conditions. They should know all about feedback loops; any modeler should. And there’s nothing like telling teachers they might lose their job to create a mighty strong feedback loop. For that matter, just consider all the cheating scandals in the D.C. area where the stakes were the highest. Now that’s a feedback loop. And by the way, I’ve never said the VAM scores are totally meaningless, but just that they are not precise enough to hold individual teachers accountable. I don’t think Chetty et al address that question.

So we can’t trust old VAM data. But what about recent VAM data? Where’s the evidence that, in this climate of high-stakes testing, this model is anything but random?

If it were a good model, we’d presumably be seeing a comparison of current VAM scores and current other measures of teacher success and how they agree. But we aren’t seeing anything like that. Tell me if I’m wrong, I’ve been looking around and I haven’t seen such comparisons. And I’m sure they’ve been tried, it’s not rocket science to compare VAM scores with other scores.

The lack of such studies reminds me of how we never hear about scientific studies on the results of Weight Watchers. There’s a reason such studies never see the light of day, namely because whenever they do those studies, they decide they’re better off not revealing the results.

And if you’re thinking that it would be hard to know exactly how to rate a teacher’s teaching in a qualitative, trustworthy way, then yes, that’s the point! It’s actually not obvious how to do this, which is the real reason we should never trust a so-called “objective mathematical model” when we can’t even decide on a definition of success. We should have the conversation of what comprises good teaching, and we should involve the teachers in that, and stop relying on old data and mysterious college graduation results 10 years hence. What are current 6th grade teachers even supposed to do about studies like that?

Note I do think educators and education researchers should be talking about these questions. I just don’t think we should punish teachers arbitrarily to have that conversation. We should have a notion of best practices that slowly evolve as we figure out what works in the long-term.

So here’s what I’d love to see, and what would be convincing to me as a statistician. If we see all sorts of qualitative ways of measuring teachers, and see their VAM scores as well, and we could compare them, and make sure they agree with each other and themselves over time. In other words, at the very least we should demand an explanation of how some teachers get totally ridiculous and inconsistent scores from one year to the next and from one VAM to the next, even in the same year.

The way things are now, the scores aren’t sufficiently sound be used for tenure decisions. They are too noisy. And if you don’t believe me, consider that statisticians and some mathematicians agree.

We need some ground truth, people, and some common sense as well. Instead we’re seeing retired education professors pull statistics out of thin air, and it’s an all-out war of supposed mathematical objectivity against the civil servant.

Getting rid of teacher tenure does not solve the problem

There’s been a movement to make primary and secondary education run more like a business. Just this week in California, a lawsuit funded by Silicon Valley entrepreneur David Welch led to a judge finding that student’s constitutional rights were being compromised by the tenure system for teachers in California.

The thinking is that tenure removes the possibility of getting rid of bad teachers, and that bad teachers are what is causing the achievement gap between poor kids and well-off kids. So if we get rid of bad teachers, which is easier after removing tenure, then no child will be “left behind.”

The problem is, there’s little evidence for this very real achievement gap problem as being caused by tenure, or even by teachers. So this is a huge waste of time.

As a thought experiment, let’s say we did away with tenure. This basically means that teachers could be fired at will, say through a bad teacher evaluation score.

An immediate consequence of this would be that many of the best teachers would get other jobs. You see, one of the appeals of teaching is getting a comfortable pension at retirement, but if you have no idea when you’re being dismissed, then it makes no sense to put in the 25 or 30 years to get that pension. Plus, what with all the crazy and random value-added teacher models out there, there’s no telling when your score will look accidentally bad one year and you’ll be summarily dismissed.

People with options and skills will seek other opportunities. After all, we wanted to make it more like a business, and that’s what happens when you remove incentives in business!

The problem is you’d still need teachers. So one possibility is to have teachers with middling salaries and no job security. That means lots of turnover among the better teachers as they get better offers. Another option is to pay teachers way more to offset the lack of security. Remember, the only reason teacher salaries have been low historically is that uber competent women like Laura Ingalls Wilder had no other options than being a teacher. I’m pretty sure I’d have been a teacher if I’d been born 150 years ago.

So we either have worse teachers or education doubles in price, both bad options. And, sadly, either way we aren’t actually addressing the underlying issue, which is that pesky achievement gap.

People who want to make schools more like businesses also enjoy measuring things, and one way they like measuring things is through standardized tests like achievement scores. They blame teachers for bad scores and they claim they’re being data-driven.

Here’s the thing though, if we want to be data-driven, let’s start to maybe blame poverty for bad scores instead:

dc-public-schools-poverty-versus-reaching-ach-2010

 

I’m tempted to conclude that we should just go ahead and get rid of teacher tenure so we can wait a few years and still see no movement in the achievement gap. The problem with that approach is that we’ll see great teachers leave the profession and no progress on the actual root cause, which is very likely to be poverty and inequality, hopelessness and despair. Not sure we want to sacrifice a generation of students just to prove a point about causation.

On the other hand, given that David Welch has a lot of money and seems to be really excited by this fight, it looks like we might have no choice but to blame the teachers, get rid of their tenure, see a bunch of them leave, have a surprise teacher shortage, respond either by paying way more or reinstating tenure, and then only then finally gather the data that none of this has helped and very possibly made things worse.

Categories: education, math education, news

Interview with a middle school math teacher on the Common Core

Today’s post is an email interview with Fawn Nguyen, who teaches math at Mesa Union Junior High in southern California. Fawn is on the leadership team for UCSB Mathematics Project that provides professional development for teachers in the Tri-County area. She is a co-founder of the Thousand Oaks Math Teachers’ Circle. In an effort to share and learn from other math teachers, Fawn blogs at Finding Ways to Nguyen Students Over. She also started VisualPatterns.org to help students develop algebraic thinking, and more recently, she shares her students’ daily math talks to promote number sense. When Fawn is not teaching or writing, she is reading posts on mathblogging.org as one of the editors. She sleeps occasionally and dreams of becoming an architect when all this is done.

Importantly for the below interview, Fawn is not being measured via a value-added model. My questions are italicized.

——

I’ve been studying the rhetoric around the mathematics Common Core State Standard (CCSS). So far I’ve listened to Diane Ravitch stuff, I’ve interviewed Bill McCallum, the lead writer of the math CCSS, and I’ve also interviewed Kiri Soares, a New York City high school principal. They have very different views. Interestingly, McCallum distinguished three things: standards, curriculum, and testing. 

What do you think? Do teachers see those as three different things? Or is it a package deal, where all three things rolled into one in terms of how they’re presented?

I can’t speak for other teachers. I understand that the standards are not meant to be the curriculum, but the two are not mutually exclusive either. They can’t be. Standards inform the curriculum. This might be a terrible analogy, but I love food and cooking, so maybe the standards are the major ingredients, and the curriculum is the entrée that contains those ingredients. In the show Chopped on Food Network, the competing chefs must use all 4 ingredients to make a dish – and the prepared foods that end up on the plates differ widely in taste and presentation. We can’t blame the ingredients when the dish is blandly prepared any more than we can blame the standards when the curriculum is poorly written.

Similary, the standards inform testing. Test items for a certain grade level cover the standards of that grade level. I’m not against testing. I’m against bad tests and a lot of it. By bad, I mean multiple-choice items that require more memorization than actual problem solving. But I’m confident we can create good multiple-choice tests because realistically a portion of the test needs to be of this type due to costs.

The three – standards, curriculum, and testing – are not a “package deal” in the sense that the same people are not delivering them to us. But they go together, otherwise what is school mathematics? Funny thing is we have always had the three operating in schools, but somehow the Common Core State Standands (CCSS) seem to get the all the blame for the anxieties and costs connected to testing and curriculum development.

As a teacher, what’s good and bad about the CCSS?

I see a lot of good in the CCSS. This set of standards is not perfect, but it’s much better than our state standards. We can examine the standards and see for ourselves that the integrity of the standards holds up to their claims of being embedded with mathematical focus, rigor, and coherence.

Implementation of CCSS means that students and teachers can expect consistency in what is being in taught at each grade level across state boundaries. This is a nontrivial effort in addressing equity. This consistency also helps teachers collaborate nationwide, and professional development for teachers will improve and be more relevant and effective.

I can only hope that textbooks will be much better because of the inherent focus and coherence in CCSS. A kid can move from Maine to California and not have to see different state outlines on their textbooks as if he’d taken on a new kind of mathematics in his new school. I went to a textbook publishers fair recently at our district, and I remain optimistic that better products are already on their way.

We had every state create its own assessment, now we have two consortia, PARCC and Smarter Balanced. I’ve gone through the sample assessments from the latter, and they are far better than the old multiple-choice items of the CST. Kids will have to process the question at a deeper level to show understanding. This is a good thing.

What is potentially bad about the CCSS is the improper or lack of implementation. So, this boils down to the most important element of the Common Core equation – the teacher. There is no doubt that many teachers, myself included, need sustained professional development to do the job right. And I don’t mean just PD in making math more relevant and engaging, and in how many ways we can use technology, I mean more importantly, we need PD in content knowledge.

It is a perverse notion to think that anyone with a college education can teach elementary mathematics. Teaching mathematics requires knowing mathematics. To know a concept is to understand it backward and forward, inside and outside, to recognize it in different forms and structures, to put it into context, to ask questions about it that leads to more questions, to know the mathematics beyond this concept. That reminds me just recently a 6th grader said to me as we were working on our unit of dividing by a fraction. She said, “My elementary teacher lied to me! She said we always get a smaller number when we divide two numbers.”

Just because one can make tuna casserole does not make one a chef. (Sorry, I’m hungry.)

What are the good and bad things for kids about testing?

Testing is only good for kids when it helps them learn and become more successful – that the feedback from testing should inform the teacher of next moves. Testing has become such a dirty word because we over test our kids. I’m still in the classroom after 23 years, yet I don’t have the answers. I struggle with telling my kids that I value them and their learning, yet at the end of each quarter, the narrative sum of their learning is a letter grade.

Then, in the absence of helping kids learn, testing is bad.

What are the good/bad things for the teachers with all these tests?

Ideally, a good test that measures what it’s supposed to measure should help the teacher and his students. Testing must be done in moderation. Do we really need to test kids at the start of the school year? Don’t we have the results from a few months ago, right before they left for summer vacation? Every test takes time away from learning.

I’m not sure I understand why testing is bad for teachers aside from lost instructional minutes. Again, I can’t speak for other teachers. But I do sense heightened anxiety among some teachers because CCSS is new – and newness causes us to squirm in our seats and doubt our abilities. I don’t necessarily see this as a bad thing. I see it as an opportunity to learn content at a deeper conceptual level and to implement better teaching strategies.

If we look at anything long and hard enough, we are bound to find the good and the bad. I choose to focus on the positives because I can’t make the day any longer and I can’t have fewer than 4 hours of sleep a night. I want to spend my energies working with my administrators, my colleagues, my parents to bring the best I can bring into my classroom.

Is there anything else you’d like to add?

The best things about CCSS for me are not even the standards – they are the 8 Mathematical Practices. These are life-long habits that will serve students well, in all disciplines. They’re equivalent to the essential cooking techniques, like making roux and roasting garlic and braising kale and shucking oysters. Okay, maybe not that last one, but I just got back from New Orleans, and raw oysters are awesome.

I’m excited to continue to share and collaborate with my colleagues locally and online because we now have a common language! We teachers do this very hard work – day in and day out, late into the nights and into the weekends – because we love our kids and we love teaching. But we need to be mathematically competent first and foremost to teach mathematics. I want the focus to always be about the kids and their learning. We start with them; we end with them.

Categories: math, math education

An attempt to FOIL request the source code of the Value-added model

Last November I wrote to the Department of Education to make a FOIL request for the source code for the teacher value-added model (VAM).

Motivation

To explain why I’d want something like this, I think the VAM model sucks and I’d like to explore the actual source code directly. The white paper I got my hands on is cryptically written (take a look!) and doesn’t explain what the actual sensitivity to inputs are, for example. The best way to get at that is the source code.

Plus, since the New York Times and other news outlets published teacher’s VAM scores after a long battle and a FOIA request (see details about this here), I figured it’s only fair to also publicly release the actual black box which determines those scores.

Indeed without knowledge of what the model consists of, the VAM scoring regime is little more than a secret set of rules, with tremendous power over teachers and the teacher union, and also incorporates outrageous public shaming as described above.

I think teachers deserve better, and I want to illustrate the weaknesses of the model directly on an open models platform.

The FOIL request

Here’s the email I sent to foil@schools.nyc.gov on 11/22/13:

Dear Records Access Officer for the NYC DOE,

I’m looking to get a copy of the source code for the most recent value-added teacher model through a FOIA request. There are various publicly available descriptions of such models, for example here, but I’d like the actual underlying code.

Please tell me if I’ve written to the correct person for this FOIA request, thank you very much.

Best,
Cathy O’Neil

Since my FOIL request

In response to my request, on 12/3/13, 1/6/14, and 2/4/14 I got letters saying stuff was taking a long time since my request was so complicated. Then yesterday I got the following response:
Screen Shot 2014-03-07 at 8.49.57 AM

If you follow the link you’ll get another white paper, this time from 2012-2013, which is exactly what I said I didn’t want in my original request.

I wrote back, not that it’s likely to work, and after reminding them of the text of my original request I added the following:


What you sent me is the newer version of the publicly available description of the model, very much like my link above. I specifically asked for the underlying code. That would be in a programming language like python or C++ or java.

Can you to come back to me with the actual code? Or who should I ask?

Thanks very much,
Cathy

It strikes me as strange that it took them more than 3 months to send me a link to a white paper instead of the source code as I requested. Plus I’m not sure what they mean by “SED” but I’m guessing it means these guys, but I’m not sure of exactly who to send a new FOIL request.

Am I getting the runaround? Any suggestions?

Categories: modeling, statistics

Cool open-source models?

I’m looking to develop my idea of open models, which I motivated here and started to describe here. I wrote the post in March 2012, but the need for such a platform has only become more obvious.

I’m lucky to be working with a super fantastic python guy on this, and the details are under wraps, but let’s just say it’s exciting.

So I’m looking to showcase a few good models to start with, preferably in python, but the critical ingredient is that they’re open source. They don’t have to be great, because the point is to see their flaws and possible to improve them.

  1. For example, I put in a FOIA request a couple of days ago to get the current teacher value-added model from New York City.
  2. A friends of mine, Marc Joffe, has an open source municipal credit rating model. It’s not in python but I’m hopeful we can work with it anyway.
  3. I’m in search of an open source credit scoring model for individuals. Does anyone know of something like that?
  4. They don’t have to be creepy! How about a Nate Silver – style weather model?
  5. Or something that relies on open government data?
  6. Can we get the Reinhart-Rogoff model?

The idea here is to get the model, not necessarily the data (although even better if it can be attached to data and updated regularly). And once we get a model, we’d build interactives with the model (like this one), or at least the tools to do so, so other people could build them.

At its core, the point of open models is this: you don’t really know what a model does until you can interact with it. You don’t know if a model is robust unless you can fiddle with its parameters and check. And finally, you don’t know if a model is best possible unless you’ve let people try to improve it.

New Jersey at risk of implementing untested VAM-like teacher evaluation model

This is a guest post by Eugene Stern.

A big reason I love this blog is Cathy’s war on crappy models. She has posted multiple times already about the lousy performance of models that rate teachers based on year-to-year changes in student test scores (for example, read about it here). Much of the discussion focuses on the model used in New York City, but such systems have been, or are being, put in place all over the country. I want to let you know about the version now being considered for use across the river, in New Jersey. Once you’ve heard more, I hope you’ll help me try to stop it.

VAM Background

A little background if you haven’t heard about this before. Because it makes no sense to rate teachers based on students’ absolute grades or test scores (not all students start at the same place each year), the models all compare students’ test scores against some baseline. The simplest thing to do is to compare each student’s score on a test given at the end of the school year against their score on a test given at the end of the previous year. Teachers are then rated based on how much their students’ scores improved over the year.

Comparing with the previous year’s score controls for the level at which students start each year, but not for other factors beside the teacher that affect how much they learn. This includes attendance, in-school environment (curriculum, facilities, other students in the class), out-of-school learning (tutoring, enrichment programs, quantity and quality of time spent with parents/caregivers), and potentially much more. Fancier models try to take these into account by comparing each student’s end of year score with a predicted score. The predicted score is based both on the student’s previous score and on factors like those above. Improvement beyond the predicted score is then attributed to the teacher as “value added” (hence the name “value-added models,” or VAM) and turned into a teacher rating in some way, often using percentiles. One such model is used to rate teachers in New York City.

It’s important to understand that there is no single value-added model, rather a family of them, and that the devil is in the details. Two different teacher rating systems, based on two models of the predicted score, may perform very differently – both across the board, and in specific locations. Different factors may be more or less important depending on where you are. For example, income differences may matter more in a district that provides few basic services, so parents have to pay to get extracurriculars for their kids. And of course the test itself matters hugely as well.

Testing the VAM models

Teacher rating models based on standardized tests have been around for 25 years or so, but two things have happened in the last decade:

  1. Some people started to use the models in formal teacher evaluation, including tenure decisions.
  2. Some (other) people started to test the models.

This did not happen in the order that one would normally like. Wanting to make “data-driven decisions,” many cities and states decided to start rating teachers based on “data” before collecting any data to validate whether that “data” was any good. This is a bit like building a theoretical model of how cancer cells behave, synthesizing a cancer drug in the lab based on the model, distributing that drug widely without any trials, then waiting around to see how many people die from the side effects.

The full body count isn’t in yet, but the models don’t appear to be doing well so far. To look at some analysis of VAM data in New York City, start here and here. Note: this analysis was not done by the city but by individuals who downloaded the data after the city had to make it available because of disclosure laws.

I’m not aware of any study on the validity of NYC’s VAM ratings done by anyone actually affiliated with the city – if you know of any, please tell me. Again, the people preaching data don’t seem willing to actually use data to evaluate the quality of the systems they’re putting in place.

Assuming you have more respect for data than the mucky-mucks, let’s talk about how well the models actually do. Broadly, two ways a model can fail are being biased and being noisy. The point of the fancier value-added models is to try to eliminate bias by factoring in everything other than the teacher that might affect a student’s test score. The trouble is that any serious attempt to do this introduces a bunch of noise into the model, to the degree that the ratings coming out look almost random.

You’d think that a teacher doesn’t go from awful to great or vice versa in one year, but the NYC VAM ratings show next to no correlation in a teacher’s rating from one year to the next. You’d think that a teacher either teaches math well or doesn’t, but the NYC VAM ratings show next to no correlation in a teacher’s rating teaching a subject to one grade and their rating teaching it to another – in the very same year!  (Gary Rubinstein’s blog, linked above, documents these examples, and a number of others.)  Again, this is one particular implementation of a general class of models, but using such noisy data to make significant decisions about teachers’ careers seems nuts.

What’s happening in New Jersey

With all this as background, let’s turn to what’s happening in New Jersey.

You may be surprised that the version of the model proposed by Chris Christie‘s administration (the education commissioner is Christie appointee Chris Cerf, who helped put VAM in place in NYC) is about the simplest possible. There is no attempt to factor out bias by trying to model predicted scores, just a straight comparison between this year’s standardized test score and last year’s.  For an overview, see this.

In more detail, the model groups together all students with the same score on last year’s test, and represents each student’s progress by their score on this year’s test, viewed as a percentile across this group. That’s it. A fancier version uses percentiles calculated across all students with the same score in each of the last several years. These can’t be calculated explicitly (you may not find enough students that got exactly the same score each the last few years), so they are estimated, using a statistical technique called quantile regression.

By design, both the simple and the fancy version ignore everything about a student except their test scores. As a modeler, or just as a human being, you might find it silly not to distinguish between a fourth grader in a wealthy suburb who scored 600 on a standardized test from a fourth grader in the projects with the same score. At least, I don’t know where to find a modeler who doesn’t find it silly, because nobody has bothered to study the validity of using this model to rate teachers. If I’m wrong, please point me to a study.

Politics and SGP

But here we get into the shell game of politics, where rating teachers based on the model is exactly the proposal that lies at the end of an impressive trail of doubletalk.  Follow the bouncing ball.

These models, we are told, differ fundamentally from VAM (which is now seen as somewhat damaged goods politically, I suspect). While VAM tried to isolate teacher contribution, these models do no such thing – they are simply measuring student progress from year to year, which, after all, is what we truly care about. The models have even been rebranded with a new name: student growth percentiles, or SGP. SGP is sold as just describing student progress rather than attributing it to teachers, there can’t be any harm in that, right? – and nothing that needs validation, either. And because SGP is such a clean methodology – if you’re looking for a data-driven model to use for broad “educational assessment,” don’t get yourself into that whole VAM morass, use SGP instead!

Only before you know it, educational assessment turns into, you guessed it, rating teachers. That’s right: because these models aren’t built to rate teachers, they can focus on the things that really matter (student progress), and thus end up being – wait for it – much better for rating teachers! War is peace, friends. Ignorance is strength.

Creators of SGP

You can find a good discussion of SGP’s and their use in evaluation here, and a lot more from the same author, the impressively prolific Bruce Baker, here.  Here’s a response from the creators of SGP. They maintain that information about student growth is useful (duh), and agree that differences in SGP’s should not be attributed to teachers (emphasis mine):

Large-scale assessment results are an important piece of evidence but are not sufficient to make causal claims about school or teacher quality.

SGP and teacher evaluations

But guess what?

The New Jersey Board of Ed and state education commissioner Cerf are putting in place a new teacher evaluation code, to be used this coming academic year and beyond. You can find more details here and here.

Summarizing: for math and English teachers in grades 4-8, 30% of their annual evaluation next year would be mandated by the state to come from those very same SGP’s that, according to their creators, are not sufficient to make causal claims about teacher quality. These evaluations are the primary input in tenure decisions, and can also be used to take away tenure from teachers who receive low ratings.

The proposal is not final, but is fairly far along in the regulatory approval process, and would become final in the next several months. In a recent step in the approval process, the weight given to SGP’s in the overall evaluation was reduced by 5%, from 35%. However, the 30% weight applies next year only, and in the future the state could increase the weight to as high as 50%, at its discretion.

Modeler’s Notes

Modeler’s Note #1: the precise weight doesn’t really matter. If the SGP scores vary a lot, and the other components don’t vary very much, SGP scores will drive the evaluation no matter what their weight.

Modeler’s Note #2: just reminding you again that this data-driven framework for teacher evaluation is being put in place without any data-driven evaluation of its effectiveness. And that this is a feature, not a bug – SGP has not been tested as an attribution tool because we keep hearing that it’s not meant to be one.

In a slightly ironic twist, commissioner Cerf has responded to criticisms that SGP hasn’t been tested by pointing to a Gates Foundation study of the effectiveness of… value-added models.  The study is here.  It draws pretty positive conclusions about how well VAM’s work.  A number of critics have argued, pretty effectively, that the conclusions are unsupported by the data underlying the study, and that the data actually shows that VAM’s work badly.  For a sample, see this.  For another example of a VAM-positive study that doesn’t seem to stand up to scrutiny, see this and this.

Modeler’s Role Play #1

Say you were the modeler who had popularized SGP’s.  You’ve said that the framework isn’t meant to make causal claims, then you see New Jersey (and other states too, I believe) putting a teaching evaluation model in place that uses SGP to make causal claims, without testing it first in any way. What would you do?

So far, the SGP mavens who told us that “Large-scale assessment results are an important piece of evidence but are not sufficient to make causal claims about school or teacher quality” remain silent about the New Jersey initiative, as far as I know.

Modeler’s Role Play #2

Now you’re you again, and you’ve never heard about SGP’s and New Jersey’s new teacher evaluation code until today.  What do you do?

I want you to help me stop this thing.  It’s not in place yet, and I hope there’s still time.

I don’t think we can convince the state education department on the merits.  They’ve made the call that the new evaluation system is better than the current one or any alternatives they can think of, they’re invested in that decision, and we won’t change their minds directly.  But we can make it easier for them to say no than to say yes.  They can be influenced – by local school administrators, state politicians,  the national education community, activists, you tell me who else.  And many of those people will have more open minds.  If I tell you, and you tell the right people, and they tell the right people, the chain gets to the decision makers eventually.

I don’t think I could convince Chris Christie, but maybe I could convince Bruce Springsteen if I met him, and maybe Bruce Springsteen could convince Chris Christie.

VAM-anifesto

I thought we could start with a manifesto – a direct statement from the modeling community explaining why this sucks. Directed at people who can influence the politics, and signed by enough experts (let’s get some big names in there) to carry some weight with those influencers.

Can you help? Help write it, sign it, help get other people to sign it, help get it to the right audience. Know someone whose opinion matters in New Jersey? Then let me know, and help spread the word to them. Use Facebook and Twitter if it’ll help. And don’t forget good old email, phone calls, and lunches with friends.

Or, do you have a better idea? Then put it down. Here. The comments section is wide open. Let’s not fall back on criticizing the politicians for being dumb after the fact.  Let’s do everything we can to keep them from doing this dumb thing in the first place.

Shame on us if we can’t make this right.

Modeling in Plain English

I’ve been enjoying my new job at Johnson Research Labs, where I spend a majority of the time editing my book with my co-author Rachel Schutt. It’s called Doing Data Science (now available for pre-purchase at Amazon), and it’s based on these notes I took last semester at Rachel’s Columbia class.

Recently I’ve been working on Brian Dalessandro‘s chapter on logistic regression. Before getting into the brass tacks of that algorithm, which is especially useful when you are trying to predict a binary outcome (i.e. a 0 or 1 outcome like “will click on this ad”), Brian discusses some common constraints to models.

The one that’s particularly interesting to me is what he calls “interpretability”. His example of an interpretability constraint is really good: it turns out that credit card companies have to be able to explain to people why they’ve been rejected. Brain and I tracked down the rule to this FTC website, which explains the rights of consumers who own credit cards. Here’s an excerpt where I’ve emphasized the key sentences:

You Also Have The Right To…

  • Have credit in your birth name (Mary Smith), your first and your spouse’s last name (Mary Jones), or your first name and a combined last name (Mary Smith Jones).
  • Get credit without a cosigner, if you meet the creditor’s standards.
  • Have a cosigner other than your spouse, if one is necessary.
  • Keep your own accounts after you change your name, marital status, reach a certain age, or retire, unless the creditor has evidence that you’re not willing or able to pay.
  • Know whether your application was accepted or rejected within 30 days of filing a complete application.
  • Know why your application was rejected. The creditor must tell you the specific reason for the rejection or that you are entitled to learn the reason if you ask within 60 days. An acceptable reason might be: “your income was too low” or “you haven’t been employed long enough.” An unacceptable reason might be “you didn’t meet our minimum standards.” That information isn’t specific enough.
  • Learn the specific reason you were offered less favorable terms than you applied for, but only if you reject these terms. For example, if the lender offers you a smaller loan or a higher interest rate, and you don’t accept the offer, you have the right to know why those terms were offered.
  • Find out why your account was closed or why the terms of the account were made less favorable, unless the account was inactive or you failed to make payments as agreed.

The result of this rule is that credit card companies must use simple models, probably decision trees, to make their rejection decisions.

It’s a new way to think about modeling choice, to be sure. It doesn’t necessarily make for “better” decisions from the point of view of the credit card company: random forests, a generalization of decision trees, are known to be more accurate, but are arbitrarily more complicated to explain.

So it matters what you’re optimizing for, and in this case the regulators have decided we’re optimizing for interpretability rather than accuracy. I think this is appropriate, given that consumers are at the mercy of these decisions and relatively powerless to act against them (although the FTC site above gives plenty of advice to people who have been rejected, mostly about how to raise their credit scores).

Three points to make about this. First, I’m reading the Bankers New Clothes, written by Anat Admati and Martin Hellwig (h/t Josh Snodgrass), which is absolutely excellent – I’m planning to write up a review soon. One thing they explain very clearly is the cost of regulation (specifically, higher capital requirements) from the bank’s perspective versus from the taxpayer’s perspective, and how it genuinely seems “expensive” to a bank but is actually cost-saving to the general public. I think the same thing could be said above for the credit card interpretability rule.

Second, it makes me wonder what else one could regulate in terms of plain english modeling. For example, what would happen if we added that requirement to, say, the teacher value-added model? Would we get much-needed feedback to teachers like, “You don’t have enough student participation”? Oh wait, no. The model only looks at student test scores, so would only be able to give the following kind of feedback: “You didn’t raise scores enough. Teach to the test more.”

In other words, what I like about the “Modeling in Plain English” idea is that you have to be able to first express and second back up your reasons for making decisions. It may not lead to ideal accuracy on the part of the modeler but it will lead to much greater clarity on the part of the modeled. And we could do with a bit more clarity.

Finally, what about online loans? Do they have any such interpretability rule? I doubt it. In fact, if I’m not wrong, they can use any information they can scrounge up about someone to decide on who gets a loan, and they don’t have to reveal their decision-making process to anyone. That seems unreasonable to me.

Categories: data science, modeling, rant

Should the U.S. News & World Reports college ranking model be open source?

I had a great time giving my “Weapons of Math Destruction” talk in San Diego, and the audience was fantastic and thoughtful.

One question that someone asked was whether the US News & World Reports college ranking model should be forced to be open sourced – wouldn’t that just cause colleges to game the model?

First of all, colleges are already widely gaming the model and have been for some time. And that gaming is a distraction and has been heading colleges in directions away from good instruction, which is a shame.

And if you suggest that they change the model all the time to prevent this, then you’ve got an internal model of this model that needs adjustment. They might be tinkering at the edges but overall it’s quite clear what’s going into the model: namely, graduation rates, SAT scores, number of Ph.D’s on staff, and so on. The exact percentages change over time but not by much.

The impact that this model has had on education and how universities apportion resources has been profound. Academic papers have been written on the law school version of this story.

Moreover, the tactics that US News & World Reports uses to enforce their dominance of the market are bullying, as you can learn from the President of Reed College, which refuses to be involved.

Back to the question. Just as I realize that opening up all data is not reasonable or desirable, because first of all there are serious privacy issues but second of all certain groups have natural advantages to openly shared resources, it is also true that opening up all models is similarly problematic.

However, certain data should surely be open: for example, the laws of our country, that we are all responsible to know, should be freely available to us (something that Aaron Swartz understood and worked towards). How can we be held responsible for laws we can’t read?

Similarly, public-facing models, such as credit scoring models and teacher value-added models, should absolutely be open and accessible to the public. If I’m being judged and measured and held accountable by some model in my daily life as a citizen, that has real impact on how my future will unfold, then I should know how that process works.

And if you complain about the potential gaming of those public-facing models, I’d answer: if they are gameable then they shouldn’t be used, considering the impact they have on so many people’s lives. Because a gameable model is a weak model, with proxies that fail.

Another way to say this is we should want someone to “game” the credit score model if it means they pay their bills on time every month (I wrote about this here).

Back to the US News & World Report model. Is it public facing? I’m no lawyer but I think a case can be made that it is, and that the public’s trust in this model makes it a very important model indeed. Evidence can be gathered by measuring  the extent to which colleges game the model, which they only do because the public cares so much about the rankings.

Even so, what difference would that make, to open it up?

In an ideal world, where the public is somewhat savvy about what models can and cannot do, opening up the US News & World Reports college ranking model would result in people losing faith in it. They’d realize that it’s no more valuable than an opinion from a highly vocal uncle of theirs who is obsessed with certain metrics and blind to individual eccentricities and curriculums that may be a perfect match for a non-conformist student. It’s only one opinion among many, and not to be religiously believed.

But this isn’t an ideal world, and we have a lot of work to do to get people to understand models as opinions in this sense, and to get people to stop trusting them just because they’re mathematically presented.

An AMS panel to examine public math models?

On Saturday I gave a talk at the AGNES conference to a room full of algebraic geometers.  After introducing myself and putting some context around my talk, I focused on a few models:

  • VaR,
  • VAM,
  • Credit scoring,
  • E-scores (online version of credit scores), and
  • The h-score model (I threw this in for the math people and because it’s an egregious example of a gameable model).

I wanted to formalize the important and salient properties of a model, and I came up with this list:

  • Name – note the name often gives off a whiff of political manipulation by itself
  • Underlying model – regression? decision tree?
  • Underlying assumptions – normal distribution of market returns?
  • Input/output – dirty data?
  • Purported/political goal – how is it actually used vs. how its advocates claim they’ll use it?
  • Evaluation method – every model should come with one. Not every model does. A red flag.
  • Gaming potential – how does being modeled cause people to act differently?
  • Reach – how universal and impactful is the model and its gaming?

In the case of VAM, it doesn’t have an evaluation method. There’s been no way for teachers to know if the model that they get scored on every year is doing a good job, even as it’s become more and more important in tenure decisions (the Chicago strike was largely related to this issue, as I posted here).

Here was my plea to the mathematical audience: this is being done in the name of mathematics. The authority that math is given by our culture, which is enormous and possibly not deserved, is being manipulated by people with vested interests.

So when the objects of modeling, the people and the teachers who get these scores, ask how those scores were derived, they’re often told “it’s math and you wouldn’t understand it.”

That’s outrageous, and mathematicians shouldn’t stand for it. We have to get more involved, as a community, with how mathematics is wielded on the population.

On the other hand, I wouldn’t want mathematicians as a group to get co-opted by these special interest groups either and become shills for the industry. We don’t want to become economists, paid by this campaign or that to write papers in favor of their political goals.

To this end, someone in the audience suggested the AMS might want to publish a book of ethics for mathematicians akin to the ethical guidelines that are published for the society of pyschologists and lawyers. His idea is that it would be case-study based, which seems pretty standard. I want to give this some more thought.

We want to make ourselves available to understand high impact, public facing models to ensure they are sound mathematically, have reasonable and transparent evaluation methods, and are very high quality in terms of proven accuracy and understandability if they are used on people in high stakes situations like tenure.

One suggestion someone in the audience came up with is to have a mathematician “mechanical turk” service where people could send questions to a group of faceless mathematicians. Although I think it’s an intriguing idea, I’m not sure it would work here. The point is to investigate so-called math models that people would rather no mathematician laid their eyes on, whereas mechanical turks only answer questions someone else comes up with.

In other words, there’s a reason nobody has asked the opinion of the mathematical community on VAM. They are using the authority of mathematics without permission.

Instead, I think the math community should form something like a panel, maybe housed inside the American Mathematical Society (AMS), that trolls for models with the following characteristics:

  • high impact – people care about these scores for whatever reason
  • large reach – city-wide or national
  • claiming to be mathematical – so the opinion of the mathematical community matters, or should,

After finding such a model, the panel should publish a thoughtful, third-party analysis of its underlying mathematical soundness. Even just one per year would have a meaningful effect if the models were chosen well.

As I said to someone in the audience (which was amazingly receptive and open to my message), it really wouldn’t take very long for a mathematician to understand these models well enough to have an opinion on them, especially if you compare it to how long it would take a policy maker to understand the math. Maybe a week, with the guidance of someone who is an expert in modeling.

So in other words, being a member of such a “public math models” panel could be seen as a community service job akin to being an editor for a journal: real work but not something that takes over your life.

Now’s the time to do this, considering the explosion of models on everything in sight, and I believe mathematicians are the right people to take it on, considering they know how to admit they’re wrong.

Tell me what you think.

What is a model?

September 28, 2012 9 comments

I’ve been thinking a lot recently about mathematical models and how to explain them to people who aren’t mathematicians or statisticians. I consider this increasingly important as more and more models are controlling our lives, such as:

  • employment models, which help large employers screen through applications,
  • political ad models, which allow political groups to personalize their ads,
  • credit scoring models, which allow consumer product companies and loan companies to screen applicants, and,
  • if you’re a teacher, the Value-Added Model.
  • See more models here and here.

It’s a big job, to explain these, because the truth is they are complicated – sometimes overly so, sometimes by construction.

The truth is, though, you don’t really need to be a mathematician to know what a model is, because everyone uses internal models all the time to make decisions.

For example, you intuitively model everyone’s appetite when you cook a meal for your family. You know that one person loves chicken (but hates hamburgers), while someone else will only eat the pasta (with extra cheese). You even take into account that people’s appetites vary from day to day, so you can’t be totally precise in preparing something – there’s a standard error involved.

To explain modeling at this level, then, you just need to imagine that you’ve built a machine that knows all the facts that you do and knows how to assemble them together to make a meal that will approximately feed your family. If you think about it, you’ll realize that you know a shit ton of information about the likes and dislikes of all of your family members, because you have so many memories of them grabbing seconds of the asparagus or avoiding the string beans.

In other words, it would be actually incredibly hard to give a machine enough information about all the food preferences for all your family members, and yourself, along with the constraints of having not too much junky food, but making sure everyone had something they liked, etc. etc.

So what would you do instead? You’d probably give the machine broad categories of likes and dislikes: this one likes meat, this one likes bread and pasta, this one always drinks lots of milk and puts nutella on everything in sight. You’d dumb it down for the sake of time, in other words. The end product, the meal, may not be perfect but it’s better than no guidance at all.

That’s getting closer to what real-world modeling for people is like. And the conclusion is right too- you aren’t expecting your model to do a perfect job, because you only have a broad outline of the true underlying facts of the situation.

Plus, when you’re modeling people, you have to a priori choose the questions to ask, which will probably come in the form of “does he/she like meat?” instead of “does he/she put nutella on everything in sight?”; in other words, the important but idiosyncratic rules won’t even be seen by a generic one-size-fits-everything model.

Finally, those generic models are hugely scaled- sometimes there’s really only one out there, being used everywhere, and its flaws are compounded that many times over because of its reach.

So, say you’ve got a CV with a spelling error. You’re trying to get a job, but the software that screens for applicants automatically rejects you because of this spelling error. Moreover, the same screening model is used everywhere, and you therefore don’t get any interviews because of this one spelling error, in spite of the fact that you’re otherwise qualified.

I’m not saying this would happen – I don’t know how those models actually work, although I do expect points against you for spelling errors. My point is there’s some real danger in using such models on a very large scale that we know are simplified versions of reality.

One last thing. The model fails in the example above, because the qualified person doesn’t get a job. But it fails invisibly; nobody knows exactly how it failed or even that it failed. Moreover, it only really fails for the applicant who doesn’t get any interviews. For the employer, as long as some qualified applicants survive the model, they don’t see failure at all.

Why are the Chicago public school teachers on strike?

September 14, 2012 59 comments

The issues of pay and testing

My friend and fellow HCSSiM 2012 staff member P.J. Karafiol explains some important issues in a Chicago Sun Times column entitled “Hard facts behind union, board dispute.”

P.J. is a Chicago public school math teacher, he has two kids in the CPS system, and he’s a graduate from that system. So I think he is qualified to speak on the issues.

He first explains that CPS teachers are paid less than those in the suburbs. This means, among other things, that it’s hard to keep good teachers. Next, he explains that, although it is difficult to argue against merit pay, the value-added models that Rahm Emanuel wants to account for half of teachers evaluation, is deeply flawed.

He then points out that, even if you trust the models, the number of teachers the model purports to identify as bad is so high that taking action on that result by firing them all would cause a huge problem – there’s a certain natural rate of finding and hiring good replacement teachers in the best of times, and these are not the best of times.

He concludes with this:

Teachers in Chicago are paid well initially, but face rising financial incentives to move to the suburbs as they gain experience and proficiency. No currently-existing “value added” evaluation system yields consistent, fair, educationally sound results. And firing bad teachers won’t magically create better ones to take their jobs.

To make progress on these issues, we have to figure out a way to make teaching in the city economically viable over the long-term; to evaluate teachers in a way that is consistent and reasonable, and that makes good sense educationally; and to help struggling teachers improve their practice. Because at base, we all want the same thing: classes full of students eager to be learning from their excellent, passionate teachers.

Test anxiety

Ultimately this crappy model, and the power that it yields, creates a culture of text anxiety for teachers and principals as well as for students. As Eric Zorn (grandson of mathematician Max Zorn) writes in the Chicago Tribune (h/t P.J. Karafiol):

The question: But why are so many presumptively good teachers also afraid? Why has the role of testing in teacher evaluations been a major sticking point in the public schools strike in Chicago?

The short answer: Because student test scores provide unreliable and erratic measurements of teacher quality. Because studies show that from subject to subject and from year to year, the same teacher can look alternately like a golden apple and a rotting fig.

Zorn quotes extensively from Math for America President John Ewing’s article in Notices of the American Mathematical Society:

Analyses of (value-added model) results have led researchers to doubt whether the methodology can accurately identify more and less effective teachers. (Value-added model) estimates have proven to be unstable across statistical models, years and classes that teachers teach.

One study found that across five large urban districts, among teachers who were ranked in the top 20 percent of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40 percent.

Another found that teachers’ effectiveness ratings in one year could only predict from 4 percent to 16 percent of the variation in such ratings in the following year.

The politics behind the test

I agree that the value-added model (VAM) is deeply flawed; I’ve blogged about it multiple times, for example here.

The way I see it, VAM is a prime example of the way that mathematics is used as a weapon against normal people – in this case, teachers, principals, and schools. If you don’t see my logic, ask yourself this:

Why would a overly-complex, unproved and very crappy model be so protected by politicians?

There’s really one reason, namely it serves a political function, not a mathematical one. And that political function is to maintain control over the union via a magical box that nobody completely understands (including the politicians, but it serves their purposes in spite of this) and therefore nobody can argue against.

This might seem ridiculous when you have examples like this one from the Washington Post (h/t Chris Wiggins), in which a devoted and beloved math teacher named Ashley received a ludicrously low VAM score.

I really like the article: it was written by Sean C. Feeney, Ashley’s principal at The Wheatley School in New York State and president of the Nassau County High School Principals’ Association. Feeney really tries to understand how the model works and how it uses data.

Feeney uncovers the crucial facts that, on the one hand nobody understands how VAM works at all, and that, on the other, the real reason it’s being used is for the political games being played behind the scenes (emphasis mine):

Officials at our State Education Department have certainly spent countless hours putting together guides explaining the scores. These documents describe what they call an objective teacher evaluation process that is based on student test scores, takes into account students’ prior performance, and arrives at a score that is able to measure teacher effectiveness. Along the way, the guides are careful to walk the reader through their explanations of Student Growth Percentiles (SGPs) and a teacher’s Mean Growth Percentile (MGP), impressing the reader with discussions and charts of confidence ranges and the need to be transparent about the data. It all seems so thoughtful and convincing! After all, how could such numbers fail to paint an accurate picture of a teacher’s effectiveness?

(One of the more audacious claims of this document is that the development of this evaluative model is the result of the collaborative efforts of the Regents Task Force on Teacher and Principal Effectiveness. Those of us who know people who served on this committee are well aware that the recommendations of the committee were either rejected or ignored by State Education officials.)

Feeney wasn’t supposed to do this. He wasn’t supposed to assume he was smart enough to understand the math behind the model. He wasn’t supposed to realize that these so-called “guides to explain the scores” actually represent the smoke being blown into the eyes of educators for the purposes of dismembering what’s left of the power of teachers’ unions in this country.

If he were better behaved, he would have bowed to the authority of the inscrutable, i.e. mathematics, and assume that his prize math teacher must have had flaws he, as her principal, just hadn’t seen before.

Weapons of Math Destruction

Politicans have created a WMD (Weapon of Math Destruction) in VAM; it’s the equivalent of owning an uzi factory when you’re fighting a war against people with pointy sticks.

It’s not the only WMD out there, but it’s a pretty powerful one, and it’s doing outrageous damage to our educational system.

If you don’t know what I mean by WMD, let me help out: one way to spot a WMD is to look at the name versus the underlying model and take note of discrepancies. VAM is a great example of this:

  • The name “Value-Added Model” makes us think we might learn how much a teacher brings to the class above and beyond, say, rote memorization.
  • In fact, if you look carefully you will see that the model is measuring exactly that: teaching to the test, but with errorbars so enormous that the noise almost completely obliterates any “teaching to the test” signal.

Nobody wants crappy teachers in the system, but vilifying well-meaning and hard-working professionals and subjecting them to random but high-stakes testing is not the solution, it’s pure old-fashioned scapegoating.

The political goal of the national VAM movement is clear: take control of education and make sure teachers know their place as the servants of the system, with no job security and no respect.

No wonder the Chicago public school teachers are on strike. I would be too.