Search Results

Keyword: ‘value added model’

Eugene Stern: How Value Added Models are Like Turds

This is a guest post by Eugene Stern, originally posted on his blog


“Why am I surrounded by statistical illiterates?” — Roger Mexico in Gravity’s Rainbow

Oops, they did it again. This weekend, the New York Times put out this profile of William Sanders, the originator of evaluating teachers using value-added models based on student standardized test results. It is statistically illiterate, uses math to mislead and intimidate, and is utterly infuriating.

Here’s the worst part:

When he began calculating value-added scores en masse, he immediately saw that the ratings fell into a “normal” distribution, or bell curve. A small number of teachers had unusually bad results, a small number had unusually good results, and most were somewhere in the middle.

And later:

Up until his death, Mr. Sanders never tired of pointing out that none of the critiques refuted the central insight of the value-added bell curve: Some teachers are much better than others, for reasons that conventional measures can’t explain.

The implication here is that value added models have scientific credibility because they look like math — they give you a bell curve, you know. That sounds sort of impressive until you remember that the bell curve is also the world’s most common model of random noise. Which is what value added models happen to be.

Just to replace the Times’s name dropping with some actual math, bell curves are ubiquitous because of the Central Limit Theorem, which says that any variable that depends on many similar-looking but independent factors looks like a bell curve, no matter what the unrelated factors are. For example, the number of heads you get in 100 coin flips. Each single flip is binary, but when you flip a coin over and over, one flip doesn’t affect the next, and out comes a bell curve. Or how about height? It depends on lots of factors: heredity, diet, environment, and so on, and you get a bell curve again. The central limit theorem is wonderful because it helps explain the world: it tells you why you see bell curves everywhere. It also tells you that random fluctuations that don’t mean anything tend to look like bell curves too.

So, just to take another example, if I decided to rate teachers by the size of the turds that come out of their ass, I could wave around a lovely bell-shaped distribution of teacher ratings, sit back, and wait for the Times article about how statistically insightful this is. Because back in the bad old days, we didn’t know how to distinguish between good and bad teachers, but the Turd Size Model™ produces a shiny, mathy-looking distribution — so it must be correct! — and shows us that teacher quality varies for reasons that conventional measures can’t explain.

Or maybe we should just rate news articles based on turd size, so this one could get a Pulitzer.

Categories: Uncategorized

Why Chetty’s Value-Added Model studies leave me unconvinced

Every now and then when I complain about the Value-Added Model (VAM), people send me links to recent papers written Raj Chetty, John Friedman, and Jonah Rockoff like this one entitled Measuring the Impacts of Teachers II: Teacher Value-Added and Student Outcomes in Adulthood or its predecessor Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates.

I think I’m supposed to come away impressed, but that’s not what happens. Let me explain.

Their data set for students scores start in 1989, well before the current value-added teaching climate began. That means teachers weren’t teaching to the test like they are now. Therefore saying that the current VAM works because an retrograded VAM worked in 1989 and the 1990’s is like saying I must like blueberry pie now because I used to like pumpkin pie. It’s comparing apples to oranges, or blueberries to pumpkins.

I’m surprised by the fact that the authors don’t seem to make any note of the difference in data quality between pre-VAM and current conditions. They should know all about feedback loops; any modeler should. And there’s nothing like telling teachers they might lose their job to create a mighty strong feedback loop. For that matter, just consider all the cheating scandals in the D.C. area where the stakes were the highest. Now that’s a feedback loop. And by the way, I’ve never said the VAM scores are totally meaningless, but just that they are not precise enough to hold individual teachers accountable. I don’t think Chetty et al address that question.

So we can’t trust old VAM data. But what about recent VAM data? Where’s the evidence that, in this climate of high-stakes testing, this model is anything but random?

If it were a good model, we’d presumably be seeing a comparison of current VAM scores and current other measures of teacher success and how they agree. But we aren’t seeing anything like that. Tell me if I’m wrong, I’ve been looking around and I haven’t seen such comparisons. And I’m sure they’ve been tried, it’s not rocket science to compare VAM scores with other scores.

The lack of such studies reminds me of how we never hear about scientific studies on the results of Weight Watchers. There’s a reason such studies never see the light of day, namely because whenever they do those studies, they decide they’re better off not revealing the results.

And if you’re thinking that it would be hard to know exactly how to rate a teacher’s teaching in a qualitative, trustworthy way, then yes, that’s the point! It’s actually not obvious how to do this, which is the real reason we should never trust a so-called “objective mathematical model” when we can’t even decide on a definition of success. We should have the conversation of what comprises good teaching, and we should involve the teachers in that, and stop relying on old data and mysterious college graduation results 10 years hence. What are current 6th grade teachers even supposed to do about studies like that?

Note I do think educators and education researchers should be talking about these questions. I just don’t think we should punish teachers arbitrarily to have that conversation. We should have a notion of best practices that slowly evolve as we figure out what works in the long-term.

So here’s what I’d love to see, and what would be convincing to me as a statistician. If we see all sorts of qualitative ways of measuring teachers, and see their VAM scores as well, and we could compare them, and make sure they agree with each other and themselves over time. In other words, at the very least we should demand an explanation of how some teachers get totally ridiculous and inconsistent scores from one year to the next and from one VAM to the next, even in the same year.

The way things are now, the scores aren’t sufficiently sound be used for tenure decisions. They are too noisy. And if you don’t believe me, consider that statisticians and some mathematicians agree.

We need some ground truth, people, and some common sense as well. Instead we’re seeing retired education professors pull statistics out of thin air, and it’s an all-out war of supposed mathematical objectivity against the civil servant.

An attempt to FOIL request the source code of the Value-added model

Last November I wrote to the Department of Education to make a FOIL request for the source code for the teacher value-added model (VAM).


To explain why I’d want something like this, I think the VAM model sucks and I’d like to explore the actual source code directly. The white paper I got my hands on is cryptically written (take a look!) and doesn’t explain what the actual sensitivity to inputs are, for example. The best way to get at that is the source code.

Plus, since the New York Times and other news outlets published teacher’s VAM scores after a long battle and a FOIA request (see details about this here), I figured it’s only fair to also publicly release the actual black box which determines those scores.

Indeed without knowledge of what the model consists of, the VAM scoring regime is little more than a secret set of rules, with tremendous power over teachers and the teacher union, and also incorporates outrageous public shaming as described above.

I think teachers deserve better, and I want to illustrate the weaknesses of the model directly on an open models platform.

The FOIL request

Here’s the email I sent to on 11/22/13:

Dear Records Access Officer for the NYC DOE,

I’m looking to get a copy of the source code for the most recent value-added teacher model through a FOIA request. There are various publicly available descriptions of such models, for example here, but I’d like the actual underlying code.

Please tell me if I’ve written to the correct person for this FOIA request, thank you very much.

Cathy O’Neil

Since my FOIL request

In response to my request, on 12/3/13, 1/6/14, and 2/4/14 I got letters saying stuff was taking a long time since my request was so complicated. Then yesterday I got the following response:
Screen Shot 2014-03-07 at 8.49.57 AM

If you follow the link you’ll get another white paper, this time from 2012-2013, which is exactly what I said I didn’t want in my original request.

I wrote back, not that it’s likely to work, and after reminding them of the text of my original request I added the following:

What you sent me is the newer version of the publicly available description of the model, very much like my link above. I specifically asked for the underlying code. That would be in a programming language like python or C++ or java.

Can you to come back to me with the actual code? Or who should I ask?

Thanks very much,

It strikes me as strange that it took them more than 3 months to send me a link to a white paper instead of the source code as I requested. Plus I’m not sure what they mean by “SED” but I’m guessing it means these guys, but I’m not sure of exactly who to send a new FOIL request.

Am I getting the runaround? Any suggestions?

Categories: modeling, statistics

Value-added model doesn’t find bad teachers, causes administrators to cheat

There’ve been a couple of articles in the past few days about teacher Value-Added Testing that have enraged me.

If you haven’t been paying attention, the Value-Added Model (VAM) is now being used in a majority of the states (source: the Economist):

Screen Shot 2013-03-31 at 7.31.53 AM

But it gives out nearly random numbers, as gleaned from looking at the same teachers with two scores (see this previous post). There’s a 24% correlation between the two numbers. Note that some people are awesome with respect to one score and complete shit on the other score:


Final thing you need to know about the model: nobody really understands how it works. It relies on error terms of an error-riddled model. It’s opaque, and no teacher can have their score explained to them in Plain English.

Now, with that background, let’s look into these articles.

First, there’s this New York Times article from yesterday, entitled “Curious Grade for Teachers: Nearly All Pass”. In this article, it describes how teachers are nowadays being judged using a (usually) 50/50 combination of classroom observations and VAM scores. This is different from the past, which was only based on classroom observations.

What they’ve found is that the percentage of teachers found “effective or better” has stayed high in spite of the new system – the numbers are all over the place but typically between 90 and 99 percent of teachers. In other words, the number of teachers that are fingered as truly terrible hasn’t gone up too much. What a fucking disaster, at least according to the NYTimes, which seems to go out of its way to make its readers understand how very much high school teachers suck.

A few things to say about this.

  1. Given that the VAM is nearly a random number generator, this is good news – it means they are not trusting the VAM scores blindly. Of course, it still doesn’t mean that the right teachers are getting fired, since half of the score is random.
  2. Another point the article mentions is that failing teachers are leaving before the reports come out. We don’t actually know how many teachers are affected by these scores.
  3. Anyway, what is the right number of teachers to fire each year, New York Times? And how did you choose that number? Oh wait, you quoted someone from the Brookings Institute: “It would be an unusual profession that at least 5 percent are not deemed ineffective.” Way to explain things so scientifically! It’s refreshing to know exactly how the army of McKinsey alums approach education reform.
  4. The overall article gives us the impression that if we were really going to do our job and “be tough on bad teachers,” then we’d weight the Value-Added Model way more. But instead we’re being pussies. Wonder what would happen if we weren’t pussies?

The second article explained just that. It also came from the New York Times (h/t Suresh Naidu), and it was a the story of a School Chief in Atlanta who took the VAM scores very very seriously.

What happened next? The teachers cheated wildly, changing the answers on their students’ tests. There was a big cover-up, lots of nasty political pressure, and a lot of good people feeling really bad, blah blah blah. But maybe we can take a step back and think about why this might have happened. Can we do that, New York Times? Maybe it had to do with the $500,000 in “performance bonuses” that the School Chief got for such awesome scores?

Let’s face it, this cheating scandal, and others like it (which may never come to light), was not hard to predict (as I explain in this post). In fact, as a predictive modeler, I’d argue that this cheating problem is the easiest thing to predict about the VAM, considering how it’s being used as an opaque mathematical weapon.

The Value Added Teacher Model Sucks

Today I want you to read this post (hat tip Jordan Ellenberg) written by Gary Rubinstein, which is the post I would have written if I’d had time and had known that they released the actual Value-added Model scores to the public in machine readable format here.

If you’re a total lazy-ass and can’t get yourself to click on that link, here’s a sound bite takeaway: a scatter plot of scores for the same teacher, in the same year, teaching the same subject to kids in different grades. So, for example, a teacher might teach math to 6th graders and to 7th graders and get two different scores; how different are those scores? Here’s how different:

Yeah, so basically random. In fact a correlation of 24%. This is an embarrassment, people, and we cannot let this be how we decide whether a teacher gets tenure or how shamed a person gets in a newspaper article.

Just imagine if you got publicly humiliated by a model with that kind of noise which was purportedly evaluating your work, which you had no view into and thus you couldn’t argue against.

I’d love to get a meeting with Bloomberg and show him this scatter plot. I might also ask him why, if his administration is indeed so excited about “transparency,” do they release the scores but not the model itself, and why they refuse to release police reports at all.

How the Value-Added Model sucks

One way people’s trust of mathematics is being abused by crappy models is through the Value-Added Model, or VAM, which is actually a congregation of models introduced nationally to attempt to assess teachers and schools and their influence on the students.

I have a lot to say about the context in which we decide to apply a mathematical model to something like this, but today I’m planning to restrict myself to complaints about the actual model. Some of these complaints are general but some of them are specific to the way the one in New York is set up (still a very large example).

The general idea of a VAM is that teachers are rewarded for bringing up their students’ test scores more than expected, given a bunch of context variables (like their poverty and last year’s test scores).

The very first question one should ask is, how good is the underlying test the kids are taking? This is famously a noisy answer, depending on how much sleep and food the kids got that day, and, with respect to the content, depends more on memory than on deep knowledge. Another way of saying this is that, if a student does a mediocre job on the test, it could be because they are learning badly at their school, or that they didn’t eat breakfast, or it could be that the teachers they have are focusing more on other things like understanding the reasons for the scientific method and creating college-prepared students by focusing on skills of inquiry rather than memorization.

This brings us to the next problem with VAM, which is a general problem with test-score cultures, namely that it is possible to teach to the test, which is to say it’s possible for teachers to chuck out their curriculums and focus their efforts on the students doing well on the test (which in middle school would mean teaching only math and English). This may be an improvement for some classrooms but in general is not.

People’s misunderstanding of this point gets to the underlying problem of skepticism of our teachers’ abilities and goals- can you imagine if, at your job, you were mistrusted so much that everyone thought it would be better if you were just given a series of purely rote tasks to do instead of using your knowledge of how things should be explained or introduced or how people learn? It’s a fact that teachers and schools that don’t teach to the test are being punished for this under the VAM system. And it’s also a fact that really good, smart teachers who would rather be able to use their pedagogical chops in an environment where they are being respected leave public schools to get away from this culture.

Another problem with the New York VAM is the way tenure is set up. The system of tenure is complex in its own right, and I personally have issues with it (and with the system of tenure in general), but in any case here’s the way it works now. New teachers are technically given three years to create a portfolio for tenure- but the VAM results of the third year don’t come back in time, which means the superintendent looking at a given person’s tenure folder only sees two years of scores, and one of them is the first year, where the person was completely inexperienced.

The reason this matters is that, depending on the population of kids that new teacher was dealing with, more or less of the year could have been spent learning how to manage a classroom. This is an effect that overall could be corrected for by a model but there’s no reason to believe was. In other words, the overall effect of teaching to kids who are difficult to manage in a classroom could be incorporated into a model but the steep learning curve of someone’s first year would be much harder to incorporate. Indeed I looked at the VAM technical white paper and didn’t see anything like that (although since the paper was written for the goal of obfuscation that doesn’t prove anything).

For a middle school teacher, the fact that they have only two years of test scores (and one year of experienced scores) going into a tenure decision really matters. Technically the breakdown of weights for their overall performance is supposed to be 20% VAM, 20% school-wide assessment, and 60% “subjective” performance evaluation, as in people coming to their classroom and taking notes. However, the superintendent in charge of looking at the folders has about 300 folders to look at in 2 weeks (an estimate), and it’s much easier to look at test scores than to read pages upon pages of written assessment. So the effective weighting scheme is measurably different, although hard to quantify.

One other unwritten rule: if the school the teacher is at gets a bad grade, then that teacher’s chances of tenure can be zero, even if their assessment is otherwise good. This is more of a political thing than anything else, in that Bloomberg doesn’t want to say that a “bad” school had a bunch of tenures go through. But it means that the 20/20/60 breakdown is false in a second way, and it also means that the “school grade” isn’t an independent assessment of the teachers’ grades- and the teachers get double punished for teaching at a school that has a bad grade.

That brings me to the way schools are graded. Believe it or not the VAM employs a binning system when they correct for poverty, which is measured in terms of the percentage of the student population that gets free school lunches. The bins are typically small ranges of percentages, say 20-25%, but the highest bin is something like 45% and higher. This means that a school with 90% of kids getting free school lunch is expected to perform on tests similarly to a school with half that many kids with unstable and distracting home lives. This penalizes the schools with the poorest populations, and as we saw above penalized the teachers at those schools, by punishing them for when the school gets a bad grade. It’s my opinion that there should never be binning in a serious model, for reasons just like this. There should always be a continuous function that is fit to the data for the sake of “correcting” for a given issue.

Moreover, as a philosophical issue, these are the very schools that the whole testing system was created to help (does anyone remember that testing was originally set up to help identify kids who struggle in order to help them?), but instead we see constant stress on their teachers, failed tenure bids, and the resulting turnover in staff is exactly the opposite of helping.

This brings me to a crucial complaint about VAM and the testing culture, namely that the emphasis put on these tests, which we’ve seen is noisy at best, reduces the quality of life for the teachers and the schools and the students to such an extent that there is no value added by the value added model!

If you need more evidence of this please read this article, which describes the rampant cheating on test in Atlanta, Georgia and which is in my opinion a natural consequence of the stress that tests and VAM put on school systems.

One last thing- a political one. There is idiosyncratic evidence that near elections, students magically do better on tests so that candidates can talk about how great their schools are. With that kind of extra variance added to the system, how can teachers and school be expected to reasonably prepare their curriculums?

Next steps: on top of the above complaints, I’d say the worst part of the VAM is actually that nobody really understands it. It’s not open source so nobody can see how the scores are created, and the training data is also not available, so nobody can argue with the robustness of the model either. It’s not even clear what a measurement of success is, and whether anyone is testing the model for success. And yet the scores are given out each year, with politicians adding their final bias, and teachers and schools are expected to live under this nearly random system that nobody comprehends. Things can and should be better than this. I will talk in another blog post about how they should be improved.

Fairness, accountability, and transparency in big data models

As I wrote about already, last Friday I attended a one day workshop in Montreal called FATML: Fairness, Accountability, and Transparency in Machine Learning. It was part of the NIPS conference for computer science, and there were tons of nerds there, and I mean tons. I wanted to give a report on the day, as well as some observations.

First of all, I am super excited that this workshop happened at all. When I left my job at Intent Media in 2011 with the intention of studying these questions and eventually writing a book about them, they were, as far as I know, on nobody’s else’s radar. Now, thanks to the organizers Solon and Moritz, there are communities of people, coming from law, computer science, and policy circles, coming together to exchange ideas and strategies to tackle the problems. This is what progress feels like!

OK, so on to what the day contained and my copious comments.

Hannah Wallach

Sadly, I missed the first two talks, and an introduction to the day, because of two airplane cancellations (boo American Airlines!). I arrived in the middle of Hannah Wallach’s talk, the abstract of which is located here. Her talk was interesting, and I liked her idea of having social scientists partnered with data scientists and machine learning specialists, but I do want to mention that, although there’s a remarkable history of social scientists working within tech companies – say at Bell Labs and Microsoft and such – we don’t see that in finance at all, nor does it seem poised to happen. So in other words, we certainly can’t count on social scientists to be on hand when important mathematical models are getting ready for production.

Also, I liked Hannah’s three categories of models: predictive, explanatory, and exploratory. Even though I don’t necessarily think that a given model will fall neatly into one category or the other, they still give you a way to think about what we do when we make models. As an example, we think of recommendation models as ultimately predictive, but they are (often) predicated on the ability to understand people’s desires as made up of distinct and consistent dimensions of personality (like when we use PCA or something equivalent). In this sense we are also exploring how to model human desire and consistency. For that matter I guess you could say any model is at its heart an exploration into whether the underlying toy model makes any sense, but that question is dramatically less interesting when you’re using linear regression.

Anupam Datta and Michael Tschantz

Next up Michael Tschantz reported on work with Anupam Datta that they’ve done on Google profiles and Google ads. The started with google’s privacy policy, which I can’t find but which claims you won’t receive ads based on things like your health problems. Starting with a bunch of browsers with no cookies, and thinking of each of them as fake users, they did experiments to see what actually happened both to the ads for those fake users and to the google ad profiles for each of those fake users. They found that, at least sometimes, they did get the “wrong” kind of ad, although whether Google can be blamed or whether the advertiser had broken Google’s rules isn’t clear. Also, they found that fake “women” and “men” (who did not differ by any other variable, including their searches) were offered drastically different ads related to job searches, with men being offered way more ads to get $200K+ jobs, although these were basically coaching sessions for getting good jobs, so again the advertisers could have decided that men are more willing to pay for such coaching.

An issue I enjoyed talking about was brought up in this talk, namely the question of whether such a finding is entirely evanescent or whether we can call it “real.” Since google constantly updates its algorithm, and since ad budgets are coming and going, even the same experiment performed an hour later might have different results. In what sense can we then call any such experiment statistically significant or even persuasive? Also, IRL we don’t have clean browsers, so what happens when we have dirty browsers and we’re logged into gmail and Facebook? By then there are so many variables it’s hard to say what leads to what, but should that make us stop trying?

From my perspective, I’d like to see more research into questions like, of the top 100 advertisers on Google, who saw the majority of the ads? What was the economic, racial, and educational makeup of those users? A similar but different (because of the auction) question would be to reverse-engineer the advertisers’ Google ad targeting methodologies.

Finally, the speakers mentioned a failure on Google’s part of transparency. In your advertising profile, for example, you cannot see (and therefore cannot change) your marriage status, but advertisers can target you based on that variable.

Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian

Next up we had Sorelle talk to us about her work with two guys with enormous names. They think about how to make stuff fair, the heart of the question of this workshop.

First, if we included race in, a resume sorting model, we’d probably see negative impact because of historical racism. Even if we removed race but included other attributes correlated with race (say zip code) this effect would remain. And it’s hard to know exactly when we’ve removed the relevant attributes, but one thing these guys did was define that precisely.

Second, say now you have some idea of the categories that are given unfair treatment, what can you do? One thing suggested by Sorelle et al is to first rank people in each category – to assign each person a percentile in their given category – and then to use the “forgetful function” and only consider that percentile. So, if we decided at a math department that we want 40% women graduate students, to achieve this goal with this method we’d independently rank the men and women, and we’d offer enough spots to top women to get our quota and separately we’d offer enough spots to top men to get our quota. Note that, although it comes from a pretty fancy setting, this is essentially affirmative action. That’s not, in my opinion, an argument against it. It’s in fact yet another argument for it: if we know women are systemically undervalued, we have to fight against it somehow, and this seems like the best and simplest approach.

Ed Felten and Josh Kroll

After lunch Ed Felton and Josh Kroll jointly described their work on making algorithms accountable. Basically they suggested a trustworthy and encrypted system of paper trails that would support a given algorithm (doesn’t really matter which) and create verifiable proofs that the algorithm was used faithfully and fairly in a given situation. Of course, we’d really only consider an algorithm to be used “fairly” if the algorithm itself is fair, but putting that aside, this addressed the question of whether the same algorithm was used for everyone, and things like that. In lawyer speak, this is called “procedural fairness.”

So for example, if we thought we could, we might want to turn the algorithm for punishment for drug use through this system, and we might find that the rules are applied differently to different people. This algorithm would catch that kind of problem, at least ideally.

David Robinson and Harlan Yu

Next up we talked to David Robinson and Harlan Yu about their work in Washington D.C. with policy makers and civil rights groups around machine learning and fairness. These two have been active with civil rights group and were an important part of both the Podesta Report, which I blogged about here, and also in drafting the Civil Rights Principles of Big Data.

The question of what policy makers understand and how to communicate with them came up several times in this discussion. We decided that, to combat cherry-picked examples we see in Congressional Subcommittee meetings, we need to have cherry-picked examples of our own to illustrate what can go wrong. That sounds bad, but put it another way: people respond to stories, especially to stories with innocent victims that have been wronged. So we are on the look-out.

Closing panel with Rayid Ghani and Foster Provost

I was on the closing panel with Rayid Ghani and Foster Provost, and we each had a few minutes to speak and then there were lots of questions and fun arguments. To be honest, since I was so in the moment during this panel, and also because I was jonesing for a beer, I can’t remember everything that happened.

As I remember, Foster talked about an algorithm he had created that does its best to “explain” the decisions of a complicated black box algorithm. So in real life our algorithms are really huge and messy and uninterpretable, but this algorithm does its part to add interpretability to the outcomes of that huge black box. The example he gave was to understand why a given person’s Facebook “likes” made a black box algorithm predict they were gay: by displaying, in order of importance, which likes added the most predictive power to the algorithm.

[Aside, can anyone explain to me what happens when such an algorithm comes across a person with very few likes? I’ve never understood this very well. I don’t know about you, but I have never “liked” anything on Facebook except my friends’ posts.]

Rayid talked about his work trying to develop a system for teachers to understand which students were at risk of dropping out, and for that system to be fair, and he discussed the extent to which that system could or should be transparent.

Oh yeah, and that reminds me that, after describing my book, we had a pretty great argument about whether credit scoring models should be open source, and what that would mean, and what feedback loops that would engender, and who would benefit.

Altogether a great day, and a fantastic discussion. Thanks again to Solon and Moritz for their work in organizing it.

Chameleon models

September 29, 2014 13 comments

Here’s an interesting paper I’m reading this morning (hat tip Suresh Naidu) entitled Chameleons: The Misuse of Theoretical Models in Finance and Economics written by Paul Pfleiderer. The paper introduces the useful concept of chameleon models, defined in the following diagram:

Screen Shot 2014-09-29 at 8.46.46 AM


Pfleiderer provides some examples of chameleon models, and also takes on the Milton Friedman argument that we shouldn’t judge a model by its assumptions but rather by its predictions (personally I think this is largely dependent on the way a model is used; the larger the stakes, the more the assumptions matter).

I like the term, and I think I might use it. I also like the point he makes that it’s really about usage. Most models are harmless until they are used as political weapons. Even the value-added teacher model could be used to identify school systems that need support, although in the current climate of distorted data due to teaching to the test and cheating, I think the signal is probably very slight.

Categories: economics, modeling

Cool open-source models?

I’m looking to develop my idea of open models, which I motivated here and started to describe here. I wrote the post in March 2012, but the need for such a platform has only become more obvious.

I’m lucky to be working with a super fantastic python guy on this, and the details are under wraps, but let’s just say it’s exciting.

So I’m looking to showcase a few good models to start with, preferably in python, but the critical ingredient is that they’re open source. They don’t have to be great, because the point is to see their flaws and possible to improve them.

  1. For example, I put in a FOIA request a couple of days ago to get the current teacher value-added model from New York City.
  2. A friends of mine, Marc Joffe, has an open source municipal credit rating model. It’s not in python but I’m hopeful we can work with it anyway.
  3. I’m in search of an open source credit scoring model for individuals. Does anyone know of something like that?
  4. They don’t have to be creepy! How about a Nate Silver – style weather model?
  5. Or something that relies on open government data?
  6. Can we get the Reinhart-Rogoff model?

The idea here is to get the model, not necessarily the data (although even better if it can be attached to data and updated regularly). And once we get a model, we’d build interactives with the model (like this one), or at least the tools to do so, so other people could build them.

At its core, the point of open models is this: you don’t really know what a model does until you can interact with it. You don’t know if a model is robust unless you can fiddle with its parameters and check. And finally, you don’t know if a model is best possible unless you’ve let people try to improve it.

New Jersey at risk of implementing untested VAM-like teacher evaluation model

This is a guest post by Eugene Stern.

A big reason I love this blog is Cathy’s war on crappy models. She has posted multiple times already about the lousy performance of models that rate teachers based on year-to-year changes in student test scores (for example, read about it here). Much of the discussion focuses on the model used in New York City, but such systems have been, or are being, put in place all over the country. I want to let you know about the version now being considered for use across the river, in New Jersey. Once you’ve heard more, I hope you’ll help me try to stop it.

VAM Background

A little background if you haven’t heard about this before. Because it makes no sense to rate teachers based on students’ absolute grades or test scores (not all students start at the same place each year), the models all compare students’ test scores against some baseline. The simplest thing to do is to compare each student’s score on a test given at the end of the school year against their score on a test given at the end of the previous year. Teachers are then rated based on how much their students’ scores improved over the year.

Comparing with the previous year’s score controls for the level at which students start each year, but not for other factors beside the teacher that affect how much they learn. This includes attendance, in-school environment (curriculum, facilities, other students in the class), out-of-school learning (tutoring, enrichment programs, quantity and quality of time spent with parents/caregivers), and potentially much more. Fancier models try to take these into account by comparing each student’s end of year score with a predicted score. The predicted score is based both on the student’s previous score and on factors like those above. Improvement beyond the predicted score is then attributed to the teacher as “value added” (hence the name “value-added models,” or VAM) and turned into a teacher rating in some way, often using percentiles. One such model is used to rate teachers in New York City.

It’s important to understand that there is no single value-added model, rather a family of them, and that the devil is in the details. Two different teacher rating systems, based on two models of the predicted score, may perform very differently – both across the board, and in specific locations. Different factors may be more or less important depending on where you are. For example, income differences may matter more in a district that provides few basic services, so parents have to pay to get extracurriculars for their kids. And of course the test itself matters hugely as well.

Testing the VAM models

Teacher rating models based on standardized tests have been around for 25 years or so, but two things have happened in the last decade:

  1. Some people started to use the models in formal teacher evaluation, including tenure decisions.
  2. Some (other) people started to test the models.

This did not happen in the order that one would normally like. Wanting to make “data-driven decisions,” many cities and states decided to start rating teachers based on “data” before collecting any data to validate whether that “data” was any good. This is a bit like building a theoretical model of how cancer cells behave, synthesizing a cancer drug in the lab based on the model, distributing that drug widely without any trials, then waiting around to see how many people die from the side effects.

The full body count isn’t in yet, but the models don’t appear to be doing well so far. To look at some analysis of VAM data in New York City, start here and here. Note: this analysis was not done by the city but by individuals who downloaded the data after the city had to make it available because of disclosure laws.

I’m not aware of any study on the validity of NYC’s VAM ratings done by anyone actually affiliated with the city – if you know of any, please tell me. Again, the people preaching data don’t seem willing to actually use data to evaluate the quality of the systems they’re putting in place.

Assuming you have more respect for data than the mucky-mucks, let’s talk about how well the models actually do. Broadly, two ways a model can fail are being biased and being noisy. The point of the fancier value-added models is to try to eliminate bias by factoring in everything other than the teacher that might affect a student’s test score. The trouble is that any serious attempt to do this introduces a bunch of noise into the model, to the degree that the ratings coming out look almost random.

You’d think that a teacher doesn’t go from awful to great or vice versa in one year, but the NYC VAM ratings show next to no correlation in a teacher’s rating from one year to the next. You’d think that a teacher either teaches math well or doesn’t, but the NYC VAM ratings show next to no correlation in a teacher’s rating teaching a subject to one grade and their rating teaching it to another – in the very same year!  (Gary Rubinstein’s blog, linked above, documents these examples, and a number of others.)  Again, this is one particular implementation of a general class of models, but using such noisy data to make significant decisions about teachers’ careers seems nuts.

What’s happening in New Jersey

With all this as background, let’s turn to what’s happening in New Jersey.

You may be surprised that the version of the model proposed by Chris Christie‘s administration (the education commissioner is Christie appointee Chris Cerf, who helped put VAM in place in NYC) is about the simplest possible. There is no attempt to factor out bias by trying to model predicted scores, just a straight comparison between this year’s standardized test score and last year’s.  For an overview, see this.

In more detail, the model groups together all students with the same score on last year’s test, and represents each student’s progress by their score on this year’s test, viewed as a percentile across this group. That’s it. A fancier version uses percentiles calculated across all students with the same score in each of the last several years. These can’t be calculated explicitly (you may not find enough students that got exactly the same score each the last few years), so they are estimated, using a statistical technique called quantile regression.

By design, both the simple and the fancy version ignore everything about a student except their test scores. As a modeler, or just as a human being, you might find it silly not to distinguish between a fourth grader in a wealthy suburb who scored 600 on a standardized test from a fourth grader in the projects with the same score. At least, I don’t know where to find a modeler who doesn’t find it silly, because nobody has bothered to study the validity of using this model to rate teachers. If I’m wrong, please point me to a study.

Politics and SGP

But here we get into the shell game of politics, where rating teachers based on the model is exactly the proposal that lies at the end of an impressive trail of doubletalk.  Follow the bouncing ball.

These models, we are told, differ fundamentally from VAM (which is now seen as somewhat damaged goods politically, I suspect). While VAM tried to isolate teacher contribution, these models do no such thing – they are simply measuring student progress from year to year, which, after all, is what we truly care about. The models have even been rebranded with a new name: student growth percentiles, or SGP. SGP is sold as just describing student progress rather than attributing it to teachers, there can’t be any harm in that, right? – and nothing that needs validation, either. And because SGP is such a clean methodology – if you’re looking for a data-driven model to use for broad “educational assessment,” don’t get yourself into that whole VAM morass, use SGP instead!

Only before you know it, educational assessment turns into, you guessed it, rating teachers. That’s right: because these models aren’t built to rate teachers, they can focus on the things that really matter (student progress), and thus end up being – wait for it – much better for rating teachers! War is peace, friends. Ignorance is strength.

Creators of SGP

You can find a good discussion of SGP’s and their use in evaluation here, and a lot more from the same author, the impressively prolific Bruce Baker, here.  Here’s a response from the creators of SGP. They maintain that information about student growth is useful (duh), and agree that differences in SGP’s should not be attributed to teachers (emphasis mine):

Large-scale assessment results are an important piece of evidence but are not sufficient to make causal claims about school or teacher quality.

SGP and teacher evaluations

But guess what?

The New Jersey Board of Ed and state education commissioner Cerf are putting in place a new teacher evaluation code, to be used this coming academic year and beyond. You can find more details here and here.

Summarizing: for math and English teachers in grades 4-8, 30% of their annual evaluation next year would be mandated by the state to come from those very same SGP’s that, according to their creators, are not sufficient to make causal claims about teacher quality. These evaluations are the primary input in tenure decisions, and can also be used to take away tenure from teachers who receive low ratings.

The proposal is not final, but is fairly far along in the regulatory approval process, and would become final in the next several months. In a recent step in the approval process, the weight given to SGP’s in the overall evaluation was reduced by 5%, from 35%. However, the 30% weight applies next year only, and in the future the state could increase the weight to as high as 50%, at its discretion.

Modeler’s Notes

Modeler’s Note #1: the precise weight doesn’t really matter. If the SGP scores vary a lot, and the other components don’t vary very much, SGP scores will drive the evaluation no matter what their weight.

Modeler’s Note #2: just reminding you again that this data-driven framework for teacher evaluation is being put in place without any data-driven evaluation of its effectiveness. And that this is a feature, not a bug – SGP has not been tested as an attribution tool because we keep hearing that it’s not meant to be one.

In a slightly ironic twist, commissioner Cerf has responded to criticisms that SGP hasn’t been tested by pointing to a Gates Foundation study of the effectiveness of… value-added models.  The study is here.  It draws pretty positive conclusions about how well VAM’s work.  A number of critics have argued, pretty effectively, that the conclusions are unsupported by the data underlying the study, and that the data actually shows that VAM’s work badly.  For a sample, see this.  For another example of a VAM-positive study that doesn’t seem to stand up to scrutiny, see this and this.

Modeler’s Role Play #1

Say you were the modeler who had popularized SGP’s.  You’ve said that the framework isn’t meant to make causal claims, then you see New Jersey (and other states too, I believe) putting a teaching evaluation model in place that uses SGP to make causal claims, without testing it first in any way. What would you do?

So far, the SGP mavens who told us that “Large-scale assessment results are an important piece of evidence but are not sufficient to make causal claims about school or teacher quality” remain silent about the New Jersey initiative, as far as I know.

Modeler’s Role Play #2

Now you’re you again, and you’ve never heard about SGP’s and New Jersey’s new teacher evaluation code until today.  What do you do?

I want you to help me stop this thing.  It’s not in place yet, and I hope there’s still time.

I don’t think we can convince the state education department on the merits.  They’ve made the call that the new evaluation system is better than the current one or any alternatives they can think of, they’re invested in that decision, and we won’t change their minds directly.  But we can make it easier for them to say no than to say yes.  They can be influenced – by local school administrators, state politicians,  the national education community, activists, you tell me who else.  And many of those people will have more open minds.  If I tell you, and you tell the right people, and they tell the right people, the chain gets to the decision makers eventually.

I don’t think I could convince Chris Christie, but maybe I could convince Bruce Springsteen if I met him, and maybe Bruce Springsteen could convince Chris Christie.


I thought we could start with a manifesto – a direct statement from the modeling community explaining why this sucks. Directed at people who can influence the politics, and signed by enough experts (let’s get some big names in there) to carry some weight with those influencers.

Can you help? Help write it, sign it, help get other people to sign it, help get it to the right audience. Know someone whose opinion matters in New Jersey? Then let me know, and help spread the word to them. Use Facebook and Twitter if it’ll help. And don’t forget good old email, phone calls, and lunches with friends.

Or, do you have a better idea? Then put it down. Here. The comments section is wide open. Let’s not fall back on criticizing the politicians for being dumb after the fact.  Let’s do everything we can to keep them from doing this dumb thing in the first place.

Shame on us if we can’t make this right.

Modeling in Plain English

I’ve been enjoying my new job at Johnson Research Labs, where I spend a majority of the time editing my book with my co-author Rachel Schutt. It’s called Doing Data Science (now available for pre-purchase at Amazon), and it’s based on these notes I took last semester at Rachel’s Columbia class.

Recently I’ve been working on Brian Dalessandro‘s chapter on logistic regression. Before getting into the brass tacks of that algorithm, which is especially useful when you are trying to predict a binary outcome (i.e. a 0 or 1 outcome like “will click on this ad”), Brian discusses some common constraints to models.

The one that’s particularly interesting to me is what he calls “interpretability”. His example of an interpretability constraint is really good: it turns out that credit card companies have to be able to explain to people why they’ve been rejected. Brain and I tracked down the rule to this FTC website, which explains the rights of consumers who own credit cards. Here’s an excerpt where I’ve emphasized the key sentences:

You Also Have The Right To…

  • Have credit in your birth name (Mary Smith), your first and your spouse’s last name (Mary Jones), or your first name and a combined last name (Mary Smith Jones).
  • Get credit without a cosigner, if you meet the creditor’s standards.
  • Have a cosigner other than your spouse, if one is necessary.
  • Keep your own accounts after you change your name, marital status, reach a certain age, or retire, unless the creditor has evidence that you’re not willing or able to pay.
  • Know whether your application was accepted or rejected within 30 days of filing a complete application.
  • Know why your application was rejected. The creditor must tell you the specific reason for the rejection or that you are entitled to learn the reason if you ask within 60 days. An acceptable reason might be: “your income was too low” or “you haven’t been employed long enough.” An unacceptable reason might be “you didn’t meet our minimum standards.” That information isn’t specific enough.
  • Learn the specific reason you were offered less favorable terms than you applied for, but only if you reject these terms. For example, if the lender offers you a smaller loan or a higher interest rate, and you don’t accept the offer, you have the right to know why those terms were offered.
  • Find out why your account was closed or why the terms of the account were made less favorable, unless the account was inactive or you failed to make payments as agreed.

The result of this rule is that credit card companies must use simple models, probably decision trees, to make their rejection decisions.

It’s a new way to think about modeling choice, to be sure. It doesn’t necessarily make for “better” decisions from the point of view of the credit card company: random forests, a generalization of decision trees, are known to be more accurate, but are arbitrarily more complicated to explain.

So it matters what you’re optimizing for, and in this case the regulators have decided we’re optimizing for interpretability rather than accuracy. I think this is appropriate, given that consumers are at the mercy of these decisions and relatively powerless to act against them (although the FTC site above gives plenty of advice to people who have been rejected, mostly about how to raise their credit scores).

Three points to make about this. First, I’m reading the Bankers New Clothes, written by Anat Admati and Martin Hellwig (h/t Josh Snodgrass), which is absolutely excellent – I’m planning to write up a review soon. One thing they explain very clearly is the cost of regulation (specifically, higher capital requirements) from the bank’s perspective versus from the taxpayer’s perspective, and how it genuinely seems “expensive” to a bank but is actually cost-saving to the general public. I think the same thing could be said above for the credit card interpretability rule.

Second, it makes me wonder what else one could regulate in terms of plain english modeling. For example, what would happen if we added that requirement to, say, the teacher value-added model? Would we get much-needed feedback to teachers like, “You don’t have enough student participation”? Oh wait, no. The model only looks at student test scores, so would only be able to give the following kind of feedback: “You didn’t raise scores enough. Teach to the test more.”

In other words, what I like about the “Modeling in Plain English” idea is that you have to be able to first express and second back up your reasons for making decisions. It may not lead to ideal accuracy on the part of the modeler but it will lead to much greater clarity on the part of the modeled. And we could do with a bit more clarity.

Finally, what about online loans? Do they have any such interpretability rule? I doubt it. In fact, if I’m not wrong, they can use any information they can scrounge up about someone to decide on who gets a loan, and they don’t have to reveal their decision-making process to anyone. That seems unreasonable to me.

Categories: data science, modeling, rant

Should the U.S. News & World Reports college ranking model be open source?

I had a great time giving my “Weapons of Math Destruction” talk in San Diego, and the audience was fantastic and thoughtful.

One question that someone asked was whether the US News & World Reports college ranking model should be forced to be open sourced – wouldn’t that just cause colleges to game the model?

First of all, colleges are already widely gaming the model and have been for some time. And that gaming is a distraction and has been heading colleges in directions away from good instruction, which is a shame.

And if you suggest that they change the model all the time to prevent this, then you’ve got an internal model of this model that needs adjustment. They might be tinkering at the edges but overall it’s quite clear what’s going into the model: namely, graduation rates, SAT scores, number of Ph.D’s on staff, and so on. The exact percentages change over time but not by much.

The impact that this model has had on education and how universities apportion resources has been profound. Academic papers have been written on the law school version of this story.

Moreover, the tactics that US News & World Reports uses to enforce their dominance of the market are bullying, as you can learn from the President of Reed College, which refuses to be involved.

Back to the question. Just as I realize that opening up all data is not reasonable or desirable, because first of all there are serious privacy issues but second of all certain groups have natural advantages to openly shared resources, it is also true that opening up all models is similarly problematic.

However, certain data should surely be open: for example, the laws of our country, that we are all responsible to know, should be freely available to us (something that Aaron Swartz understood and worked towards). How can we be held responsible for laws we can’t read?

Similarly, public-facing models, such as credit scoring models and teacher value-added models, should absolutely be open and accessible to the public. If I’m being judged and measured and held accountable by some model in my daily life as a citizen, that has real impact on how my future will unfold, then I should know how that process works.

And if you complain about the potential gaming of those public-facing models, I’d answer: if they are gameable then they shouldn’t be used, considering the impact they have on so many people’s lives. Because a gameable model is a weak model, with proxies that fail.

Another way to say this is we should want someone to “game” the credit score model if it means they pay their bills on time every month (I wrote about this here).

Back to the US News & World Report model. Is it public facing? I’m no lawyer but I think a case can be made that it is, and that the public’s trust in this model makes it a very important model indeed. Evidence can be gathered by measuring  the extent to which colleges game the model, which they only do because the public cares so much about the rankings.

Even so, what difference would that make, to open it up?

In an ideal world, where the public is somewhat savvy about what models can and cannot do, opening up the US News & World Reports college ranking model would result in people losing faith in it. They’d realize that it’s no more valuable than an opinion from a highly vocal uncle of theirs who is obsessed with certain metrics and blind to individual eccentricities and curriculums that may be a perfect match for a non-conformist student. It’s only one opinion among many, and not to be religiously believed.

But this isn’t an ideal world, and we have a lot of work to do to get people to understand models as opinions in this sense, and to get people to stop trusting them just because they’re mathematically presented.

An AMS panel to examine public math models?

On Saturday I gave a talk at the AGNES conference to a room full of algebraic geometers.  After introducing myself and putting some context around my talk, I focused on a few models:

  • VaR,
  • VAM,
  • Credit scoring,
  • E-scores (online version of credit scores), and
  • The h-score model (I threw this in for the math people and because it’s an egregious example of a gameable model).

I wanted to formalize the important and salient properties of a model, and I came up with this list:

  • Name – note the name often gives off a whiff of political manipulation by itself
  • Underlying model – regression? decision tree?
  • Underlying assumptions – normal distribution of market returns?
  • Input/output – dirty data?
  • Purported/political goal – how is it actually used vs. how its advocates claim they’ll use it?
  • Evaluation method – every model should come with one. Not every model does. A red flag.
  • Gaming potential – how does being modeled cause people to act differently?
  • Reach – how universal and impactful is the model and its gaming?

In the case of VAM, it doesn’t have an evaluation method. There’s been no way for teachers to know if the model that they get scored on every year is doing a good job, even as it’s become more and more important in tenure decisions (the Chicago strike was largely related to this issue, as I posted here).

Here was my plea to the mathematical audience: this is being done in the name of mathematics. The authority that math is given by our culture, which is enormous and possibly not deserved, is being manipulated by people with vested interests.

So when the objects of modeling, the people and the teachers who get these scores, ask how those scores were derived, they’re often told “it’s math and you wouldn’t understand it.”

That’s outrageous, and mathematicians shouldn’t stand for it. We have to get more involved, as a community, with how mathematics is wielded on the population.

On the other hand, I wouldn’t want mathematicians as a group to get co-opted by these special interest groups either and become shills for the industry. We don’t want to become economists, paid by this campaign or that to write papers in favor of their political goals.

To this end, someone in the audience suggested the AMS might want to publish a book of ethics for mathematicians akin to the ethical guidelines that are published for the society of pyschologists and lawyers. His idea is that it would be case-study based, which seems pretty standard. I want to give this some more thought.

We want to make ourselves available to understand high impact, public facing models to ensure they are sound mathematically, have reasonable and transparent evaluation methods, and are very high quality in terms of proven accuracy and understandability if they are used on people in high stakes situations like tenure.

One suggestion someone in the audience came up with is to have a mathematician “mechanical turk” service where people could send questions to a group of faceless mathematicians. Although I think it’s an intriguing idea, I’m not sure it would work here. The point is to investigate so-called math models that people would rather no mathematician laid their eyes on, whereas mechanical turks only answer questions someone else comes up with.

In other words, there’s a reason nobody has asked the opinion of the mathematical community on VAM. They are using the authority of mathematics without permission.

Instead, I think the math community should form something like a panel, maybe housed inside the American Mathematical Society (AMS), that trolls for models with the following characteristics:

  • high impact – people care about these scores for whatever reason
  • large reach – city-wide or national
  • claiming to be mathematical – so the opinion of the mathematical community matters, or should,

After finding such a model, the panel should publish a thoughtful, third-party analysis of its underlying mathematical soundness. Even just one per year would have a meaningful effect if the models were chosen well.

As I said to someone in the audience (which was amazingly receptive and open to my message), it really wouldn’t take very long for a mathematician to understand these models well enough to have an opinion on them, especially if you compare it to how long it would take a policy maker to understand the math. Maybe a week, with the guidance of someone who is an expert in modeling.

So in other words, being a member of such a “public math models” panel could be seen as a community service job akin to being an editor for a journal: real work but not something that takes over your life.

Now’s the time to do this, considering the explosion of models on everything in sight, and I believe mathematicians are the right people to take it on, considering they know how to admit they’re wrong.

Tell me what you think.

What is a model?

September 28, 2012 9 comments

I’ve been thinking a lot recently about mathematical models and how to explain them to people who aren’t mathematicians or statisticians. I consider this increasingly important as more and more models are controlling our lives, such as:

  • employment models, which help large employers screen through applications,
  • political ad models, which allow political groups to personalize their ads,
  • credit scoring models, which allow consumer product companies and loan companies to screen applicants, and,
  • if you’re a teacher, the Value-Added Model.
  • See more models here and here.

It’s a big job, to explain these, because the truth is they are complicated – sometimes overly so, sometimes by construction.

The truth is, though, you don’t really need to be a mathematician to know what a model is, because everyone uses internal models all the time to make decisions.

For example, you intuitively model everyone’s appetite when you cook a meal for your family. You know that one person loves chicken (but hates hamburgers), while someone else will only eat the pasta (with extra cheese). You even take into account that people’s appetites vary from day to day, so you can’t be totally precise in preparing something – there’s a standard error involved.

To explain modeling at this level, then, you just need to imagine that you’ve built a machine that knows all the facts that you do and knows how to assemble them together to make a meal that will approximately feed your family. If you think about it, you’ll realize that you know a shit ton of information about the likes and dislikes of all of your family members, because you have so many memories of them grabbing seconds of the asparagus or avoiding the string beans.

In other words, it would be actually incredibly hard to give a machine enough information about all the food preferences for all your family members, and yourself, along with the constraints of having not too much junky food, but making sure everyone had something they liked, etc. etc.

So what would you do instead? You’d probably give the machine broad categories of likes and dislikes: this one likes meat, this one likes bread and pasta, this one always drinks lots of milk and puts nutella on everything in sight. You’d dumb it down for the sake of time, in other words. The end product, the meal, may not be perfect but it’s better than no guidance at all.

That’s getting closer to what real-world modeling for people is like. And the conclusion is right too- you aren’t expecting your model to do a perfect job, because you only have a broad outline of the true underlying facts of the situation.

Plus, when you’re modeling people, you have to a priori choose the questions to ask, which will probably come in the form of “does he/she like meat?” instead of “does he/she put nutella on everything in sight?”; in other words, the important but idiosyncratic rules won’t even be seen by a generic one-size-fits-everything model.

Finally, those generic models are hugely scaled- sometimes there’s really only one out there, being used everywhere, and its flaws are compounded that many times over because of its reach.

So, say you’ve got a CV with a spelling error. You’re trying to get a job, but the software that screens for applicants automatically rejects you because of this spelling error. Moreover, the same screening model is used everywhere, and you therefore don’t get any interviews because of this one spelling error, in spite of the fact that you’re otherwise qualified.

I’m not saying this would happen – I don’t know how those models actually work, although I do expect points against you for spelling errors. My point is there’s some real danger in using such models on a very large scale that we know are simplified versions of reality.

One last thing. The model fails in the example above, because the qualified person doesn’t get a job. But it fails invisibly; nobody knows exactly how it failed or even that it failed. Moreover, it only really fails for the applicant who doesn’t get any interviews. For the employer, as long as some qualified applicants survive the model, they don’t see failure at all.

Open Models (part 2)

In my first post about open models, I argued that something needs to be done but I didn’t really say what.

This morning I want to outline how I see an open model platform working, although I won’t be able to resist  mentioning a few more reasons we urgently need this kind of thing to happen.

The idea is for the platform to have easy interfaces both for modelers and for users. I’ll tackle these one at a time.


Say I’m a modeler. I just wrote a paper on something that used a model, and I want to open source my model so that people can see how it works. I go to this open source platform and I click on “new model”. It asks for source code, as well as which version of which open source language (and exactly which packages) it’s written in. I feed it the code.

It then asks for the data and I either upload the data or I give it a url which tells the platform the location of the data. I also need to explain to the platform exactly how to transform the data, if at all, to prepare it for feeding to the model. This may require code as well.

Next, I specify the extent to which the data needs to stay anonymous (hopefully not at all, but sometimes in the case of medical data or something, I need to place security around the data). These anonymity limits will translate into the kinds of visualizations and results that can be requested by users but not the overall model’s aggregated results.

Finally, I specify which parameters in my model were obvious “choices” (like tuning parameters, or prior strengths, or thresholds I chose for cleaning data). This is helpful but not necessary, since other people will be able to come along later and add things. Specifically, they might try out new things like how many signals to use, which ones to use, and how to normalize various signals.

That’s it, I’m done, and just to be sure I “play” the model and make sure that the results jive with my published paper. There’s a suite of visualization tools and metrics of success built into the model platform for me to choose from which emphasize the good news for my model. I’ve created an instance of my model which is available for anyone to take a look at. This alone would be major progress, and the technology already exists for some languages.


Now say I’m a user. First of all, I want to be able to retrain the model and confirm the results, or see a record that this has already been done.

Next, I want to be able to see how the model predicts a given set of input data (that I supply). Specifically, if I’m a teacher and this is the open-sourced value added teacher model, I’d like to see how my score would have varied if I’d had 3 fewer students or they had had free school lunches or if I’d been teaching in a different district. If there were a bunch of different models, I could see what scores my data would have produced in different cities or different years in my city. This is a good start for a robustness test for such models.

If I’m also a modeler, I’d like to be able to play with the model itself. For example, I’d like to tweak the choices that have been made by the original modeler and retrain the model, seeing how different the results are. I’d like to be able to provide new data, or a new url for data, along with instructions for using the data, to see how this model would fare on new training data. Or I’d like to think of this new data as updating the model.

This way I get to confirm the results of the model, but also see how robust the model is under various conditions. If the overall result holds only when you exclude certain outliers and have a specific prior strength, that’s not good news.

I can also change the model more fundamentally. I can make a copy of the model, and add another predictor from the data or from new data, and retrain the model and see how this new model performs. I can change the way the data is normalized. I can visualize the results in an entirely different way. Or whatever.

Depending on the anonymity constraints of the original data, there are things I may not be able to ask as a user. However, most aggregated results should be allowed. Specifically, the final model with its coefficients.


As a user, when I play with a model, there is an anonymous record kept of what I’ve done, which I can choose to put my name on. On the one hand this is useful for users because if I’m a teacher, I can fiddle with my data and see how my score changes under various conditions, and if it changes radically, I have a way of referencing this when I write my op-ed in the New York Times. If I’m a scientist trying to make a specific point about some published result, there’s a way for me to reference my work.

On the other hand this is useful for the original modelers, because if someone comes along and improves my model, then I have a way of seeing how they did it. This is a way to crowdsource modeling.

Note that this is possible even if the data itself is anonymous, because everyone in sight could just be playing with the model itself and only have metadata information.

More on why we need this

First, I really think we need a better credit rating system, and so do some guys in Europe. From the New York Times article (emphasis mine):

Last November, the European Commission proposed laws to regulate the ratings agencies, outlining measures to increase transparency, to reduce the bloc’s dependence on ratings and to tackle conflicts of interest in the sector.

But it’s not just finance that needs this. The entirety of science publishing is in need of more transparent models. From the nature article’s abstract:

Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail.

Finally, the field of education is going through a revolution, and it’s not all good. Teachers are being humiliated and shamed by weak models, which very few people actually understand. Here’s what the teacher’s union has just put out to prove this point:

Creepy model watch

I really feel like I can’t keep up with all of the creepy models coming out and the news articles about them, so I think I’ll just start making a list. I would appreciate readers adding to my list in the comment section. I think I’ll move this to a separate page on my blog if it comes out nice.

  1. I recently blogged about a model that predicts student success in for-profit institutions, which I claim is really mostly about student debt and default,
  2. but here’s a model which actually goes ahead and predicts default directly, it’s a new payday-like loan model. Oh good, because the old payday models didn’t make enough money or something.
  3. Of course there’s the teacher value-added model which I’ve blogged about multiple times, most recently here. And here’s a paper I’d like everyone to read before they listen to anyone argue one way or the other about the model (h/t Joshua Batson). The abstract is stunning: Recently, educational researchers and practitioners have turned to value-added models to evaluate teacher performance. Although value-added estimates depend on the assessment used to measure student achievement, the importance of outcome selection has received scant attention in the literature. Using data from a large, urban school district, I examine whether value-added estimates from three separate reading achievement tests provide similar answers about teacher performance. I find moderate-sized rank correlations, ranging from 0.15 to 0.58, between the estimates derived from different tests. Although the tests vary to some degree in content, scaling, and sample of students, these factors do not explain the differences in teacher effects. Instead, test timing and measurement error contribute substantially to the instability of value-added estimates across tests. Just in case that didn’t come through, they are saying that the results of the teacher value-added test scores are very very noisy.
  4. That reminds me, credit scoring models are old but very very creepy, wouldn’t you agree? What’s in them that they want to conceal them?
  5. Did you read about how Target predicts pregnancy? Extremely creepy.
  6. I’m actually divided about whether it’s the creepiest though, because I think the sheer enormity of information that Facebook collects about us is the most depressing thing of all.

Before I became a modeler, I wasn’t personally offended by the idea that people could use my information. I thought, I’ve got nothing to hide, and in fact maybe it will make my life easier and more efficient for the machine to know me and my habits.

But here’s how I think now that I’m a modeler and I see how this stuff gets made and I see how it gets applied. That we are each giving up our data, and it’s so easy to do we don’t think about it, and it’s being used to funnel people into success or failure in a feedback loop. And the modelers, the people responsible for creating these things and implementing them, are always already the successes, they are educated and are given good terms on their credit cards and mortgages because they have a nifty high tech job. So the makers get to think of how much easier and more convenient their lives are now that the models see how dependable they are as consumers.

But when there are funnels, there’s always someone who gets funneled down.

Think about how it works with insurance. The idea of insurance is to pool people so that when one person gets sick, the medical costs for that person are paid from the common fund. Everyone pays a bit so it doesn’t break the bank.

But if we have really good information, we begin to see how likely people are to get sick. So we can stratify the pool. Since I almost never get sick, and when I do it’s just strep throat, I get put into a very nice pool with other people who never get sick, and we pay very very little and it works out great for us. But other people have worse luck of the DNA draw and they get put into the “pretty sick” pool and their premium gets bigger as their pool gets sicker until they are really sick and the premium is actually unaffordable. We are left with a system where the people who need insurance the most can’t be part of the system anymore. Too much information ruins the whole idea of insurance and pooled risk.

I think modern modeling is analogous. When people offer deals, they can first check to see if the people they are offering deals are guaranteed to pay back everything. In other words, the businesses (understandably) want to make very certain they are going to profit from each and every customer, and they are getting more and more able to do this. That’s great for customers with perfect credit scores, and it makes it easier for people with perfect credit scores to keep their perfect credit scores, because they are getting the best deals.

But for people with bad credit scores, they get the rottenest deals, which makes a larger and larger percentage of their takehome pay (if they even get a job considering their credit scores) go towards fees and high interest rates. This of course creates an environment in which it’s difficult to improve their credit score- so they default and their credit score gets worse instead of better.

So there you have it, a negative feedback loop and a death spiral of modeling.

Categories: data science

Let them game the model

One of the most common reasons I hear for not letting a model be more transparent is that, if they did that, then people would game the model. I’d like to argue that that’s exactly what they should do, and it’s not a valid argument against transparency.

Take as an example the Value-added model for teachers. I don’t think there’s any excuse for this model to be opaque: it is widely used (all of New York City public middle and high schools for example), the scores are important to teachers, especially when they are up for tenure, and the community responds to the corresponding scores for the schools by taking their kids out or putting their kids into those schools. There’s lots at stake.

Why would you not want this to be transparent? Don’t we usually like to know how to evaluate our performance on the job? I’d like to know it if being 4 minutes late to work was a big deal, or if I need to stay late on Tuesdays in order to be perceived as working hard. In other words, given that it’s high stakes it’s only fair to let people know how they are being measured and, thus, how to “improve” with respect to that measurement.

Instead of calling it “gaming the model”, we should see it as improving our scores, which, if it’s a good model, should mean being better teachers (or whatever you’re testing). If you tell me that when someone games the model, they aren’t actually becoming a better teacher, then I’d say that means your model needs to improve, not the teacher. Moreover, if that’s true, then without transparency or with transparency, in either case, you’re admitting that the model doesn’t measure the right thing. At least when it’s transparent the problems are more obvious and the modelers have more motivation to make the model measure the right thing.

Another example: credit scoring. Why are these models closed? They affect everyone all the time. How is Visa or Mastercard winning if they don’t tell us what we need to do to earn a good credit card interest rate? What’s the worst thing that could happen, that we are told explicitly that we need to pay our bills on time? I don’t see it. Unless the models are using something devious, like people’s race or gender, in which case I’d understand why they’d want to hide that model. I suspect they aren’t, because that would be too obvious, but I also suspect they might be using other kinds of inputs (like zip codes) that are correlated to race and/ or gender. That’s the kind of thing that argues for transparency, not against it. When a model is as important as credit scores are, I don’t see an argument for opacity.

Why should you care about statistical modeling?

One of the major goals of this blog is to let people know how statistical modeling works. My plan is to explain as much as I can in simple plain English, with the least amount of confusion, and the maximum amount of elucidation at every possible level, so every reader can take at least a basic understanding away.

Why? What’s so important about you knowing about what nerds do?

Well, there are different answers. First, you may be interested in it from a purely cerebral perspective – you may yourself be a nerd or a potential nerd. Since it is interesting, and since there will be I suspect many more job openings coming soon that use this stuff, there’s nothing wrong with getting technical; it may come in handy.

But I would argue that even if it’s not intellectually stimulating for you, you should know at least the basics of this stuff, kind of like how we should all know how our government is run and how to conserve energy; kind of a modern civic duty, if you will.

Civic duty? Whaaa?

Here’s why. There’s an incredible amount of data out there, more than every before, and certainly more than when I was growing up. I mean, sure, we always kept track of our GDP and the stock market, that’s old school data collection. And marketers and politicians have always experimented with different ads and campaigns and kept track of what does and what doesn’t work. That’s all data too. But the sheer volume of data that we are now collecting about people and behaviors is positively stunning. Just think of it as a huge and exponentially growing data vat.

And with that data comes data analysis. This is a young field. Even though I encourage every nerd out there to consider becoming a data scientist, I know that if a huge number of them agreed to it today, there wouldn’t be enough jobs out there for everyone. Even so, there will be, and very soon. Each CEO of each internet startup should be seriously considering hiring a data scientist, if they don’t have one already. The power in data mining is immense and it’s only growing. And as I said, the field is young but it’s growing in sophistication rapidly, for good and for evil.

And that gets me to the evil part, and with it the civic duty part.

I claim two things. First, that statistical modeling can and does get out of hand, which I define as when it starts controlling things in a way that is not intended or understood by the people who built the model (or who use the model, or whose lives are affected by the model). And second, that by staying informed about what models are, what they aren’t, what limits they have and what boundaries need to be enforced, we can, as a society, live in a place which is still data-intensive but reasonable.

To give evidence to my first claim, I point you to the credit crisis. In fact finance is a field which is not that different from others like politics and marketing, except that it is years ahead in terms of data analysis. It was and still is the most data-driven, sophisticated place where models rule and the people typically stand back passively and watch (and wait for the money to be transferred to their bank accounts). To be sure, it’s not the fault of the models. In fact I firmly believe that nobody in the mortgage industry, for example, really believed that the various tranches of the mortgage backed securities were in fact risk-free; they knew they were just getting rid of the risk with a hefty reward and they left it at that. And yet, the models were run, and their numbers were quoted, and people relied on them in an abstract way at the very least, and defended their AAA ratings because that’s what the models said. It was a very good example of models being misapplied in situations that weren’t intended or appropriate. The result, as we know, was and still is an economic breakdown when the underlying numbers were revealed to be far far different than the models had predicted.

Another example, which I plan to write more about, is the value-added models being used to evaluate school teachers. In some sense this example is actually more scary than the example of modeling in finance, in that in this case, we are actually talking about people being fired based on a model that nobody really understands. Lives are ruined and schools are closed based on the output of an opaque process which even the model’s creators do not really comprehend (I have seen a technical white paper of one of the currently used value-added models, and it’s my opinion that the writer did not really understand modeling or at best tried not to explain it if he did).

In summary, we are already seeing how statistical modeling can and has affected all of us. And it’s only going to get more omnipresent. Sometimes it’s actually really nice, like when I go to and learn about new bands besides Bright Eyes (is there really any band besides Bright Eyes?!). I’m not trying to stop cool types of modeling! I’m just saying, we wouldn’t let a model tell us what to name our kids, or when to have them. We just like models to suggest cool new songs we’d like.

Actually, it’s a fun thought experiment to imagine what kind of things will be modeled in the future. Will we have models for how much insurance you need to pay based on your DNA? Will there be modeling of how long you will live? How much joy you give to the people around you? Will we model your worth? Will other people model those things about you?

I’d like to take a pause just for a moment to mention a philosophical point about what models do. They make best guesses. They don’t know anything for sure. In finance, a successful model is a model that makes the right bet 51% of the time. In data science we want to find out who is twice as likely to click a button- but that subpopulation is still very unlikely to click! In other words, in terms of money, weak correlations and likelihoods pay off. But that doesn’t mean they should decide peoples’ fates.

My appeal is this: we need to educate ourselves on how the models around us work so we can spot one that’s a runaway model. We need to assert our right to have power over the models rather than the other way around. And to do that we need to understand how to create them and how to control them. And when we do, we should also demand that any model which does affect us needs to be explained to us in terms we can understand as educated people.

Guest Post: A Discussion Of PARCC Testing

This is a guest post by Eugene Stern, who writes a blog at Sense Made Here, and Kristin Wald, who writes a blog at This Unique* Weblog. Crossposted on their blogs as well.

Today’s post is a discussion of education reform, standardized testing, and PARCC with my friend Kristin Wald, who has been extremely kind to this blog. Kristin taught high school English in the NYC public schools for many years. Today her kids and mine go to school together in Montclair. She has her own blog that gets orders of magnitude more readers than I do.

ES: PARCC testing is beginning in New Jersey this month. There’s been lots of anxiety and confusion in Montclair and elsewhere as parents debate whether to have their kids take the test or opt out. How do you think about it, both as a teacher and as a parent?

KW: My simple answer is that my kids will sit for PARCC. However, and this is where is gets grainy, that doesn’t mean I consider myself a cheerleader for the exam or for the Common Core curriculum in general.

In fact, my initial reaction, a few years ago, was to distance my children from both the Common Core and PARCC. So much so that I wrote to my child’s principal and teacher requesting that no practice tests be administered to him. At that point I had only peripherally heard about the issues and was extending my distaste for No Child Left Behind and, later, Race to the Top. However, despite reading about and discussing the myriad issues, I still believe in change from within and trying the system out to see kinks and wrinkles up-close rather than condemning it full force.


ES: Why did you dislike NCLB and Race to the Top? What was your experience with them as a teacher?

KW: Back when I taught in NYC, there was wiggle room if students and schools didn’t meet standards. Part of my survival as a teacher was to shut my door and do what I wanted. By the time I left the classroom in 2007 we were being asked to post the standards codes for the New York State Regents Exams around our rooms, similar to posting Common Core standards all around. That made no sense to me. Who was this supposed to be for? Not the students – if they’re gazing around the room they’re not looking at CC RL.9-10 next to an essay hanging on a bulletin board. I also found NCLB naïve in its “every child can learn it all” attitude. I mean, yes, sure, any child can learn. But kids aren’t starting out at the same place or with the same support. And anyone who has experience with children who have not had the proper support up through 11th grade knows they’re not going to do well, or even half-way to well, just because they have a kickass teacher that year.

Regarding my initial aversion to Common Core, especially as a high school English Language Arts teacher, the minimal appearance of fiction and poetry was disheartening. We’d already seen the slant in the NYS Regents Exam since the late 90’s.

However, a couple of years ago, a friend asked me to explain the reason The Bluest Eye, with its abuse and rape scenes, was included in Common Core selections, so I took a closer look. Basically, a right-wing blogger had excerpted lines and scenes from the novel to paint it as “smut” and child pornography, thus condemning the entire Common Core curriculum. My response to my friend ended up as “In Defense of The Bluest Eye.”

That’s when I started looking more closely at the Common Core curriculum. Learning about some of the challenges facing public schools around the country, I had to admit that having a required curriculum didn’t seem like a terrible idea. In fact, in a few cases, the Common Core felt less confining than what they’d had before. And you know, even in NYC, there were English departments that rarely taught women or minority writers. Without a strong leader in a department, there’s such a thing as too much autonomy. Just like a unit in a class, a school and a department should have a focus, a balance.

But your expertise is Mathematics, Eugene. What are your thoughts on the Common Core from that perspective?

ES: They’re a mix. There are aspects of the reforms that I agree with, aspects that I strongly disagree with, and then a bunch of stuff in between.

The main thing I agree with is that learning math should be centered on learning concepts rather than procedures. You should still learn procedures, but with a conceptual underpinning, so you understand what you’re doing. That’s not a new idea: it’s been in the air, and frustrating some parents, for 50 years or more. In the 1960’s, they called it New Math.

Back then, the reforms didn’t go so well because the concepts they were trying to teach were too abstract – too much set theory, in a nutshell, at least in the younger grades. So then there was a retrenchment, back to learning procedures. But these things seem to go in cycles, and now we’re trying to teach concepts better again. This time more flexibly, less abstractly, with more examples. At least that’s the hope, and I share that hope.

I also agree with your point about needing some common standards defining what gets taught at each grade level. You don’t want to be super-prescriptive, but you need to ensure some kind of consistency between schools. Otherwise, what happens when a kid switches schools? Math, especially, is such a cumulative subject that you really need to have some big picture consistency in how you teach it.


ES: What I disagree with is the increased emphasis on standardized testing, especially the raised stakes of those tests. I want to see better, more consistent standards and curriculum, but I think that can and should happen without putting this very heavy and punitive assessment mechanism on top of it.

KW: Yes, claiming to want to assess ability (which is a good thing), but then connecting the results to a teacher’s effectiveness in that moment is insincere evaluation. And using a standardized test not created by the teacher with material not covered in class as a hard percentage of a teacher’s evaluation makes little sense. I understand that much of the exam is testing critical thinking, ability to reason and use logic, and so on. It’s not about specific content, and that’s fine. (I really do think that’s fine!) Linking teacher evaluations to it is not.

Students cannot be taught to think critically in six months. As you mentioned about the spiraling back to concepts, those skills need to be revisited again and again in different contexts. And I agree, tests needn’t be the main driver for raising standards and developing curriculum. But they can give a good read on overall strengths and weaknesses. And if PARCC is supposed to be about assessing student strengths and weaknesses, it should be informing adjustments in curriculum.

On a smaller scale, strong teachers and staffs are supposed to work as a team to influence the entire school and district with adjusted curriculum as well. With a wide reach like the Common Core, a worrying issue is that different parts of the USA will have varying needs to meet. Making adjustments for all based on such a wide collection of assessments is counterintuitive. Local districts (and the principals and teachers in them) need to have leeway with applying them to best suit their own students.

Even so, I do like some things about data driven curricula. Teachers and school administrators are some of the most empathetic and caring people there are, but they are still human, and biases exist. Teachers, guidance counselors, administrators can’t help but be affected by personal sympathies and peeves. Having a consistent assessment of skills can be very helpful for those students who sometimes fall through the cracks. Basically, standards: yes. Linking scores to teacher evaluation: no.

ES: Yes, I just don’t get the conventional wisdom that we can only tell that the reforms are working, at both the individual and group level, through standardized test results. It gives us some information, but it’s still just a proxy. A highly imperfect proxy at that, and we need to have lots of others.

I also really like your point that, as you’re rolling out national standards, you need some local assessment to help you see how those national standards are meeting local needs. It’s a safeguard against getting too cookie-cutter.

I think it’s incredibly important that, as you and I talk, we can separate changes we like from changes we don’t. One reason there’s so much noise and confusion now is that everything – standards, curriculum, testing – gets lumped together under “Common Core.” It becomes this giant kitchen sink that’s very hard to talk about in a rational way. Testing especially should be separated out because it’s fundamentally an issue of process, whereas standards and curriculum are really about content.

You take a guy like Cuomo in New York. He’s trying to increase the reliance on standardized tests in teacher evaluations, so that value added models based on test scores count for half of a teacher’s total evaluation. And he says stuff like this: “Everyone will tell you, nationwide, the key to education reform is a teacher evaluation system.” That’s from his State of the State address in January. He doesn’t care about making the content better at all. “Everyone” will tell you! I know for a fact that the people spending all their time figuring out at what grade level kids should start to learn about fractions aren’t going tell you that!

I couldn’t disagree with that guy more, but I’m not going to argue with him based on whether or not I like the problems my kids are getting in math class. I’m going to point out examples, which he should be well aware of by now, of how badly the models work. That’s a totally different discussion, about what we can model accurately and fairly and what we can’t.

So let’s have that discussion. Starting point: if you want to use test scores to evaluate teachers, you need a model because – I think everyone agrees on this – how kids do on a test depends on much more than how good their teacher was. There’s the talent of the kid, what preparation they got outside their teacher’s classroom, whether they got a good night’s sleep the night before, and a good breakfast, and lots of other things. As well as natural randomness: maybe the reading comprehension section was about DNA, and the kid just read a book about DNA last month. So you need a model to break out the impact of the teacher. And the models we have today, even the most state-of-the-art ones, can give you useful aggregate information, but they just don’t work at that level of detail. I’m saying this as a math person, and the American Statistical Association agrees. I’ve written about this here and here and here and here.

Having student test results impact teacher evaluations is my biggest objection to PARCC, by far.

KW: Yep. Can I just cut and paste what you’ve said? However, for me, another distasteful aspect is how technology is tangled up in the PARCC exam.


ES: Let me tell you the saddest thing I’ve heard all week. There’s a guy named Dan Meyer, who writes very interesting things about math education, both in his blog and on Twitter. He put out a tweet about a bunch of kids coming into a classroom and collectively groaning when they saw laptops on every desk. And the reason was that they just instinctively assumed they were either about to take a test or do test prep.

That feels like such a collective failure to me. Look, I work in technology, and I’m still optimistic that it’s going to have a positive impact on math education. You can use computers to do experiments, visualize relationships, reinforce concepts by having kids code them up, you name it. The new standards emphasize data analysis and statistics much more than any earlier standards did, and I think that’s a great thing. But using computers primarily as a testing tool is an enormous missed opportunity. It’s like, here’s the most amazing tool human beings have ever invented, and we’re going to use it primarily as a paperweight. And we’re going to waste class time teaching kids exactly how to use it as a paperweight. That’s just so dispiriting.

KW: That’s something that hardly occurred to me. My main objection to hosting the PARCC exam on computers – and giving preparation homework and assignments that MUST be done on a computer – is the unfairness inherent in accessibility. It’s one more way to widen the achievement gap that we are supposed to be minimizing. I wrote about it from one perspective here.

I’m sure there are some students who test better on a computer, but the playing field has to be evenly designed and aggressively offered. Otherwise, a major part of what the PARCC is testing is how accurately and quickly children use a keyboard. And in the aggregate, the group that will have scores negatively impacted will be children with less access to the technology used on the PARCC. That’s not an assessment we need to test to know. When I took the practice tests, I found some questions quite clear, but others were difficult not for content but in maneuvering to create a fraction or other concept. Part of that can be solved through practice and comfort with the technology, but then we return to what we’re actually testing.

ES: Those are both great points. The last thing you want to do is force kids to write math on a computer, because it’s really hard! Math has lots of specialized notation that’s much easier to write with pencil and paper, and learning how to write math and use that notation is a big part of learning the subject. It’s not easy, and you don’t want to put artificial obstacles in kids’ way. I want kids thinking about fractions and exponents and what they mean, and how to write them in a mathematical expression, but not worrying about how to put a numerator above a denominator or do a superscript or make a font smaller on a computer. Plus, why in the world would you limit what kids can express on a test to what they can input on a keyboard? A test is a proxy already, and this limits what it can capture even more.

I believe in using technology in education, but we’ve got the order totally backwards. Don’t introduce the computer as a device to administer tests, introduce it as a tool to help in the classroom. Use it for demos and experiments and illustrating concepts.

As far as access and fairness go, I think that’s another argument for using the computer as a teaching tool rather than a testing tool. If a school is using computers in class, then at least everyone has access in the classroom setting, which is a start. Now you might branch out from there to assignments that require a computer. But if that’s done right, and those assignments grow in an organic way out of what’s happening in the classroom, and they have clear learning value, then the school and the community are also morally obligated to make sure that everyone has access. If you don’t have a computer at home, and you need to do computer-based homework, then we have to get you computer access, after school hours, or at the library, or what have you. And that might actually level the playing field a bit. Whereas now, many computer exercises feel like they’re primarily there to get kids used to the testing medium. There isn’t the same moral imperative to give everybody access to that.

I really want to hear more about your experience with the PARCC practice tests, though. I’ve seen many social media threads about unclear questions, both in a testing context and more generally with the Common Core. It sounds like you didn’t think it was so bad?

KW: Well, “not so bad” in that I am a 45 year old who was really trying to take the practice exam honestly, but didn’t feel stressed about the results. However, I found the questions with fractions confusing in execution on the computer (I almost gave up), and some of the questions really had to be read more than once. Now, granted, I haven’t been exposed to the language and technique of the exam. That matters a lot. In the SAT, for example, if you don’t know the testing language and format it will adversely affect your performance. This is similar to any format of an exam or task, even putting together an IKEA nightstand.

There are mainly two approaches to preparation, and out of fear of failing, some school districts are doing hardcore test preparation – much like SAT preparation classes – to the detriment of content and skill-based learning. Others are not altering their classroom approaches radically; in fact, some teachers and parents have told me they hardly notice a difference. My unscientific observations point to a separation between the two that is lined in Socio-Economic Status. If districts feel like they are on the edge or have a lot to lose (autonomy, funding, jobs), if makes sense that they would be reactionary in dealing with the PARCC exam. Ironically, schools that treat the PARCC like a high-stakes test are the ones losing the most.

Opting Out

KW: Despite my misgivings, I’m not in favor of “opting out” of the test. I understand the frustration that has prompted the push some districts are experiencing, but there have been some compromises in New Jersey. I was glad to see that the NJ Assembly voted to put off using the PARCC results for student placement and teacher evaluations for three years. And I was relieved, though not thrilled, that the percentage of PARCC results to be used in teacher evaluations was lowered to 10% (and now put off). I still think it should not be a part of teacher evaluations, but 10% is an improvement.

Rather than refusing the exam, I’d prefer to see the PARCC in action and compare honest data to school and teacher-generated assessments in order to improve the assessment overall. I believe an objective state or national model is worth having; relying only on teacher-based assessment has consistency and subjective problems in many areas. And that goes double for areas with deeply disadvantaged students.

ES: Yes, NJ seems to be stepping back from the brink as far as model-driven teacher evaluation goes. I think I feel the same way you do, but if I lived in NY, where Cuomo is trying to bump up the weight of value added models in evaluations to 50%, I might very well be opting out.

Let me illustrate the contrast – NY vs. NJ, more test prep vs. less — with an example. My family is good friends with a family that lived in NYC for many years, and just moved to Montclair a couple months ago. Their older kid is in third grade, which is the grade level where all this testing starts. In their NYC gifted and talented public school, the test was this big, stressful thing, and it was giving the kid all kinds of test anxiety. So the mom was planning to opt out. But when they got to Montclair, the kid’s teacher was much more low key, and telling the kids not to worry. And once it became lower stakes, the kid wanted to take the test! The mom was still ambivalent, but she decided that here was an opportunity for her kid to get used to tests without anxiety, and that was the most important factor for her.

I’m trying to make two points here. One: whether or not you opt out depends on lots of factors, and people’s situations and priorities can be very different. We need to respect that, regardless of which way people end up going. Two: shame on us, as grown ups, for polluting our kids’ education with our anxieties! We need to stop that, and that extends both to the education policies we put in place and how we collectively debate those policies. I guess what I’m saying is: less noise, folks, please.

KW: Does this very long blog post count as noise, Eugene? I wonder how this will be assessed? There are so many other issues – private profits from public education, teacher autonomy in high performing schools, a lack of educational supplies and family support, and so on. But we have to start somewhere with civil and productive discourse, right? So, thank you for having the conversation.

ES: Kristin, I won’t try to predict anyone else’s assessment, but I will keep mine low stakes and say this has been a pleasure!

I’m writing a book called Weapons of Math Destruction

I’m incredibly excited to announce that I am writing a book called Weapons of Math Destruction for Random House books, with my editor Amanda Cook. There will also be a subtitle which we haven’t decided on yet.

Here’s how this whole thing went down. First I met my amazing book agent Jay Mandel from William Morris though my buddy Jordan Ellenberg. As many of you know, Jordan is also writing a book but it’s much farther along in the process and has already passed the editing phase. Jordan’s book is called How Not To Be Wrong and it’s already available for pre-order on Amazon.

Anyhoo, Jay spent a few months with me telling me how to write a book proposal, and it was a pretty substantial undertaking actually and required more than just an outline. It was like a short treatment of all the chapters but then two chapters pretty filled in, including the first, and as you know the first is kind of like an advertisement for the whole rest of the book.

Then, once that proposal was ready, Jay started what he hoped would be a bidding war for the proposal among publishers. He had a whole list of people he talked to from all over the place in the publishing world.

What actually happened though was Amanda Cook from Crown Publishing, which is part of Random House, was the first person who was interested enough to talk to me about it, and then we hit it off really well, and she made a pre-emptive offer for the book so the full on bidding war didn’t end up needing to happen. And then just last week she announced the deal in what’s called the Publisher’s Marketplace, which is for people inside publishing to keep abreast of the deals and news. The actual link is here, but it’s behind a pay wall, so Amanda got me a screen shot:


If that font is too small, it says something like this:

Harvard math Ph.D., former Wall Street quant, and advisor to the Occupy movement Cathy O’Neil’s WEAPONS OF MATH DESTRUCTION, arguing that mathematical modeling has become a pervasive and destructive force in society—in finance, education, medicine, politics, and the workplace—and showing how current models exacerbate inequality and endanger democracy and how we might rein them in, to Amanda Cook at Crown in a pre-empt by Jay Mandel at William Morris Endeavor (NA).

So as you can tell I’m incredibly excited about the book, and I have tons of ideas about it, but of course I’d love my readers to weigh in on crucial examples of models and industries that you think might get overlooked.

Please, post a comment or send me an email (located on my About page) with your favorite example of a family of models (Value Added Model for teachers is already in!) or a specific model (Value-at-Risk model in finance in already!) that is illustrative of feedback loops, or perverted incentives, or creepy modeling, or some such concept that you imagine I’ll be writing about (or should be!). Thanks so much for your input!

One last thing. I’m aiming to finish the writing part by next Spring, and then the book is actually released about 9 months later. It takes a while. I’m super glad I have had the experience of writing a technical book with O’Reilly as well as the homemade brew Occupy Finance with my Occupy group so I know at least some of the ropes, but even so this is a bit more involved.


Categories: data science, modeling, news