Last November I wrote to the Department of Education to make a FOIL request for the source code for the teacher value-added model (VAM).
To explain why I’d want something like this, I think the VAM model sucks and I’d like to explore the actual source code directly. The white paper I got my hands on is cryptically written (take a look!) and doesn’t explain what the actual sensitivity to inputs are, for example. The best way to get at that is the source code.
Plus, since the New York Times and other news outlets published teacher’s VAM scores after a long battle and a FOIA request (see details about this here), I figured it’s only fair to also publicly release the actual black box which determines those scores.
Indeed without knowledge of what the model consists of, the VAM scoring regime is little more than a secret set of rules, with tremendous power over teachers and the teacher union, and also incorporates outrageous public shaming as described above.
I think teachers deserve better, and I want to illustrate the weaknesses of the model directly on an open models platform.
The FOIL request
Here’s the email I sent to email@example.com on 11/22/13:
Dear Records Access Officer for the NYC DOE,
I’m looking to get a copy of the source code for the most recent value-added teacher model through a FOIA request. There are various publicly available descriptions of such models, for example here, but I’d like the actual underlying code.
Please tell me if I’ve written to the correct person for this FOIA request, thank you very much.
Since my FOIL request
In response to my request, on 12/3/13, 1/6/14, and 2/4/14 I got letters saying stuff was taking a long time since my request was so complicated. Then yesterday I got the following response:
If you follow the link you’ll get another white paper, this time from 2012-2013, which is exactly what I said I didn’t want in my original request.
I wrote back, not that it’s likely to work, and after reminding them of the text of my original request I added the following:
What you sent me is the newer version of the publicly available description of the model, very much like my link above. I specifically asked for the underlying code. That would be in a programming language like python or C++ or java.
Can you to come back to me with the actual code? Or who should I ask?
Thanks very much,
It strikes me as strange that it took them more than 3 months to send me a link to a white paper instead of the source code as I requested. Plus I’m not sure what they mean by “SED” but I’m guessing it means these guys, but I’m not sure of exactly who to send a new FOIL request.
Am I getting the runaround? Any suggestions?
Tonight I’ll be giving a talk at the NYC Open Data Meetup, organized by Vivian Zhang. I’ll be discussing my essay from last year entitled On Being a Data Skeptic, as well as my Doing Data Science book. I believe there are still spots left if you’d like to attend. The details are as follows:
When: Thursday, March 6, 2014, 7:00 PM to 9:00 PM
- 6:15pm: Doors Open for pizza and casual networking
- 7:00pm: Workshop begins
- 8:30pm: Audience Q&A
A while back I was talking to some math people about how credit default swaps (CDSs), by their very nature, contain risk that is generally speaking undetectable with standard risk models like Value-at-Risk (VaR).
It occurred to me then that I could put it another way: that perhaps credit default swaps might have been deliberately created by someone who knew all about the standard risk models to game the system. VaR was commercialized in the mid 1990′s and CDSs existed around the same time, but didn’t take off for a decade or so until after VaR became super widespread, which makes it hard to prove without knowing the actors.
For that matter it is reasonable to assume something less deliberate occurred: that a bunch of weird instruments were created and those which hid risk the most thrived, kind of an evolutionary approach to the same theory.
I was reminded recently of this conspiracy theory when Joe Burns talked to my Occupy group last Sunday about his recent book, Reviving the Strike. He talked about the history of strikes as a tool of leverage, and how much less frequently we’ve seen large-scale strikes and industry-wide strikes. He made the point that the legality of strikes has historically been uncorrelated to the existence of strikes – that strikers cannot necessarily wait for the legal system to catch up with the needs of the worker. Sometimes strikers need to exert pressure on legislation.
Anyhoo, one question that came up in Q&A was how, in this world of subsidiaries and franchises, can workers strike against the upper management with control over the actual big money? After all, McDonalds workers work for franchisees who are often not well-off. The real money lives in the mother company but is legally isolated from the franchises.
Similarly, with Walmart, there are massive numbers of workers that don’t work directly for Walmart but do work in the massive supply chain network set up and run by Walmart. They would like to hold Walmart responsible for their working conditions. How does that work?
It seems like the same VaR/CDS story as above. Namely, the legal structure of McDonalds and Walmart almost seems deliberately set up to avoid legal responsibility from disgruntled workers. So maybe first you had the legal system, then lawyers set up the legal construction of the supply chain and workers such that striking workers could only strike against powerless figures, especially in the McDonalds case (since Walmart has plenty of workers working for the mother company as well).
Last couple of points. First, only long-term, powerful enterprises can go to the trouble of gaming such large systems. It’s an artifact of the age of the corporation.
And finally, I feel like it’s hard to combat. We could try to improve our risk or legal system but that makes them – probably – even more complicated, which in turn gives massive corporations more ways to game them. Not to be a cynic, but I don’t see a solution besides somehow separately sidestepping our personal risk exposure to these problems.
I heard an NPR report yesterday with Emily Steel, reporter from the Financial Times, about what kind of attributes make you worth more to advertisers. She has developed an ingenious online calculator here, which you should go play with.
As you can see it cares about things like whether you’re about to have a kid or are a new parent, as well as if you’ve got some disease where the industry for that disease is well-developed in terms of predatory marketing.
For example, you can bump up your worth to $0.27 from the standard $0.0007 if you’re obese, and another $0.10 if you admit to being the type to buy weight-loss products. And of course data warehouses can only get that much money for your data if they know about your weight, which they may or may not since if you don’t buy weight-loss products.
The calculator doesn’t know everything, and you can experiment with how much it does know, but some of the default assumptions are that it knows my age, gender, education level, and ethnicity. Plenty of assumed information to, say, build an unregulated version of a credit score to bypass the Equal Credit Opportunities Act.
I am back from Berkeley where I attended a couple of hours of conversations about MOOCs last Friday up at MSRI.
It was a panel discussion given mostly by math and stats people who themselves run MOOCs, and I was wondering if the people who are involved have a better sense of the side effects and feedback loops involved in the process. After all, I’m claiming that the MOOC Revolution will lead to the end of math research, and I wanted to be proven wrong.
Unfortunately, I left feeling like I have even more evidence that my fears will be realized.
I think the critical moment came when Ani Adhikari spoke. Professor Adhikari is in the second semester of giving her basic stats MOOC, and from how she described it, she is incredibly good at it, and there’s a social network aspect of the class which seems like it’s going really well – she says she spends 30 minutes to an hour a day on it herself, interacting with students. I think she said 28,000 students took it her first semester in addition to her in-class students at Berkeley. I know and respect Professory Adhikari personally, as I taught for her at the Berkeley Mills summer program for women many years ago. I know how devoted she is to good teaching.
Even so, she lost me late in the discussion when she explained that EdX, the platform which hosts her stats MOOC, wanted to offer her class three times a year without her participation. She said something to the effect that MOOC professors had to be “extra vigilant” about this outrageous idea and guard against it at all costs.
After all, she said, at the end of the day the MOOC videos are something like a fancy textbook, and we don’t hand out textbooks and claim they are courses, so we by the same token cannot hand out MOOC videos (and presumably the social networks associated with them) and claim they are courses.
When I pressed her in the Q&A session as to how exactly she was going to remain vigilant against this threat, she said she has a legal contract with EdX that prevented them from offering the course without her approval.
And I’m happy for her and her great contract, but here are two questions for her and for the community.
First, how long until someone in math or stats makes a kick-ass MOOC and doesn’t remember to have that air-tight legal contract? Or has an actual legal battle with EdX and realized their lawyers are not as expensive? Or believes that “information should be free” and does it with the express intention of letting the MOOC be replayed forever?
Second, how much sense does it make to claim that you and your presence are super critical to the success of a MOOC if 28,000 people took this class and you interacted at most one hour a day? Can you possibly claim that the average student benefitted from your presence? It seems to me that the value proposition for the average MOOC student is very similar whether you are there or not.
Overall the impression I got from the speakers, who were mostly MOOC evangelists and involved with MOOCs themselves, was that they loved MOOCs because MOOCs were working for them. They weren’t looking much beyond that point to side effects.
There was one exception, namely Susan Holmes, who listed some side effects of MOOCs including a decreased need for math Ph.D.’s. Unfortunately the conversation didn’t dwell on this, though, and it happened at the very end of the day.
Here’s what I’d like to see: a conversation at MSRI about the future of math research funding in the context of MOOCs and a reduced NSF, where hopefully we come up with something besides “Jim Simons”. It’s extra ironic that the conversation, if it happens, would be held in the Simons Theater.
I’m in Berkeley this week, where I gave two talks (here are my slides from Monday’s talk on recommendation engines, and here are my slides from Tuesday’s talk on modeling) and I’ve been hanging out with math nerds and college friends and enjoying the amazing food and cafe scene. This is the freaking life, people.
Here’s what’s been on my mind lately: the urgent need for good data journalism. If you read this Washington Post blog by Max Fisher you will get at one important angle of the problem. The article talks about the need for journalists to be competent in basic statistics and exploratory data analysis to do reasonable reporting on data, in this case the state of journalistic freedoms.
And you might think that, as long as journalists report on other stuff that’s not data heavy, they’re safe. But I’d argue that the proliferation of data is leaking into all corners of our culture, and basic data and computing literacy is becoming increasingly vital to the job of journalism.
Here’s what I’m not saying (a la Miss Disruption): learn to code, journalists, and everything will be cool. To be clear, having data skills is necessary but not sufficient.
So it’s more like, if you don’t learn to code, and even more importantly if you don’t learn to be skeptical of the models and the data, then you will have yet another obstacle between you and the truth.
Here’s one way to think about it. A few days ago I wrote a post about different ways to define and regulate discriminatory acts. On the one hand you have acts or processes that are “effectively discriminatory” and on the other you have acts or processes that are “intentionally discriminatory.”
In this day and age, we have complicated, opaque, and proprietary models: in other words, a perfect hiding place for bad intentions. It would be idiotic for someone with the intention of being discriminatory to do so outright. It’s much easier to embed such a thing in an opaque model where it will seem unintentional and will probably never be discovered at all.
But how is an investigative journalist going to even approach that? The first thing they need is to arm themselves with the right questions and the right attitude. And it wouldn’t help if they or their team can perform a test on the data and algorithm as well.
I’m not saying that we’re going to suddenly have do-everything super human journalists. Just as the list of job requirements for data scientists is outrageously long and nobody can be expert at everything, we will have to form teams of journalists which as a whole has lots of computing and investigative expertise.
The alternative is that the models go unchallenged, which is a really bad idea.
Here’s a perfect example of what I think needs to happen more: when ProPublica reverse-engineered Obama’s political messaging model.
There’s a wicked irony when it comes to many privacy advocates.
They are often narrowly focused on the their own individual privacy issues, but when it comes down to it they are typically super educated well-off nerds with few revolutionary thoughts. In other words, the very people obsessing over their privacy are people who are not particularly vulnerable to the predatory attacks of either the NSA or the private companies that make use of private data.
Let me put it this way. If I’m a data scientist working at a predatory credit card firm, seeking to build a segmentation model to target the most likely highly profitable customers – those that ring up balances and pay off minimums every month, sometimes paying late to accrue extra fees – then if I am profiling a user and notice an ad blocker or some other signal of privacy concerns, chances are that becomes a wealth indicator and I leave them alone. The mere presence of privacy concerns signals that this person isn’t worth pursuing with my manipulative scheme.
If you don’t believe me, take a look at a recent Slate article written by Cyrus Nemati and entitled Take My Data Please: How I learned to stop worrying and love a less private internet.
In it he describes how he used to be privacy obsessed, for no better reason than that he like to stick up a middle finger to those who would collect his data. I think that article should have been called something like, Well-educated white guy was a privacy freak until he realized he didn’t have to be because he’s a well-educated white guy.
He concludes that he really likes how well customized things are to his particular personality, and that shucks, we should all just appreciate the web and stop fretting.
But here’s the thing, the problem isn’t that companies are using his information to screw Cyrus Nemati. The problem is that the most vulnerable people – the very people that should be concerned with privacy but aren’t – are the ones getting tracked, mined, and screwed.
In other words, it’s silly for certain people to be scrupulously careful about their private data if they are the types of people who get great credit card offers and have a stable well-paid job and are generally healthy. I include myself in this group. I do not prevent myself from being tracked, because I’m not at serious risk.
And I’m not saying nothing can go wrong for those people, including me. Things can, especially if they suddenly lose their jobs or they have kids with health problems or something else happens which puts them into a special category. But generally speaking those people with enough time on their hands and education to worry about these things are not the most vulnerable people.
I hereby challenge Cyrus Nemati to seriously consider who should be concerned about their data being collected, and how we as a society are going to address their concerns. Recent legislation in California is a good start for kids, and I’m glad to see the New York Times editors asking for more.