I am always amazed by my Occupy group, and yesterday’s meeting was no exception. We decided to look into redefining the poverty line, and although the conversation took a moving and deeply philosophical turn, I’ll probably only have time to talk about the nuts and bolts of formulas this morning.
In the early 1960′s, it was noted that poor families spent about a third of their money on food. To build an “objective” measure of poverty, then, they decided to measure the cost of an “economic food budget” for a family of that size and then multiply that cost by 3.
Does that make sense anymore?
Well, no. Food has gotten a lot cheaper since 1964, and other stuff hasn’t. According to the following chart, which I got from The Atlantic, poor families now spend about one sixth of their money on food:
Now if you think about it, the formula should be more like “economic food budget” * 6, which would effectively double all the numbers.
Does this matter? Well, yes. Various programs like Medicare and Medicaid determine eligibility based on poverty. Also, the U.S. census measures poverty in our country using this yardstick. If we double those numbers we will be seeing a huge surge in the official numbers.
Not that we’d be capturing everyone even then. The truth is, in some locations, like New York, rent is so high that the formula would likely be needing even more adjustment. Although food is expensive too, so maybe the base “economic food budget” would simply need adjusting.
As usual the key questions are, what are we accomplishing with such a formula, and who is “we”?
There’s a good New York Times article by Todd Balf entitled The Story Behind the SAT Overhaul (hat tip Chris Wiggins).
In it is described the story of the new College Board President David Coleman, and how he decided to deal with the biggest problem with the SAT: namely, that it was pretty easy to prepare for the test, and the result was that richer kids did better, having more resources – both time and money – to prepare.
Here’s a visual from another NY Times blog on the issue:
Here’s my summary of the story.
At this point the SAT serves mainly to sort people by income. It’s no longer an appropriate way to gauge “IQ” as it was supposed to be when it was invented. Not to mention that colleges themselves have been playing a crazy game with respect to gaming the US News & World Reports college ranking model via their SAT scores. So it’s one feedback loop feeding into another.
How can we deal with this? One way is to stop using it. The article describes some colleges that have made SAT scores optional. They have not suffered, and they have more diversity.
But since the College Board makes their livelihood by testing people, they were never going to just shut down. Instead they’ve decided to explicitly make the SAT about content knowledge that they think high school students should know to signal college readiness.
And that’s good, but of course one can still prepare for that test. And since they’re acknowledging that now, they’re trying to set up the prep to make it more accessible, possibly even “free”.
But here’s the thing, it’s still online, and it still involves lots of time and attention, which still saps resources. I predict we will still see incredible efforts towards gaming this new model, and it will still break down by income, although possibly not quite as much, and possibly we will be training our kids to get good at slightly more relevant stuff.
I would love to see more colleges step outside the standardized testing field altogether.
Last November I wrote to the Department of Education to make a FOIL request for the source code for the teacher value-added model (VAM).
To explain why I’d want something like this, I think the VAM model sucks and I’d like to explore the actual source code directly. The white paper I got my hands on is cryptically written (take a look!) and doesn’t explain what the actual sensitivity to inputs are, for example. The best way to get at that is the source code.
Plus, since the New York Times and other news outlets published teacher’s VAM scores after a long battle and a FOIA request (see details about this here), I figured it’s only fair to also publicly release the actual black box which determines those scores.
Indeed without knowledge of what the model consists of, the VAM scoring regime is little more than a secret set of rules, with tremendous power over teachers and the teacher union, and also incorporates outrageous public shaming as described above.
I think teachers deserve better, and I want to illustrate the weaknesses of the model directly on an open models platform.
The FOIL request
Here’s the email I sent to firstname.lastname@example.org on 11/22/13:
Dear Records Access Officer for the NYC DOE,
I’m looking to get a copy of the source code for the most recent value-added teacher model through a FOIA request. There are various publicly available descriptions of such models, for example here, but I’d like the actual underlying code.
Please tell me if I’ve written to the correct person for this FOIA request, thank you very much.
Since my FOIL request
In response to my request, on 12/3/13, 1/6/14, and 2/4/14 I got letters saying stuff was taking a long time since my request was so complicated. Then yesterday I got the following response:
If you follow the link you’ll get another white paper, this time from 2012-2013, which is exactly what I said I didn’t want in my original request.
I wrote back, not that it’s likely to work, and after reminding them of the text of my original request I added the following:
What you sent me is the newer version of the publicly available description of the model, very much like my link above. I specifically asked for the underlying code. That would be in a programming language like python or C++ or java.
Can you to come back to me with the actual code? Or who should I ask?
Thanks very much,
It strikes me as strange that it took them more than 3 months to send me a link to a white paper instead of the source code as I requested. Plus I’m not sure what they mean by “SED” but I’m guessing it means these guys, but I’m not sure of exactly who to send a new FOIL request.
Am I getting the runaround? Any suggestions?
Yesterday a couple of people sent me this article about mysterious deaths at JP Morgan. There’s no known connection between them, but maybe it speaks to some larger problem?
I don’t think so. A little back-of-the-envelope calculation tells me it’s not at all impressive, and this is nothing but media attention turned into conspiracy theory with the usual statistics errors.
Here are some numbers. We’re talking about 3 suicides over 3 weeks. According to wikipedia, JP Morgan has 255,000 employees, and also according to wikipedia, the U.S. suicide rate for men is 19.2 per 100,000 per year, and for women is 5.5. The suicide rates for Hong Kong and the UK, where two of the suicides took place, are much higher.
Let’s eyeball the overall rate at 19 since it’s male dominated and since may employees are overseas in higher-than-average suicide rate countries.
Since 3 weeks is about 1/17th of a year, we’d expect to see about 19/17 suicides per year per 100,000 employees, and seince we have 255,000 employees, that means about 19/17*2.55 = 2.85 suicides in that time. We had three.
This isn’t to say we’ve heard about all the suicides, just that we expect to see about one suicide a week considering how huge JP Morgan is. So let’s get over this, it’s normal. People commit suicide pretty regularly.
It’s very much like how we heard all about suicides at Foxconn, but then heard that the suicide rate at Foxconn is lower than the general Chinese population.
There is a common statistical problem called the clustering illusion, whereby actually random events look clustered sometimes. Here’s a 2-dimensional version of the clustering illusion:
Actually my calculation above points to something even dumber, which is that we expected 2.85 suicides and we saw 3, so it’s not even a proven cluster. Although it could be, because again we probably didn’t hear about all of them. Maybe it’s a cluster of “really obvious jump-from-a-building” suicides.
And I’m not saying JP Morgan is a nice place to work. I feel suicidal just thinking about working there myself. But I don’t want us to jump to any statistically unsupported conclusions.
A fascinating and timely study just came out about the “Stand Your Ground” laws. It was written by Cheng Cheng and Mark Hoekstra, and is available as a pdf here, although I found out about in a Reuters column written by Hoekstra. Here’s a longish but crucial excerpt from that column:
It is fitting that much of this debate has centered on Florida, which enacted its law in October of 2005. Florida provides a case study for this more general pattern. Homicide rates in Florida increased by 8 percent from the period prior to passing the law (2000-04) to the period after the law (2006-10).By comparison, national homicide rates fell by 6 percent over the same time period. This is a crude example, but it illustrates the more general pattern that exists in the homicide data published by the FBI.
The critical question for our research is whether this relative increase in homicide rates was caused by these laws. Several factors lead us to believe that laws are in fact responsible. First, the relative increase in homicide rates occurred in adopting states only after the laws were passed, not before. Moreover, there is no history of homicide rates in adopting states (like Florida) increasing relative to other states. In fact, the post-law increase in homicide rates in states like Florida was larger than any relative increase observed in the last 40 years. Put differently, there is no evidence that states like Florida just generally experience increases in homicide rates relative to other states, even when they don’t pass these laws.
We also find no evidence that the increase is due to other factors we observe, such as demographics, policing, economic conditions, and welfare spending. Our results remain the same when we control for these factors. Along similar lines, if some other factor were driving the increase in homicides, we’d expect to see similar increases in other crimes like larceny, motor vehicle theft and burglary. We do not. We find that the magnitude of the increase in homicide rates is sufficiently large that it is unlikely to be explained by chance.
In fact, there is substantial empirical evidence that these laws led to more deadly confrontations. Making it easier to kill people does result in more people getting killed.
If you take a look at page 33 of the paper, you’ll see some graphs of the data. Here’s a rather bad picture of them but it might give you the idea:
That red line is the same in each plot and refers to the log homicide rate in states without the Stand Your Ground law. The blue lines are showing how the log homicide rates looked for states that enacted such a law in a given year. So there’s a graph for each year.
In 2009 there’s only one “treatment” state, namely Montana, which has a population of 1 million, less than one third of one percent of the country. For that reason you see much less stable data. The authors did different analyses, sometimes weighted by population, which is good.
I have to admit, looking at these plots, the main thing I see in the data is that, besides Montana, we’re talking about states that have a higher homicide rate than usual, which could potentially indicate a confounding condition, and to address that (and other concerns) they conducted “falsification tests,” which is to say they studied whether crimes unrelated to Stand Your Ground type laws – larceny and motor vehicle theft – went up at the same time. They found that the answer is no.
The next point is that, although there seem to be bumps for 2005, 2006, and 2008 for the two years after the enactment of the law, there doesn’t for 2007 and 2009. And then even those states go down eventually, but the point is they don’t go down as much as the rest of the states without the laws.
It’s hard to do this analysis perfectly, with so few years of data. The problem is that, as soon as you suspect there’s a real effect, you’d want to act on it, since it directly translates into human deaths. So your natural reaction as a researcher is to “collect more data” but your natural reaction as a citizen is to abandon these laws as ineffective and harmful.
When I emailed my mom last month to tell her the awesome news about the book I’m writing she emailed me back the following:
i.e, A modern-day How to Lie with Statistics (1954), avail on Amazon
for $9.10. Love, Mom
That was her whole email. She’s never been very verbose, in person or electronically. Too busy hacking.
Even so, she gave me enough to go on, and I bought the book and recently read it. It was awesome and I recommend it to anyone who hasn’t read it – or read it recently. It’s a quick read and available as a free pdf download here.
The goal of the book is to demonstrate all the ways marketers, journalists, accountants, and sometimes even statisticians can bias your interpretation of statistical facts or even just confuse you into thinking something is true when it’s not. It’s illustrated as well, which is fun and often funny.
The author does things like talk about how you can present graphs to be very misleading – my favorite, because it happens to be my pet peeve, is the “growth chart” where the y-axis goes from 1400 to 1402 so things look like they’ve grown a huge amount because “0″ isn’t represented anywhere. Or of course the chart that has no numbers at all so you don’t know what you’re looking at.
There are a few things that don’t translate: so for example, he has a big thing about how people say “average” but they don’t specify whether they mean “arithmetic mean” or “median.” Nowadays this is taken to mean the former (am I wrong?).
And also, it’s fascinating to see how culture has changed – many of his examples that involve race would be very different nowadays, and issues around women, and the idea that you could run a randomized experiment to give half the people polio vaccines and withhold them from the other half, when polio is a real threat that leaves children paralyzed, is really strange.
Also, many of the examples – there are hundreds – refer to the Great Depression and the recovery since then, and the assumptions are bizarrely different in 1954 than you see in 2014 (and I’d guess different than how it will be in 2024 but I hope I’m wrong). Specifically, it seems that many of the lies that people are propagating with statistics are to downplay their profits so as to not seem excessive. Can you imagine?!
One of the reasons I read this book, of course, was to see if my book really is a modern version of that one. And I have to say that many of the issues do not translate, but some of them do, in interesting ways.
Even the reason that many of them don’t is kind of interesting: in the age of big data, we often don’t even see charts of data so how can we be misled by them? In other words, the presumption is that the data is so big as to be inaccessible. Google doesn’t bother showing us the numbers. Plus they don’t have to since we use their services anyway.
The most transferrable tips on how to lie with statistics probably stem from discussions on the following topics:
- Selection bias (things like, of the people who responded to our poll, they are all happy with our service)
- Survivorship bias (things like, companies that have been in the S&P for 30 years have great stock performance)
- Confusing people about topic A by discussing a related but not directly relevant topic B. This is described in the book as a “semi-attached figure”
The last one is the most relevant, I believe. In the age of big data, and partly because the data is “too big” to take a real look at, we spend an amazing amount of time talking about how a model is measuring something we care about (teachers’ value, or how good a candidate is for a job) when in fact the model is doing something quite different (test scores, demographic data).
If we were aware of those discrepancies we’d have way more skepticism, but we’re intimidated by the size of the data and the complexity of the models.
A final point. For the most part that crucial big data issue of complexity isn’t addressed in the book. It kind of makes me pine for the olden days, except not really if I’m black, a woman, or at risk of being exposed to polio.
UPDATES: First, my bad for not understanding that, at the time, the polio vaccine wasn’t known to work, or even be harmful, so of course there were trials. I was speaking from the perspective of the present day when it seems obvious that it works. For that matter I’m not even sure it was the particular vaccine that ended up working that was being tested.
Second, I showed my mom this post and her response was perfect:
Glad you liked it! Love, Mom
My friend Jordan Ellenberg sent me an article yesterday entitled Coin-flip judgement of psychopathic prisoners’ risk.
It was written by Seena Fazel, a researcher at the department of psychiatry at Oxford, and it concerns his research into the currently used predictive risk models for violence, repeat offense, and the like, which are supposedly tailored to people who have mental disorders like psychopathy.
Turns out there are a lot of these models, and they’re in use today in a bunch of countries. I did not know that. And they’re not just being used as extra, “good to know” information, but rather as a tool to assess important decisions for the prisoner. From the article:
Many US states use such tools to assess sexual offending risk and to help decide whether to exercise their powers to detain sexual offenders indefinitely after a prison term ends.
In England and Wales, these tools are part of the admission criteria for centres that treat people with dangerous and severe personality disorders. Outside North America, Europe and Australasia, similar approaches are increasingly popular, particularly in clinical settings, and there has been a steady growth of research from middle-income countries, such as China, documenting their use.
Also turns out, according to a meta-analysis done by Fazel, that these models don’t work very well, especially for the highest risk most violent population. And what’s super troubling is, as Fazel says, “In practice, the high false-positive rate probably means that some offenders spend longer in prison and secure hospital than their true risk would suggest.”
Talk about creepy.
This seems to be yet another example of a mathematical obfuscation and intimidation that gives people a false sense of having a good tool at hand. From the article:
Of course, sensible clinicians and judges take into account factors other than the findings of these instruments, but their misuse does complicate the picture. Some have argued that the veneer of scientific respectability surrounding such methods may lead to over-reliance on their findings, and that their complexity is difficult for the courts. Beyond concerns about public protection, liberty and costs of extended detention, there are worries that associated training and administration may divert resources from treatment.
The solution? Get people to acknowledge that the tools suck, and have a more transparent method of evaluating them. In this case, according to Fazel, it’s the researchers who are over-estimating the power of their models. But especially where it involves incarceration and the law, we have to maintain an adherence to a behavior-based methodology. It doesn’t make sense to put people in jail an extra 10 years because a crappy model said so.
This is a case, in my opinion, for an open model with a closed black box data set. The data itself is extremely sensitive and protected, but the model itself should be scrutinized.
A few of you may have read this recent New York TImes op-ed (hat tip Suresh Naidu) by economist Raj Chetty entitled “Yes, Economics is a Science.” In it he defends the scienciness of economics by comparing it to the field of epidemiology. Let’s focus on these three sentences in his essay, which for me are his key points:
I’m troubled by the sense among skeptics that disagreements about the answers to certain questions suggest that economics is a confused discipline, a fake science whose findings cannot be a useful basis for making policy decisions.
That view is unfair and uninformed. It makes demands on economics that are not made of other empirical disciplines, like medicine, and it ignores an emerging body of work, building on the scientific approach of last week’s winners, that is transforming economics into a field firmly grounded in fact.
Chetty is conflating two issues in his first sentence. The first is whether economics can be approached as a science, and the second is whether, if you are an honest scientist, you push as hard as you can to implement your “results” as public policy. Because that second issue is politics, not science, and that’s where people like myself get really pissed at economists, when they treat their estimates as facts with no uncertainty.
In other words, I’d have no problem with economists if they behaved like the people in the following completely made-up story based on the infamous Reinhart-Rogoff paper with the infamous excel mistake.
Two guys tried to figure what public policy causes GDP growth by using historical data. They collected their data and did some analysis, and they later released both the spreadsheet and the data by posting them on their Harvard webpages. They also ran the numbers a few times with slightly different countries and slightly different weighting schemes and explained in their write-up that got different answers depending on the initial conditions, so therefore they couldn’t conclude much at all, because the error bars are just so big. Oh well.
You see how that works? It’s called science, and it’s not what economists are known to do. It’s what we all wish they’d do though. Instead we have economists who basically get paid to write papers pushing for certain policies.
Next, let’s talk about Chetty’s comparison of economics with medicine. It’s kind of amazing that he’d do this considering how discredited epidemiology is at this point, and how truly unscientific it’s been found to be, for essentially exactly the same reasons as above – initial conditions, even just changing which standard database you use for your tests, switch the sign of most of the results in medicine. I wrote this up here based on a lecture by David Madigan, but there’s also a chapter in my new book with Rachel Schutt based on this issue.
To briefly summarize, Madigan and his colleagues reproduce a bunch of epidemiological studies and come out with incredible depressing “sensitivity” results. Namely, that the majority of “statistically significant findings” change sign depending on seemingly trivial initial condition changes that the authors of the original studies often didn’t even explain.
So in other words, Chetty defends economics as “just as much science” as epidemiology, which I would claim is in the category “not at all a science.” In the end I guess I’d have to agree with him, but not in a good way.
Finally, let’s be clear: it’s a good thing that economists are striving to be scientists, when they are. And it’s of course a lot easier to do science in microeconomic settings where the data is plentiful than it is to answer big, macro-economic questions where we only have a few examples.
Even so, it’s still a good thing that economists are asking the hard questions, even when they can’t answer them, like what causes recessions and what determines growth. It’s just crucial to remember that actual scientists are skeptical, even of their own work, and don’t pretend to have error bars small enough to make high-impact policy decisions based on their fragile results.
There’s an interesting debate described in this essay, Wrong Answer: the case against Algebra II, by Nicholson Baker (hat tip Nicholas Evangelos) around the requirement of algebra II to go to college. I’ll do my best to summarize the positions briefly. I’m making some of the pro-side up since it wasn’t well-articulated in the article.
On the pro-algebra side, we have the argument that learning algebra II promotes abstract thinking. It’s the first time you go from thinking about ratios of integers to ratios of polynomial functions, and where you consider the geometric properties of these generalized fractions. It is a convenient litmus test for even more abstraction: sure, it’s kind of abstract, but on the other hand you can also for the most part draw pictures of what’s going on, to keep things concrete. In that sense you might see it as a launching pad for the world of truly abstract geometric concepts.
Plus, doing well in algebra II is a signal for doing well in college and in later life. Plus, if we remove it as a requirement we might as well admit we’re dumbing down college: we’re giving the message that you can be a college graduate even if you can’t do math beyond adding fractions. And if that’s what college means, why have college? What happened to standards? And how is this preparing our young people to be competitive on a national or international scale?
On the anti-algebra side, we see a lot of empathy for struggling and suffering students. We see that raising so-called standards only gives them more suffering but no more understanding or clarity. And although we’re not sure if that’s because the subject is taught badly or because the subject is inherently unappealing or unattainable, it’s clear that wishful thinking won’t close this gap.
Plus, of course doing well in algebra II is a signal for doing well in college, it’s a freaking prerequisite for going to college. We might as well have embroidery as a prerequisite and then be impressed by all the beautiful piano stool covers that result. Finally, the standards aren’t going up just because we’re training a new generation in how to game a standardized test in an abstract rote-memorization skill of formulas and rules. It’s more like learning student’s capacity for drudgery.
OK, so now I’m going to make comments.
While it’s certainly true that, in the best of situations, the content of algebra II promotes abstract and logical thinking, it’s easy for me to believe, based on my very small experience in the matter that, it’s much more often taught poorly, and the students are expected to memorize formulas and rules. This makes it easier to test but doesn’t add to anyone’s love for math, including people who actually love math.
Speaking of my experience, it’s an important issue. Keep in mind that asking the population of mathematicians what they think of removing a high school class is asking for trouble. This is a group of people who pretty much across the board didn’t have any problems whatsoever with the class in question and sailed through it, possibly with a teacher dedicated to teaching honors students. They likely can’t remember much about their experience, and if they can it probably wasn’t bad.
Plus, removing a math requirement, any math requirement, will seem to a mathematician like an indictment of their field as not as important as it used to be to the world, which is always a bad thing. In other words, even if someone’s job isn’t directly on the line with this issue of algebra II, which it undoubtedly is for thousands of math teachers and college teachers, then even so it’s got a slippery slope feel, and pretty soon we’re going to have math departments shrinking over this.
In other words, it shouldn’t surprised anyone that we have defensive and unsympathetic mathematicians on one side who cannot understand the arguments of the empathizers on the other hand.
Of course, it’s always a difficult decision to remove a requirement. It’s much easier to make the case for a new one than to take one away, except of course for the students who have to work for the ensuing credentials.
And another thing, not so long ago we’d hear people say that women don’t need education at all, or that peasants don’t need to know how to read. Saying that a basic math course should become and elective kind of smells like that too if you want to get histrionic about things.
For myself, I’m willing to get rid of all of it, all the math classes ever taught, at least as a thought experiment, and then put shit back that we think actually adds value. So I still think we all need to know our multiplication tables and basic arithmetic, and even basic algebra so we can deal with an unknown or two. But from then on it’s all up in the air. Abstract reasoning is great, but it can be done in context just as well as in geometry class.
And, coming as I now do from data science, I don’t see why statistics is never taught in high school (at least in mine it wasn’t, please correct me if I’m wrong). It seems pretty clear we can chuck trigonometry out the window, and focus on getting the average high school student up to the point of scientific literacy that she can read a paper in a medical journal and understand what the experiment was and what the results mean. Or at the very least be able to read media reports of the studies and have some sense of statistical significance. That’d be a pretty cool goal, to get people to be able to read the newspaper.
So sure, get rid of algebra II, but don’t stop there. Think about what is actually useful and interesting and mathematical and see if we can’t improve things beyond just removing one crappy class.
Someone asked me a math question the other day and I had fun figuring it out. I thought it would be nice to write it down.
So here’s the problem. You are getting to see sample data and you have to infer the underlying distribution. In fact you happen to know you’re getting draws – which, because I’m a basically violent person, I like to think of as throws of a dart – from a uniform distribution from 0 to some unknown and you need to figure out what is. All you know is your data, so in particular you know how many dart throws you’ve gotten to see so far. Let’s say you’ve seen draws.
In other words, given what’s your best guess for ?
First, in order to simplify, note that all that really matters in terms of the estimate of is what is and how big is.
Next, note you might as well assume that and you just don’t know it yet.
With this set-up, you’ve rephrased the question like this: if you throw darts at the interval , then where do you expect the right-most dart – the maximum – to land?
It’s obvious from this phrasing that, as goes to infinity, you can expect a dart to get closer and closer to 1. Moreover, you can look at the simplest case, where and since the uniform distribution is symmetric, you can see the answer is 1/2. Then you might guess the overall answer, which depends on and goes to 1 as goes to infinity, might be . It makes intuitive sense, but how do you prove that?
Start with a small case where you know the answer. For we just need to know what the expected value of is, and since there’s one dart, the max is just itself, which is to say we need to compute a simple integral to find the expected value (note it’s coming in handy here that I’ve normalized the interval from 0 to 1 so I don’t have to divide by the width of the interval):
and we recover what we already know. In the next case, we need to integrate over two variables (same comment here, don’t have to divide by area of the 1×1 square base):
If you think about it, though, and play symmetric parts in this matter, so you can assume without loss of generality that is bigger, as long as we only let range between 0 and and then multiply the end result by 2:
But that simplifies to:
Let’s do the general case. It’s an n-fold integral over the maximum of all darts, and again without loss of generality is the maximum as long as we remember to multiply the whole thing by . We end up computing:
But this collapses to:
To finish the original question, take the maximum value in your collection of draws and multiply it by the plumping factor to get a best estimate of the parameter
My buddy Jordan Ellenberg just came out with a fantastic piece in Slate entitled “The Case of the Missing Zeroes: An astonishing act of statistical chutzpah in the Indiana schools’ grade-changing scandal.”
Here are the leading sentences of the piece:
Florida Education Commissioner Tony Bennett resigned Thursday amid claims that, in his former position as superintendent of public instruction in Indiana, he manipulated the state’s system for evaluating school performance. Bennett, a Republican who created an A-to-F grading protocol for Indiana schools as a way to promote educational accountability, is accused of raising the mark for a school operated by a major GOP donor.
Jordan goes on to explain exactly what happened and how that manipulation took place. Turns out it was a pretty outrageous and easy-to-understand lie about missing zeroes which didn’t make any sense. You should read the whole thing, Jordan is a great writer and his fantasy about how he would deal with a student trying the same scam in his calculus class is perfect.
A few comments to make about this story overall.
- First of all, it’s another case of a mathematical model being manipulated for political reasons. It just happens to be a really simple mathematical model in this case, namely a weighted average of scores.
- In other words, the lesson learned for corrupt politicians in the future may well to be sure the formulae are more complicated and thus easier to game.
- Or in other words, let’s think about other examples of this kind of manipulation, where people in power manipulate scores after the fact for their buddies. Where might it be happening now? Look no further than the Value-Added Model for teachers and schools, which literally nobody understands or could prove is being manipulated in any given instance.
- Taking a step further back, let’s remind ourselves that educational accountability models in general are extremely ripe for gaming and manipulation due to their high stakes nature. And the question of who gets the best opportunity to manipulate their scores is, as shown in this example of the GOP-donor-connected school, often a question of who has the best connections.
- In other words, I wonder how much the system can be trusted to give us a good signal on how well schools actually teach (at least how well they teach to the test).
- And if we want that signal to be clear, maybe we should take away the high stakes and literally measure it, with no consequences. Then, instead of punishing schools with bad scores, we could see how they need help.
- The conversation doesn’t profit from our continued crazy high expectations and fundamental belief in the existence of a silver bullet, the latest one being the Kipp Charter Schools – read this reality check if you’re wondering what I’m talking about (hat tip Jordan Ellenberg).
- As any statistician could tell you, any time you have an “educational experiment” involving highly motivated students, parents, and teachers, it will seem like a success. That’s called selection bias. The proof of the pudding lies in the scaling up of the method.
- We need to think longer term and consider how we’re treating good teachers and school administration who have to live under arbitrary and unfair systems. They might just leave.
This is a guest post from Jordan Ellenberg, a professor of mathematics at the University of Wisconsin. Jordan’s book, How Not To Be Wrong, comes out in May 2014. It is crossposted from his blog, Quomodocumque, and tweeted about at @JSEllenberg.
Cathy posted some cool data yesterday coming from the new visualization features of the magnificent Stacks Project. Summary: you can make a directed graph whose vertices are the 10,445 tagged assertions in the Stacks Project, and whose edges are logical dependency. So this graph (hopefully!) doesn’t have any directed cycles. (Actually, Cathy tells me that the Stacks Project autovomits out any contribution that would create a logical cycle! I wish LaTeX could do that.)
Given any assertion v, you can construct the subgraph G_v of vertices which are the terminus of a directed path starting at v. And Cathy finds that if you plot the number of vertices and number of edges of each of these graphs, you get something that looks really, really close to a line.
Why is this so? Does it suggest some underlying structure? I tend to say no, or at least not much — my guess is that in some sense it is “expected” for graphs like this to have this sort of property.
Because I am trying to get strong at sage I coded some of this up this morning. One way to make a random directed graph with no cycles is as follows: start with N edges, and a function f on natural numbers k that decays with k, and then connect vertex N to vertex N-k (if there is such a vertex) with probability f(k). The decaying function f is supposed to mimic the fact that an assertion is presumably more likely to refer to something just before it than something “far away” (though of course the stack project is not a strictly linear thing like a book.)
Here’s how Cathy’s plot looks for a graph generated by N= 1000 and f(k) = (2/3)^k, which makes the mean out-degree 2 as suggested in Cathy’s post.
Pretty linear — though if you look closely you can see that there are really (at least) a couple of close-to-linear “strands” superimposed! At first I thought this was because I forgot to clear the plot before running the program, but no, this is the kind of thing that happens.
Is this because the distribution decays so fast, so that there are very few long-range edges? Here’s how the plot looks with f(k) = 1/k^2, a nice fat tail yielding many more long edges:
My guess: a random graph aficionado could prove that the plot stays very close to a line with high probability under a broad range of random graph models. But I don’t really know!
Update: Although you know what must be happening here? It’s not hard to check that in the models I’ve presented here, there’s a huge amount of overlap between the descendant graphs; in fact, a vertex is very likely to be connected all but c of the vertices below it for a suitable constant c.
I would guess the Stacks Project graph doesn’t have this property (though it would be interesting to hear from Cathy to what extent this is the case) and that in her scatterplot we are not measuring the same graph again and again.
It might be fun to consider a model where vertices are pairs of natural numbers and (m,n) is connected to (m-k,n-l) with probability f(k,l) for some suitable decay. Under those circumstances, you’d have substantially less overlap between the descendant trees; do you still get the approximately linear relationship between edges and nodes?
I wrote a post three months ago talking about how we don’t need better models but we need to stop lying with our models. My first example was municipal debt and how various towns and cities are in deep debt partly because their accounting for future pension obligations allows them to be overly optimistic about their investments and underfund their pension pots.
This has never been more true than it is right now, and as this New York Times Dealbook article explains, was a major factor in Detroit’s bankruptcy filing this past week. But don’t make any mistake: even in places where they don’t end up declaring bankruptcy, something is going to shake out because of these broken models, and it isn’t going to be extra money for retired civil servants.
It all comes down to wanting to avoid putting required money away and hiring quants (in this case actuaries) to make that seem like it’s mathematically acceptable. It’s a form of mathematical control fraud. From the article:
When a lender calculates the value of a mortgage, or a trader sets the price of a bond, each looks at the payments scheduled in the future and translates them into today’s dollars, using a commonplace calculation called discounting. By extension, it might seem that an actuary calculating a city’s pension obligations would look at the scheduled future payments to retirees and discount them to today’s dollars.
But that is not what happens. To calculate a city’s pension liabilities, an actuary instead projects all the contributions the city will probably have to make to the pension fund over time. Many assumptions go into this projection, including an assumption that returns on the investments made by the pension fund will cover most of the plan’s costs. The greater the average annual investment returns, the less the city will presumably have to contribute. Pension plan trustees set the rate of return, usually between 7 percent and 8 percent.
In addition, actuaries “smooth” the numbers, to keep big swings in the financial markets from making the pension contributions gyrate year to year. These methods, actuarial watchdogs say, build a strong bias into the numbers. Not only can they make unsustainable pension plans look fine, they say, but they distort the all-important instructions actuaries give their clients every year on how much money to set aside to pay all benefits in the future.
One caveat: if the pensions have actually been making between 7 percent and 8 percent on their investments every year then all is perhaps well. But considering that they typically invest in bonds, not stocks – which is a good thing – we’re likely seeing much smaller returns than that, which means their yearly contributions to the local pension plans are in dire straits.
What’s super interesting about this article is that it goes into the action on the ground inside the Actuary community, since their reputations are at stake in this battle:
A few years ago, with the debate still raging and cities staggering through the recession, one top professional body, the Society of Actuaries, gathered expert opinion and realized that public pension plans had come to pose the single largest reputational risk to the profession. A Public Plans Reputational Risk Task Force was convened. It held some meetings, but last year, the matter was shifted to a new body, something called the Blue Ribbon Panel, which was composed not of actuaries but public policy figures from a number of disciplines. Panelists include Richard Ravitch, a former lieutenant governor of New York; Bradley Belt, a former executive director of the Pension Benefit Guaranty Corporation; and Robert North, the actuary who shepherds New York City’s five big public pension plans.
I’m not sure what happened here, but it seems like a bunch of people in a profession, the actuaries, got worried that they were being used by politicians, and decided to investigate, but then that initiative got somehow replaced by a bunch of politicians. I’d love to talk to someone on the inside about this.
This is a guest post by Eugene Stern.
Now that I have kids in school, I’ve become a lot more familiar with high-stakes testing, which is the practice of administering standardized tests with major consequences for students who take them (you have to pass to graduate), their teachers (who are often evaluated based on standarized test results), and their school districts (state funding depends on test results). To my great chagrin, New Jersey, where I live, is in the process of putting such a teacher evaluation system in place (for a lot more detail and criticism, see here).
The excellent John Ewing pointed me to a pretty comprehensive survey of standardized testing called “Measuring Up,” by Harvard Ed School prof Daniel Koretz, who teaches a course there about this stuff. If you have any interest in the subject, the book is very much worth your time. But in case you don’t get to it, or just to whet your appetite, here are my top 10 takeaways:
Believe it or not, most of the people who write standardized tests aren’t idiots. Building effective tests is a difficult measurement problem! Koretz makes an analogy to political polling, which is a good reminder that a test result is really a sample from a distribution (if you take multiple versions of a test designed to measure the same thing, you won’t do exactly the same each time), and not an absolute measure of what someone knows. It’s also a good reminder that the way questions are phrased can matter a great deal.
The reliability of a test is inversely related to the standard deviation of this distribution: a test is reliable if your score on it wouldn’t vary very much from one instance to the next. That’s a function of both the test itself and the circumstances under which people take it. More reliability is better, but the big trade-off is that increasing the sophistication of the test tends to decrease reliability. For example, tests with free form answers can test for a broader range of skills than multiple choice, but they introduce variability across graders, and even the same person may grade the same test differently before and after lunch. More sophisticated tasks also take longer to do (imagine a lab experiment as part of a test), which means fewer questions on the test and a smaller cross-section of topics being sampled, again meaning more noise and less reliability.
A complementary issue is bias, which is roughly about people doing better or worse on a test for systematic reasons outside the domain being tested. Again, there are trade-offs: the more sophisticated the test, the more extraneous skills beyond those being tested it may be bringing in. One common way to weed out such questions is to look at how people who score the same on the overall test do on each particular question: if you get variability you didn’t expect, that may be a sign of bias. It’s harder to do this for more sophisticated tests, where each question is a bigger chunk of the overall test. It’s also harder if the bias is systematic across the test.
Beyond the (theoretical) distribution from which a single student’s score is a sample, there’s also the (likely more familiar) distribution of scores across students. This depends both on the test and on the population taking it. For example, for many years, students on the eastern side of the US were more likely to take the SAT than those in the west, where only students applying to very selective eastern colleges took the test. Consequently, the score distributions were very different in the east and the west (and average scores tended to be higher in the west), but this didn’t mean that there was bias or that schools in the west were better.
The shape of the score distribution across students carries important information about the test. If a test is relatively easy for the students taking it, scores will be clustered to the right of the distribution, while if it’s hard, scores will be clustered to the left. This matters when you’re interpreting results: the first test is worse at discriminating among stronger students and better at discriminating among weaker ones, while the second is the reverse.
The score distribution across students is an important tool in communicating results (you may not know right away what a score of 600 on a particular test means, but if you hear it’s one standard deviation above a mean of 500, that’s a decent start). It’s also important for calibrating tests so that the results are comparable from year to year. In general, you want a test to have similar means and variances from one year to the next, but this raises the question of how to handle year-to-year improvement. This is particularly significant when educational goals are expressed in terms of raising standardized test scores.
If you think in terms of the statistics of test score distributions, you realize that many of those goals of raising scores quickly are deluded. Koretz has a good phrase for this: the myth of the vanishing variance. The key observation is that test score distributions are very wide, on all tests, everywhere, including countries that we think have much better education systems than we do. The goals we set for student score improvement (typically, a high fraction of all students taking a test several years from now are supposed to score above some threshold) imply a great deal of compression at the lower end of this distribution – compression that has never been seen in any country, anywhere. It sounds good to say that every kid who takes a certain test in four years will score as proficient, but that corresponds to a score distribution with much less variance than you’ll ever see. Maybe we should stop lying to ourselves?
Koretz is highly critical of the recent trend to report test results in terms of standards (e.g., how many students score as “proficient”) instead of comparisons (e.g., your score is in the top 20% of all students who took the test). Standards and standard-based reporting are popular because it’s believed that American students’ performance as a group is inadequate. The idea is that being near the top doesn’t mean much if the comparison group is weak, so instead we should focus on making sure every student meets an absolute standard needed for success in life. There are three (at least) problems with this. First, how do you set a standard – i.e., what does proficient mean, anyway? Koretz gives enough detail here to make it clear how arbitrary the standards are. Second, you lose information: in the US, standards are typically expressed in terms of just four bins (advanced, proficient, partially proficient, basic), and variation inside the bins is ignored. Third, even standards-based reporting tends to slide back into comparisons: since we don’t know exactly what proficient means, we’re happiest when our school, or district, or state places ahead of others in the fraction of students classified as proficient.
Koretz’s other big theme is score inflation for high-stakes tests: if everyone is evaluated based on test scores, everyone has an incentive to get those scores up, whether or not that actually has much correlation with learning. If you remember anything from the book or from this post, remember this phrase: sawtooth pattern. The idea is that when a new high-stakes standardized test appears, average scores start at some base level, go up quickly as people figure out how to game the test, then plateau. If the test is replaced with another, the same thing happens: base, rapid growth, plateau. Repeat ad infinitum. Koretz and his collaborators did a nice experiment in which they went back to a school district in which one high-stakes test had been replaced with another and administered the first test several years later. Now that teachers weren’t teaching to the first test, scores on it reverted back to the original base level. Moral: score inflation is real, pervasive, and unavoidable, unless we bite the bullet and do away with high-stakes tests.
While Koretz is sympathetic toward test designers, who live the complexity of standardized testing every day, he is harsh on those who (a) interpret and report on test results and (b) set testing and education policy, without taking that complexity into account. Which, as he makes clear, is pretty much everyone who reports on results and sets policy.
If you think it’s a good idea to make high-stakes decisions about schools and teachers based on standardized test results, Koretz’s book offers several clear warnings.
First, we should expect any high-stakes test to be gamed. Worse yet, the more reliable tests, being more predictable, are probably easier to game (look at the SAT prep industry).
Second, the more (statistically) reliable tests, by their controlled nature, cover only a limited sample of the domain we want students to learn. Tests trying to cover more ground in more depth (“tests worth teaching to,” in the parlance of the last decade) will necessarily have noisier results. This noise is a huge deal when you realize that high-stakes decisions about teachers are made based on just two or three years of test scores.
Third, a test that aims to distinguish “proficiency” will do a worse job of distinguishing students elsewhere in the skills range, and may be largely irrelevant for teachers whose students are far away from the proficiency cut-off. (For a truly distressing example of this, see here.)
With so many obstacles to rating schools and teachers reliably based on standardized test scores, is it any surprise that we see results like this?
I’m psyched to see Suresh Naidu tonight in the first Data Skeptics Meetup. He’s talking about Political Uses and Abuses of Data and his abstract is this:
While a lot has been made of the use of technology for election campaigns, little discussion has focused on other political uses of data. From targeting dissidents and tax-evaders to organizing protests, the same datasets and analytics that let data scientists do prediction of consumer and voter behavior can also be used to forecast political opponents, mobilize likely leaders, solve collective problems and generally push people around. In this discussion, Suresh will put this in a 1000 year government data-collection perspective, and talk about how data science might be getting used in authoritarian countries, both by regimes and their opponents.
Given the recent articles highlighting this kind of stuff, I’m sure the topic will provoke a lively discussion – my favorite kind!
Unfortunately the Meetup is full but I’d love you guys to give suggestions for more speakers and/or more topics.
At first glance, data miners inside governments, start-ups, corporations, and political campaigns are all doing basically the same thing. They’ll all need great engineering infrastructure, good clean data, a working knowledge of statistical techniques and enough domain knowledge to get things done.
We’ve seen recent articles that are evidence for this statement: Facebook data people move to the NSA or other government agencies easily, and Obama’s political campaign data miners have launched a new data mining start-up. I am a data miner myself, and I could honestly work at any of those places – my skills would translate, if not my personality.
I do think there are differences, though, and here I’m not talking about ethics or trust issues, I’m talking about pure politics.
Namely, the world of data mining is divided into two broad categories: people who want to cause things to happen and people who want to prevent things from happening.
I know that sounds incredibly vague, so let me give some examples.
In start-ups, irrespective of what you’re actually doing (what you’re actually doing is probably incredibly banal, like getting people to click on ads), you feel like you’re the first person ever to do it, at least on this scale, or at least with this dataset, and that makes it technically challenging and exciting.
Or, even if you’re not the first, at least what you’re creating or building is state-of-the-art and is going to be used to “disrupt” or destroy lagging competition. You feel like a motherfucker, and it feels great!
The same thing can be said for Obama’s political data miners: if you read this article, you’ll know they felt like they’d invented a new field of data mining, and a cult along with it, and it felt great! And although it’s probably not true that they did something all that impressive technically, in any case they did a great job of applying known techniques to a different data set, and they got lots of people to allow access to their private information based on their trust of Obama, and they mined the fuck out of it to persuade people to go out and vote and to go out and vote for Obama.
Now let’s talk about corporations. I’ve worked in enough companies to know that “covering your ass” is a real thing, and can overwhelm a given company’s other goals. And the larger the company, the more the fear sets in and the more time is spent covering one’s ass and less time is spent inventing and staying state-of-the-art. If you’ve ever worked in a place where it takes months just to integrate two different versions of SalesForce you know what I mean.
Those corporate people have data miners too, and in the best case they are somewhat protected from the conservative, risk averse, cover-your-ass atmosphere, but mostly they’re not. So if you work for a pharmaceutical company, you might spend your time figuring out how to draw up the numbers to make them look good for the CEO so he doesn’t get axed.
In other words, you spend your time preventing something from happening rather than causing something to happen.
Finally, let’s talk about government data miners. If there’s one thing I learned when I went to the State Department Tech@State “Moneyball Diplomacy” conference a few weeks back, it’s that they are the most conservative of all. They spend their time worrying about a terrorist attack and how to prevent it. It’s all about preventing bad things from happening, and that makes for an atmosphere where causing good things to happen takes a rear seat.
I’m not saying anything really new here; I think this stuff is pretty uncontroversial. Maybe people would quibble over when a start-up becomes a corporation (my answer: mostly they never do, but certainly by the time of an IPO they’ve already done it). Also, of course, there are ass-coverers in start-ups and there are risk-takers in corporation and maybe even in government, but they don’t dominate.
If you think through things in this light, it makes sense that Obama’s data miners didn’t want to stay in government and decided to go work on advertising stuff. And although they might have enough clout and buzz to get hired by a big corporation, I think they’ll find it pretty frustrating to be dealing with the cover-my-ass types that will hire them. It also makes sense that Facebook, which spends its time making sure no other social network grows enough to compete with it, works so well with the NSA.
1. If you want to talk ethics, though, join me on Monday at Suresh Naidu’s Data Skeptics Meetup where he’ll be talking about Political Uses and Abuses of Data.
I’m happy to say that the book I’m writing with Rachel Schutt called Doing Data Science is officially out for early review. That means a few chapters which we’ve deemed “ready” have been sent to some prominent people in the field to see what they think. Thanks, prominent and busy people!
It also means that things are (knock on wood) wrapping up on the editing side. I’m cautiously optimistic that this book will be a valuable resource for people interested in what data scientists do, especially people interested in switching fields. The range of topics is broad, which I guess means that the most obvious complaint about the book will be that we didn’t cover things deeply enough, and perhaps that the level of pre-requisite assumptions is uneven. It’s hard to avoid.
Thanks to my awesome editor Courtney Nash over at O’Reilly for all her help!
And by the way, we have an armadillo on our cover, which is just plain cool:
An article in yesterday’s Science Times explained that limiting the salt in your diet doesn’t actually improve health, and could in fact be bad for you. That’s a huge turn-around for a public health rule that has run very deep.
How can this kind of thing happen?
Well, first of all epidemiologists use crazy models to make predictions on things, and in this case what happened was they saw a correlation between high blood pressure and high salt intake, and they saw a separate correlation between high blood pressure and death, and so they linked the two.
Trouble is, while very low salt intake might lower blood pressure a little bit, it also for what ever reason makes people die a wee bit more often.
As this Scientific American article explains, that “little bit” is actually really small:
Over the long-term, low-salt diets, compared to normal diets, decreased systolic blood pressure (the top number in the blood pressure ratio) in healthy people by 1.1 millimeters of mercury (mmHg) and diastolic blood pressure (the bottom number) by 0.6 mmHg. That is like going from 120/80 to 119/79. The review concluded that “intensive interventions, unsuited to primary care or population prevention programs, provide only minimal reductions in blood pressure during long-term trials.” A 2003 Cochrane review of 57 shorter-term trials similarly concluded that “there is little evidence for long-term benefit from reducing salt intake.”
Moreover, some people react to changing their salt intake with higher, and some with lower blood pressure. Turns out it’s complicated.
I’m a skeptic, especially when it comes to epidemiology. None of this surprises me, and I don’t think it’s the last bombshell we’ll be hearing. But this meta-analysis also might have flaws, so hold your breath for the next pronouncement.
One last thing – they keep saying that it’s too expensive to do this kind of study right, but I’m thinking that by now they might realize the real cost of not doing it right is a loss of the public’s trust in medical research.
I recently read an article off the newsstand called The Rise of Big Data.
It was written by Kenneth Neil Cukier and Viktor Mayer-Schoenberger and it was published in the May/June 2013 edition of Foreign Affairs, which is published by the Council on Foreign Relations (CFR). I mention this because CFR is an influential think tank, filled with powerful insiders, including people like Robert Rubin himself, and for that reason I want to take this view on big data very seriously: it might reflect the policy view before long.
And if I think about it, compared to the uber naive view I came across last week when I went to the congressional hearing about big data and analytics, that would be good news. I’ll write more about it soon, but let’s just say it wasn’t everything I was hoping for.
At least Cukier and Mayer-Schoenberger discuss their reservations regarding “big data” in this article. To contrast this with last week, it seemed like the only background material for the hearing, at least for the congressmen, was the McKinsey report talking about how sexy data science is and how we’ll need to train an army of them to stay competitive.
So I’m glad it’s not all rainbows and sunshine when it comes to big data in this article. Unfortunately, whether because they’re tied to successful business interests, or because they just haven’t thought too deeply about the dark side, their concerns seem almost token, and their examples bizarre.
The article is unfortunately behind the pay wall, but I’ll do my best to explain what they’ve said.
First they discuss the concept of datafication, and their example is how we quantify friendships with “likes”: it’s the way everything we do, online or otherwise, ends up recorded for later examination in someone’s data storage units. Or maybe multiple storage units, and maybe for sale.
They formally define later in the article as a process:
… taking all aspect of life and turning them into data. Google’s augmented-reality glasses datafy the gaze. Twitter datafies stray thoughts. LinkedIn datafies professional networks.
Datafication is an interesting concept, although as far as I can tell they did not coin the word, and it has led me to consider its importance with respect to intentionality of the individual.
Here’s what I mean. We are being datafied, or rather our actions are, and when we “like” someone or something online, we are intending to be datafied, or at least we should expect to be. But when we merely browse the web, we are unintentionally, or at least passively, being datafied through cookies that we might or might not be aware of. And when we walk around in a store, or even on the street, we are being datafied in an completely unintentional way, via sensors or Google glasses.
This spectrum of intentionality ranges from us gleefully taking part in a social media experiment we are proud of to all-out surveillance and stalking. But it’s all datafication. Our intentions may run the gambit but the results don’t.
They follow up their definition in the article, once they get to it, with a line that speaks volumes about their perspective:
Once we datafy things, we can transform their purpose and turn the information into new forms of value
But who is “we” when they write it? What kinds of value do they refer to? As you will see from the examples below, mostly that translates into increased efficiency through automation.
So if at first you assumed they mean we, the American people, you might be forgiven for re-thinking the “we” in that sentence to be the owners of the companies which become more efficient once big data has been introduced, especially if you’ve recently read this article from Jacobin by Gavin Mueller, entitled “The Rise of the Machines” and subtitled “Automation isn’t freeing us from work — it’s keeping us under capitalist control.” From the article (which you should read in its entirety):
In the short term, the new machines benefit capitalists, who can lay off their expensive, unnecessary workers to fend for themselves in the labor market. But, in the longer view, automation also raises the specter of a world without work, or one with a lot less of it, where there isn’t much for human workers to do. If we didn’t have capitalists sucking up surplus value as profit, we could use that surplus on social welfare to meet people’s needs.
The big data revolution and the assumption that N=ALL
According to Cukier and Mayer-Schoenberger, the Big Data revolution consists of three things:
- Collecting and using a lot of data rather than small samples.
- Accepting messiness in your data.
- Giving up on knowing the causes.
They describe these steps in rather grand fashion, by claiming that big data doesn’t need to understand cause because the data is so enormous. It doesn’t need to worry about sampling error because it is literally keeping track of the truth. The way the article frames this is by claiming that the new approach of big data is letting “N = ALL”.
But here’s the thing, it’s never all. And we are almost always missing the very things we should care about most.
So for example, as this InfoWorld post explains, internet surveillance will never really work, because the very clever and tech-savvy criminals that we most want to catch are the very ones we will never be able to catch, since they’re always a step ahead.
Even the example from their own article, election night polls, is itself a great non-example: even if we poll absolutely everyone who leaves the polling stations, we still don’t count people who decided not to vote in the first place. And those might be the very people we’d need to talk to to understand our country’s problems.
Indeed, I’d argue that the assumption we make that N=ALL is one of the biggest problems we face in the age of Big Data. It is, above all, a way of excluding the voices of people who don’t have the time or don’t have the energy or don’t have the access to cast their vote in all sorts of informal, possibly unannounced, elections.
Those people, busy working two jobs and spending time waiting for buses, become invisible when we tally up the votes without them. To you this might just mean that the recommendations you receive on Netflix don’t seem very good because most of the people who bother to rate things are Netflix are young and have different tastes than you, which skews the recommendation engine towards them. But there are plenty of much more insidious consequences stemming from this basic idea.
Another way in which the assumption that N=ALL can matter is that it often gets translated into the idea that data is objective. Indeed the article warns us against not assuming that:
… we need to be particularly on guard to prevent our cognitive biases from deluding us; sometimes, we just need to let the data speak.
And later in the article,
In a world where data shape decisions more and more, what purpose will remain for people, or for intuition, or for going against the facts?
This is a bitch of a problem for people like me who work with models, know exactly how they work, and know exactly how wrong it is to believe that “data speaks”.
I wrote about this misunderstanding here, in the context of Bill Gates, but I was recently reminded of it in a terrifying way by this New York Times article on big data and recruiter hiring practices. From the article:
“Let’s put everything in and let the data speak for itself,” Dr. Ming said of the algorithms she is now building for Gild.
If you read the whole article, you’ll learn that this algorithm tries to find “diamond in the rough” types to hire. A worthy effort, but one that you have to think through.
Why? If you, say, decided to compare women and men with the exact same qualifications that have been hired in the past, but then, looking into what happened next you learn that those women have tended to leave more often, get promoted less often, and give more negative feedback on their environments, compared to the men, your model might be tempted to hire the man over the woman next time the two showed up, rather than looking into the possibility that the company doesn’t treat female employees well.
In other words, ignoring causation can be a flaw, rather than a feature. Models that ignore causation can add to historical problems instead of addressing them. And data doesn’t speak for itself, data is just a quantitative, pale echo of the events of our society.
Some cherry-picked examples
One of the most puzzling things about the Cukier and Mayer-Schoenberger article is how they chose their “big data” examples.
One of them, the ability for big data to spot infection in premature babies, I recognized from the congressional hearing last week. Who doesn’t want to save premature babies? Heartwarming! Big data is da bomb!
But if you’re going to talk about medicalized big data, let’s go there for reals. Specifically, take a look at this New York Times article from last week where a woman traces the big data footprints, such as they are, back in time after receiving a pamphlet on living with Multiple Sclerosis. From the article:
Now she wondered whether one of those companies had erroneously profiled her as an M.S. patient and shared that profile with drug-company marketers. She worried about the potential ramifications: Could she, for instance, someday be denied life insurance on the basis of that profile? She wanted to track down the source of the data, correct her profile and, if possible, prevent further dissemination of the information. But she didn’t know which company had collected and shared the data in the first place, so she didn’t know how to have her entry removed from the original marketing list.
Two things about this. First, it happens all the time, to everyone, but especially to people who don’t know better than to search online for diseases they actually have. Second, the article seems particularly spooked by the idea that a woman who does not have a disease might be targeted as being sick and have crazy consequences down the road. But what about a woman is actually is sick? Does that person somehow deserve to have their life insurance denied?
The real worries about the intersection of big data and medical records, at least the ones I have, are completely missing from the article. Although they did mention that “improving and lowering the cost of health care for the world’s poor” inevitable will lead to “necessary to automate some tasks that currently require human judgment.” Increased efficiency once again.
To be fair, they also talked about how Google tried to predict the flu in February 2009 but got it wrong. I’m not sure what they were trying to say except that it’s cool what we can try to do with big data.
Also, they discussed a Tokyo research team that collects data on 360 pressure points with sensors in a car seat, “each on a scale of 0 to 256.” I think that last part about the scale was added just so they’d have more numbers in the sentence – so mathematical!
And what do we get in exchange for all these sensor readings? The ability to distinguish drivers, so I guess you’ll never have to share your car, and the ability to sense if a driver slumps, to either “send an alert or atomatically apply brakes.” I’d call that a questionable return for my investment of total body surveillance.
Big data, business, and the government
Make no mistake: this article is about how to use big data for your business. It goes ahead and suggests that whoever has the biggest big data has the biggest edge in business.
Of course, if you’re interested in treating your government office like a business, that’s gonna give you an edge too. The example of Bloomberg’s big data initiative led to efficiency gain (read: we can do more with less, i.e. we can start firing government workers, or at least never hire more).
As for regulation, it is pseudo-dealt with via the discussion of market dominance. We are meant to understand that the only role government can or should have with respect to data is how to make sure the market is working efficiently. The darkest projected future is that of market domination by Google or Facebook:
But how should governments apply antitrust rules to big data, a market that is hard to define and is constantly changing form?
In particular, no discussion of how we might want to protect privacy.
Big data, big brother
I want to be fair to Cukier and Mayer-Schoenberger, because they do at least bring up the idea of big data as big brother. Their topic is serious. But their examples, once again, are incredibly weak.
Should we find likely-to-drop-out boys or likely-to-get-pregnant girls using big data? Should we intervene? Note the intention of this model would be the welfare of poor children. But how many models currently in production are targeting that demographic with that goal? Is this in any way at all a reasonable example?
Here’s another weird one: they talked about the bad metric used by US Secretary of Defense Robert McNamara in the Viet Nam War, namely the number of casualties. By defining this with the current language of statistics, though, it gives us the impression that we could just be super careful about our metrics in the future and: problem solved. As we experts in data know, however, it’s a political decision, not a statistical one, to choose a metric of success. And it’s the guy in charge who makes that decision, not some quant.
If you end up reading the Cukier and Mayer-Schoenberger article, please also read Julie Cohen’s draft of a soon-to-be published Harvard Law Review article called “What Privacy is For” where she takes on big data in a much more convincing and skeptical light than Cukier and Mayer-Schoenberger were capable of summoning up for their big data business audience.
I’m actually planning a post soon on Cohen’s article, which contains many nuggets of thoughtfulness, but for now I’ll simply juxtapose two ideas surrounding big data and innovation, giving Cohen the last word. First from the Cukier and Mayer-Schoenberger article:
Big data enables us to experiment faster and explore more leads. These advantages should produce more innovation
Second from Cohen, where she uses the term “modulation” to describe, more or less, the effect of datafication on society:
When the predicate conditions for innovation are described in this way, the problem with characterizing privacy as anti-innovation becomes clear: it is modulation, not privacy, that poses the greater threat to innovative practice. Regimes of pervasively distributed surveillance and modulation seek to mold individual preferences and behavior in ways that reduce the serendipity and the freedom to tinker on which innovation thrives. The suggestion that innovative activity will persist unchilled under conditions of pervasively distributed surveillance is simply silly; it derives rhetorical force from the cultural construct of the liberal subject, who can separate the act of creation from the fact of surveillance. As we have seen, though, that is an unsustainable fiction. The real, socially-constructed subject responds to surveillance quite differently—which is, of course, exactly why government and commercial entities engage in it. Clearing the way for innovation requires clearing the way for innovative practice by real people, by preserving spaces within which critical self-determination and self-differentiation can occur and by opening physical spaces within which the everyday practice of tinkering can thrive.
I wanted to give this advice today just in case it’s useful to someone. It’s basically the way I went about reinventing myself from being a quant in finance to being a data scientist in the tech scene.
In other words, many of the same skills but not all, and many of the same job description elements but not all.
The truth is, I didn’t even know the term “data scientist” when I started my job hunt, so for that reason I think it’s possibly good and useful advice: if you follow it, you may end up getting a great job you don’t even know exists right now.
Also, I used this advice yesterday on my friend who is trying to reinvent himself, and he seemed to find it useful, although time will tell how much – let’s see if he gets a new job soon!
- Write a list of things you like about jobs: learning technical stuff, managing people, whatever floats your boat.
- Next, write a list of things you don’t like: being secretive, no vacation, office politics, whatever. Some people hate working with “dumb people” but some people can’t stand “arrogant people”. It makes a huge difference actually.
- Next, write a list of skills you have: python, basic statistics, math, managing teams, smelling a bad deal, stuff like that. This is probably the most important list, so spend some serious time on it.
- Finally, write a list of skills you don’t have that you wish you did: hadoop, knowing when to stop talking, stuff like that.
Once you have your lists, start going through LinkedIn by cross-searching for your preferred city and a keyword from one of your lists (probably the “skills you have” list).
Every time you find a job that you think you’d like to have, take note of what skills it lists that you don’t have, the name of the company, and your guess on a scale of 1-10 of how much you’d like the job into a spreadsheet or at least a file. This last part is where you use the “stuff I like” and “stuff I don’t like” lists.
And when you’ve done this for a long time, like you made it your job for a few hours a day for at least a few weeks, then do some wordcounts on this file, preferably using a command line script to add to the nerdiness, to see which skills you’d need to get which jobs you’d really like.
Note LinkedIn is not an oracle: it doesn’t have every job in the world (although it might have most jobs you could ever get), and the descriptions aren’t always accurate.
For example, I think companies often need managers of software engineers, but they never advertise for managers of software engineers. They advertise for software engineers, and then let them manage if they have the ability to, and sometimes even if they don’t. But even in that case I think it makes sense: engineers don’t want to be managed by someone they think isn’t technical, and the best way to get someone who is definitely technical is just to get another engineer.
In other words, sometimes the “job requirements” data on LInkedIn dirty, but it’s still useful. And thank god for LinkedIn.
Next, make sure your LinkedIn profile is up-to-date and accurate, and that your ex-coworkers have written letters for you and endorsed you for your skills.
Finally, buy a book or two to learn the new skills you’ve decided to acquire based on your research. I remember bringing a book on Bayesian statistics to my interview for a data scientist. I wasn’t all the way through the book, and my boss didn’t even know enough to interview me on that subject, but it didn’t hurt him to see that I was independently learning stuff because I thought it would be useful, and it didn’t hurt to be on top of that stuff when I started my new job.
What I like about this is that it looks for jobs based on what you want rather than what you already know you can do. It’s in some sense the dual method to what people usually do.