A tiny article in The Cap Times was recently published (hat tip Jordan Ellenberg) which describes the existence of a big data model which claims to help filter and rank school teachers based on their ability to raise student test scores. I guess it’s a kind of pre-VAM filtering system, and if it was hard to imagine a more vile model than the VAM, here you go. The article mentioned that the Madison School Board was deliberating on whether to spend $273K on this model.
One of the teachers in the district wrote her concerns about this model in her blog and then there was a debate at the school board meeting, and a journalist covered the meeting, so we know about it. But it was a close call, and this one could have easily slipped under the radar, or at least my radar.
Even so, now I know about it, and once I looked at the website of the company promoting this model, I found links to an article where they name a customer, for example in the Charlotte-Mecklenburg School District of North Carolina. They claim they only filter applications using their tool, they don’t make hiring decisions. Cold comfort for people who got removed by some random black box algorithm.
I wonder how many of the teachers applying to that district knew their application was being filtered through such a model? I’m going to guess none. For that matter, there are all sorts of application screening algorithms being regularly used of which applicants are generally unaware.
It’s just one example of the dark matter of big data. And by that I mean the enormous and growing clusters of big data models that are only inadvertently detectable by random small-town or small-city budget meeting journalism, or word-of-mouth reports coming out of conferences or late-night drinking parties with VC’s.
The vast majority of big data dark matter is still there in the shadows. You can only guess at its existence and its usage. Since the models themselves are proprietary, and are generally deployed secretly, there’s no reason for the public to be informed.
Let me give you another example, this time speculative, but not at all unlikely.
Namely, big data health models arising from the quantified self movement data. This recent Wall Street Journal article entitled Can Data From Your Fitbit Transform Medicine? articulated the issue nicely:
Consumer wearables fall into a regulatory gray area. Health-privacy laws that prevent the commercial use of patient data without consent don’t apply to the makers of consumer devices. “There are no specific rules about how those vendors can use and share data,” said Deven McGraw, a partner in the health-care practice at Manatt, Phelps, and Phillips LLP.
The key is that phrase “regulatory gray area”; it should make you think “big data dark matter lives here”.
When you have unprotected data that can be used as a proxy of HIPAA-protected medical data, there’s no reason it won’t be. So anyone who wants stands to benefit from knowing health-related information about you – think future employers who might help pay for future insurance claims – will be interested in using big data dark matter models gleaned from this kind of unregulated data.
To be sure, most people nowadays who wear fitbits are athletic, trying to improve their 5K run times. But the article explained that the medical profession is on the verge of suggesting a much larger population of patients use such devices. So it could get ugly real fast.
Secret big data models aren’t new, of course. I remember a friend of mine working for a credit card company a few decades ago. Her job was to model which customers to offer subprime credit cards to, and she was specifically told to target those customers who would end up paying the most in fees. But it’s become much much easier to do this kind of thing with the proliferation of so much personal data, including social media data.
I’m interested in the dark matter, partly as research for my book, and I’d appreciate help from my readers in trying to spot it when it pops up. For example, I remember begin told that a certain kind of online credit score is used to keep people on hold for customer service longer, but now I can’t find a reference to it anywhere. We should really compile a list at the boundaries of this dark matter. Please help! And if you don’t feel comfortable commenting, my email address is on the About page.
One of the reasons I enjoy my blog is that I get to try out an argument and then see if readers can 1) poke holes in my arguement, or 2) if they misunderstand my argument, or 3) if they misunderstand something tangential to my argument.
Today I’m going to write about an issue of the third kind. Yesterday I talked about how I’d like to see the VAM scores for teachers directly compared to other qualitative scores or other VAM scores so we could see how reliably they regenerate various definitions of “good teaching.”
The idea is this. Many mathematical models are meant to replace a human-made model that is deemed too expensive to work out at scale. Credit scores were like that; take the work out of the individual bankers’ hands and create a mathematical model that does the job consistently well. The VAM was originally intended as such – in-depth qualitative assessments of teachers is expensive, so let’s replace them with a much cheaper option.
So all I’m asking is, how good a replacement is the VAM? Does it generate the same scores as a trusted, in-depth qualitative assessment?
When I made the point yesterday that I haven’t seen anything like that, a few people mentioned studies that show positive correlations between the VAM scores and principal scores.
But here’s the key point: positive correlation does not imply equality.
Of course sometimes positive correlation is good enough, but sometimes it isn’t. It depends on the context. If you’re a trader that makes thousands of bets a day and your bets are positively correlated with the truth, you make good money.
But on the other side, if I told you that there’s a ride at a carnival that has a positive correlation with not killing children, that wouldn’t be good enough. You’d want the ride to be safe. It’s a higher standard.
I’m asking that we make sure we are using that second, higher standard when we score teachers, because their jobs are increasingly on the line, so it matters that we get things right. Instead we have a machine that nobody understand that is positively correlated with things we do understand. I claim that’s not sufficient.
Let me put it this way. Say your “true value” as a teacher is a number between 1 and 100, and the VAM gives you a noisy approximation of your value, which is 24% correlated with your true value. And say I plot your value against the approximation according to VAM, and I do that for a bunch of teachers, and it looks like this:
So maybe your “true value” as a teacher is 58 but the VAM gave you a zero. That would not just be frustrating to you, since it’s taken as an important part of your assessment. You might even lose your job. And you might get a score of zero many years in a row, even if your true score stays at 58. It’s increasingly unlikely, to be sure, but given enough teachers it is bound to happen to a handful of people, just by statistical reasoning, and if it happens to you, you will not think it’s unlikely at all.
In fact, if you’re a teacher, you should demand a scoring system that is consistently the same as a system you understand rather than positively correlated with one. If you’re working for a teachers’ union, feel free to contact me about this.
One last thing. I took the above graph from this post. These are actual VAM scores for the same teacher in the same year but for two different class in the same subject – think 7th grade math and 8th grade math. So neither score represented above is “ground truth” like I mentioned in my thought experiment. But that makes it even more clear that the VAM is an insufficient tool, because it is only 24% correlated with itself.
Every now and then when I complain about the Value-Added Model (VAM), people send me links to recent papers written Raj Chetty, John Friedman, and Jonah Rockoff like this one entitled Measuring the Impacts of Teachers II: Teacher Value-Added and Student Outcomes in Adulthood or its predecessor Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates.
I think I’m supposed to come away impressed, but that’s not what happens. Let me explain.
Their data set for students scores start in 1989, well before the current value-added teaching climate began. That means teachers weren’t teaching to the test like they are now. Therefore saying that the current VAM works because an retrograded VAM worked in 1989 and the 1990’s is like saying I must like blueberry pie now because I used to like pumpkin pie. It’s comparing apples to oranges, or blueberries to pumpkins.
I’m surprised by the fact that the authors don’t seem to make any note of the difference in data quality between pre-VAM and current conditions. They should know all about feedback loops; any modeler should. And there’s nothing like telling teachers they might lose their job to create a mighty strong feedback loop. For that matter, just consider all the cheating scandals in the D.C. area where the stakes were the highest. Now that’s a feedback loop. And by the way, I’ve never said the VAM scores are totally meaningless, but just that they are not precise enough to hold individual teachers accountable. I don’t think Chetty et al address that question.
So we can’t trust old VAM data. But what about recent VAM data? Where’s the evidence that, in this climate of high-stakes testing, this model is anything but random?
If it were a good model, we’d presumably be seeing a comparison of current VAM scores and current other measures of teacher success and how they agree. But we aren’t seeing anything like that. Tell me if I’m wrong, I’ve been looking around and I haven’t seen such comparisons. And I’m sure they’ve been tried, it’s not rocket science to compare VAM scores with other scores.
The lack of such studies reminds me of how we never hear about scientific studies on the results of Weight Watchers. There’s a reason such studies never see the light of day, namely because whenever they do those studies, they decide they’re better off not revealing the results.
And if you’re thinking that it would be hard to know exactly how to rate a teacher’s teaching in a qualitative, trustworthy way, then yes, that’s the point! It’s actually not obvious how to do this, which is the real reason we should never trust a so-called “objective mathematical model” when we can’t even decide on a definition of success. We should have the conversation of what comprises good teaching, and we should involve the teachers in that, and stop relying on old data and mysterious college graduation results 10 years hence. What are current 6th grade teachers even supposed to do about studies like that?
Note I do think educators and education researchers should be talking about these questions. I just don’t think we should punish teachers arbitrarily to have that conversation. We should have a notion of best practices that slowly evolve as we figure out what works in the long-term.
So here’s what I’d love to see, and what would be convincing to me as a statistician. If we see all sorts of qualitative ways of measuring teachers, and see their VAM scores as well, and we could compare them, and make sure they agree with each other and themselves over time. In other words, at the very least we should demand an explanation of how some teachers get totally ridiculous and inconsistent scores from one year to the next and from one VAM to the next, even in the same year.
We need some ground truth, people, and some common sense as well. Instead we’re seeing retired education professors pull statistics out of thin air, and it’s an all-out war of supposed mathematical objectivity against the civil servant.
This is a great book. It’s well written, clear, and it focuses on important issues. I did not check all of the claims made by the data but, assuming they hold up, the book makes two hugely important points which hopefully everyone can understand and debate, even if we don’t all agree on what to do about them.
First, the authors explain the insufficiency of monetary policy to get the country out of recession. Second, they suggest a new way to structure debt.
To explain these points, the authors do something familiar to statisticians: they think about distributions rather than averages. So rather than talking about how much debt there was, or how much the average price of houses fell, they talked about who was in debt, and where they lived, and which houses lost value. And they make each point carefully, with the natural experiments inherent in our cities due to things like available land and income, to try to tease out causation.
Their first main point is this: the financial system works against poor people (“borrowers”) much more than rich people (“lenders”) in times of crisis, and the response to the financial crisis exacerbated this discrepancy.
The crisis fell on poor people much more heavily: they were wiped out by the plummeting housing prices, whereas rich people just lost a bit of their wealth. Then the government stepped in and protected creditors and shareholders but didn’t renegotiate debt, which protected lenders but not borrowers. This is a large reason we are seeing so much increasing inequality and why our economy is stagnant. They make the case that we should have bailed out homeowners not only because it would have been fair but because it would have been helpful economically.
The authors looked into what actually caused the Great Recession, and they come to a startling conclusion: that the banking crisis was an effect, rather than a cause, of enormous household debt and consumer pull-back. Their narrative goes like this: people ran up debt, then started to pull back, and and as a result the banking system collapsed, as it was utterly dependent on ever-increasing debt. Moreover, the financial system did a very poor job of figuring out how to allocate capital and the people who made those loans were not adequately punished, whereas the people who got those loans were more than reasonably punished.
About half of the run-up of household debt was explained by home equity extraction, where people took out money from their home to spend on stuff. This is partly due to the fact that, in the meantime, wages were stagnant and home equity was a big thing and was hugely available.
But the authors also made the case that, even so, the bubble wasn’t directly caused by rising home valuations but rather to securitization and the creation of “financial innovation” which made investors believe they were buying safe products which were in fact toxic. In their words, securities are invented to exploit “neglected risks” (my experience working in a financial risk firm absolutely agrees to this; whenever you hear the phrase “financial innovation,” please interpret it to mean “an instrument whose risk hides somewhere in the creases that investors are not yet aware of”).
They make the case that debt access by itself elevates prices and build bubbles. In other words, it was the sausage factory itself, producing AAA-rated ABS CDO’s that grew the bubble.
Next, they talked about what works and what doesn’t, given this distributional way of looking at the household debt crisis. Specifically, monetary policy is insufficient, since it works through the banks, who are unwilling to lend to the poor who are already underwater, and only rich people benefit from cheap money and inflated markets. Even at its most extreme, the Fed can at most avoid deflation but it not really help create inflation, which is what debtors need.
Fiscal policy, which is to say things like helicopter money drops or added government jobs, paid by taxpayers, is better but it makes the wrong people pay – high income earners vs. high wealth owners – and isn’t as directly useful as debt restructuring, where poor people get a break and it comes directly from rich people who own the debt.
There are obstacles to debt restructuring, which are mostly political. Politicians are impotent in times of crisis, as we’ve seen, so instead of waiting forever for that to happen, we need a new kind of debt contract that automatically gets restructured in times of crisis. Such a new-fangled contract would make the financial system actually spread out risk better. What would that look like?
The authors give two examples, for mortgages and student debt. The student debt example is pretty simple: how quickly you need to pay back your loans depends in part on how many jobs there are when you graduate. The idea is to cushion the borrower somewhat from macro-economic factors beyond their control.
Next, for mortgages, they propose something the called the shared-responsibility mortgage. The idea here is to have, say, a 30-year mortgage as usual, but if houses in your area lost value, your principal and monthly payments would go down in a commensurate way. So if there’s a 30% drop, your payments go down 30%. To compensate the lenders for this loss-share, the borrowers also share the upside: 5% of capital gains are given to the lenders in the case of a refinancing.
In the case of a recession, the creditors take losses but the overall losses are smaller because we avoid the foreclosure feedback loops. It also acts as a form of stimulus to the borrowers, who are more likely to spend money anyway.
If we had had such mortgage contracts in the Great Recession, the authors estimate that it would have been worth a stimulus of $200 billion, which would have in turn meant fewer jobs lost and many fewer foreclosures and a smaller decline of housing prices. They also claim that shared-responsibility mortgages would prevent bubbles from forming in the first place, because of the fear of creditors that they would be sharing in the losses.
A few comments. First, as a modeler, I am absolutely sure that once my monthly mortgage payment is directly dependent on a price index, that index is going to be manipulated. Similarly as a college graduate trying to figure out how quickly I need to pay back my loans. And depending on how well that manipulation works, it could be a disaster.
Second, it is interesting to me that the authors make no mention of the fact that, for many forms of debt, restructuring is already a typical response. Certainly for commercial mortgages, people renegotiate their principal all the time. We can address the issue of how easy it is to negotiate principal directly by talking about standards in contracts.
Having said that I like the idea of having a contract that makes restructuring automatic and doesn’t rely on bypassing the very real organizational and political frictions that we see today.
Let me put it this way. If we saw debt contracts being written like this, where borrowers really did have down-side protection, then the people of our country might start actually feeling like the financial system was working for them rather than against them. I’m not holding my breath for this to actually happen.
I am now part of the administrative bloat over at Columbia. I am non-faculty administration, tasked with directing a data journalism program. The program is great, and I’m not complaining about my job. But I will be honest, it makes me uneasy.
Although I’m in the Journalism School, which is in many ways separated from the larger university, I now have a view into how things got so bloated. And how they might stay that way, as well: it’s not clear that, at the end of my 6-month gig, on September 16th, I could hand my job over to any existing person at the J-School. They might have to replace me, or keep me on, with a real live full-time person in charge of this program.
There are good and less good reasons for that, but overall I think there exists a pretty sound argument for such a person to run such a program and to keep it good and intellectually vibrant. That’s another thing that makes me uneasy, although many administrative positions have less of an easy sell attached to them.
I was reminded of this fact of my current existence when I read this recent New York Times article about the administrative bloat in hospitals. From the article:
And studies suggest that administrative costs make up 20 to 30 percent of the United States health care bill, far higher than in any other country. American insurers, meanwhile, spent $606 per person on administrative costs, more than twice as much as in any other developed country and more than three times as much as many, according to a study by the Commonwealth Fund.
A comprehensive study published by the Delta Cost Project in 2010 reported that between 1998 and 2008, America’s private colleges increased spending on instruction by 22 percent while increasing spending on administration and staff support by 36 percent. Parents who wonder why college tuition is so high and why it increases so much each year may be less than pleased to learn that their sons and daughters will have an opportunity to interact with more administrators and staffers— but not more professors.
There are similarities and there are differences between the university and the medical situations.
A similarity is that people really want to be educated, and people really need to be cared for, and administrations have grown up around these basic facts, and at each stage they seem to be adding something either seemingly productive or vitally needed to contain the complexity of the existing machine, but in the end you have enormous behemoths of organizations that are much too complex and much too expensive. And as a reality check on whether that’s necessary, take a look at hospitals in Europe, or take a look at our own university system a few decades ago.
And that also points out a critical difference: the health care system is ridiculously complicated in this country, and in some sense you need all these people just to navigate it for a hospital. And ObamaCare made that worse, not better, even though it also has good aspects in terms of coverage.
Whereas the university system made itself complicated, it wasn’t externally forced into complexity, except if you count the US News & World Reports gaming that seems inescapable.
You might have heard about the recent study entitled Higher social class predicts increased unethical behavior. In it, the authors figure out seven ways to measure the extent to which rich people are bigger assholes than poor people, a plan that works brilliantly every time.
What they term “unethical behavior” comes down to stuff like cutting off people and cars in an intersection, cheating in a game, and even stealing candy from a baby.
The authors also show that rich people are more likely to think of greed as good, and that attitude is sufficient to explain their feelings of entitlement. Another way of saying this it that, once you “account for greed feelings,” being rich doesn’t make you more likely to cheat.
I’d like to go one step further and ask, why do rich people think greed is good? A couple of things come to mind.
First, rich people rarely get arrested, and even when they are arrested, their experiences are very different and much less likely to end up with a serious sentence. Specifically, the fees are not onerous for the rich, and fancier lawyers do better jobs for the rich (by the way, in Finland, speeding tickets are on a sliding scale depending on the income of the perpetrator). It’s easy to think greed is good if you never get punished for cheating.
Second, rich people are examples of current or legacy winners in the current system, and that feeling that they have won leaks onto other feelings of entitlement. They have faith in the system to keep them from having to deal with consequences because so far so good.
Finally, some people deliberately judge that they can afford to be assholes. They are insulated from depending on other people because they have money. Who needs friends when you have resources?
Of course, not all rich people are greed-is-good obsessed assholes. But there are some that specialize in it. They call themselves Libertarians. Paypal founder Peter Thiel is one of their heroes.
Here’s some good news: some of those people intend to sail off on a floating country. Thiel is helping fund this concept. The only problem is, they all are so individualistic it’s hard for them to agree on ground rules and, you know, a process by which to decide things (don’t say government!).
This isn’t a new idea, but for some reason it makes me very happy. I mean, wouldn’t you love it if a good fraction of the people who cut you off in traffic got together and decided to leave town? I’m thinking of donating to that cause. Do they have a Kickstarter yet?
I gave a talk to the invitation-only NYC CTO Club a couple of weeks ago about my fears about big data modeling, namely:
- that big data modeling is discriminatory,
- that big data modeling increases inequality, and
- that big data modeling threatens democracy.
I had three things on my “to do” list for the audience of senior technologists, namely:
- test internal, proprietary models for discrimination,
- help regulators like the CFPB develop reasonable audits, and
- get behind certain models being transparent and publicly accessible, including credit scoring, teacher evaluations, and political messaging models.
Given the provocative nature of my talk, I was pleasantly surprised by the positive reception I was given. Those guys were great – interactive, talkative, and very thoughtful. I think it helped that I wasn’t trying to sell them something.
Even so, I shouldn’t have been surprised when one of them followed up with me to talk about a possible business model for “fairness audits.” The idea is that, what with the recent bad press about discrimination in big data modeling (some of the audience had actually worked with the Podesta team), there will likely be a business advantage to being able to claim that your models are fair. So someone should develop those tests that companies can take. Quick, someone, monetize fairness!
One reason I think this might actually work – and more importantly, be useful – is that I focused on “effects-based” discrimination, which is to say testing a model by treating it like a black box and seeing how it works on different inputs and gives different outputs. In other words, I want to give a resume-sorting algorithm different resumes with similar qualifications but different races. An algorithmically induced randomized experiment, if you will.
From the business perspective, a test that allows a model to remain a black box feels safe, because it does not require true transparency, and allows the “secret sauce” to remain secret.
One thing, though. I don’t think it makes too much sense to have a proprietary model for fairness auditing. In fact the way I was imagining this was to develop an open-source audit model that the CFPB could use. What I don’t want, and which would be worse than nothing, would be if some private company developed a proprietary “fairness audit” model that we cannot trust and would claim to solve the very real problems listed above.
Update: something like this is already happening for privacy compliance in the big data world (hat tip David Austin).