### Archive

Archive for the ‘modeling’ Category

## Salt it up, baby!

An article in yesterday’s Science Times explained that limiting the salt in your diet doesn’t actually improve health, and could in fact be bad for you. That’s a huge turn-around for a public health rule that has run very deep.

How can this kind of thing happen?

Well, first of all epidemiologists use crazy models to make predictions on things, and in this case what happened was they saw a correlation between high blood pressure and high salt intake, and they saw a separate correlation between high blood pressure and death, and so they linked the two.

Trouble is, while very low salt intake might lower blood pressure a little bit, it also for what ever reason makes people die a wee bit more often.

As this Scientific American article explains, that “little bit” is actually really small:

Over the long-term, low-salt diets, compared to normal diets, decreased systolic blood pressure (the top number in the blood pressure ratio) in healthy people by 1.1 millimeters of mercury (mmHg) and diastolic blood pressure (the bottom number) by 0.6 mmHg. That is like going from 120/80 to 119/79. The review concluded that “intensive interventions, unsuited to primary care or population prevention programs, provide only minimal reductions in blood pressure during long-term trials.” A 2003 Cochrane review of 57 shorter-term trials similarly concluded that “there is little evidence for long-term benefit from reducing salt intake.”

Moreover, some people react to changing their salt intake with higher, and some with lower blood pressure. Turns out it’s complicated.

I’m a skeptic, especially when it comes to epidemiology. None of this surprises me, and I don’t think it’s the last bombshell we’ll be hearing. But this meta-analysis also might have flaws, so hold your breath for the next pronouncement.

One last thing – they keep saying that it’s too expensive to do this kind of study right, but I’m thinking that by now they might realize the real cost of not doing it right is a loss of the public’s trust in medical research.

Categories: modeling, statistics

## SEC Roundtable on credit rating agency models today

I’ve discussed the broken business model that is the credit rating agency system in this country on a few occasions. It directly contributed to the opacity and fraud in the MBS market and to the ensuing financial crisis, for example. And in this post and then this one, I suggest that someone should start an open source version of credit rating agencies. Here’s my explanation:

The system of credit ratings undermines the trust of even the most fervently pro-business entrepreneur out there. The models are knowingly games by both sides, and it’s clearly both corrupt and important. It’s also a bipartisan issue: Republicans and Democrats alike should want transparency when it comes to modeling downgrades- at the very least so they can argue against the results in a factual way. There’s no reason I can see why there shouldn’t be broad support for a rule to force the ratings agencies to make their models publicly available. In other words, this isn’t a political game that would score points for one side or the other.

Well, it wasn’t long before Marc Joffe, who had started an open source credit rating agency, contacted me and came to my Occupy group to explain his plan, which I blogged about here. That was almost a year ago.

Today the SEC is going to have something they’re calling a Credit Ratings Roundtable. This is in response to an amendment that Senator Al Franken put on Dodd-Frank which requires the SEC to examine the credit rating industry. From their webpage description of the event:

The roundtable will consist of three panels:

• The first panel will discuss the potential creation of a credit rating assignment system for asset-backed securities.
• The second panel will discuss the effectiveness of the SEC’s current system to encourage unsolicited ratings of asset-backed securities.
• The third panel will discuss other alternatives to the current issuer-pay business model in which the issuer selects and pays the firm it wants to provide credit ratings for its securities.

Marc is going to be one of something like 9 people in the third panel. He wrote this op-ed piece about his goal for the panel, a key excerpt being the following:

Section 939A of the Dodd-Frank Act requires regulatory agencies to replace references to NRSRO ratings in their regulations with alternative standards of credit-worthiness. I suggest that the output of a certified, open source credit model be included in regulations as a standard of credit-worthiness.

Just to be clear: the current problem is that not only is there wide-spread gaming, but there’s also a near monopoly by the “big three” credit rating agencies, and for whatever reason that monopoly status has been incredibly well protected by the SEC. They don’t grant “NRSRO” status to credit rating agencies unless the given agency can produce something like 10 letters from clients who will vouch for them providing credit ratings for at least 3 years. You can see why this is a hard business to break into.

The Roundtable was covered yesterday in the Wall Street Journal as well: Ratings Firms Steer Clear of an Overhaul - an unfortunate title if you are trying to be optimistic about the event today. From the WSJ article:

Mr. Franken’s amendment requires the SEC to create a board that would assign a rating firm to evaluate structured-finance deals or come up with another option to eliminate conflicts.

While lawsuits filed against S&P in February by the U.S. government and more than a dozen states refocused unflattering attention on the bond-rating industry, efforts to upend its reliance on issuers have languished, partly because of a lack of consensus on what to do.

I’m just kind of amazed that, given how dirty and obviously broken this industry is, we can’t do better than this. SEC, please start doing your job. How could allowing an open-source credit rating agency hurt our country? How could it make things worse?

## WSJ: “When Your Boss Makes You Pay for Being Fat”

Going along with the theme of shaming which I took up yesterday, there was a recent Wall Street Journal article called “When Your Boss Makes You Pay for Being Fat” about new ways employers are trying to “encourage healthy living”, or otherwise described, “save money on benefits”. From the article:

Until recently, Michelin awarded workers automatic \$600 credits toward deductibles, along with extra money for completing health-assessment surveys or participating in a nonbinding “action plan” for wellness. It adopted its stricter policy after its health costs spiked in 2012.

Now, the company will reward only those workers who meet healthy standards for blood pressure, glucose, cholesterol, triglycerides and waist size—under 35 inches for women and 40 inches for men. Employees who hit baseline requirements in three or more categories will receive up to \$1,000 to reduce their annual deductibles. Those who don’t qualify must sign up for a health-coaching program in order to earn a smaller credit.

• This policy combines the critical characteristics of shaming, namely 1) a complete lack of empathy and 2) the shifting of blame for a problem entirely onto one segment of the population even though the “obesity epidemic” is a poorly understood cultural phenomenon.
• To the extent that there may be push-back against this or similar policies inside the workplace, there will be very little to stop employers from not hiring fat people in the first place.
• Or for that matter, what’s going to stop employers from using people’s full medical profiles (note: by this I mean the unregulated online profile that Acxiom and other companies collect about you and then sell to employers or advertisers for medical stuff – not the official medical records which are regulated) against them in the hiring process? Who owns the new-fangled health analytics models anyway?
• We do that already to poor people by basing their acceptance on credit scores.
Categories: data science, modeling

## E-discovery and the public interest

Today I want to bring up a few observations and concerns I have about the emergence of a new field in machine learning called e-discovery. It’s the algorithmic version of discovery, so I’ll start there.

Discovery is part of the process in a lawsuit where relevant documents are selected, pored over, and then handed to the other side. Nowadays, of course, there are more and more documents, almost all electronic, typically including lots of e-mails.

If you’re talking about a big lawsuit, there could be literally millions of documents to wade through, and that takes a lot of time for humans to do, and it can be incredibly expensive and time-consuming. Enter the algorithm.

With advances in Natural Language Processing (NLP), a machine algorithm can sort emails or documents by topic (after getting the documents into machine-readable form, cleaning, and deduping) and can in general do a pretty good job of figuring out whether a given email is “relevant” to the case.

And this is already happening – the Wall Street Journal recently reported that the Justice Department allowed e-discovery for a case involving the merger of two beer companies. From the article:

With the blessing of the Justice Department’s antitrust division, the lawyers loaded the documents into a program and manually reviewed a batch to train the software to recognize relevant documents. The manual review was repeated until the Justice Department and Constellation were satisfied that the program could accurately predict relevance in the rest of the documents. Lawyers for Constellation and Crown Imports used software developed by kCura Corp., which lists the Justice Department as a client.

In the end, Constellation and Crown Imports turned over hundreds of thousands of documents to antitrust investigators.

Here are some of my questions/ concerns:

• These algorithms are typically not open source – companies like kCura make good money doing these jobs.
• That means that they could be wrong, possibly in subtle ways.
• Or maybe not so subtle ways: maybe they’ve been trained to find documents that are both “relevant” and “positive” for a given side.
• In any case, the laws of this country will increasingly depend on a black box algorithm that is no accessible to the average citizen.
• Is that in the public’s interest?
• Is that even constitutional?

## The NYC Data Skeptics Meetup

One thing I’m super excited about at work is the new NYC Data Skeptics Meetup we’re organizing. Here’s the description of our mission:

The hype surrounding Big Data and Data Science is at a fever pitch with promises to solve the world’s business and social problems, large and small. How accurate or misleading is this message? How is it helping or damaging people, and which people? What opportunities exist for data nerds and entrepreneurs that examine the larger issues with a skeptical view?

This Meetup focuses on mathematical, ethical, and business aspects of data from a skeptical perspective. Guest speakers will discuss the misuse of and best practices with data, common mistakes people make with data and ways to avoid them, how to deal with intentional gaming and politics surrounding mathematical modeling, and taking into account the feedback loops and wider consequences of modeling. We will take deep dives into models in the fields of Data Science, statistics, finance, economics, healthcare, and public policy.

This is an independent forum and open to anyone sharing an interest in the larger use of data. Technical aspects will be discussed, but attendees do not need to have a technical background.

A few things:

• I wouldn’t blame you for not joining until we have a confirmed speaker, so please suggest speakers for us! I have a bunch of people in mind I’d absolutely love to see but I’d love more ideas. And I’m thinking broadly here – of course data scientists and statisticians and economists, but also lawyers, sociologists, or anyone who works with data or the effects of data.
• If you are skeptical of the need for yet another data-oriented Meetup (or other regular meeting), please think about it this way: there are not that many currently active groups which aren’t afraid to go into the technical weeds and also not obsesses with a simplistic, sound bite business take-away. But please tell me if I’m wrong, I’d love to reach out to people doing similar things.
• Suggest a better graphic for our Meetup than our current portrait of Isaac Asimov.
Categories: data science, modeling

## The rise of big data, big brother

I recently read an article off the newsstand called The Rise of Big Data.

It was written by Kenneth Neil Cukier and Viktor Mayer-Schoenberger and it was published in the May/June 2013 edition of Foreign Affairs, which is published by the Council on Foreign Relations (CFR). I mention this because CFR is an influential think tank, filled with powerful insiders, including people like Robert Rubin himself, and for that reason I want to take this view on big data very seriously: it might reflect the policy view before long.

And if I think about it, compared to the uber naive view I came across last week when I went to the congressional hearing about big data and analytics, that would be good news. I’ll write more about it soon, but let’s just say it wasn’t everything I was hoping for.

At least Cukier and Mayer-Schoenberger discuss their reservations regarding “big data” in this article. To contrast this with last week, it seemed like the only background material for the hearing, at least for the congressmen, was the McKinsey report talking about how sexy data science is and how we’ll need to train an army of them to stay competitive.

So I’m glad it’s not all rainbows and sunshine when it comes to big data in this article. Unfortunately, whether because they’re tied to successful business interests, or because they just haven’t thought too deeply about the dark side, their concerns seem almost token, and their examples bizarre.

The article is unfortunately behind the pay wall, but I’ll do my best to explain what they’ve said.

Datafication

First they discuss the concept of datafication, and their example is how we quantify friendships with “likes”: it’s the way everything we do, online or otherwise, ends up recorded for later examination in someone’s data storage units. Or maybe multiple storage units, and maybe for sale.

They formally define later in the article as a process:

… taking all aspect of life and turning them into data. Google’s augmented-reality glasses datafy the gaze. Twitter datafies stray thoughts. LinkedIn datafies professional networks.

Datafication is an interesting concept, although as far as I can tell they did not coin the word, and it has led me to consider its importance with respect to intentionality of the individual.

Here’s what I mean. We are being datafied, or rather our actions are, and when we “like” someone or something online, we are intending to be datafied, or at least we should expect to be. But when we merely browse the web, we are unintentionally, or at least passively, being datafied through cookies that we might or might not be aware of. And when we walk around in a store, or even on the street, we are being datafied in an completely unintentional way, via sensors or Google glasses.

This spectrum of intentionality ranges from us gleefully taking part in a social media experiment we are proud of to all-out surveillance and stalking. But it’s all datafication. Our intentions may run the gambit but the results don’t.

They follow up their definition in the article, once they get to it, with a line that speaks volumes about their perspective:

Once we datafy things, we can transform their purpose and turn the information into new forms of value

But who is “we” when they write it? What kinds of value do they refer to? As you will see from the examples below, mostly that translates into increased efficiency through automation.

So if at first you assumed they mean we, the American people, you might be forgiven for re-thinking the “we” in that sentence to be the owners of the companies which become more efficient once big data has been introduced, especially if you’ve recently read this article from Jacobin by Gavin Mueller, entitled “The Rise of the Machines” and subtitled “Automation isn’t freeing us from work — it’s keeping us under capitalist control.” From the article (which you should read in its entirety):

In the short term, the new machines benefit capitalists, who can lay off their expensive, unnecessary workers to fend for themselves in the labor market. But, in the longer view, automation also raises the specter of a world without work, or one with a lot less of it, where there isn’t much for human workers to do. If we didn’t have capitalists sucking up surplus value as profit, we could use that surplus on social welfare to meet people’s needs.

The big data revolution and the assumption that N=ALL

According to Cukier and Mayer-Schoenberger, the Big Data revolution consists of three things:

1. Collecting and using a lot of data rather than small samples.
2. Accepting messiness in your data.
3. Giving up on knowing the causes.

They describe these steps in rather grand fashion, by claiming that big data doesn’t need to understand cause because the data is so enormous. It doesn’t need to worry about sampling error because it is literally keeping track of the truth. The way the article frames this is by claiming that the new approach of big data is letting “N = ALL”.

But here’s the thing, it’s never all. And we are almost always missing the very things we should care about most.

So for example, as this InfoWorld post explains, internet surveillance will never really work, because the very clever and tech-savvy criminals that we most want to catch are the very ones we will never be able to catch, since they’re always a step ahead.

Even the example from their own article, election night polls, is itself a great non-example: even if we poll absolutely everyone who leaves the polling stations, we still don’t count people who decided not to vote in the first place. And those might be the very people we’d need to talk to to understand our country’s problems.

Indeed, I’d argue that the assumption we make that N=ALL is one of the biggest problems we face in the age of Big Data. It is, above all, a way of excluding the voices of people who don’t have the time or don’t have the energy or don’t have the access to cast their vote in all sorts of informal, possibly unannounced, elections.

Those people, busy working two jobs and spending time waiting for buses, become invisible when we tally up the votes without them. To you this might just mean that the recommendations you receive on Netflix don’t seem very good because most of the people who bother to rate things are Netflix are young and have different tastes than you, which skews the recommendation engine towards them. But there are plenty of much more insidious consequences stemming from this basic idea.

Another way in which the assumption that N=ALL can matter is that it often gets translated into the idea that data is objective. Indeed the article warns us against not assuming that:

… we need to be particularly on guard to prevent our cognitive biases from deluding us; sometimes, we just need to let the data speak.

And later in the article,

In a world where data shape decisions more and more, what purpose will remain for people, or for intuition, or for going against the facts?

This is a bitch of a problem for people like me who work with models, know exactly how they work, and know exactly how wrong it is to believe that “data speaks”.

I wrote about this misunderstanding here, in the context of Bill Gates, but I was recently reminded of it in a terrifying way by this New York Times article on big data and recruiter hiring practices. From the article:

“Let’s put everything in and let the data speak for itself,” Dr. Ming said of the algorithms she is now building for Gild.

If you read the whole article, you’ll learn that this algorithm tries to find “diamond in the rough” types to hire. A worthy effort, but one that you have to think through.

Why? If you, say, decided to compare women and men with the exact same qualifications that have been hired in the past, but then, looking into what happened next you learn that those women have tended to leave more often, get promoted less often, and give more negative feedback on their environments, compared to the men, your model might be tempted to hire the man over the woman next time the two showed up, rather than looking into the possibility that the company doesn’t treat female employees well.

In other words, ignoring causation can be a flaw, rather than a feature. Models that ignore causation can add to historical problems instead of addressing them. And data doesn’t speak for itself, data is just a quantitative, pale echo of the events of our society.

Some cherry-picked examples

One of the most puzzling things about the Cukier and Mayer-Schoenberger article is how they chose their “big data” examples.

One of them, the ability for big data to spot infection in premature babies, I recognized from the congressional hearing last week. Who doesn’t want to save premature babies? Heartwarming! Big data is da bomb!

But if you’re going to talk about medicalized big data, let’s go there for reals. Specifically, take a look at this New York Times article from last week where a woman traces the big data footprints, such as they are, back in time after receiving a pamphlet on living with Multiple Sclerosis. From the article:

Now she wondered whether one of those companies had erroneously profiled her as an M.S. patient and shared that profile with drug-company marketers. She worried about the potential ramifications: Could she, for instance, someday be denied life insurance on the basis of that profile? She wanted to track down the source of the data, correct her profile and, if possible, prevent further dissemination of the information. But she didn’t know which company had collected and shared the data in the first place, so she didn’t know how to have her entry removed from the original marketing list.

Two things about this. First, it happens all the time, to everyone, but especially to people who don’t know better than to search online for diseases they actually have. Second, the article seems particularly spooked by the idea that a woman who does not have a disease might be targeted as being sick and have crazy consequences down the road. But what about a woman is actually is sick? Does that person somehow deserve to have their life insurance denied?

The real worries about the intersection of big data and medical records, at least the ones I have, are completely missing from the article. Although they did mention that ”improving and lowering the cost of health care for the world’s poor” inevitable  will lead to “necessary to automate some tasks that currently require human judgment.” Increased efficiency once again.

To be fair, they also talked about how Google tried to predict the flu in February 2009 but got it wrong. I’m not sure what they were trying to say except that it’s cool what we can try to do with big data.

Also, they discussed a Tokyo research team that collects data on 360 pressure points with sensors in a car seat, “each on a scale of 0 to 256.” I think that last part about the scale was added just so they’d have more numbers in the sentence – so mathematical!

And what do we get in exchange for all these sensor readings? The ability to distinguish drivers, so I guess you’ll never have to share your car, and the ability to sense if a driver slumps, to either “send an alert or atomatically apply brakes.” I’d call that a questionable return for my investment of total body surveillance.

Big data, business, and the government

Of course, if you’re interested in treating your government office like a business, that’s gonna give you an edge too. The example of Bloomberg’s big data initiative led to efficiency gain (read: we can do more with less, i.e. we can start firing government workers, or at least never hire more).

As for regulation, it is pseudo-dealt with via the discussion of market dominance. We are meant to understand that the only role government can or should have with respect to data is how to make sure the market is working efficiently. The darkest projected future is that of market domination by Google or Facebook:

But how should governments apply antitrust rules to big data, a market that is hard to define and is constantly changing form?

In particular, no discussion of how we might want to protect privacy.

Big data, big brother

I want to be fair to Cukier and Mayer-Schoenberger, because they do at least bring up the idea of big data as big brother. Their topic is serious. But their examples, once again, are incredibly weak.

Should we find likely-to-drop-out boys or likely-to-get-pregnant girls using big data? Should we intervene? Note the intention of this model would be the welfare of poor children. But how many models currently in production are targeting that demographic with that goal? Is this in any way at all a reasonable example?

Here’s another weird one: they talked about the bad metric used by US Secretary of Defense Robert McNamara in the Viet Nam War, namely the number of casualties. By defining this with the current language of statistics, though, it gives us the impression that we could just be super careful about our metrics in the future and: problem solved. As we experts in data know, however, it’s a political decision, not a statistical one, to choose a metric of success. And it’s the guy in charge who makes that decision, not some quant.

Innovation

If you end up reading the Cukier and Mayer-Schoenberger article, please also read Julie Cohen’s draft of a soon-to-be published Harvard Law Review article called “What Privacy is For” where she takes on big data in a much more convincing and skeptical light than Cukier and Mayer-Schoenberger were capable of summoning up for their big data business audience.

I’m actually planning a post soon on Cohen’s article, which contains many nuggets of thoughtfulness, but for now I’ll simply juxtapose two ideas surrounding big data and innovation, giving Cohen the last word. First from the Cukier and Mayer-Schoenberger article:

Big data enables us to experiment faster and explore more leads. These advantages should produce more innovation

Second from Cohen, where she uses the term “modulation” to describe, more or less, the effect of datafication on society:

When the predicate conditions for innovation are described in this way, the problem with characterizing privacy as anti-innovation becomes clear: it is modulation, not privacy, that poses the greater threat to innovative practice. Regimes of pervasively distributed surveillance and modulation seek to mold individual preferences and behavior in ways that reduce the serendipity and the freedom to tinker on which innovation thrives. The suggestion that innovative activity will persist unchilled under conditions of pervasively distributed surveillance is simply silly; it derives rhetorical force from the cultural construct of the liberal subject, who can separate the act of creation from the fact of surveillance. As we have seen, though, that is an unsustainable fiction. The real, socially-constructed subject responds to surveillance quite differently—which is, of course, exactly why government and commercial entities engage in it. Clearing the way for innovation requires clearing the way for innovative practice by real people, by preserving spaces within which critical self-determination and self-differentiation can occur and by opening physical spaces within which the everyday practice of tinkering can thrive.

## Big data and surveillance

You know how, every now and then, you hear someone throw out a statistic that implies almost all of the web is devoted to porn?

Well, that turns out to be a false myth, which you can read more about here - although once upon a time it was kind of true, before women started using the web in large numbers and before there was Netflix streaming.

Here’s another myth along the same lines which I think might actually be true: almost all of big data is devoted to surveillance.

Of course, data is data, and you could define “surveillance” broadly (say as “close observation”), to make the above statement a tautology. To what extent is Google’s data, collected about you, a surveillance database, if they only use it to tailor searches and ads?

On the other hand, something that seems unthreatening now can become creepy soon: recall the NSA whistleblower who last year described how the government stores an enormous amount of the “electronic communications” in this country to keep close tabs on us.

The past

Back in 2011, computerworld.com published an article entitled “Big data to drive a surveillance society” and makes the case that there is a natural competition among corporations with large databases to collect more data, have it more interconnected (knowing now only a person’s shopping habits but also their location and age, say) and have the analytics work faster, even real-time, so they can peddle their products faster and better than the next guy.

Of course, not everyone agrees to talk about this “natural competition”. From the article:

Todd Papaioannou, vice president of cloud architecture at Yahoo, said instead of thinking about big data analytics as a weapon that empowers corporate Big Brothers, consumers should regard it as a tool that enables a more personalized Web experience.

“If someone can deliver a more compelling, relevant experience for me as a consumer, then I don’t mind it so much,” he said.

Thanks for telling us consumers how great this is, Todd. Later in the same article Todd says, “Our approach is not to throw any data away.”

The present

Fast forward to 2013, when defence contractor Raytheon is reported to have a new piece of software, called Riot, which is cutting-edge in the surveillance department.

The name Riot refers to “Rapid Information Overlay Technology” and it can locate individuals with longitude and latitudes, using cell phone data, and make predictions as well, using data scraped from Facebook, Twitter, and Foursquare. A video explains how they do it. From the op-ed:

The possibilities for RIOT are hideous at consumer level. This really is the stalker’s dream technology. There’s also behavioural analysis to predict movements in the software. That’s what Big Data can do, and if it’s not foolproof, there are plenty of fools around the world to try it out on.

US employers, who have been creating virtual Gulags of surveillance for employees with much less effective technology, will love this. “We know what you do” has always been a working option for coercion. The fantastic levels of paranoia involved in the previous generations of surveillance technology will be truly gratified by RIOT.

The future

Lest we think that our children are not as affected by such stalking software, since they don’t spend as much time on social media and often don’t have cellphones, you should also be aware that educational data is now being collected about individual learners in the U.S. at an enormous scale and with very little oversight.

This report from educationnewyork.com (hat tip Matthew Cunningham-Cook) explains recent changes in privacy laws for children, which happen to coincide with how much data is being collected (tons) and how much money is in the analysis of that data (tons):

Schools are a rich source of personal information about children that can be legally and illegally accessed by third parties.With incidences of identity theft, database hacking, and sale of personal information rampant, there is an urgent need to protect students’ rights under FERPA and raise awareness of aspects of the law that may compromise the privacy of students and their families.

In 2008 and 2011, amendments to FERPA gave third parties, including private companies,increased access to student data. It is significant that in 2008, the amendments to FERPA expanded the definitions of “school  officials” who have access to student data to include “contractors, consultants, volunteers, and other parties to whom an educational agency or institution has outsourced institutional services or functions it would otherwise use employees to perform.” This change has the effect of increasing the market for student data.

There are lots of contractors and consultants, for example inBloom, and they are slightly less concerned about data privacy issues than you might be:

inBloom has stated that it “cannot guarantee the security of the information stored … or that the information will not be intercepted when it is being transmitted.”

The article ends with this:

The question is: Should we compromise and endanger student privacy to support a centralized and profit-driven education reform initiative? Given this new landscape of an information and data free-for-all, and the proliferation of data-driven education reform initiatives like CommonCore and huge databases of student information, we’ve arrived at a time when once a child enters a public school,their parents will never again know who knows what about their children and about their families. It is now up to individual states to find ways to grant students additional privacy protections.

No doubt about it: our children are well on their way to being the most stalked generation.

One of the reasons I’m writing this post today is that I’m on a train to D.C. to sit in a Congressional hearing where Congressmen will ask “big data experts” questions about big data and analytics. The announcement is here, and I’m hoping to get into it.

The experts present are from IBM, the NSF, and North Carolina State University. I’m wondering how they got picked and what their incentives are. If I get in I will write a follow-up post on what happened.

Here’s what I hope happens. First, I hope it’s made clear that anonymization doesn’t really work with large databases. Second, I hope it’s clear that there’s no longer a very clear dividing line between sensitive data and nonsensitive data – you’d be surprised how much can be inferred about your sensitive data using only nonsensitive data.

Next, I hope it’s clear that the very people who should be worried the most about their data being exposed and freely available are the ones who don’t understand the threat. This means that merely saying that people should protect their data more is utterly insufficient.

Next, we should understand what policies already in place look like in Europe:

Finally, we should focus not only the collection of data, but on the usage of data. Just because you have a good idea of my age, race, education level, income, and HIV status doesn’t mean you should be able to use that information against me whenever you want.

In particular, it should not be legal for companies that provide loans or insurance to use whatever information they can buy from Acxiom about you. It should be a highly regulated set of data that allows for such decisions.

Categories: data science, modeling

## How much math do scientists need to know?

I’m catching up with reading the “big data news” this morning (via Gil Press) and I came across this essay by E. O. Wilson called “Great Scientist ≠ Good at Math”. In it, he argues that most of the successful scientists he knows aren’t good at math, and he doesn’t see why people get discouraged from being scientists just because they suck at math.

Here’s an important excerpt from the essay:

Over the years, I have co-written many papers with mathematicians and statisticians, so I can offer the following principle with confidence. Call it Wilson’s Principle No. 1: It is far easier for scientists to acquire needed collaboration from mathematicians and statisticians than it is for mathematicians and statisticians to find scientists able to make use of their equations.

Given that he’s written many papers with mathematicians and statisticians, then, he is not claiming that math itself is not part of great science, just that he hasn’t been the one that supplied the mathy bits. I think this is really key.

And it resonates with me: I’ve often said that the cool thing about working on a data science team in industry, for example, is that different people bring different skills to the table. I might be an expert on some machine learning algorithms, while someone else will be a domain expert. The problem requires both skill sets, and perhaps no one person has all that knowledge. Teamwork kinda rocks.

Another thing he exposes with Wilson’s Principle No. 1, though, which doesn’t resonate with me, is a general lack of understanding of what mathematicians are actually trying to accomplish with “their equations”.

It is a common enough misconception to think of the quant as a guy with a bunch of tools but no understanding or creativity. I’ve complained about that before on this blog. But when it comes to professional mathematicians, presumably including his co-authors, a prominent scientist such as Wilson should realize that they are doing creative things inside the realm of mathematics simply for the sake of understanding mathematics.

Mathematicians, as a group, are not sitting around wishing someone could “make use of their equations.” For one thing, they often don’t even think about equations. And for another, they often think about abstract structures with no goal whatsoever of connecting it back to, say, how ants live in colonies. And that’s cool and beautiful too, and it’s not a failure of the system. That’s just math.

I’m not saying it wouldn’t be fun for mathematicians to spend more time thinking about applied science. I think it would be fun for them, actually. Moreover, as the next few years and decades unfold, we might very well see a large-scale shrinkage in math departments and basic research money, which could force the issue.

And, to be fair, there are probably some actual examples of mathy-statsy people who are thinking about equations that are supposed to relate to the real world but don’t. Those guys should learn to be better communicators and pair up with colleagues who have great data. In my experience, this is not a typical situation.

One last thing. The danger in ignoring the math yourself, if you’re a scientist, is that you probably aren’t that great at knowing the difference between someone who really knows math and someone who can throw around terminology. You can’t catch charlatans, in other words. And, given that scientists do need real math and statistics to do their research, this can be a huge problem if your work ends up being meaningless because your team got the math wrong.

Categories: modeling, news, statistics

## Global move to austerity based on mistake in Excel

As Rortybomb reported yesterday on the Roosevelt Institute blog (hat tip Adam Obeng), a recent paper written by Thomas HerndonMichael Ash, and Robert Pollin looked into replicating the results of a economics paper originally written by Carmen Reinhart and Kenneth Rogoff entitled Growth in a Time of Debt.

The original Reinhart and Rogoff paper had concluded that public debt loads greater than 90 percent of GDP consistently reduce GDP growth, a “fact” which has been widely used. However, the more recent paper finds problems. Here’s the abstract:

Herndon, Ash and Pollin replicate Reinhart and Rogoff and find that coding errors, selective exclusion of available data, and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period. They find that when properly calculated, the average real GDP growth rate for countries carrying a public-debt-to-GDP ratio of over 90 percent is actually 2.2 percent, not -0:1 percent as published in Reinhart and Rogo ff. That is, contrary to RR, average GDP growth at public debt/GDP ratios over 90 percent is not dramatically different than when debt/GDP ratios are lower.

The authors also show how the relationship between public debt and GDP growth varies significantly by time period and country. Overall, the evidence we review contradicts Reinhart and Rogoff ’s claim to have identified an important stylized fact, that public debt loads greater than 90 percent of GDP consistently reduce GDP growth.

1) We should always have the data and code for published results.

The way the authors Herndon, Ash and Pollin managed to replicate the results was that they personally requested the excel spreadsheets from Reinhart and Rogoff. Given how politically useful and important this result has been (see Rortybomb’s explanation of this), it’s kind of a miracle that they released the spreadsheet. Indeed that’s the best part of this story from a scientific viewpoint.

2) The data and code should be open source.

One cool thing is that now you can actually download the data – there’s a link at the bottom of this page. I did this and was happy to have a bunch of csv files and some (open source) R code which presumably recovers the excel spreadsheet mistakes. I also found some .dta files, which seems like Stata proprietary file types, which is annoying, but then I googled and it seems like you can use R to turn .dta files into csv files. It’s still weird that they wrote code in R but saved files in Stata.

3) These mistakes are easy to make and they’re mostly not considered mistakes.

Let’s talk about the “mistakes” the authors found. First, they’re excluding certain time periods for certain countries, specifically right after World War II. Second, they chose certain “non-standard” weightings for the various countries they considered. Finally, they accidentally excluded certain rows from their calculation.

Only that last one is considered a mistake by modelers. The others are modeling choices, and they happen all the time. Indeed it’s impossible not to make such choices. Who’s to say that you have to use standard country weightings? Why? How much data do you actually need to consider? Why?

[Aside: I'm sure there are proprietary trading models running right now in hedge funds that anticipate how other people weight countries in standard ways and betting accordingly. In that sense, using standard weightings might be a stupid thing to do. But in any case validating a weighting scheme is extremely difficult. In the end you're trying to decide how much various countries matter in a certain light, and the answer is often that your country matters the most to you.]

4) We need to actually consider other modeling possibilities.

It’s not a surprise, to economists anyway, that after you include more post-WWII years of data, which we all know to be high debt and high growth years worldwide, you get a substantively different answer. Excluding these data points is just as much a political decision as a modeling decision.

In the end the only reasonable way to proceed is to describe your choices, and your reasoning, and the result, but also consider other “reasonable” choices and report the results there too. And if you don’t like the answer, or don’t want to do the work, at the very least you need to provide your code and data and let other people check how your result changes with different “reasonable” choices.

Once the community of economists (and other data-centric fields) starts doing this, we will all realize that our so-called “objective results” utterly depend on such modeling decisions, and are about as variable as our own opinions.

5) And this is an easy model.

Think about how many modeling decisions and errors are in more complicated models!

Categories: modeling, news

## War of the machines, college edition

A couple of people have sent me this recent essay (hat tip Leon Kautsky) written by Elijah Mayfield on the education technology blog e-Literate, described on their About page as “a hobby weblog about educational technology and related topics that is maintained by Michael Feldstein and written by Michael and some of his trusted colleagues in the field of educational technology.”

Mayfield’s essay is entitled “Six Ways the edX Announcement Gets Automated Essay Grading Wrong”. He’s referring to the recent announcement, which was written about in the New York Times last week, about how professors will soon be replaced by computers in grading essays. He claims they got it all wrong and there’s nothing to worry about.

I wrote about this idea too, in this post, and he hasn’t addressed my complaints at all.

First, Mayfield’s points:

• Journalists sensationalize things.
• The machine is identifying things in the essays that are associated with good writing vs. bad writing, much like it might learn to distinguish pictures of ducks from pictures of houses.
• It’s actually not that hard to find the duck and has nothing to do with “creativity” (look for webbed feet).
• If the machine isn’t sure it can spit back the essay to the professor to read (if the professor is still employed).
• The machine doesn’t necessarily reward big vocabulary words, except when it does.
• You’d need thousands of training examples (essays on a given subject) to make this actually work.
• What’s so really wonderful is that a student can get all his or her many drafts graded instantaneously, which no professor would be willing to do.

Here’s where I’ll start, with this excerpt from near the end:

“Can machine learning grade essays?” is a bad question. We know, statistically, that the algorithms we’ve trained work just as well as teachers for churning out a score on a 5-point scale.  We know that occasionally it’ll make mistakes; however, more often than not, what the algorithms learn to do are reproduce the already questionable behavior of humans. If we’re relying on machine learning solely to automate the process of grading, to make it faster and cheaper and enable access, then sure. We can do that.

OK, so we know that the machine can grade essays written for human consumption pretty accurately. But it hasn’t had to deal with essays written for machine consumption yet. There’s major room for gaming here, and only a matter of time before there’s a competing algorithm to build a great essay. I even know how to train that algorithm. Email me privately and we can make a deal on profit-sharing.

And considering that students will be able to get their drafts graded as many times as they want, as Mayfield advertised, this will only be easier. If I build an essay that I think should game the machine, by putting in lots of (relevant) long vocabulary words and erudite phrases, then I can always double check by having the system give me a grade. If it doesn’t work, I’ll try again.

And the essays built this way won’t get caught via the fraud detection software that finds plagiarism, because any good essay-builder will only steal smallish phrases.

One final point. The fact that the machine-learning grading algorithm only works when it’s been trained on thousands of essays points to yet another depressing trend: large-scale classes with the same exact assignments every semester so last year’s algorithm can be used, in the name of efficiency.

But that means last year’s essay-building algorithm can be used as well. Pretty soon it will just be a war of the machines.

Categories: data science, modeling, musing, news

## A public-facing math panel

I’m returning from two full days of talking to mathematicians and applied mathematicians at Cornell. I was really impressed with the people I met there – thoughtful, informed, and inquisitive – and with the kind reception they gave me.

I gave an “Oliver Talk” which was joint with the applied math colloquium on Thursday afternoon. The goal of my talk was to convince mathematicians that there’s a very bad movement underway whereby models are being used against people, in predatory ways, and in the name of mathematics. I turned some people off, I think, by my vehemence, but then again it’s hard not get riled up about this stuff, because it’s creepy and I actually think there’s a huge amount at stake.

One thing I did near the end of my talk was bring up (and recruit for) the idea of a panel of mathematicians which defines standards for public-facing models and vets the current crop.

The first goal of such a panel would be to define mathematical models, with a description of “best practices” when modeling people, including things like anticipating impact, gaming, and feedback loops of models, and asking for transparent and ongoing evaluation methods, as well as having minimum standards for accuracy.

The second goal of the panel would be to choose specific models that are in use and measure the extent to which they pass the standards of the above best practices rubric.

So the teacher value-added model, I’d expect, would fail in that it doesn’t have an evaluation method, at least that is made public, nor does it seem to have any accuracy standards, even though it’s widely used and is high impact.

I’ve had some pretty amazing mathematicians already volunteer to be on such a panel, which is encouraging. What’s cool is that I think mathematicians, as a group, are really quite ethical and can probably make their voices heard and trusted if they set their minds to it.

Categories: math, modeling

## New creepy model: job hiring software

Warmup: Automatic Grading Models

Before I get to my main take-down of the morning, let me warm up with an appetizer of sorts: have you been hearing a lot about new models that automatically grade essays?

Does it strike you that’s there’s something wrong with that idea but you don’t know what it is?

Here’s my take. While it’s true that it’s possible to train a model to grade essays similarly to what a professor now does, that doesn’t mean we can introduce automatic grading – at least not if the students in question know that’s what we’re doing.

There’s a feedback loop, whereby if the students know their essays will be automatically graded, then they will change what they’re doing to optimize for good automatic grades rather than, say, a cogent argument.

For example, a student might download a grading app themselves (wouldn’t you?) and run their essay through the machine until it gets a great grade. Not enough long words? Put them in! No need to make sure the sentences make sense, because the machine doesn’t understand grammar!

This is, in fact, a great example where people need to take into account the (obvious when you think about them) feedback loops that their models will enter in actual use.

Job Hiring Models

Now on to the main course.

In this week’s Economist there is an essay about the new widely-used job hiring software and how awesome it is. It’s so efficient! It removes the biases of of those pesky recruiters! Here’s an excerpt from the article:

The problem with human-resource managers is that they are human. They have biases; they make mistakes. But with better tools, they can make better hiring decisions, say advocates of “big data”.

So far “the machine” has made observations such as:

• Good if candidate uses browser you need to download like Chrome.
• Not as bad as one might expect to have a criminal record.
• Neutral on job hopping.
• Great if you live nearby.
• Good if you are on Facebook.
• Bad if you’re on Facebook and every other social networking site as well.

Now, I’m all for learning to fight against our biases and hire people that might not otherwise be given a chance. But I’m not convinced that this will happen that often – the people using the software can always train the model to include their biases and then point to the machine and say “The machine told me to do it”. True.

What I really object to, however, is the accumulating amount of data that is being collected about everyone by models like this.

It’s one thing for an algorithm to take my CV in and note that I misspelled my alma mater, but it’s a different thing altogether to scour the web for my online profile trail (via Acxiom, for example), to look up my credit score, and maybe even to see my persistence score as measured by my past online education activities (soon available for your 7-year-old as well!).

As a modeler, I know how hungry the model can be. It will ask for all of this data and more. And it will mean that nothing you’ve ever done wrong, no fuck-up that you wish to forget, will ever be forgotten. You can no longer reinvent yourself.

Forget mobility, forget the American Dream, you and everyone else will be funneled into whatever job and whatever life the machine has deemed you worthy of. WTF.

Categories: data science, modeling, rant

## Hey WSJ, don’t blame unemployed disabled people for the crap economy

This morning I’m being driven crazy by this article in yesterday’s Wall Street Journal entitled “Workers Stuck in Disability Stunt Economic Recovery.”

Even the title makes the underlying goal of the article crystal clear: the lazy disabled workers are to blame for the crap economy. Lest you are unconvinced that anyone could make such an unreasonable claim of causation, here’s a tasty excerpt from the article that spells it out:

Economic growth is driven by the number of workers in an economy and by their productivity. Put simply, fewer workers usually means less growth.

Since the recession, more people have gone on disability, on net, than new workers have joined the labor force. Mr. Feroli estimated the exodus to disability costs 0.6% of national output, equal to about \$95 billion a year.

“The greater cost is their long-term dependency on transfers from the federal government,” Mr. Autor said, “placing strain on the soon-to-be exhausted Social Security Disability trust fund.”

The underlying model here, then, is that there’s a bunch of people who have the choice between going on disability or “joining the labor force” and they’ve  all chosen to go on disability. I wonder where their evidence is that people really have that choice, considering the unemployment numbers and participation rate numbers we see nowadays.

For example, the unemployment rate for youths is now 22.9%, and the participation rate for them has gone from 59.2% in December 2007, to 54.5% today. This is probably not because so many kids under the age of 25 are disabled, I suspect. If you look at the overall labor participation rate, it’s dropped from 66.0 in December 2007 to 63.3 in March 2013. Most of the people who have left the work force are also not disabled. They’ve been discouraged for some other mysterious reason. I’m gonna go ahead and guess it’s because they can’t find a job.

This leads me to ask the following question from the journalists LESLIE SCISM and JON HILSENRATH who wrote the article: Where is your evidence of causation??

Here’s another example from the article of a seriously fucked-up understanding of cause and effect:

With overall participation down, the labor force—a measure of people working and people looking for work—is barely growing.

They consistently paint the picture whereby people decide to stop working, and then yucky things happen, in this case the labor force stops growing. Damn those lazy people.

They even bring in a fancy word from physics to describe the problem, namely hysteresis. Now, they didn’t understand or correctly define the term, but it doesn’t really matter, because the point of using a fancy term from physics was not to add to the clarity of the argument but rather to impress.

The goal here is, in fact, that if enough economists use sophisticated language to describe the various effects, we will all be able to blame people with bad backs, making \$13.6K per year, on why our economy sucks, rather than the rich assholes in finance who got us into this mess and are currently buying \$2 million dollar personal offices instead of going to jail.

Just to be clear, that’s \$1,130 a month, which I guess represents so enticing a lifestyle that the people currently enjoying it ‘are “pretty unlikely to want to forfeit economic security for a precarious job market”‘ according to M.I.T. economist David Autor. I’d love to have David Autor spell out, for us, exactly what’s economically secure about that kind of monthly check.

The rest of the article is in large part a description of how people get onto SSDI, insinuating that the people currently on it are not really all that disabled or worthy of living high on the hog, and are in any case never ever leaving.

How’s this for a slightly different take on the situation: there are of course some people who are faking something, that’s always the case. But in general, the people on SSDI need to be there, and before the recession might have had the kind of employers who kept them on even though they often called in sick, out of loyalty and kindness, because they didn’t want to fire them. But when the recession struck those employers had to cut them off, or they went out of business completely. Now those people can’t find work and don’t have many options. In other words, the recession caused the SSDI program to grow. That doesn’t mean it caused a bunch of people to get sick, but it does mean that sick people are more dependent on SSDI because there are fewer options.

By the way, read the comments of this article, there are some really good ones (“What were people with injuries and no high-value job skills to do? Is the number of people in the social security disability program the problem or the symptom?”) as well as some really outrageous ones (‘The current situation makes the picture of the “Welfare Queen” of the 1980s look like an honest citizen’).

Categories: modeling, news, rant

## Bob Fischer talks about climate modeling at Occupy today

I’m really excited to be going to the pre-meeting talk of my Occupy group today. We’re having a talk by Bob Fischer, who is a post-doc at NASA GISS, a laboratory focused largely on climate change up here in the Columbia neighborhood.

He is coming to talk with us about his work investigating the long-term behavior of ice sheets in a changing climate.  Before joining GISS, Dr. Fischer was a quant on Wall Street, working on statistical arbitrage, trade execution, simulation/modeling platforms, signal development, and options trading. I met him when we were both students at math camp in 1988, but we reconnected this past summer at the reunion.

The actual title of his talk is “The History of CO2: Past, Present and Future” and it’s open to the public, so please come if you can (it’s at 2:00 pm in room 409 here but more details are here).

After Bob, we’ll be having our usual Occupy meeting. Topics this week include our plans for a Citigroup and HSBC picket later this month, our panel submissions to the Left Forum in June, our plans for May Day, and continued work on writing a book modeled after the Debt Resistor’s Operations Manual.

Housekeeping – RSS feed for mathbabe broken, possibly fixed

I’ve been trying to address the problem people have been having with their RSS feed for mathbabe. Thanks to my nerd-girl friend Jennifer Rubinovitz, I’ve changed some settings in my WordPress settings and now I can view all of my posts when I open up RSSOwl. But in order for your reader to get caught up I have a feeling you’ll need to somehow refresh it or maybe get rid of mathbabe and then re-subscribe. I’ll update as I learn more (please tell me what’s working for you!).

Categories: #OWS, modeling

## Guest post by Julia Evans: How I got a data science job

This is a guest post by Julia Evans. Julia is a data scientist & programmer who lives in Montréal. She spends her free time these days playing with data and running events for women who program or want to — she just started a Montréal chapter of pyladies to teach programming, and co-organize a monthly meetup called Montréal All-Girl Hack Night for women who are developers.

asked mathbabe a question a few weeks ago saying that I’d recently started a data science job without having too much experience with statistics, and she asked me to write something about how I got the job. Needless to say I’m pretty honoured to be a guest blogger here Hopefully this will help someone!

Last March I decided that I wanted a job playing with data, since I’d been playing with datasets in my spare time for a while and I really liked it. I had a BSc in pure math, a MSc in theoretical computer science and about 6 months of work experience as a programmer developing websites. I’d taken one machine learning class and zero statistics classes.

In October, I left my web development job with some savings and no immediate plans to find a new job. I was thinking about doing freelance web development. Two weeks later, someone posted a job posting to my department mailing list looking for a “Junior Data Scientist”. I wrote back and said basically “I have a really strong math background and am a pretty good programmer”. This email included, embarrassingly, the sentence “I am amazing at math”. They said they’d like to interview me.

The interview was a lunch meeting. I found out that the company (Via Science) was opening a new office in my city, and was looking for people to be the first employees at the new office. They work with clients to make predictions based on their data.

My interviewer (now my manager) asked me about my role at my previous job (a little bit of everything — programming, system administration, etc.), my math background (lots of pure math, but no stats), and my experience with machine learning (one class, and drawing some graphs for fun). I was asked how I’d approach a digit recognition problem and I said “well, I’d see what people do to solve problems like that, and I’d try that”.

I also talked about some data visualizations I’d worked on for fun. They were looking for someone who could take on new datasets and be independent and proactive about creating model, figuring out what is the most useful thing to model, and getting more information from clients.

I got a call back about a week after the lunch interview saying that they’d like to hire me. We talked a bit more about the work culture, starting dates, and salary, and then I accepted the offer.

So far I’ve been working here for about four months. I work with a machine learning system developed inside the company (there’s a paper about it here). I’ve spent most of my time working on code to interface with this system and make it easier for us to get results out of it quickly. I alternate between working on this system (using Java) and using Python (with the fabulous IPython Notebook) to quickly draw graphs and make models with scikit-learn to compare our results.

I like that I have real-world data (sometimes, lots of it!) where there’s not always a clear question or direction to go in. I get to spend time figuring out the relevant features of the data or what kinds of things we should be trying to model. I’m beginning to understand what people say about data-wrangling taking up most of their time. I’m learning some statistics, and we have a weekly Friday seminar series where we take turns talking about something we’ve learned in the last few weeks or introducing a piece of math that we want to use.

Overall I’m really happy to have a job where I get data and have to figure out what direction to take it in, and I’m learning a lot.

## K-Nearest Neighbors: dangerously simple

I spend my time at work nowadays thinking about how to start a company in data science. Since there are tons of companies now collecting tons of data, and they don’t know what do to do with it, nor who to ask, part of me wants to design (yet another) dumbed-down “analytics platform” so that business people can import their data onto the platform, and then perform simple algorithms themselves, without even having a data scientist to supervise.

After all, a good data scientist is hard to find. Sometimes you don’t even know if you want to invest in this whole big data thing, you’re not sure the data you’re collecting is all that great or whether the whole thing is just a bunch of hype. It’s tempting to bypass professional data scientists altogether and try to replace them with software.

I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well. Let me explain.

Say you have a bunch of data points, maybe corresponding to users on your website. They have a bunch of attributes, and you want to categorize them based on their attributes. For example, they might be customers that have spent various amounts of money on your product, and you can put them into “big spender”, “medium spender”, “small spender”, and “will never buy anything” categories.

What you really want, of course, is a way of anticipating the category of a new user before they’ve bought anything, based on what you know about them when they arrive, namely their attributes. So the problem is, given a user’s attributes, what’s your best guess for that user’s category?

Let’s use k-Nearest Neighbors. Let k be 5 and say there’s a new customer named Monica. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Monica in terms of attributes, and sees what categories those 5 customers were in. If 4 of them were “medium spenders” and 1 was “small spender”, then your best guess for Monica is “medium spender”.

Holy shit, that was simple! Mathbabe, what’s your problem?

The devil is all in the detail of what you mean by close. And to make things trickier, as in easier to be deceptively easy, there are default choices you could make (and which you would make) which would probably be totally stupid. Namely, the raw numbers, and Euclidean distance.

So, for example, say your customer attributes were: age, salary, and number of previous visits to your website. Don’t ask me how you know your customer’s salary, maybe you bought info from Acxiom.

So in terms of attribute vectors, Monica’s might look like:

$(22.0, 55000.0, 0.0)$

And the nearest neighbor to Monica might look like:

$(75.0, 54000.0, 35.0)$

In other words, because you’re including the raw salary numbers, you are thinking of Monica, who is 22 and new to the site, as close to a 75-year old who comes to the site a lot. The salary, being of a much larger scale, is totally dominating the distance calculation. You might as well have only that one attribute and scrap the others.

Note: you would not necessarily think about this problem if you were just pressing a big button on a dashboard called “k-NN me!”

Of course, it gets trickier. Even if you measured salary in thousands (so Monica would now be given the attribution vector $(22.0, 55.0, 0.0)$) you still don’t know if that’s the right scaling. In fact, if you think about it, the algorithm’s results completely depends on how you scale these numbers, and there’s almost no way to reasonably visualize it even, to do it by hand, if you have more than 4 attributes.

Another problem is redundancy – if you have a bunch of attributes that are essentially redundant, i.e. that are highly correlated to each other, then including them all is tantamount to multiplying the scale of that factor.

Another problem is not all your attributes are even numbers, so you have string attributes. You might think you can solve this by using 0′s and 1′s, but in the case of k-NN, that becomes just another scaling problem.

One way around this might be to first use some kind of dimension-reducing algorithm, like PCA, to figure out what attribute combinations to actually use from the get-go. That’s probably what I’d do.

But that means you’re using a fancy algorithm in order to use a completely stupid algorithm. Not that there’s anything wrong with that, but it indicates the basic problem, which is that doing data analysis carefully is actually pretty hard and maybe should be done by professionals, or at least under the supervision of a one.

Categories: data science, modeling

## We don’t need more complicated models, we need to stop lying with our models

The financial crisis has given rise to a series of catastrophes related to mathematical modeling.

Time after time you hear people speaking in baffled terms about mathematical models that somehow didn’t warn us in time, that were too complicated to understand, and so on. If you have somehow missed such public displays of throwing the model (and quants) under the bus, stay tuned below for examples.

A common response to these problems is to call for those models to be revamped, to add features that will cover previously unforeseen issues, and generally speaking, to make them more complex.

For a person like myself, who gets paid to “fix the model,” it’s tempting to do just that, to assume the role of the hero who is going to set everything right with a few brilliant ideas and some excellent training data.

Unfortunately, reality is staring me in the face, and it’s telling me that we don’t need more complicated models.

If I go to the trouble of fixing up a model, say by adding counterparty risk considerations, then I’m implicitly assuming the problem with the existing models is that they’re being used honestly but aren’t mathematically up to the task.

But this is far from the case – most of the really enormous failures of models are explained by people lying. Before I give three examples of “big models failing because someone is lying” phenomenon, let me add one more important thing.

Namely, if we replace okay models with more complicated models, as many people are suggesting we do, without first addressing the lying problem, it will only allow people to lie even more. This is because the complexity of a model itself is an obstacle to understanding its results, and more complex models allow more manipulation.

Example 1: Municipal Debt Models

Many municipalities are in shit tons of problems with their muni debt. This is in part because of the big banks taking advantage of them, but it’s also in part because they often lie with models.

Specifically, they know what their obligations for pensions and school systems will be in the next few years, and in order to pay for all that, they use a model which estimates how well their savings will pay off in the market, or however they’ve invested their money. But they use vastly over-exaggerated numbers in these models, because that way they can minimize the amount of money to put into the pool each year. The result is that pension pools are being systematically and vastly under-funded.

Example 2: Wealth Management

I used to work at Riskmetrics, where I saw first-hand how people lie with risk models. But that’s not the only thing I worked on. I also helped out building an analytical wealth management product. This software was sold to banks, and was used by professional “wealth managers” to help people (usually rich people, but not mega-rich people) plan for retirement.

We had a bunch of bells and whistles in the software to impress the clients – Monte Carlo simulations, fancy optimization tools, and more. But in the end, the banks and their wealth managers put in their own market assumptions when they used it. Specifically, they put in the forecast market growth for stocks, bonds, alternative investing, etc., as well as the assumed volatility of those categories and indeed the entire covariance matrix representing how correlated the market constituents are to each other.

The result is this: no matter how honest I would try to be with my modeling, I had no way of preventing the model from being misused and misleading to the clients. And it was indeed misused: wealth managers put in absolutely ridiculous assumptions of fantastic returns with vanishingly small risk.

Example 3: JP Morgan’s Whale Trade

I saved the best for last. JP Morgan’s actions around their \$6.2 billion trading loss, the so-called “Whale Loss” was investigated recently by a Senate Subcommittee. This is an excerpt (page 14) from the resulting report, which is well worth reading in full:

While the bank claimed that the whale trade losses were due, in part, to a failure to have the right risk limits in place, the Subcommittee investigation showed that the five risk limits already in effect were all breached for sustained periods of time during the first quarter of 2012. Bank managers knew about the breaches, but allowed them to continue, lifted the limits, or altered the risk measures after being told that the risk results were “too conservative,” not “sensible,” or “garbage.” Previously undisclosed evidence also showed that CIO personnel deliberately tried to lower the CIO’s risk results and, as a result, lower its capital requirements, not by reducing its risky assets, but by manipulating the mathematical models used to calculate its VaR, CRM, and RWA results. Equally disturbing is evidence that the OCC was regularly informed of the risk limit breaches and was notified in advance of the CIO VaR model change projected to drop the CIO’s VaR results by 44%, yet raised no concerns at the time.

I don’t think there could be a better argument explaining why new risk limits and better VaR models won’t help JPM or any other large bank. The manipulation of existing models is what’s really going on.

Just to be clear on the models and modelers as scapegoats, even in the face of the above report, please take a look at minute 1:35:00 of the C-SPAN coverage of  former CIO head Ina Drew’s testimony when she’s being grilled by Senator Carl Levin (hat tip Alan Lawhon, who also wrote about this issue here).

Ina Drew firmly shoves the quants under the bus, pretending to be surprised by the failures of the models even though, considering she’d been at JP Morgan for 30 years, she might know just a thing or two about how VaR can be manipulated. Why hasn’t Sarbanes-Oxley been used to put that woman in jail? She’s not even at JP Morgan anymore.

Stick around for a few minutes in the testimony after Levin’s done with Drew, because he’s on a roll and it’s awesome to watch.

Categories: finance, modeling, news, rant, statistics

## Value-added model doesn’t find bad teachers, causes administrators to cheat

There’ve been a couple of articles in the past few days about teacher Value-Added Testing that have enraged me.

If you haven’t been paying attention, the Value-Added Model (VAM) is now being used in a majority of the states (source: the Economist):

But it gives out nearly random numbers, as gleaned from looking at the same teachers with two scores (see this previous post). There’s a 24% correlation between the two numbers. Note that some people are awesome with respect to one score and complete shit on the other score:

Final thing you need to know about the model: nobody really understands how it works. It relies on error terms of an error-riddled model. It’s opaque, and no teacher can have their score explained to them in Plain English.

Now, with that background, let’s look into these articles.

First, there’s this New York Times article from yesterday, entitled “Curious Grade for Teachers: Nearly All Pass”. In this article, it describes how teachers are nowadays being judged using a (usually) 50/50 combination of classroom observations and VAM scores. This is different from the past, which was only based on classroom observations.

What they’ve found is that the percentage of teachers found “effective or better” has stayed high in spite of the new system – the numbers are all over the place but typically between 90 and 99 percent of teachers. In other words, the number of teachers that are fingered as truly terrible hasn’t gone up too much. What a fucking disaster, at least according to the NYTimes, which seems to go out of its way to make its readers understand how very much high school teachers suck.

1. Given that the VAM is nearly a random number generator, this is good news – it means they are not trusting the VAM scores blindly. Of course, it still doesn’t mean that the right teachers are getting fired, since half of the score is random.
2. Another point the article mentions is that failing teachers are leaving before the reports come out. We don’t actually know how many teachers are affected by these scores.
3. Anyway, what is the right number of teachers to fire each year, New York Times? And how did you choose that number? Oh wait, you quoted someone from the Brookings Institute: “It would be an unusual profession that at least 5 percent are not deemed ineffective.” Way to explain things so scientifically! It’s refreshing to know exactly how the army of McKinsey alums approach education reform.
4. The overall article gives us the impression that if we were really going to do our job and “be tough on bad teachers,” then we’d weight the Value-Added Model way more. But instead we’re being pussies. Wonder what would happen if we weren’t pussies?

The second article explained just that. It also came from the New York Times (h/t Suresh Naidu), and it was a the story of a School Chief in Atlanta who took the VAM scores very very seriously.

What happened next? The teachers cheated wildly, changing the answers on their students’ tests. There was a big cover-up, lots of nasty political pressure, and a lot of good people feeling really bad, blah blah blah. But maybe we can take a step back and think about why this might have happened. Can we do that, New York Times? Maybe it had to do with the \$500,000 in “performance bonuses” that the School Chief got for such awesome scores?

Let’s face it, this cheating scandal, and others like it (which may never come to light), was not hard to predict (as I explain in this post). In fact, as a predictive modeler, I’d argue that this cheating problem is the easiest thing to predict about the VAM, considering how it’s being used as an opaque mathematical weapon.

## Leila Schneps is a mystery writer!

I’m back! I missed you guys bad.

My experience with Seattle in the last 8 days has convinced me of something I rather suspected, namely I’m a huge New York snob and can’t exist happily anywhere else. I will spare you the details (they have to do with cars, subways, and being an asshole pedestrian) but suffice it to say, glad to be home.

Just a few caveats on complaining about my vacation:

1. I enjoyed visiting the University of Washington and giving the math colloquium there as well as a “Math Day” talk where I showed kids the winning strategy for Nim (as well as other impartial two-player games) following my notes from last summer.
2. I enjoyed reading Leon and Becky’s guest posts. Thanks guys!
3. And then there was the time spent with my darling family. Of course, goes without saying, it’s always magical to get to the point where your kids have invented a whole new language of insults after you’ve outlawed certain words: “Shut your fidoodle, you syncopathic lardle!”

Of all the topics I want to write about today, I’ve decided to go with the most immediate and surprising one : Leila Schneps is now a mystery writer! How cool is that? She’s written a book with her daughter, Math on Trial: How Numbers Get Used and Abused in the Courtroom, currently in stock and available on Amazon. And she wrote an op-ed for the New York Times talking about it (hat tip Chris Wiggins).

I know Leila from having been her grad student assistant at the GWU Summer Program for Women in Math the first year it existed, in 1995. She taught undergrads about Galois cohomology and interpreted elements of $H^1$ as twists and elements of $H^2$ as obstructions and then had them do a bunch of examples for homework with me. It was pretty awesome, and I learned a ton. Leila is also a regular and fantastic commenter on mathbabe.

I love the premise of the book she’s written. She finds a bunch of historical examples where mathematics is used in trials to the detriment of justice, and people get unfairly jailed (or, less often, let free). From the op-ed (emphasis mine):

Decades ago, the Harvard law professor Laurence H. Tribe wrote a stinging denunciation of the use of mathematics at trial, saying that the “overbearing impressiveness” of numbers tends to “dwarf” other evidence. But we neither can nor should throw math out of the courtroom. Advances in forensics, which rely on data analysis for everything from gunpowder to DNA, mean that quantitative methods will play an ever more important role in judicial deliberations.

The challenge is to make sure that the math behind the legal reasoning is fundamentally sound. Good math can help reveal the truth. But in inexperienced hands, math can become a weapon that impedes justice and destroys innocent lives.

Go Leila!

Categories: math, modeling, women in math

## Hackprinceton

March 25, 2013 1 comment

He-Yo

This Friday, I’ll be participating at HackPrinceton.

My team will be training an EEG to recognize yes and no thoughts for particular electromechanical devices and creating general human brain interface (HBI) architecture.

We’ll be working on allowing you to turn on your phone and navigate various menus with your mind!

There’s lots of cool swag and prizes – the best being jobs at Google and Microsoft. Everyone on the team has experience in the field,* but of course the more the merrier and you’re welcome no matter what you bring (or don’t bring!) to the table.

If you’re interested, email leon.kautsky@gmail.com ASAP!

*So far we’ve got a math Ph.D., a mech engineer, some CS/Operations Research guys and while my field is finance I picked up some neuro/machine learning along the way. If you have nothing to do for the next three days and want to learn something specifically for this competition, I recommend checking out my personal favorites: neurofocus.com, frontiernerds.com or neurogadget.com.