### Archive

Archive for the ‘data science’ Category

## WSJ: “When Your Boss Makes You Pay for Being Fat”

Going along with the theme of shaming which I took up yesterday, there was a recent Wall Street Journal article called “When Your Boss Makes You Pay for Being Fat” about new ways employers are trying to “encourage healthy living”, or otherwise described, “save money on benefits”. From the article:

Until recently, Michelin awarded workers automatic \$600 credits toward deductibles, along with extra money for completing health-assessment surveys or participating in a nonbinding “action plan” for wellness. It adopted its stricter policy after its health costs spiked in 2012.

Now, the company will reward only those workers who meet healthy standards for blood pressure, glucose, cholesterol, triglycerides and waist size—under 35 inches for women and 40 inches for men. Employees who hit baseline requirements in three or more categories will receive up to \$1,000 to reduce their annual deductibles. Those who don’t qualify must sign up for a health-coaching program in order to earn a smaller credit.

• This policy combines the critical characteristics of shaming, namely 1) a complete lack of empathy and 2) the shifting of blame for a problem entirely onto one segment of the population even though the “obesity epidemic” is a poorly understood cultural phenomenon.
• To the extent that there may be push-back against this or similar policies inside the workplace, there will be very little to stop employers from not hiring fat people in the first place.
• Or for that matter, what’s going to stop employers from using people’s full medical profiles (note: by this I mean the unregulated online profile that Acxiom and other companies collect about you and then sell to employers or advertisers for medical stuff – not the official medical records which are regulated) against them in the hiring process? Who owns the new-fangled health analytics models anyway?
• We do that already to poor people by basing their acceptance on credit scores.
Categories: data science, modeling

## E-discovery and the public interest (part 2)

Yesterday I wrote this short post about my concerns about the emerging field of e-discovery. As usual the comments were amazing and informative. By the end of the day yesterday I realized I needed to make a much more nuanced point here.

Namely, I see a tacit choice being made, probably by judges or court-appointed “experts”, on how machine learning is used in discovery, and I think that the field could get better or worse. I think we need to urgently discuss this matter, before we wander into a crazy place.

And to be sure, the current discovery process is fraught with opacity and human judgment, so complaining about those features being present in a machine learning version of discovery is unreasonable – the question is whether it’s better or worse than the current system.

Making it worse: private code, opacity

The way I see it, if we allow private companies to build black box machines that we can’t peer into, nor keep track of as they change versions, then we’ll never know why a given set of documents was deemed “relevant” in a given case. We can’t, for example, check to see if the code was modified to be more friendly to a given side.

Besides the healthy response to this new revenue source of competition for clients, the resulting feedback loop will likely be a negative one, whereby private companies use the cheapest version they can get away with to achieve the best results (for their clients) that they can argue for.

Making it better: open source code, reproducibility

What we should be striving for is to use only open source software, saved in a repository so we can document exactly what happened with a given corpus and a given version of the tools. It will still be an industry to clean the data and feed in the documents, train the algorithm (whilst documenting how that works), and interpreting the results. Data scientists will still get paid.

In other words, instead of asking for interpretability, which is a huge ask considering the massive scale of the work being done, we should, at the very least, be able to ask for reproducibility of the e-discovery, as well as transparency in the code itself.

Why reproducibility? Then we can go back in time, or rather scholars can, and test how things might have changed if a different version of the code were used, for example. This could create a feedback loop crucial to improve the code itself over time, and to improve best practices for using that code.

## E-discovery and the public interest

Today I want to bring up a few observations and concerns I have about the emergence of a new field in machine learning called e-discovery. It’s the algorithmic version of discovery, so I’ll start there.

Discovery is part of the process in a lawsuit where relevant documents are selected, pored over, and then handed to the other side. Nowadays, of course, there are more and more documents, almost all electronic, typically including lots of e-mails.

If you’re talking about a big lawsuit, there could be literally millions of documents to wade through, and that takes a lot of time for humans to do, and it can be incredibly expensive and time-consuming. Enter the algorithm.

With advances in Natural Language Processing (NLP), a machine algorithm can sort emails or documents by topic (after getting the documents into machine-readable form, cleaning, and deduping) and can in general do a pretty good job of figuring out whether a given email is “relevant” to the case.

And this is already happening – the Wall Street Journal recently reported that the Justice Department allowed e-discovery for a case involving the merger of two beer companies. From the article:

With the blessing of the Justice Department’s antitrust division, the lawyers loaded the documents into a program and manually reviewed a batch to train the software to recognize relevant documents. The manual review was repeated until the Justice Department and Constellation were satisfied that the program could accurately predict relevance in the rest of the documents. Lawyers for Constellation and Crown Imports used software developed by kCura Corp., which lists the Justice Department as a client.

In the end, Constellation and Crown Imports turned over hundreds of thousands of documents to antitrust investigators.

Here are some of my questions/ concerns:

• These algorithms are typically not open source – companies like kCura make good money doing these jobs.
• That means that they could be wrong, possibly in subtle ways.
• Or maybe not so subtle ways: maybe they’ve been trained to find documents that are both “relevant” and “positive” for a given side.
• In any case, the laws of this country will increasingly depend on a black box algorithm that is no accessible to the average citizen.
• Is that in the public’s interest?
• Is that even constitutional?

## The NYC Data Skeptics Meetup

One thing I’m super excited about at work is the new NYC Data Skeptics Meetup we’re organizing. Here’s the description of our mission:

The hype surrounding Big Data and Data Science is at a fever pitch with promises to solve the world’s business and social problems, large and small. How accurate or misleading is this message? How is it helping or damaging people, and which people? What opportunities exist for data nerds and entrepreneurs that examine the larger issues with a skeptical view?

This Meetup focuses on mathematical, ethical, and business aspects of data from a skeptical perspective. Guest speakers will discuss the misuse of and best practices with data, common mistakes people make with data and ways to avoid them, how to deal with intentional gaming and politics surrounding mathematical modeling, and taking into account the feedback loops and wider consequences of modeling. We will take deep dives into models in the fields of Data Science, statistics, finance, economics, healthcare, and public policy.

This is an independent forum and open to anyone sharing an interest in the larger use of data. Technical aspects will be discussed, but attendees do not need to have a technical background.

A few things:

• I wouldn’t blame you for not joining until we have a confirmed speaker, so please suggest speakers for us! I have a bunch of people in mind I’d absolutely love to see but I’d love more ideas. And I’m thinking broadly here – of course data scientists and statisticians and economists, but also lawyers, sociologists, or anyone who works with data or the effects of data.
• If you are skeptical of the need for yet another data-oriented Meetup (or other regular meeting), please think about it this way: there are not that many currently active groups which aren’t afraid to go into the technical weeds and also not obsesses with a simplistic, sound bite business take-away. But please tell me if I’m wrong, I’d love to reach out to people doing similar things.
• Suggest a better graphic for our Meetup than our current portrait of Isaac Asimov.
Categories: data science, modeling

## The rise of big data, big brother

I recently read an article off the newsstand called The Rise of Big Data.

It was written by Kenneth Neil Cukier and Viktor Mayer-Schoenberger and it was published in the May/June 2013 edition of Foreign Affairs, which is published by the Council on Foreign Relations (CFR). I mention this because CFR is an influential think tank, filled with powerful insiders, including people like Robert Rubin himself, and for that reason I want to take this view on big data very seriously: it might reflect the policy view before long.

And if I think about it, compared to the uber naive view I came across last week when I went to the congressional hearing about big data and analytics, that would be good news. I’ll write more about it soon, but let’s just say it wasn’t everything I was hoping for.

At least Cukier and Mayer-Schoenberger discuss their reservations regarding “big data” in this article. To contrast this with last week, it seemed like the only background material for the hearing, at least for the congressmen, was the McKinsey report talking about how sexy data science is and how we’ll need to train an army of them to stay competitive.

So I’m glad it’s not all rainbows and sunshine when it comes to big data in this article. Unfortunately, whether because they’re tied to successful business interests, or because they just haven’t thought too deeply about the dark side, their concerns seem almost token, and their examples bizarre.

The article is unfortunately behind the pay wall, but I’ll do my best to explain what they’ve said.

Datafication

First they discuss the concept of datafication, and their example is how we quantify friendships with “likes”: it’s the way everything we do, online or otherwise, ends up recorded for later examination in someone’s data storage units. Or maybe multiple storage units, and maybe for sale.

They formally define later in the article as a process:

… taking all aspect of life and turning them into data. Google’s augmented-reality glasses datafy the gaze. Twitter datafies stray thoughts. LinkedIn datafies professional networks.

Datafication is an interesting concept, although as far as I can tell they did not coin the word, and it has led me to consider its importance with respect to intentionality of the individual.

Here’s what I mean. We are being datafied, or rather our actions are, and when we “like” someone or something online, we are intending to be datafied, or at least we should expect to be. But when we merely browse the web, we are unintentionally, or at least passively, being datafied through cookies that we might or might not be aware of. And when we walk around in a store, or even on the street, we are being datafied in an completely unintentional way, via sensors or Google glasses.

This spectrum of intentionality ranges from us gleefully taking part in a social media experiment we are proud of to all-out surveillance and stalking. But it’s all datafication. Our intentions may run the gambit but the results don’t.

They follow up their definition in the article, once they get to it, with a line that speaks volumes about their perspective:

Once we datafy things, we can transform their purpose and turn the information into new forms of value

But who is “we” when they write it? What kinds of value do they refer to? As you will see from the examples below, mostly that translates into increased efficiency through automation.

So if at first you assumed they mean we, the American people, you might be forgiven for re-thinking the “we” in that sentence to be the owners of the companies which become more efficient once big data has been introduced, especially if you’ve recently read this article from Jacobin by Gavin Mueller, entitled “The Rise of the Machines” and subtitled “Automation isn’t freeing us from work — it’s keeping us under capitalist control.” From the article (which you should read in its entirety):

In the short term, the new machines benefit capitalists, who can lay off their expensive, unnecessary workers to fend for themselves in the labor market. But, in the longer view, automation also raises the specter of a world without work, or one with a lot less of it, where there isn’t much for human workers to do. If we didn’t have capitalists sucking up surplus value as profit, we could use that surplus on social welfare to meet people’s needs.

The big data revolution and the assumption that N=ALL

According to Cukier and Mayer-Schoenberger, the Big Data revolution consists of three things:

1. Collecting and using a lot of data rather than small samples.
2. Accepting messiness in your data.
3. Giving up on knowing the causes.

They describe these steps in rather grand fashion, by claiming that big data doesn’t need to understand cause because the data is so enormous. It doesn’t need to worry about sampling error because it is literally keeping track of the truth. The way the article frames this is by claiming that the new approach of big data is letting “N = ALL”.

But here’s the thing, it’s never all. And we are almost always missing the very things we should care about most.

So for example, as this InfoWorld post explains, internet surveillance will never really work, because the very clever and tech-savvy criminals that we most want to catch are the very ones we will never be able to catch, since they’re always a step ahead.

Even the example from their own article, election night polls, is itself a great non-example: even if we poll absolutely everyone who leaves the polling stations, we still don’t count people who decided not to vote in the first place. And those might be the very people we’d need to talk to to understand our country’s problems.

Indeed, I’d argue that the assumption we make that N=ALL is one of the biggest problems we face in the age of Big Data. It is, above all, a way of excluding the voices of people who don’t have the time or don’t have the energy or don’t have the access to cast their vote in all sorts of informal, possibly unannounced, elections.

Those people, busy working two jobs and spending time waiting for buses, become invisible when we tally up the votes without them. To you this might just mean that the recommendations you receive on Netflix don’t seem very good because most of the people who bother to rate things are Netflix are young and have different tastes than you, which skews the recommendation engine towards them. But there are plenty of much more insidious consequences stemming from this basic idea.

Another way in which the assumption that N=ALL can matter is that it often gets translated into the idea that data is objective. Indeed the article warns us against not assuming that:

… we need to be particularly on guard to prevent our cognitive biases from deluding us; sometimes, we just need to let the data speak.

And later in the article,

In a world where data shape decisions more and more, what purpose will remain for people, or for intuition, or for going against the facts?

This is a bitch of a problem for people like me who work with models, know exactly how they work, and know exactly how wrong it is to believe that “data speaks”.

I wrote about this misunderstanding here, in the context of Bill Gates, but I was recently reminded of it in a terrifying way by this New York Times article on big data and recruiter hiring practices. From the article:

“Let’s put everything in and let the data speak for itself,” Dr. Ming said of the algorithms she is now building for Gild.

If you read the whole article, you’ll learn that this algorithm tries to find “diamond in the rough” types to hire. A worthy effort, but one that you have to think through.

Why? If you, say, decided to compare women and men with the exact same qualifications that have been hired in the past, but then, looking into what happened next you learn that those women have tended to leave more often, get promoted less often, and give more negative feedback on their environments, compared to the men, your model might be tempted to hire the man over the woman next time the two showed up, rather than looking into the possibility that the company doesn’t treat female employees well.

In other words, ignoring causation can be a flaw, rather than a feature. Models that ignore causation can add to historical problems instead of addressing them. And data doesn’t speak for itself, data is just a quantitative, pale echo of the events of our society.

Some cherry-picked examples

One of the most puzzling things about the Cukier and Mayer-Schoenberger article is how they chose their “big data” examples.

One of them, the ability for big data to spot infection in premature babies, I recognized from the congressional hearing last week. Who doesn’t want to save premature babies? Heartwarming! Big data is da bomb!

But if you’re going to talk about medicalized big data, let’s go there for reals. Specifically, take a look at this New York Times article from last week where a woman traces the big data footprints, such as they are, back in time after receiving a pamphlet on living with Multiple Sclerosis. From the article:

Now she wondered whether one of those companies had erroneously profiled her as an M.S. patient and shared that profile with drug-company marketers. She worried about the potential ramifications: Could she, for instance, someday be denied life insurance on the basis of that profile? She wanted to track down the source of the data, correct her profile and, if possible, prevent further dissemination of the information. But she didn’t know which company had collected and shared the data in the first place, so she didn’t know how to have her entry removed from the original marketing list.

Two things about this. First, it happens all the time, to everyone, but especially to people who don’t know better than to search online for diseases they actually have. Second, the article seems particularly spooked by the idea that a woman who does not have a disease might be targeted as being sick and have crazy consequences down the road. But what about a woman is actually is sick? Does that person somehow deserve to have their life insurance denied?

The real worries about the intersection of big data and medical records, at least the ones I have, are completely missing from the article. Although they did mention that ”improving and lowering the cost of health care for the world’s poor” inevitable  will lead to “necessary to automate some tasks that currently require human judgment.” Increased efficiency once again.

To be fair, they also talked about how Google tried to predict the flu in February 2009 but got it wrong. I’m not sure what they were trying to say except that it’s cool what we can try to do with big data.

Also, they discussed a Tokyo research team that collects data on 360 pressure points with sensors in a car seat, “each on a scale of 0 to 256.” I think that last part about the scale was added just so they’d have more numbers in the sentence – so mathematical!

And what do we get in exchange for all these sensor readings? The ability to distinguish drivers, so I guess you’ll never have to share your car, and the ability to sense if a driver slumps, to either “send an alert or atomatically apply brakes.” I’d call that a questionable return for my investment of total body surveillance.

Big data, business, and the government

Of course, if you’re interested in treating your government office like a business, that’s gonna give you an edge too. The example of Bloomberg’s big data initiative led to efficiency gain (read: we can do more with less, i.e. we can start firing government workers, or at least never hire more).

As for regulation, it is pseudo-dealt with via the discussion of market dominance. We are meant to understand that the only role government can or should have with respect to data is how to make sure the market is working efficiently. The darkest projected future is that of market domination by Google or Facebook:

But how should governments apply antitrust rules to big data, a market that is hard to define and is constantly changing form?

In particular, no discussion of how we might want to protect privacy.

Big data, big brother

I want to be fair to Cukier and Mayer-Schoenberger, because they do at least bring up the idea of big data as big brother. Their topic is serious. But their examples, once again, are incredibly weak.

Should we find likely-to-drop-out boys or likely-to-get-pregnant girls using big data? Should we intervene? Note the intention of this model would be the welfare of poor children. But how many models currently in production are targeting that demographic with that goal? Is this in any way at all a reasonable example?

Here’s another weird one: they talked about the bad metric used by US Secretary of Defense Robert McNamara in the Viet Nam War, namely the number of casualties. By defining this with the current language of statistics, though, it gives us the impression that we could just be super careful about our metrics in the future and: problem solved. As we experts in data know, however, it’s a political decision, not a statistical one, to choose a metric of success. And it’s the guy in charge who makes that decision, not some quant.

Innovation

If you end up reading the Cukier and Mayer-Schoenberger article, please also read Julie Cohen’s draft of a soon-to-be published Harvard Law Review article called “What Privacy is For” where she takes on big data in a much more convincing and skeptical light than Cukier and Mayer-Schoenberger were capable of summoning up for their big data business audience.

I’m actually planning a post soon on Cohen’s article, which contains many nuggets of thoughtfulness, but for now I’ll simply juxtapose two ideas surrounding big data and innovation, giving Cohen the last word. First from the Cukier and Mayer-Schoenberger article:

Big data enables us to experiment faster and explore more leads. These advantages should produce more innovation

Second from Cohen, where she uses the term “modulation” to describe, more or less, the effect of datafication on society:

When the predicate conditions for innovation are described in this way, the problem with characterizing privacy as anti-innovation becomes clear: it is modulation, not privacy, that poses the greater threat to innovative practice. Regimes of pervasively distributed surveillance and modulation seek to mold individual preferences and behavior in ways that reduce the serendipity and the freedom to tinker on which innovation thrives. The suggestion that innovative activity will persist unchilled under conditions of pervasively distributed surveillance is simply silly; it derives rhetorical force from the cultural construct of the liberal subject, who can separate the act of creation from the fact of surveillance. As we have seen, though, that is an unsustainable fiction. The real, socially-constructed subject responds to surveillance quite differently—which is, of course, exactly why government and commercial entities engage in it. Clearing the way for innovation requires clearing the way for innovative practice by real people, by preserving spaces within which critical self-determination and self-differentiation can occur and by opening physical spaces within which the everyday practice of tinkering can thrive.

## Big data and surveillance

You know how, every now and then, you hear someone throw out a statistic that implies almost all of the web is devoted to porn?

Well, that turns out to be a false myth, which you can read more about here - although once upon a time it was kind of true, before women started using the web in large numbers and before there was Netflix streaming.

Here’s another myth along the same lines which I think might actually be true: almost all of big data is devoted to surveillance.

Of course, data is data, and you could define “surveillance” broadly (say as “close observation”), to make the above statement a tautology. To what extent is Google’s data, collected about you, a surveillance database, if they only use it to tailor searches and ads?

On the other hand, something that seems unthreatening now can become creepy soon: recall the NSA whistleblower who last year described how the government stores an enormous amount of the “electronic communications” in this country to keep close tabs on us.

The past

Back in 2011, computerworld.com published an article entitled “Big data to drive a surveillance society” and makes the case that there is a natural competition among corporations with large databases to collect more data, have it more interconnected (knowing now only a person’s shopping habits but also their location and age, say) and have the analytics work faster, even real-time, so they can peddle their products faster and better than the next guy.

Todd Papaioannou, vice president of cloud architecture at Yahoo, said instead of thinking about big data analytics as a weapon that empowers corporate Big Brothers, consumers should regard it as a tool that enables a more personalized Web experience.

“If someone can deliver a more compelling, relevant experience for me as a consumer, then I don’t mind it so much,” he said.

Thanks for telling us consumers how great this is, Todd. Later in the same article Todd says, “Our approach is not to throw any data away.”

The present

Fast forward to 2013, when defence contractor Raytheon is reported to have a new piece of software, called Riot, which is cutting-edge in the surveillance department.

The name Riot refers to “Rapid Information Overlay Technology” and it can locate individuals with longitude and latitudes, using cell phone data, and make predictions as well, using data scraped from Facebook, Twitter, and Foursquare. A video explains how they do it. From the op-ed:

The possibilities for RIOT are hideous at consumer level. This really is the stalker’s dream technology. There’s also behavioural analysis to predict movements in the software. That’s what Big Data can do, and if it’s not foolproof, there are plenty of fools around the world to try it out on.

US employers, who have been creating virtual Gulags of surveillance for employees with much less effective technology, will love this. “We know what you do” has always been a working option for coercion. The fantastic levels of paranoia involved in the previous generations of surveillance technology will be truly gratified by RIOT.

The future

Lest we think that our children are not as affected by such stalking software, since they don’t spend as much time on social media and often don’t have cellphones, you should also be aware that educational data is now being collected about individual learners in the U.S. at an enormous scale and with very little oversight.

This report from educationnewyork.com (hat tip Matthew Cunningham-Cook) explains recent changes in privacy laws for children, which happen to coincide with how much data is being collected (tons) and how much money is in the analysis of that data (tons):

Schools are a rich source of personal information about children that can be legally and illegally accessed by third parties.With incidences of identity theft, database hacking, and sale of personal information rampant, there is an urgent need to protect students’ rights under FERPA and raise awareness of aspects of the law that may compromise the privacy of students and their families.

In 2008 and 2011, amendments to FERPA gave third parties, including private companies,increased access to student data. It is significant that in 2008, the amendments to FERPA expanded the definitions of “school  officials” who have access to student data to include “contractors, consultants, volunteers, and other parties to whom an educational agency or institution has outsourced institutional services or functions it would otherwise use employees to perform.” This change has the effect of increasing the market for student data.

There are lots of contractors and consultants, for example inBloom, and they are slightly less concerned about data privacy issues than you might be:

inBloom has stated that it “cannot guarantee the security of the information stored … or that the information will not be intercepted when it is being transmitted.”

The article ends with this:

The question is: Should we compromise and endanger student privacy to support a centralized and profit-driven education reform initiative? Given this new landscape of an information and data free-for-all, and the proliferation of data-driven education reform initiatives like CommonCore and huge databases of student information, we’ve arrived at a time when once a child enters a public school,their parents will never again know who knows what about their children and about their families. It is now up to individual states to find ways to grant students additional privacy protections.

No doubt about it: our children are well on their way to being the most stalked generation.

One of the reasons I’m writing this post today is that I’m on a train to D.C. to sit in a Congressional hearing where Congressmen will ask “big data experts” questions about big data and analytics. The announcement is here, and I’m hoping to get into it.

The experts present are from IBM, the NSF, and North Carolina State University. I’m wondering how they got picked and what their incentives are. If I get in I will write a follow-up post on what happened.

Here’s what I hope happens. First, I hope it’s made clear that anonymization doesn’t really work with large databases. Second, I hope it’s clear that there’s no longer a very clear dividing line between sensitive data and nonsensitive data – you’d be surprised how much can be inferred about your sensitive data using only nonsensitive data.

Next, I hope it’s clear that the very people who should be worried the most about their data being exposed and freely available are the ones who don’t understand the threat. This means that merely saying that people should protect their data more is utterly insufficient.

Next, we should understand what policies already in place look like in Europe:

Finally, we should focus not only the collection of data, but on the usage of data. Just because you have a good idea of my age, race, education level, income, and HIV status doesn’t mean you should be able to use that information against me whenever you want.

In particular, it should not be legal for companies that provide loans or insurance to use whatever information they can buy from Acxiom about you. It should be a highly regulated set of data that allows for such decisions.

Categories: data science, modeling

## How to reinvent yourself, nerd version

I wanted to give this advice today just in case it’s useful to someone. It’s basically the way I went about reinventing myself from being a quant in finance to being a data scientist in the tech scene.

In other words, many of the same skills but not all, and many of the same job description elements but not all.

The truth is, I didn’t even know the term “data scientist” when I started my job hunt, so for that reason I think it’s possibly good and useful advice: if you follow it, you may end up getting a great job you don’t even know exists right now.

Also, I used this advice yesterday on my friend who is trying to reinvent himself, and he seemed to find it useful, although time will tell how much – let’s see if he gets a new job soon!

Here goes.

• Write a list of things you like about jobs: learning technical stuff, managing people, whatever floats your boat.
• Next, write a list of things you don’t like: being secretive, no vacation, office politics, whatever. Some people hate working with “dumb people” but some people can’t stand “arrogant people”. It makes a huge difference actually.
• Next, write a list of skills you have: python, basic statistics, math, managing teams, smelling a bad deal, stuff like that. This is probably the most important list, so spend some serious time on it.
• Finally, write a list of skills you don’t have that you wish you did: hadoop, knowing when to stop talking, stuff like that.

Once you have your lists, start going through LinkedIn by cross-searching for your preferred city and a keyword from one of your lists (probably the “skills you have” list).

Every time you find a job that you think you’d like to have, take note of what skills it lists that you don’t have, the name of the company, and your guess on a scale of 1-10 of how much you’d like the job into a spreadsheet or at least a file. This last part is where you use the “stuff I like” and “stuff I don’t like” lists.

And when you’ve done this for a long time, like you made it your job for a few hours a day for at least a few weeks, then do some wordcounts on this file, preferably using a command line script to add to the nerdiness, to see which skills you’d need to get which jobs you’d really like.

Note LinkedIn is not an oracle: it doesn’t have every job in the world (although it might have most jobs you could ever get), and the descriptions aren’t always accurate.

For example, I think companies often need managers of software engineers, but they never advertise for managers of software engineers. They advertise for software engineers, and then let them manage if they have the ability to, and sometimes even if they don’t. But even in that case I think it makes sense: engineers don’t want to be managed by someone they think isn’t technical, and the best way to get someone who is definitely technical is just to get another engineer.

In other words, sometimes the “job requirements” data on LInkedIn dirty, but it’s still useful. And thank god for LinkedIn.

Next, make sure your LinkedIn profile is up-to-date and accurate, and that your ex-coworkers have written letters for you and endorsed you for your skills.

Finally, buy a book or two to learn the new skills you’ve decided to acquire based on your research. I remember bringing a book on Bayesian statistics to my interview for a data scientist. I wasn’t all the way through the book, and my boss didn’t even know enough to interview me on that subject, but it didn’t hurt him to see that I was independently learning stuff because I thought it would be useful, and it didn’t hurt to be on top of that stuff when I started my new job.

What I like about this is that it looks for jobs based on what you want rather than what you already know you can do. It’s in some sense the dual method to what people usually do.

## War of the machines, college edition

A couple of people have sent me this recent essay (hat tip Leon Kautsky) written by Elijah Mayfield on the education technology blog e-Literate, described on their About page as “a hobby weblog about educational technology and related topics that is maintained by Michael Feldstein and written by Michael and some of his trusted colleagues in the field of educational technology.”

Mayfield’s essay is entitled “Six Ways the edX Announcement Gets Automated Essay Grading Wrong”. He’s referring to the recent announcement, which was written about in the New York Times last week, about how professors will soon be replaced by computers in grading essays. He claims they got it all wrong and there’s nothing to worry about.

First, Mayfield’s points:

• Journalists sensationalize things.
• The machine is identifying things in the essays that are associated with good writing vs. bad writing, much like it might learn to distinguish pictures of ducks from pictures of houses.
• It’s actually not that hard to find the duck and has nothing to do with “creativity” (look for webbed feet).
• If the machine isn’t sure it can spit back the essay to the professor to read (if the professor is still employed).
• The machine doesn’t necessarily reward big vocabulary words, except when it does.
• You’d need thousands of training examples (essays on a given subject) to make this actually work.
• What’s so really wonderful is that a student can get all his or her many drafts graded instantaneously, which no professor would be willing to do.

Here’s where I’ll start, with this excerpt from near the end:

“Can machine learning grade essays?” is a bad question. We know, statistically, that the algorithms we’ve trained work just as well as teachers for churning out a score on a 5-point scale.  We know that occasionally it’ll make mistakes; however, more often than not, what the algorithms learn to do are reproduce the already questionable behavior of humans. If we’re relying on machine learning solely to automate the process of grading, to make it faster and cheaper and enable access, then sure. We can do that.

OK, so we know that the machine can grade essays written for human consumption pretty accurately. But it hasn’t had to deal with essays written for machine consumption yet. There’s major room for gaming here, and only a matter of time before there’s a competing algorithm to build a great essay. I even know how to train that algorithm. Email me privately and we can make a deal on profit-sharing.

And considering that students will be able to get their drafts graded as many times as they want, as Mayfield advertised, this will only be easier. If I build an essay that I think should game the machine, by putting in lots of (relevant) long vocabulary words and erudite phrases, then I can always double check by having the system give me a grade. If it doesn’t work, I’ll try again.

And the essays built this way won’t get caught via the fraud detection software that finds plagiarism, because any good essay-builder will only steal smallish phrases.

One final point. The fact that the machine-learning grading algorithm only works when it’s been trained on thousands of essays points to yet another depressing trend: large-scale classes with the same exact assignments every semester so last year’s algorithm can be used, in the name of efficiency.

But that means last year’s essay-building algorithm can be used as well. Pretty soon it will just be a war of the machines.

Categories: data science, modeling, musing, news

## New creepy model: job hiring software

Before I get to my main take-down of the morning, let me warm up with an appetizer of sorts: have you been hearing a lot about new models that automatically grade essays?

Does it strike you that’s there’s something wrong with that idea but you don’t know what it is?

Here’s my take. While it’s true that it’s possible to train a model to grade essays similarly to what a professor now does, that doesn’t mean we can introduce automatic grading – at least not if the students in question know that’s what we’re doing.

There’s a feedback loop, whereby if the students know their essays will be automatically graded, then they will change what they’re doing to optimize for good automatic grades rather than, say, a cogent argument.

For example, a student might download a grading app themselves (wouldn’t you?) and run their essay through the machine until it gets a great grade. Not enough long words? Put them in! No need to make sure the sentences make sense, because the machine doesn’t understand grammar!

This is, in fact, a great example where people need to take into account the (obvious when you think about them) feedback loops that their models will enter in actual use.

Job Hiring Models

Now on to the main course.

In this week’s Economist there is an essay about the new widely-used job hiring software and how awesome it is. It’s so efficient! It removes the biases of of those pesky recruiters! Here’s an excerpt from the article:

The problem with human-resource managers is that they are human. They have biases; they make mistakes. But with better tools, they can make better hiring decisions, say advocates of “big data”.

So far “the machine” has made observations such as:

• Good if candidate uses browser you need to download like Chrome.
• Not as bad as one might expect to have a criminal record.
• Neutral on job hopping.
• Great if you live nearby.
• Good if you are on Facebook.
• Bad if you’re on Facebook and every other social networking site as well.

Now, I’m all for learning to fight against our biases and hire people that might not otherwise be given a chance. But I’m not convinced that this will happen that often – the people using the software can always train the model to include their biases and then point to the machine and say “The machine told me to do it”. True.

What I really object to, however, is the accumulating amount of data that is being collected about everyone by models like this.

It’s one thing for an algorithm to take my CV in and note that I misspelled my alma mater, but it’s a different thing altogether to scour the web for my online profile trail (via Acxiom, for example), to look up my credit score, and maybe even to see my persistence score as measured by my past online education activities (soon available for your 7-year-old as well!).

As a modeler, I know how hungry the model can be. It will ask for all of this data and more. And it will mean that nothing you’ve ever done wrong, no fuck-up that you wish to forget, will ever be forgotten. You can no longer reinvent yourself.

Forget mobility, forget the American Dream, you and everyone else will be funneled into whatever job and whatever life the machine has deemed you worthy of. WTF.

Categories: data science, modeling, rant

## Guest post by Julia Evans: How I got a data science job

This is a guest post by Julia Evans. Julia is a data scientist & programmer who lives in Montréal. She spends her free time these days playing with data and running events for women who program or want to — she just started a Montréal chapter of pyladies to teach programming, and co-organize a monthly meetup called Montréal All-Girl Hack Night for women who are developers.

asked mathbabe a question a few weeks ago saying that I’d recently started a data science job without having too much experience with statistics, and she asked me to write something about how I got the job. Needless to say I’m pretty honoured to be a guest blogger here Hopefully this will help someone!

Last March I decided that I wanted a job playing with data, since I’d been playing with datasets in my spare time for a while and I really liked it. I had a BSc in pure math, a MSc in theoretical computer science and about 6 months of work experience as a programmer developing websites. I’d taken one machine learning class and zero statistics classes.

In October, I left my web development job with some savings and no immediate plans to find a new job. I was thinking about doing freelance web development. Two weeks later, someone posted a job posting to my department mailing list looking for a “Junior Data Scientist”. I wrote back and said basically “I have a really strong math background and am a pretty good programmer”. This email included, embarrassingly, the sentence “I am amazing at math”. They said they’d like to interview me.

The interview was a lunch meeting. I found out that the company (Via Science) was opening a new office in my city, and was looking for people to be the first employees at the new office. They work with clients to make predictions based on their data.

My interviewer (now my manager) asked me about my role at my previous job (a little bit of everything — programming, system administration, etc.), my math background (lots of pure math, but no stats), and my experience with machine learning (one class, and drawing some graphs for fun). I was asked how I’d approach a digit recognition problem and I said “well, I’d see what people do to solve problems like that, and I’d try that”.

I also talked about some data visualizations I’d worked on for fun. They were looking for someone who could take on new datasets and be independent and proactive about creating model, figuring out what is the most useful thing to model, and getting more information from clients.

I got a call back about a week after the lunch interview saying that they’d like to hire me. We talked a bit more about the work culture, starting dates, and salary, and then I accepted the offer.

So far I’ve been working here for about four months. I work with a machine learning system developed inside the company (there’s a paper about it here). I’ve spent most of my time working on code to interface with this system and make it easier for us to get results out of it quickly. I alternate between working on this system (using Java) and using Python (with the fabulous IPython Notebook) to quickly draw graphs and make models with scikit-learn to compare our results.

I like that I have real-world data (sometimes, lots of it!) where there’s not always a clear question or direction to go in. I get to spend time figuring out the relevant features of the data or what kinds of things we should be trying to model. I’m beginning to understand what people say about data-wrangling taking up most of their time. I’m learning some statistics, and we have a weekly Friday seminar series where we take turns talking about something we’ve learned in the last few weeks or introducing a piece of math that we want to use.

Overall I’m really happy to have a job where I get data and have to figure out what direction to take it in, and I’m learning a lot.

## K-Nearest Neighbors: dangerously simple

I spend my time at work nowadays thinking about how to start a company in data science. Since there are tons of companies now collecting tons of data, and they don’t know what do to do with it, nor who to ask, part of me wants to design (yet another) dumbed-down “analytics platform” so that business people can import their data onto the platform, and then perform simple algorithms themselves, without even having a data scientist to supervise.

After all, a good data scientist is hard to find. Sometimes you don’t even know if you want to invest in this whole big data thing, you’re not sure the data you’re collecting is all that great or whether the whole thing is just a bunch of hype. It’s tempting to bypass professional data scientists altogether and try to replace them with software.

I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well. Let me explain.

Say you have a bunch of data points, maybe corresponding to users on your website. They have a bunch of attributes, and you want to categorize them based on their attributes. For example, they might be customers that have spent various amounts of money on your product, and you can put them into “big spender”, “medium spender”, “small spender”, and “will never buy anything” categories.

What you really want, of course, is a way of anticipating the category of a new user before they’ve bought anything, based on what you know about them when they arrive, namely their attributes. So the problem is, given a user’s attributes, what’s your best guess for that user’s category?

Let’s use k-Nearest Neighbors. Let k be 5 and say there’s a new customer named Monica. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Monica in terms of attributes, and sees what categories those 5 customers were in. If 4 of them were “medium spenders” and 1 was “small spender”, then your best guess for Monica is “medium spender”.

Holy shit, that was simple! Mathbabe, what’s your problem?

The devil is all in the detail of what you mean by close. And to make things trickier, as in easier to be deceptively easy, there are default choices you could make (and which you would make) which would probably be totally stupid. Namely, the raw numbers, and Euclidean distance.

So, for example, say your customer attributes were: age, salary, and number of previous visits to your website. Don’t ask me how you know your customer’s salary, maybe you bought info from Acxiom.

So in terms of attribute vectors, Monica’s might look like:

$(22.0, 55000.0, 0.0)$

And the nearest neighbor to Monica might look like:

$(75.0, 54000.0, 35.0)$

In other words, because you’re including the raw salary numbers, you are thinking of Monica, who is 22 and new to the site, as close to a 75-year old who comes to the site a lot. The salary, being of a much larger scale, is totally dominating the distance calculation. You might as well have only that one attribute and scrap the others.

Note: you would not necessarily think about this problem if you were just pressing a big button on a dashboard called “k-NN me!”

Of course, it gets trickier. Even if you measured salary in thousands (so Monica would now be given the attribution vector $(22.0, 55.0, 0.0)$) you still don’t know if that’s the right scaling. In fact, if you think about it, the algorithm’s results completely depends on how you scale these numbers, and there’s almost no way to reasonably visualize it even, to do it by hand, if you have more than 4 attributes.

Another problem is redundancy – if you have a bunch of attributes that are essentially redundant, i.e. that are highly correlated to each other, then including them all is tantamount to multiplying the scale of that factor.

Another problem is not all your attributes are even numbers, so you have string attributes. You might think you can solve this by using 0′s and 1′s, but in the case of k-NN, that becomes just another scaling problem.

One way around this might be to first use some kind of dimension-reducing algorithm, like PCA, to figure out what attribute combinations to actually use from the get-go. That’s probably what I’d do.

But that means you’re using a fancy algorithm in order to use a completely stupid algorithm. Not that there’s anything wrong with that, but it indicates the basic problem, which is that doing data analysis carefully is actually pretty hard and maybe should be done by professionals, or at least under the supervision of a one.

Categories: data science, modeling

There’ve been a couple of articles in the past few days about teacher Value-Added Testing that have enraged me.

If you haven’t been paying attention, the Value-Added Model (VAM) is now being used in a majority of the states (source: the Economist):

But it gives out nearly random numbers, as gleaned from looking at the same teachers with two scores (see this previous post). There’s a 24% correlation between the two numbers. Note that some people are awesome with respect to one score and complete shit on the other score:

Final thing you need to know about the model: nobody really understands how it works. It relies on error terms of an error-riddled model. It’s opaque, and no teacher can have their score explained to them in Plain English.

Now, with that background, let’s look into these articles.

First, there’s this New York Times article from yesterday, entitled “Curious Grade for Teachers: Nearly All Pass”. In this article, it describes how teachers are nowadays being judged using a (usually) 50/50 combination of classroom observations and VAM scores. This is different from the past, which was only based on classroom observations.

What they’ve found is that the percentage of teachers found “effective or better” has stayed high in spite of the new system – the numbers are all over the place but typically between 90 and 99 percent of teachers. In other words, the number of teachers that are fingered as truly terrible hasn’t gone up too much. What a fucking disaster, at least according to the NYTimes, which seems to go out of its way to make its readers understand how very much high school teachers suck.

1. Given that the VAM is nearly a random number generator, this is good news – it means they are not trusting the VAM scores blindly. Of course, it still doesn’t mean that the right teachers are getting fired, since half of the score is random.
2. Another point the article mentions is that failing teachers are leaving before the reports come out. We don’t actually know how many teachers are affected by these scores.
3. Anyway, what is the right number of teachers to fire each year, New York Times? And how did you choose that number? Oh wait, you quoted someone from the Brookings Institute: “It would be an unusual profession that at least 5 percent are not deemed ineffective.” Way to explain things so scientifically! It’s refreshing to know exactly how the army of McKinsey alums approach education reform.
4. The overall article gives us the impression that if we were really going to do our job and “be tough on bad teachers,” then we’d weight the Value-Added Model way more. But instead we’re being pussies. Wonder what would happen if we weren’t pussies?

The second article explained just that. It also came from the New York Times (h/t Suresh Naidu), and it was a the story of a School Chief in Atlanta who took the VAM scores very very seriously.

What happened next? The teachers cheated wildly, changing the answers on their students’ tests. There was a big cover-up, lots of nasty political pressure, and a lot of good people feeling really bad, blah blah blah. But maybe we can take a step back and think about why this might have happened. Can we do that, New York Times? Maybe it had to do with the \$500,000 in “performance bonuses” that the School Chief got for such awesome scores?

Let’s face it, this cheating scandal, and others like it (which may never come to light), was not hard to predict (as I explain in this post). In fact, as a predictive modeler, I’d argue that this cheating problem is the easiest thing to predict about the VAM, considering how it’s being used as an opaque mathematical weapon.

## Guest Post SuperReview Part III of VI: The Occupy Handbook Part I and a little Part II: Where We Are Now

Whattup.

Moving on from Lewis’ cute Bloomberg column reprint, we come to the next essay in the series:

Indefatigable pair Paul Krugman and Robin Wells (KW hereafter) contribute one of the several original essays in the book, but the content ought to be familiar if you read the New York Times, know something about economics or practice finance. Paul Krugman is prolific, and it isn’t hard to be prolific when you have to rewrite essentially the same column every week; question, are there other columnists who have been so consistently right yet have failed to propose anything that the polity would adopt? Political failure notwithstanding, Krugman leaves gems in every paragraph for the reader new to all this. The title “The Widening Gyre” comes from an apocalyptic William Yeats Butler poem. In this case, Krugman and Wells tackle the problem of why the government responded so poorly to the crisis. In their words:

By 2007, America was about as unequal as it had been on the eve of the Great Depression – and sure enough, just after hitting this milestone, we lunged into the worst slump since the Depression. This probably wasn’t a coincidence, although economists are still working on trying to understand the linkages between inequality and vulnerability to economic crisis.

Here, however, we want to focus on a different question: why has the response to crisis been so inadequate? Before financial crisis struck, we think it’s fair to say that most economists imagined that even if such a crisis were to happen, there would be a quick and effective policy response [editor's note: see Kautsky et al 2016 for a partial explanation]. In 2003 Robert Lucas, the Nobel laureate and then president of the American Economic Association, urged the profession to turn its attention away from recessions to issues of longer-term growth. Why? Because he declared, the “central problem of depression-prevention has been solved, for all practical purposes, and has in fact been solved for many decades.”

Famous last words from Professor Lucas. Nevertheless, the curious failure to apply what was once the conventional wisdom on a useful scale intrigues me for two reasons. First, most political scientists suggest that democracy, versus authoritarian system X, leads to better outcomes for two reasons.

1. Distributional – you get a nicer distribution of wealth (possibly more productivity for complicated macro reasons); economics suggests that since people are mostly envious and poor people have rapidly increasing utility in wealth, democracy’s tendency to share the wealth better maximizes some stupid social welfare criterion (typically, Kaldor-Hicks efficiency).

2. Information – democracy is a better information aggregation system than dictatorship and an expanded polity makes better decisions beyond allocation of produced resources. The polity must be capable of learning and intelligent OR vote randomly if uninformed for this to work. While this is the original rigorous justification for democracy (first formalized in the 1800s by French rationalists), almost no one who studies these issues today believes one-person one-vote democracy better aggregates information than all other systems at a national level. “Well Leon,” some knave comments, “we don’t live in a democracy, we live in a Republic with a president…so shouldn’t a small group of representatives better be able to make social-welfare maximizing decisions?” Short answer: strong no, and US Constitutionalism has some particularly nasty features when it comes to political decision-making.

Second, KW suggest that the presence of extreme wealth inequalities act like a democracy disabling virus at the national level. According to KW extreme wealth inequalities perpetuate themselves in a way that undermines both “nice” features of a democracy when it comes to making regulatory and budget decisions.* Thus, to get better economic decision-making from our elected officials, a good intermediate step would be to make our tax system more progressive or expand Medicare or Social Security or…Well, we have a lot of good options here. Of course, for mathematically minded thinkers, this begs the following question: if we could enact so-called progressive economic policies to cure our political crisis, why haven’t we done so already? What can/must change for us to do so in the future? While I believe that the answer to this question is provided by another essay in the book, let’s take a closer look at KW’s explanation at how wealth inequality throws sand into the gears of our polity. They propose four and the following number scheme is mine:

1. The most likely explanation of the relationship between inequality and polarization is that the increased income and wealth of a small minority has, in effect bought the allegiance of a major political party…Needless to say, this is not an environment conducive to political action.

2. It seems likely that this persistence [of financial deregulation] despite repeated disasters had a lot do with rising inequality, with the causation running in both directions. On the one side the explosive growth of the financial sector was a major source of soaring incomes at the very top of the income distribution. On the other side, the fact that the very rich were the prime beneficiaries of deregulation meant that as this group gained power- simply because of its rising wealth- the push for deregulation intensified. These impacts of inequality on ideology did not in 2008…[they] left us incapacitated in the face of crisis.

3. Conservatives have always seen seen [Keynesian economics] as the thin edge of the wedge: concede that the government can play a useful role in fighting slumps, and the next thing you know we’ll be living under socialism.

4. [Krugman paraphrasing Kalecki] Every widening of state activity is looked upon by business with suspicion, but the creation of employment by government spending has a special aspect which makes the opposition particularly intense. Under a laissez-faire system the level of employment to a great extend on the so-called state of confidence….This gives capitalists a powerful indirect control over government policy: everything which may shake the state of confidence must be avoided because it would cause an economic crisis.

All of these are true to an extent. Two are related to the features of a particular policy position that conservatives don’t like (countercyclical spending) and their cost will dissipate if the economy improves. Isn’t it the case that most proponents and beneficiaries of financial liberalization are Democrats? (Wall Street mostly supported Obama in 08 and barely supported Romney in 12 despite Romney giving the house away). In any case, while KW aren’t big on solutions they certainly have a strong grasp of the problem.

Take a Stand: Sit In by Phillip Dray

As the railroad strike of 1877 had led eventually to expanded workers’ rights, so the Greensboro sit-in of February 1, 1960, helped pave the way for passage of the Civil Rights Act of 1964 and the Voting Rights Act of 1965. Both movements remind us that not all successful protests are explicit in their message and purpose; they rely instead on the participants’ intuitive sense of justice. [28]

I’m not the only author to have taken note of this passage as particularly important, but I am the only author who found the passage significant and did not start ranting about so-called “natural law.” Chronicling the (hitherto unknown-to-me) history of the Great Upheaval, Dray does a great job relating some important moments in left protest history to the OWS history. This is actually an extremely important essay and I haven’t given it the time it deserves. If you read three essays in this book, include this in your list.

Inequality and Intemperate Policy by Raghuram Rajan (no URL, you’ll have to buy the book)

Rajan’s basic ideas are the following: inequality has gotten out of control:

Deepening income inequality has been brought to the forefront of discussion in the United States. The discussion tends to center on the Croesus-like income of John Paulson, the hedge fund manager who made a killing in 2008 betting on a financial collapse and netted over \$3 billion, about seventy-five-thousand times the average household income. Yet a more worrying, everyday phenomenon that confronts most Americans is the disparity in income growth rates between a manager at the local supermarket and the factory worker or office assistant. Since the 1970s, the wages of the former, typically workers at the ninetieth percentile of the wage distribution in the United States, have grown much faster than the wages of the latter, the typical median worker.

But American political ideologies typically rule out the most direct responses to inequality (i.e. redistribution). The result is a series of stop-gap measures that do long-run damage to the economy (as defined by sustainable and rising income levels and full employment), but temporarily boost the consumption level of lower classes:

It is not surprising then, that a policy response to rising inequality in the United States in the 1990s and 200s – whether carefully planned or chosen as the path of least resistance – was to encourage lending to households, especially but not exclusively low-income ones, with the government push given to housing credit just the most egregious example. The benefit – higher consumption – was immediate, whereas paying the inevitable bill could be postponed into the future. Indeed, consumption inequality did not grow nearly as much as income inequality before the crisis. The difference was bridged by debt. Cynical as it may seem, easy credit has been used as a palliative success administrations that been unable to address the deeper anxieties of the middle class directly. As I argue in my book Fault Lines, “Let them eat credit” could well summarize the mantra of the political establishment in the go-go years before the crisis.

Why should you believe Raghuram Rajan? Because he’s one of the few guys who called the first crisis and tried to warn the Fed.

A solid essay providing a more direct link between income inequality and bad policy than KW do.

The 5 percent’s [consisting of the seven million Americans who, in 1934, were sixty-five and older] protests coalesced as the Townsend movement, launched by a sinewy midwestern farmer’s son and farm laborer turned California physician. Francis Townsend was a World War I veteran who had served in the Army Medical Corps. He had an ambitious, and impractical plan for a federal pension program. Although during its heyday in the 1930s the movement failed to win enactment of its [editor's note: insane] program, it did play a critical role in contemporary politics. Before Townsend, America understood the destitution of its older generations only in abstract terms; Townsend’s movement made it tangible. “It is no small achievment to have opened the eyes of even a few million Americans to these facts,” Bruce Bliven, editor of the New Republic observed. “If the Townsend Plan were to die tomorrow and be completely forgotten as miniature golf, mah-jongg, or flinch [editor's note: everything old is new again], it would still have left some sedimented flood marks on the national consciousness.” Indeed, the Townsend movement became the catalyst for the New Deal’s signal achievement, the old-age program of Social Security. The history of its rise offers a lesson for the Occupy movement in how to convert grassroots enthusiasm into a potent political force – and a warning about the limitations of even a nationwide movement.

Does the author live up to the promises of this paragraph? Is the whole essay worth reading? Does FDR give in to the people’s demands and pass Social Security?!

Hidden in Plain Sight by Gillian Tett (no URL, you’ll have to buy the book)

This is a great essay. I’m going to outsource the review and analysis to:

http://beyoubesure.com/2012/10/13/generation-lost-lazy-afraid/

because it basically sums up my thoughts. You all, go read it.

If you know nothing about Wall Street, then the essay is worth reading, otherwise skip it. There are two common ways to write a bad article in financial journalism. First, you can try to explain tiny index price movements via news articles from that day/week/month. “Shares in the S&P moved up on good news in Taiwan today,” that kind of nonsense. While the news and price movements might be worth knowing for their own sake, these articles are usually worthless because no journalist really knows who traded and why (theorists might point out even if the journalists did know who traded to generate the movement and why, it’s not clear these articles would add value – theorists are correct).

The other way, the Cassidy! way is to ask some subgroup of American finance what they think about other subgroups in finance. High frequency traders think iBankers are dumb and overpaid, but HFT on the other hand, provides an extremely valuable service – keeping ETFs cheap, providing liquidity and keeping shares the right level. iBankers think prop-traders add no value, but that without iBanking M&A services, American manufacturing/farmers/whatever would cease functioning. Low speed prop-traders think that HFT just extracts cash from dumb money, but prop-traders are reddest blooded American capitalists, taking the right risks and bringing knowledge into the markets. Insurance hates hedge funds, hedge funds hate the bulge bracket, the bulge bracket hates the ratings agencies, who hate insurance and on and on.

You can spit out dozens of articles about these catty and tedious rivalries (invariably claiming that financial sector X, rivals for institutional cash with Y, “adds no value”) and learn nothing about finance. Cassidy writes the article taking the iBankers side and surprises no one (this was originally published as an article in The New Yorker).

Ms. McLean holds immense talent. It was always pretty obvious that the bottom twenty-percent, i.e. the vast majority of subprime loan recipients, who are generally poor at planning, were using mortgages to get quick cash rather than buy houses. Regulators and high finance, after resisting for a good twenty years, gave in for reasons explained in Rajan’s essay.

Against Political Capture by Daron Acemoglu(sorry I couldn’t find a URL, for this original essay you’ll have to buy the book).

A legit essay by a future Nobelist in Econ. Read it.

I first came to this country in 1967. I have been either a crypto-anthropologist or professional anthropologist for most of the intervening years. Still, because I came here with an interest in India and took the path of least resistance in choosing to retain India as my principal ethnographic referent, I have always been reluctant to offer opinions about life in these United States.

His instincts were correct. The essay reads like an old man complaining about how bad the weather is these days. Skip it.

Editor Byrne has amazing powers of persuasion or, a lot of authors have had some essays in the desk-drawer they were waiting for an opportunity to publish. In any case, Rogoff and Reinhart (RR hereafter) have summed up a couple hundred studies and two of their books in a single executive summary and given it to whoever buys The Occupy Handbook. Value. RR are Republicans and the essay appears to be written in good faith (unlike some people *cough* Tyler Cowen and Veronique de Rugy *cough*). RR do a great job discovering and presenting stylized facts about financial crises past and present. What to expect next? A couple national defaults and maybe a hyperinflation or two.

Government As Tough Love by Robert Shiller as interviewed by Brandon Adams (buy the book)!

Shiller has always been ahead of the curve. In 1981, he wrote a cornerstone paper in behavioral finance at a time when the field was in its embryonic stages. In the early 1990s, he noticed insufficient attention was paid to real estate values, despite their overwhelming importance to personal wealth levels; this led him to create, along with Karl E. Case, the Case-Shiller index – now the Case-Shiller Home Prices Indices. In March 2000**, Shiller published Irrational Exuberance, arguing that U.S. stocks were substantially overvalued and due for a tumble. [Editor's note: what Brandon Adams fails to mention, but what's surely relevant is that Shiller also called the subprime bubble and re-released Irrational Exuberance in 2005 to sound the alarms a full three years before The Subprime Solution]. In 2008, he published The Subprime Solution, which detailed the origins of the housing crisis and suggested innovative policy responses for dealing with the fallout. These days, one of his primary interests is neuroeconomics, a field that relates economic decision-making to brain function as measured by fMRIs.

Shiller is basically a champ and you should listen to him.

Shiller was disappointed but not surprised when governments bailed out banks in extreme fashion while leaving the contracts between banks and homeowners unchanged. He said, of Hank Paulson, “As Treasury secretary, he presented himself in a very sober and collected way…he did some bailouts that benefited Goldman Sachs, among others. And I can imagine that they were well-meaning, but I don’t know that they were totally well-meaning, because the sense of self-interest is hard to clean out of your mind.”

Shiller understates everything.

And so, we close our discussion of part I. Moving on to part II:

In Ms. Byrne’s own words:

Part 2, “Where We Are Now,” which covers the present, both in the United States and abroad, opens with a piece by the anthropologist David Graeber. The world of Madison Avenue is far from the beliefs of Graeber, an anarchist, but it’s Graeber who arguably (he says he didn’t do it alone) came up with the phrase “We Are the 99 percent.” As Bloomberg Businessweek pointed out in October 2011, during month two of the Occupy encampments that Graeber helped initiate and three moths after the publication of his Debt: The First 5,000 Years, “David Graeber likes to say that he had three goals for the year: promote his book, learn to drive, and launch a worldwide revolution. The first is going well, the second has proven challenging and the third is looking up.” Graeber’s counterpart in Chile can loosely be said to be Camila Vallejo, the college undergraduate, pictured on page 219, who, at twenty-three, brought the country to a standstill. The novelist and playwright Ariel Dorfman writes about her and about his own self-imposed exile from Chile, and his piece is followed by an entirely different, more quantitative treatment of the subject. This part of the book also covers the indignados in Spain, who before Occupy began, “occupied” the public squares of Madrid and other cities – using, as the basis for their claim on the parks could be legally be slept in, a thirteenth-century right granted to shepherds who moved, and still move, their flocks annually.

In other words, we’re in occupy is the hero we deserve, but not the hero we need territory here.

*Addendum 1: Some have suggested that it’s not the wealth inequality that ought to be reduced, but the democratic elements of our system. California’s terrible decision-making resulting from its experiments with direct democracy notwithstanding, I would like to stay in the realm of the sane.

**Addendum 2: Yes, Shiller managed to get the book published the week before the crash. Talk about market timing.

## Guest Post SuperReview Part I of VI: The Occupy Handbook

Whassup.

It has become a truism that as the amount of news and information generated per moment continues to grow, so too does the value of aggregation, curation and editing. A point less commonly made is that these aggregators are often limited by time in the sense, whatever the topic, the value of news for the median reader decays extremely rapidly. Some extremists even claim that it’s useless to read the newspaper, so rapidly do things change.  The forty eight hours news cycle, in addition to destroying context, has made it impossible for both reporters and viewers to learn from history. See “Is News Memoryless?” (Kautsky et. al. 2014).

A more promising approach to news aggregation (for those who read the news with purpose) is to organize pieces by subject and publish those articles in a book.  Paul Krugman did this for himself in The Great Unraveling, bundling selected columns from 1999 to 2003 into a single book, with chapters organized by subject and proceeding chronologically. While the rise and rise of Krumgan’s real-time blogging virtually guarantees he’ll never make such an effort again, a more recent try came from uber-journalist Michael Lewis in Panic!: The Story of Modern Financial Insanity.  Financial journalists’ myopic perspective at any given point in time make financial column compilations of years past particularly fun(ny) to read.

Nothing is staler than yesterday’s Wall Street journal (financial news spoils quickly) and reading WSJ or Barron’s pieces from 10 to 20 years ago is just painful.

The title PANIC: The story of modern financial insanity led me to believe the book was about the current crises. The book does say, in very, very fine print “Edited by” Michael Lewis.

-Fritz Krieger, Amazon Reviewer and chief scientist at ISIS

Unfortunately, some philistines became angry in 2008 when they insta-purchased a book called Panic! by Michael Lewis and to their horror, discovered that it contained information about prior financial crises, the nerve of the author to bring us historical perspective, even worse…some of that perspective relating to nations other than the ole’ US of A.

As the more alert readers have noted, almost nothing in the book concerns the 2008 Credit Meltdown, but instead this is merely a collection of news clippings and old magazine articles about past financial crises. You might as well visit a chiropodist’s office and offer them a couple of bucks for their old magazines.

Granted, the articles are by some of today’s finest and most celebrated journalists (although some of the news clippings are unsigned), but do you really want to read more about the 1987 crash or the 1997 collapse of the Thai Baht?

Perhaps you do, but whoever threw this book together wasn’t very particular about the articles chosen. Page 193 reprints an article from “Barron’s” of March, 2000 in which Jack Willoughby presents a long list of Internet companies that he considered likely to run out of cash by 2001. “Some can raise more funds through stock and bond offerings,” he warns. “Others will be forced to go out of business. It’s Darwinian capitalism at work.” True, many of the companies he listed did go belly-up, but on his list of the doomed are
[..]Amazon.com

- Someone named Keith Otis Edwards

Perhaps because I was abroad for both the initial disaster and the entire Occupation of Zucotti Park, both events have held my attention.  So it is with a mixture of hope and apprehension that I picked up Princeton alum Janet Byrne’s The Occupy Handbook from the public library. The Occupy Handbook is a collection of essays written from 2010 to 2011 by an assortment of first and second-rate authors that attempt to: show what Wall Street does and what it did that led to the most recent crash, explain why our policy apparatus was paralyzed in response to the crash, describe how OWS arose and how it compared with concurrent international movements and prior social movements in the US, and perhaps most importantly, provide policy solutions for the 99% in finance and economics. Janet Byrne begins with a heartfelt introduction:

One fall morning I stood outside the Princeton Club, on West 43rd Street in Manhattan. Occupy Wall Street, which I had visited several times as a sympathetic outsider, has passed its one month anniversary, and I thought the movement might be usefully analyzed by economists and financial writers whose pieces I would commission and assemble into a book that was analytical and- this was what really interested me – prescriptive. I’d been invited to breakfast to talk about the idea with a Princeton Club member and had arrived early out of nervousness.

It seemed a strange place to be discussing the book. I tried the idea out on a young bellhop…

And so it continues. The book is divided into three parts. Part I, broadly speaking, tries to give some economic background on the crash and the ensuing political instability that the crash engendered, up to the first occupation of Zuccotti Park. Part II, broadly speaking, describes the events in Zuccotti Park and around the world as they were in those critical months of fall 2011. Part III, broadly speaking, prescribes solutions to current depression. I say broadly speaking because, as you will see, several essays appear to be in the wrong part and in the worst cases, in the wrong book.

## Data science code of conduct, Evgeny Morozov

I’m going on an 8-day long trip to Seattle with my family this morning and I’m taking the time off from mathbabe. But don’t fret! I have a crack team of smartypants skeptics who are writing for me while I’m gone. I’m very much looking forward to seeing what Leon and Becky come up with.

In the meantime, I’ll leave you with two things I’m reading today.

First, a proposed Data Science Code of Professional Conduct. I don’t know anything about the guys at Rose Business Technologies who wrote it except that they’re from Boulder Colorado and have had lots of fancy consulting gigs. But I am really enjoying their proposed Data Science Code. An excerpt from the code after they define their terms:

(c)  A data scientist shall rate the quality of evidence and disclose such rating to client to enable client to make informed decisions. The data scientist understands that evidence may be weak or strong or uncertain and shall take reasonable measures to protect the client from relying and making decisions based on weak or uncertain evidence.

(d) If a data scientist reasonably believes a client is misusing data science to communicate a false reality or promote an illusion of understanding, the data scientist shall take reasonable remedial measures, including disclosure to the client, and including, if necessary, disclosure to the proper authorities. The data scientist shall take reasonable measures to persuade the client to use data science appropriately.

(e)  If a data scientist knows that a client intends to engage, is engaging or has engaged in criminal or fraudulent conduct related to the data science provided, the data scientist shall take reasonable remedial measures, including, if necessary, disclosure to the proper authorities.

(f) A data scientist shall not knowingly:

1. fail to use scientific methods in performing data science;
2. fail to rank the quality of evidence in a reasonable and understandable manner for the client;
3. claim weak or uncertain evidence is strong evidence;
4. misuse weak or uncertain evidence to communicate a false reality or promote an illusion of understanding;
5. fail to rank the quality of data in a reasonable and understandable manner for the client;
6. claim bad or uncertain data quality is good data quality;
7. misuse bad or uncertain data quality to communicate a false reality or promote an illusion of understanding;
8. fail to disclose any and all data science results or engage in cherry-picking;

Second, my favorite new Silicon Valley curmudgeon is named Evgeny Morozov, and he recently wrote an opinion column in the New York Times. It’s wonderfully cynical and makes me feel like I’m all sunshine and rainbows in comparison – a rare feeling for me! Here’s an excerpt (h/t Chris Wiggins):

Facebook’s Mark Zuckerberg concurs: “There are a lot of really big issues for the world that need to be solved and, as a company, what we are trying to do is to build an infrastructure on top of which to solve some of these problems.” As he noted in Facebook’s original letter to potential investors, “We don’t wake up in the morning with the primary goal of making money.”

Such digital humanitarianism aims to generate good will on the outside and boost morale on the inside. After all, saving the world might be a price worth paying for destroying everyone’s privacy, while a larger-than-life mission might convince young and idealistic employees that they are not wasting their lives tricking gullible consumers to click on ads for pointless products. Silicon Valley and Wall Street are competing for the same talent pool, and by claiming to solve the world’s problems, technology companies can offer what Wall Street cannot: a sense of social mission.

Categories: data science

## Modeling in Plain English

I’ve been enjoying my new job at Johnson Research Labs, where I spend a majority of the time editing my book with my co-author Rachel Schutt. It’s called Doing Data Science (now available for pre-purchase at Amazon), and it’s based on these notes I took last semester at Rachel’s Columbia class.

Recently I’ve been working on Brian Dalessandro‘s chapter on logistic regression. Before getting into the brass tacks of that algorithm, which is especially useful when you are trying to predict a binary outcome (i.e. a 0 or 1 outcome like “will click on this ad”), Brian discusses some common constraints to models.

The one that’s particularly interesting to me is what he calls “interpretability”. His example of an interpretability constraint is really good: it turns out that credit card companies have to be able to explain to people why they’ve been rejected. Brain and I tracked down the rule to this FTC website, which explains the rights of consumers who own credit cards. Here’s an excerpt where I’ve emphasized the key sentences:

#### You Also Have The Right To…

• Have credit in your birth name (Mary Smith), your first and your spouse’s last name (Mary Jones), or your first name and a combined last name (Mary Smith Jones).
• Get credit without a cosigner, if you meet the creditor’s standards.
• Have a cosigner other than your spouse, if one is necessary.
• Keep your own accounts after you change your name, marital status, reach a certain age, or retire, unless the creditor has evidence that you’re not willing or able to pay.
• Know whether your application was accepted or rejected within 30 days of filing a complete application.
• Know why your application was rejected. The creditor must tell you the specific reason for the rejection or that you are entitled to learn the reason if you ask within 60 days. An acceptable reason might be: “your income was too low” or “you haven’t been employed long enough.” An unacceptable reason might be “you didn’t meet our minimum standards.” That information isn’t specific enough.
• Learn the specific reason you were offered less favorable terms than you applied for, but only if you reject these terms. For example, if the lender offers you a smaller loan or a higher interest rate, and you don’t accept the offer, you have the right to know why those terms were offered.
• Find out why your account was closed or why the terms of the account were made less favorable, unless the account was inactive or you failed to make payments as agreed.

The result of this rule is that credit card companies must use simple models, probably decision trees, to make their rejection decisions.

It’s a new way to think about modeling choice, to be sure. It doesn’t necessarily make for “better” decisions from the point of view of the credit card company: random forests, a generalization of decision trees, are known to be more accurate, but are arbitrarily more complicated to explain.

So it matters what you’re optimizing for, and in this case the regulators have decided we’re optimizing for interpretability rather than accuracy. I think this is appropriate, given that consumers are at the mercy of these decisions and relatively powerless to act against them (although the FTC site above gives plenty of advice to people who have been rejected, mostly about how to raise their credit scores).

Three points to make about this. First, I’m reading the Bankers New Clothes, written by Anat Admati and Martin Hellwig (h/t Josh Snodgrass), which is absolutely excellent – I’m planning to write up a review soon. One thing they explain very clearly is the cost of regulation (specifically, higher capital requirements) from the bank’s perspective versus from the taxpayer’s perspective, and how it genuinely seems “expensive” to a bank but is actually cost-saving to the general public. I think the same thing could be said above for the credit card interpretability rule.

Second, it makes me wonder what else one could regulate in terms of plain english modeling. For example, what would happen if we added that requirement to, say, the teacher value-added model? Would we get much-needed feedback to teachers like, “You don’t have enough student participation”? Oh wait, no. The model only looks at student test scores, so would only be able to give the following kind of feedback: “You didn’t raise scores enough. Teach to the test more.”

In other words, what I like about the “Modeling in Plain English” idea is that you have to be able to first express and second back up your reasons for making decisions. It may not lead to ideal accuracy on the part of the modeler but it will lead to much greater clarity on the part of the modeled. And we could do with a bit more clarity.

Finally, what about online loans? Do they have any such interpretability rule? I doubt it. In fact, if I’m not wrong, they can use any information they can scrounge up about someone to decide on who gets a loan, and they don’t have to reveal their decision-making process to anyone. That seems unreasonable to me.

Categories: data science, modeling, rant

## Data audits and data strategies

There are lots of start-up companies out there that want to have a data team, because they heard somewhere that they should leverage big data, but they don’t know what it really means, what they can expect from such a team, or how to get started. They also don’t really know how to hire qualified people, or what qualifications to look for.

Finally, they often don’t know what kinds of questions are answerable through data, nor what data they should be collecting to answer those questions. So even if they did manage to hire a data scientist or a data team, those guys might be literally sitting on their hands for six months until they have enough data to start work.

It’s a common situation and could end up a big waste time and money. What these companies need is something I like to call a “data audit” followed by a “data strategy”.

Data Audit

First thing’s first. Do you actually need a data team? Is your company a data science company or is it a traditional-style company that happens to collect data? It would be a waste of resources to form a data team you don’t need. There’s no reason every single company needs to consider itself part of the big data revolution just to be cool.

Here’s how you tell. Let’s say that, as of now, you’re using incoming data to monitor and report on what’s happening with the business and to keep tabs on various indicators to make sure things aren’t going to hell. Absolutely every company should do this, but it honestly could be set up by a good data analyst working closely with the end-users, i.e. the business peeps.

What are the high-level goals of using data in the business? In particular, is there a way that, if you could really know how customers or clients were interacting with your product, that you would change the product to respond to the data? Because that feedback loop is the hallmark of a true data science engine (versus data analytics).

Here are some extreme examples to give you an idea of what I’m talking about. If you make shoes, then you need data to see how sales are and which shoes are getting sold faster so you can kick up production in certain areas. You need to see how sales are seasonal so you know to stop making quite so many shoes at a certain point in the deep of winter. But that’s about it, and you should be able to make do with data analysis.

If, on the other hand, you are building a recommendation engine, say for music, then you need to constantly refresh and improve your recommendation model. Your model is your product, and you need a data team.

Not all examples are this easy. Sometimes you can use new kinds of data models to improve your product even if it seems somewhat traditional, depending on how much data you are able to collect about how your clients use your product. It all depends on what kinds of questions you are asking and what data you have access to. Of course, you might want to go out and collect data that you hadn’t bothered to do before, which could bring you from the first category to the second.

Say you decide you really are a data science company, or want to be one. What’s next?

Pose a bunch of questions you think you’ll need to answer and a bunch of data you think should be useful to answer them.

The heart of a data audit is a (preliminary) plan for choosing, collecting, and storing data, as well as figuring out the initial shape of the data pipeline and infrastructure. Do you store data in the cloud? Is it unstructured or do you set up some overnight jobs to put stuff into some type of database? Do you aggregate data and throw some stuff away, or do you keep absolutely everything?

The most important issue above is whether you’re collecting enough data. Truth be told, you could probably throw it all into an unstructured pile on S3 for now and figure out pipelines later. It might not be the best way to do it but if you are short for time and attention, it’s possible, and storage is cheap. But make sure you’re collecting the right stuff!

You’d be surprised how many startups want to ask good questions about their customers to improve their product, and have gone to some trouble to figure out what those questions are, but don’t bother to collect the relevant information. They might do things like count the number of users, or collect a timestamp for whenever a user logs in, but they don’t actually keep track of the interaction. It’s essential that you collect pertinent information if you want to use this data to check things are working or to predict people’s desires or needs.

So if you think customers might be all ditching your site at critical moments, then definitely tag their departure as well as their arrival, and keep track of where they were and what they were doing when they bailed.

Note I’m not necessarily being creepy here. You definitely want to know how people interact with your product and your site, and it doesn’t need to be personal information you’re collecting about your users. It could be kept aggregate. You could find out that 45% of people leave your site when you ask them for their phone number, and then you might decide it’s not worth it to do that.

Speaking of creepy, another critical thing to consider during your data audit is privacy controls and encryption methods. Are you saving data legally? Are you protecting it legally? Are you informing your users appropriately about how and what data will be stored? Are you planning to remain consistent with your stated privacy policy? Do you respect people’s “Do Not Track” option?

At the end of a data audit, you might still have a vague idea of what exactly you can do with your data, but you should have a bunch of possible ideas, as well as guesses at what kind of attributes would contribute to the kind of behavior you’re considering tracking.

Then, after you start collecting high-quality data and figuring out the basic questions you care about, you will probably have to wait a few weeks or months to start training and implementing your models. This is a good time to make sure your data infrastructure is in place and doesn’t have major bugs.

Data Strategy

Ok, now you’ve collected lots of data and you also have a bunch of questions you think may be answerable. It’s time to prioritize your questions and form a plan. For each question on your list, you’ll need to think about the following issues:

• Is it a monitor or an algorithm?
• Is it short-term, one-time analysis or should you set it up as a dashboard?
• How much data will you need to train the model?
• What is your expectation of the signal in the data you’re collecting?
• How useful will the results of the model be considering the range of signal and the quality of the answer?
• Do you need to go find proxy data? Should you start now?
• Which algorithms should you consider?
• Is it scalable?
• Can you do a baby version first or does it only make sense to go deep?
• Can you do a simpler version of it that’s much cheaper to build?
• How long will it probably take to train?
• How fast can it update?
• Will it be a pain to integrate it to the realtime system?
• What are the costs if it doesn’t work?
• What are the costs of not trying it? What else could you be doing with that time?
• How is the feedback loop expected to work?
• What is the impact of this model on the users?
• What is the impact of this model on the world at large? This is especially important if you’re creepy. Don’t be creepy.

Also, you need a team to build your models. How do you hire? Who do you hire? Some of these answers depend on your above plan. If there’s a lot of realtime updating for your models you’ll need more data engineers and fewer pure modelers. If you need excellent-looking results from your work you’ll need more data viz nerds.

You should consider hiring a consultant just to interview for you. It’s really hard to interview for data scientists if nobody is an expert in data science, and you might end up with someone who knows how to sounds smart but can’t build anything. Or you could end up with someone who can build anything but has no idea what their choices really mean.

The ultimate goal at the end of a data audit and strategy is to end up with a reasonable expectation of what having a data science team will accomplish, how long it will take, how deep an investment it is, and how to do it.

Categories: data science, modeling

## Team Turnstile: how do NYC neighborhoods recover from extreme weather events?

I wanted to give you the low-down on a data hackathon I participated in this weekend, which was sponsored by the NYU Institute for Public Knowledge on the topic of climate change and social information. We were assigned teams and given a very broad mandate. We had only 24 hours to do the work, so it had to be simple.

Our team consisted of Venky Kannan, Tom Levine, Eric Schles, Aaron Schumacher, Laura Noren, Stephen Fybish, and me.

We decided to think about the effects of super storms on different neighborhoods. In particular, to measure the recovery time of the subway ridership in various neighborhoods using census information. Our project was inspired by this “nofarehikes” map of New York which tries to measure the impact of a fare hike on the different parts of New York. Here’s a copy of our final slides.

Also, it’s not directly related to climate change, but rather rests on the assumption that with climate change comes more frequent extreme weather events, which seems to be an existing myth (please tell me if the evidence is or isn’t there for that myth).

We used three data sets: subway ridership by turnstile, which only exists since May 2010, the census of 2010 (which is kind of out of date but things don’t change that quickly) and daily weather observations from NOAA.

Using the weather map and relying on some formal definitions while making up some others, we came up with a timeline of extreme weather events:

Then we looked at subway daily ridership to see the effect of the storms or the recovery from the storms:

We broke it down to individual stations. Here’s a closeup around Sandy:

Then we used the census tracts to understand wealth in New York:

And of course we had to know which subway stations were in which census tracts. This isn’t perfect because we didn’t have time to assign “empty” census tracts to some nearby subway station. There are on the order of 2,000 census tracts but only on the order of 800 subway stations. But again, 24 hours isn’t alot of time, even to build clustering algorithms.

Finally, we attempted to put the data together to measure which neighborhoods have longer-than-expected recovery times after extreme weather events. This is our picture:

Interestingly, it looks like the neighborhoods of Manhattan are most impacted by severe weather events, which is not in line with our prior [Update: I don't think we actually computed the impact on a given resident, but rather just the overall change in rate of ridership versus normal. An impact analysis would take into account the relative wealth of the neighborhoods and would probably look very different].

There are tons of caveats, I’ll mention only a few here:

• We didn’t have time to measure the extent to which the recovery time took longer because the subway stopped versus other reasons people might not sure the subway. But our data is good enough to do this.
• Our data might have been overwhelmingly biased by Sandy. We’d really like to do this with much longer-term data, but the granular subway ridership data has not been available for long. But the good news is we can do this from now on.
• We didn’t have bus data at the same level, which is a huge part of whether someone can get to work, especially in the outer boroughs. This would have been great and would have given us a clearer picture.
• When someone can’t get to work, do they take a car service? How much does that cost? We’d love to have gotten our hands on the alternative ways people got to work and how that would impact them.
• In general we’d have like to measure the impact relative to their median salary.
• We would also have loved to have measured the extent to which each neighborhood consisted of salary versus hourly wage earners to further understand how a loss of transportation would translate into an impact on income.

## Unintended Consequences of Journal Ranking

I just read this paper, written by Björn Brembs and Marcus Munafò and entitled “Deep Impact: Unintended consequences of journal rank”. It was recently posted on the Computer Science arXiv (h/t Jordan Ellenberg).

I’ll give you a rundown on what it says, but first I want to applaud the fact that it was written in the first place. We need more studies like this, which examine the feedback loop of modeling at a societal level. Indeed this should be an emerging scientific or statistical field of study in its own right, considering how many models are being set up and deployed on the general public.

Here’s the abstract:

Much has been said about the increasing bureaucracy in science, stifling innovation, hampering the creativity of researchers and incentivizing misconduct, even outright fraud. Many anecdotes have been recounted, observations described and conclusions drawn about the negative impact of impact assessment on scientists and science. However, few of these accounts have drawn their conclusions from data, and those that have typically relied on a few studies. In this review, we present the most recent and pertinent data on the consequences that our current scholarly communication system has had on various measures of scientific quality (such as utility/citations, methodological soundness, expert ratings and retractions). These data confirm previous suspicions: using journal rank as an assessment tool is bad scientific practice. Moreover, the data lead us to argue that any journal rank (not only the currently-favored Impact Factor) would have this negative impact. Therefore, we suggest that abandoning journals altogether, in favor of a library-based scholarly communication system, will ultimately be necessary. This new system will use modern information technology to vastly improve the filter, sort and discovery function of the current journal system.

The key points in the paper are as follows:

• There’s a growing importance of science and trust in science
• There’s also a growing rate (x20 from 2000 to 2010) of retractions, with scientific misconduct cases growing even faster to become the majority of retractions (to an overall rate of 0.02% of published papers)
• There’s a larger and growing “publication bias” problem – in other words, an increasing unreliability of published findings
• One problem: initial “strong effects” get published in high-ranking journal, but subsequent “weak results” (which are probably more reasonable) are published in low-ranking journals
• The formal “Impact Factor” (IF) metric for rank is highly correlated to “journal rank”, defined below.
• There’s a higher incidence of retraction in high-ranking (measured through “high IF”) journals.
• “A meta-analysis of genetic association studies provides evidence that the extent to which a study over-estimates the likely true effect size is positively correlated with the IF of the journal in which it is published”
• Can the higher retraction error in high-rank journal be explained by higher visibility of those journals? They think not. Journal rank is bad predictor for future citations for example. [mathbabe inserts her opinion: this part needs more argument.]
• “…only the most highly selective journals such as Nature and Science come out ahead over unselective preprint repositories such as ArXiv and RePEc”
• Are there other measures of excellence that would correlate with IF? Methodological soundness? Reproducibility? No: “In fact, the level of reproducibility was so low that no relationship between journal rank and reproducibility could be detected.
• More about Impact Factor: The IF is a metric for the number of citations to articles in a journal (the numerator), normalized by the number of articles in that journal (the denominator). Sounds good! But:
• For a given journal, IF is not calculated but is negotiated – the publisher can (and does) exclude certain articles (but not citations). Even retroactively!
• The IF is also not reproducible – errors are found and left unexplained.
• Finally, IF is likely skewed by the fat-tailedness of citations (certain articles get lots, most get few). Wouldn’t a more robust measure be given by the median?

Conclusion

1. Journal rank is a weak to moderate predictor of scientific impact
2. Journal rank is a moderate to strong predictor of both intentional and unintentional scientific unreliability
3. Journal rank is expensive, delays science and frustrates researchers
4. Journal rank as established by IF violates even the most basic scientific standards, but predicts subjective judgments of journal quality

Long-term Consequences

• “IF generates an illusion of exclusivity and prestige based on an assumption that it will predict subsequent impact, which is not supported by empirical data.”
• “Systemic pressures on the author, rather than increased scrutiny on the part of the reader, inflate the unreliability of much scientific research. Without reform of our publication system, the incentives associated with increased pressure to publish in high-ranking journals will continue to encourage scientiststo be less cautious in their conclusions (or worse), in an attempt to market their research to the top journals.”
• “It is conceivable that, for the last few decades, research institutions world-wide may have been hiring and promoting scientists who excel at marketing their work to top journals, but who are not necessarily equally good at conducting their research. Conversely, these institutions may have purged excellent scientists from their ranks, whose marketing skills did not meet institutional requirements. If this interpretation of the data is correct, we now have a generation of excellent marketers (possibly, but not necessarily also excellent scientists) as the leading figures of the scientific enterprise, constituting another potentially major contributing factor to the rise in retractions. This generation is now in charge of training the next generation of scientists, with all the foreseeable consequences for the reliability of scientific publications in the future.

The authors suggest that we need a new kind of publishing platform. I wonder what they’d think of the Episciences Project.

## Poseurs should not own the backlash against data science poseurs

I’ve noticed a recent trend in coverage of data science. Namely, there’s backlash against the hype and the over-promising, intentional or not, of data science and data scientists. People are beginning to develop smell tests for big data and raise incredulous eyebrows at certain claims.

This is a good thing. We data scientists should welcome the backlash, first because it’s inevitable, and second because it allows us to have a much-needed conversation about how to behave and what is reasonable to claim or even hope for with respect to big data. There is a poseur problem in big data, after all.

But, fellow data nerds, let’s take this as a cue to start an internal discussion about data science skepticism. Let’s make sure that it’s coming from our community, or at least the surrounding technical community, rather than from yet another set of poseurs who don’t actually know what data is and would only serve to lampoon and discredit our emerging field rather than improve it. We should be the ones leading the charge and admitting when we’re full of shit. We need to own the backlash.

Let me give you an example. A serious data scientist friend of mine recently got asked to be interviewed as part of a conversation on data science skepticism. After thinking hard about what her contribution could be, she wrote back to accept the offer, but was then told she was “off the hook” because they’d found someone else who was “perfect for the assignment.” It turned out to be a journalist who had previously interviewed her. That was his credential for this conversation.

But how can you actually have informed skepticism if you are not yourself an expert?

Another example. David Brooks recently wrote a column wherein he declared himself a data science skeptic and then followed that up by referring to no fewer than eight random statistical studies that made no coherent sense and had no overall point. My conclusion: this is the wrong man to lead the charge against poseurs in data science.

If we are going to rebel against big data soundbites, let’s not do it in soundbites. Instead, let’s talk to people on the inside, who see specific problems in the field and are willing to talk openly about them.

I liked the recent Strata talk by Kate Crawford entitled “Untangling Algorithmic Illusions from Reality in Big Data” (h/t Alan Fekete) which discusses bias in data using very concrete examples, and asks us to examine the objectivity of our “facts”.

For example, she talked about a smart phone app that finds potholes in Boston and report them to the City, and how on the one hand it was cool but on the other it would mean that, if naively applied, richer neighborhoods like Lincoln would get better services than Roxbury. She explained an important point: data analysis is not objective, which most people know. But often the data itself is not either – it was collected in a certain way with particular selection biases.

We need more conversations like this or else we will be leaving a hole which will be filled with loud, uninformed skeptics who would be right to raise the alarm.

One last thing. I’m aware that tons of people, especially serious academic statisticians and computer scientists, criticize data scientists for a totally different reason, namely that we are overly self-promoting (although academics have their own status plays).

But I don’t apologize for that. The truth is, a data scientist is a hybrid between a business person and a researcher. And this is a good thing, not a bad thing: it means the world gets direct access to the modeler, and can challenge any hyperbolic claims by asking for details, rather than having to go through a marketing person who acts (usually quite poorly) as a nerd interpreter. I for one would rather represent my work directly to the world (and be called a self-promoter) then to be kept in the back room.

Categories: data science, rant