Archive for the ‘data science’ Category

Algorithmic Accountability Reporting: On the Investigation of Black Boxes

Tonight I’m going to be on a panel over at Columbia’s Journalism School called Algorithmic Accountability Reporting: On the Investigation of Black Boxes. It’s being organized by Nick Diakopoulos, Tow Fellow and previous guest blogger on mathbabe. You can sign up to come here and it will also be livestreamed.

The other panelists are Scott Klein from ProPublica and Clifford Stein from Columbia. I’m super excited to meet them.

Unlike some panel discussions I’ve been on, where the panelists talk about some topic they choose for a few minutes each and then there are questions, this panel will be centered around a draft of a paper coming from the Tow Center at Columbia. First Nick will present the paper and then the panelists will respond to it. Then there will be Q&A.

I wish I could share it with you but it doesn’t seem publicly available yet. Suffice it to say it has many elements in common with Nick’s guest post on raging against the algorithms, and its overall goal is to understand how investigative journalism should handle a world filled with black box algorithms.

Super interesting stuff, and I’m looking forward to tonight, even if it means I’ll miss the New Day New York rally in Foley Square tonight.

Categories: data science, modeling

“People analytics” embeds old cultural problems in new mathematical models

Today I’d like to discuss recent article from the Atlantic entitled “They’re watching you at work” (hat tip Deb Gieringer).

In the article they describe what they call “people analytics,” which refers to the new suite of managerial tools meant to help find and evaluate employees of firms. The first generation of this stuff happened in the 1950’s, and relied on stuff like personality tests. It didn’t seem to work very well and people stopped using it.

But maybe this new generation of big data models can be super useful? Maybe they will give us an awesome way of throwing away people who won’t work out more efficiently and keeping those who will?

Here’s an example from the article. Royal Dutch Shell sources ideas for “business disruption” and wants to know which ideas to look into. There’s an app for that, apparently, written by a Silicon Valley start-up called Knack.

Specifically, Knack had a bunch of the ideamakers play a video game, and they presumably also were given training data on which ideas historically worked out. Knack developed a model and was able to give Royal Dutch Shell a template for which ideas to pursue in the future based on the personality of the ideamakers.

From the perspective of Royal Dutch Shell, this represents huge timesaving. But from my perspective it means that whatever process the dudes at Royal Dutch Shell developed for vetting their ideas has now been effectively set in stone, at least for as long as the algorithm is being used.

I’m not saying they won’t save time, they very well might. I’m saying that, whatever their process used to be, it’s now embedded in an algorithm. So if they gave preference to a certain kind of arrogance, maybe because the people in charge of vetting identified with that, then the algorithm has encoded it.

One consequence is that they might very well pass on really excellent ideas that happened to have come from a modest person – no discussion necessary on what kind of people are being invisible ignored in such a set-up. Another consequence is that they will believe their process is now objective because it’s living inside a mathematical model.

The article compares this to the “blind auditions” for orchestras example, where people are kept behind a curtain so that the listeners don’t give extra consideration to their friends. Famously, the consequence of blind auditions has been way more women in orchestras. But that’s an extremely misleading comparison to the above algorithmic hiring software, and here’s why.

In the blind auditions case, the people measuring the musician’s ability have committed themselves to exactly one clean definition of readiness for being a member of the orchestra, namely the sound of the person playing the instrument. And they accept or deny someone, sight unseen, based solely on that evaluation metric.

Whereas with the idea-vetting process above, the training data consisted of “previous winners” which presumable had to go through a series of meetings and convince everyone in the meeting that their idea had merit, and that they could manage the team to try it out, and all sorts of other things. Their success relied, in other words, on a community’s support of their idea and their ability to command that support.

In other words, imagine that, instead of listening to someone playing trombone behind a curtain, their evaluation metric was to compare a given musician to other musicians that had already played in a similar orchestra and, just to make it super success-based, had made first seat.

That you’d have a very different selection criterion, and a very different algorithm. It would be based on all sorts of personality issues, and community bias and buy-in issues. In particular you’d still have way more men.

The fundamental difference here is one of transparency. In the blind auditions case, everyone agrees beforehand to judge on a single transparent and appealing dimension. In the black box algorithms case, you’re not sure what you’re judging things on, but you can see when a candidate comes along that is somehow “like previous winners.”

One of the most frustrating things about this industry of hiring algorithms is how unlikely it is to actively fail. It will save time for its users, since after all computers can efficiently throw away “people who aren’t like people who have succeeded in your culture or process” once they’ve been told what that means.

The most obvious consequence of using this model, for the companies that use it, is that they’ll get more and more people just like the people they already have. And that’s surprisingly unnoticeable for people in such companies.

My conclusion is that these algorithms don’t make things objective, they makes things opaque. And they embeds our old cultural problems in new mathematical models, giving us a false badge of objectivity.

Categories: data science, modeling, rant

Cool open-source models?

I’m looking to develop my idea of open models, which I motivated here and started to describe here. I wrote the post in March 2012, but the need for such a platform has only become more obvious.

I’m lucky to be working with a super fantastic python guy on this, and the details are under wraps, but let’s just say it’s exciting.

So I’m looking to showcase a few good models to start with, preferably in python, but the critical ingredient is that they’re open source. They don’t have to be great, because the point is to see their flaws and possible to improve them.

  1. For example, I put in a FOIA request a couple of days ago to get the current teacher value-added model from New York City.
  2. A friends of mine, Marc Joffe, has an open source municipal credit rating model. It’s not in python but I’m hopeful we can work with it anyway.
  3. I’m in search of an open source credit scoring model for individuals. Does anyone know of something like that?
  4. They don’t have to be creepy! How about a Nate Silver – style weather model?
  5. Or something that relies on open government data?
  6. Can we get the Reinhart-Rogoff model?

The idea here is to get the model, not necessarily the data (although even better if it can be attached to data and updated regularly). And once we get a model, we’d build interactives with the model (like this one), or at least the tools to do so, so other people could build them.

At its core, the point of open models is this: you don’t really know what a model does until you can interact with it. You don’t know if a model is robust unless you can fiddle with its parameters and check. And finally, you don’t know if a model is best possible unless you’ve let people try to improve it.

Twitter and its modeling war

I often talk about the modeling war, and I usually mean the one where the modelers are on one side and the public is on the other. The modelers are working hard trying to convince or trick the public into clicking or buying or consuming or taking out loans or buying insurance, and the public is on the other, barely aware that they’re engaging in anything at all resembling a war.

But there are plenty of other modeling wars that are being fought by two sides which are both sophisticated. To name a couple, Anonymous versus the NSA and Anonymous versus itself.

Here’s another, and it’s kind of bland but pretty simple: Twitter bots versus Twitter.

This war arose from the fact that people care about how many followers someone on Twitter has. It’s a measure of a person’s influence, albeit a crappy one for various reasons (and not just because it’s being gamed).

The high impact of the follower count means it’s in a wannabe celebrity’s best interest to juice their follower numbers, which introduces the idea of fake twitter accounts to game the model. This is an industry in itself, and an associated arms race of spam filters to get rid of them. The question is, who’s winning this arms race and why?

Twitter has historically made some strides in finding and removing such fake accounts with the help of some modelers who actually bought the services of a spammer and looked carefully at what their money bought them. Recently though, at least according to this WSJ article, it looks like Twitter has spent less energy pursuing the spammers.

It begs the question, why? After all, Twitter has a lot theoretically at stake. Namely, its reputation, because if everyone knows how gamed the system is, they’ll stop trusting it. On the other hand, that argument only really holds if people have something else to use instead as a better proxy of influence.

Even so, considering that Twitter has a bazillion dollars in the bank right now, you’d think they’d spend a few hundred thousand a year to prevent their reputation from being too tarnished. And maybe they’re doing that, but the spammers seem to be happily working away in spite of that.

And judging from my experience on Twitter recently, there are plenty of active spammers which actively degrade the user experience. That brings up my final point, which is that the lack of competition argument at some point gives way to the “I don’t want to be spammed” user experience argument. At some point, if Twitter doesn’t maintain standards, people will just not spend time on Twitter, and its proxy of influence will fall out of favor for that more fundamental reason.

Categories: data science, modeling

Crisis Text Line: Using Data to Help Teens in Crisis

This morning I’m helping out at a datadive event set up by DataKind (apologies to Aunt Pythia lovers).

The idea is that we’re analyzing metadata around a texting hotline for teens in crisis. We’re trying to see if we can use the information we have on these texts (timestamps, character length, topic – which is most often suicide – and outcome reported by both the texter and the counselor) to help the counselors improve their responses.

For example, right now counselors can be in up to 5 conversations at a time – is that too many? Can we figure that out from the data? Is there too much waiting between texts? Other questions are listed here.

Our “hackpad” is located here, and will hopefully be updated like a wiki with results and visuals from the exploration of our group. It looks like we have a pretty amazing group of nerds over here looking into this (mostly python users!), and I’m hopeful that we will be helping the good people at Crisis Text Line.

There is no “market solution” for ethics

We saw what happened in finance with self-regulation and ethics. Let’s prepare for the exact same thing in big data.


Remember back in the 1970’s through the 1990’s, the powers that were decided that we didn’t need to regulate banks because “they” wouldn’t put “their” best interests at risk? And then came the financial crisis, and most recently came Alan Greenspan’s recent admission that he’d got it kinda wrong but not really.

Let’s look at what the “self-regulated market” in derivatives has bestowed upon us. We’ve got a bunch of captured regulators and a huge group of bankers who insist on keeping derivatives opaque so that they can charge clients bigger fees, not to mention that they insist on not having fiduciary duties to their clients, and oh yes, they’d like to continue to bet depositors’ money on those derivatives. They wrote the regulation themselves for that one. And this is after they blew up the world and got saved by the taxpayers.

Given that the banks write the regulations, it’s arguably still kind of a self-regulated market in finance. So we can see how ethics has been and is faring in such a culture.

The answer is, not well. Just in case the last 5 years of news articles wasn’t enough to persuade you of this fact, here’s what NY Fed Chief Dudley had to say recently about big banks and the culture of ethics, from this Huffington Post article:

“Collectively, these enhancements to our current regime may not solve another important problem evident within some large financial institutions — the apparent lack of respect for law, regulation and the public trust,” he said.

“There is evidence of deep-seated cultural and ethical failures at many large financial institutions,” he continued. “Whether this is due to size and complexity, bad incentives, or some other issues is difficult to judge, but it is another critical problem that needs to be addressed.”

Given that my beat is now more focused on the big data community and less on finance, mostly since I haven’t worked in finance for almost 2 years, this kind of stuff always makes me wonder how ethics is faring in the big data world, which is, again, largely self-regulated.

Big data

According to this ComputerWorld article, things are pretty good. I mean, there are the occasional snafus – unappreciated sensors or unreasonable zip code gathering examples – but the general idea is that, as long as you have a transparent data privacy policy, you’ll be just fine.

Examples of how awesome “transparency” is in these cases vary from letting people know what cookies are being used (BlueKai), to promising not to share certain information between vendors (Retention Science), to allowing customers a limited view into their profiling by Acxiom, the biggest consumer information warehouse. Here’s what I assume a typical reaction might be to this last one.

Wow! I know a few things Acxiom knows about me, but probably not all! How helpful. I really trust those guys now.

Not a solution

What’s great about letting customers know exactly what you’re doing with their data is that you can then turn around and complain that customers don’t understand or care about privacy policies. In any case, it’s on them to evaluate and argue their specific complaints. Which of course they don’t do, because they can’t possibly do all that work and have a life, and if they really care they just boycott the product altogether. The result in any case is a meaningless, one-sided conversation where the tech company only hears good news.

Oh, and you can also declare that customers are just really confused and don’t even know what they want:

In a recent Infosys global survey, 39% of the respondents said that they consider data mining invasive. And 72% said they don’t feel that the online promotions or emails they receive speak to their personal interests and needs.

Conclusion: people must want us to collect even more of their information so they can get really really awesome ads.

Finally, if you make the point that people shouldn’t be expected to be data mining and privacy experts to use the web, the issue of a “market solution for ethics” is raised.

“The market will provide a mechanism quicker than legislation will,” he says. “There is going to be more and more control of your data, and more clarity on what you’re getting in return. Companies that insist on not being transparent are going to look outdated.”

Back to ethics

What we’ve got here is a repeat problem. The goal of tech companies is to make money off of consumers, just as the goal of banks is to make money off of investors (and taxpayers as a last resort).

Given how much these incentives clash, the experts on the inside have figured out a way of continuing to do their thing, make money, and at the same time, keeping a facade of the consumer’s trust. It’s really well set up for that since there are so many technical terms and fancy math models. Perfect for obfuscation.

If tech companies really did care about the consumer, they’d help set up reasonable guidelines and rules on these issues, which could easily be turned into law. Instead they send lobbyists to water down any and all regulation. They’ve even recently created a new superPAC for big data (h/t Matt Stoller).

And although it’s true that policy makers are totally ignorant of the actual issues here, that might be because of the way big data professionals talk down to them and keep them ignorant. It’s obvious that tech companies are desperate for policy makers to stay out of any actual informed conversation about these issues, never mind the public.


There never has been, nor there ever will be, a market solution for ethics so long as the basic incentives between the public and an industry are so misaligned. The public needs to be represented somehow, and without rules and regulations, and without leverage of any kind, that will not happen.

Categories: data science, finance, modeling

How do you know when you’ve solved your data problem?

I’ve been really impressed by how consistently people have gone to read my post “K-Nearest Neighbors: dangerously simple,” which I back in April. Here’s a timeline of hits on that post:

Stats for "K-Nearest Neighbors: dangerously simple." I've actually gotten more hits recently.

Stats for “K-Nearest Neighbors: dangerously simple.” I’ve actually gotten more hits recently.

I think the interest in this post is that people like having myths debunked, and are particularly interested in hearing how even the simple things that they thought they understand are possibly wrong, or at least more complicated than they’d been assuming. Either that or it’s just got a real catchy name.

Anyway, since I’m still getting hits on that post, I’m also still getting comments, and just this morning I came across a new comment by someone who calls herself “travelingactuary”. Here it is:

My understanding is that CEOs hate technical details, but do like results. So, they wouldn’t care if you used K-Nearest Neighbors, neural nets, or one that you invented yourself, so long as it actually solved a business problem for them. I guess the problem everyone faces is, if the business problem remains, is it because the analysis was lacking or some other reason? If the business is ‘solved’ is it actually solved or did someone just get lucky? That being so, if the business actually needs the classifier to classify correctly, you better hire someone who knows what they’re doing, rather than hoping the software will do it for you.

Presumably you want to sell something to Monica, and the next n Monicas who show up. If your model finds a whole lot of big spenders who then don’t, your technophobe CEO is still liable to think there’s something wrong.

I think this comment brings up the right question, namely knowing when you’ve solved your data problem, with K-Nearest Neighbors or whichever algorithms you’ve chosen to use. Unfortunately, it’s not that easy.

Here’s the thing, it’s almost never possible to tell if a data problem is truly solved. I mean, it might be a business problem where you go from losing money to making money, and in that sense you could say it’s been “solved.” But in terms of modeling, it’s very rarely a binary thing.

Why do I say that? Because, at least in my experience, it’s rare that you could possibly hope for high accuracy when you model stuff, even if it’s a classification problem. Most of the time you’re trying to achieve something better than random, some kind of edge. Often an edge is enough, but it’s nearly impossible to know if you’ve gotten the biggest edge possible.

For example, say you’re binning people you who come to your site in three equally sized groups, as “high spenders,” “medium spenders,” and “low spenders.” So if the model were random, you’d expect a third to be put into each group, and that someone who ends up as a big spender is equally likely to be in any of the three bins.

Next, say you make a model that’s better than random. How would you know that? You can measure that, for example, by comparing it to the random model, or in other words by seeing how much better you do than random. So if someone who ends up being a big spender is three times more likely to have been labeled a big spender than a low spender and twice as likely than a medium spender, you know your model is “working.”

You’d use those numbers, 3x and 2x, as a way of measuring the edge your model is giving you. You might care about other related numbers more, like whether pegged low spenders are actually low spenders. It’s up to you to decide what it means that the model is working. But even when you’ve done that carefully, and set up a daily updated monitor, the model itself still might not be optimal, and you might still be losing money.

In other words, you can be a bad modeler or a good modeler, and either way when you try to solve a specific problem you won’t really know if you did the best possible job you could have, or someone else could have with their different tools and talents.

Even so, there are standards that good modelers should follow. First and most importantly, you should always set up a model monitor to keep track of the quality of the model and see how it fares over time.  Because why? Because second, you should always assume that, over time, your model will degrade, even if you are updating it regularly or even automatically. It’s of course good to know how crappy things are getting so you don’t have a false sense of accomplishment.

Keep in mind that just because it’s getting worse doesn’t mean you can easily start over again and do better. But a least you can try, and you will know when it’s worth a try. So, that’s one thing that’s good about admitting your inability to finish anything.

On to the political aspect of this issue. If you work for a CEO who absolutely hates ambiguity – and CEO’s are trained to hate ambiguity, as well as trained to never hesitate – and if that CEO wants more than anything to think their data problem has been “solved,” then you might be tempted to argue that you’ve done a phenomenal job just to make her happy. But if you’re honest, you won’t say that, because it ‘aint true.

Ironically and for these reasons, some of the most honest data people end up looking like crappy scientists because they never claim to be finished doing their job.

Categories: data science, modeling

The private-data-for-services trade fallacy

I had a great time at Harvard Wednesday giving my talk (prezi here) about modeling challenges. The audience was fantastic and truly interdisciplinary, and they pushed back and challenged me in a great way. I’m glad I went and I’m glad Tess Wise invited me.

One issue that came up is something I want to talk about today, because I hear it all the time and it’s really starting to bug me.

Namely, the fallacy that people, especially young people, are “happy to give away their private data in order to get the services they love on the internet”. The actual quote came from the IBM guy on the congressional subcommittee panel on big data, which I blogged about here (point #7), but I’ve started to hear that reasoning more and more often from people who insist on side-stepping the issue of data privacy regulation.

Here’s the thing. It’s not that people don’t click “yes” on those privacy forms. They do click yes, and I acknowledge that. The real problem is that people generally have no clue what it is they’re trading.

In other words, this idea of a omniscient market participant with perfect information making a well-informed trade, which we’ve already seen is not the case in the actual market, is doubly or triply not the case when you think about young people giving away private data for the sake of a phone app.

Just to be clear about what these market participants don’t know, I’ll make a short list:

  • They probably don’t know that their data is aggregated, bought, and sold by Acxiom, which they’ve probably never heard of.
  • They probably don’t know that Facebook and other social media companies sell stuff about them even if their friends don’t see it and even though it’s often “de-identified”. Think about this next time you sign up for a service like “Bang With Friends,” which works through Facebook.
  • They probably don’t know how good algorithms are getting at identifying de-identified information.
  • They probably don’t know how this kind of information is used by companies to profile users who ask for credit or try to get a job.

Conclusion: people are ignorant of what they’re giving away to play Candy Crush Saga[1]. And whatever it is they’re giving away, it’s something way far in the future that they’re not worried about right now. In any case it’s not a fair trade by any means, and we should stop referring to it as such.

What is it instead? I’d say it’s a trick. A trick which plays on our own impulses and short-sightedness and possibly even a kind of addiction to shiny toys in the form of candy. If you give me your future, I’ll give you a shiny toy to play with right now. People who click “yes” are not signaling that they’ve thought deeply about the consequences of giving their data away, and they are certainly not making the definitive political statement that we don’t need privacy regulation.

1. I actually don’t know the data privacy rules for Candy Crush and can’t seem to find them, for example here. Please tell me if you know what they are.

Categories: data science, modeling, rant

Harvard Applied Statistics workshop today

I’m on an Amtrak train to Boston today to give a talk in the Applied Statistics workshop at Harvard, which is run out of the Harvard Institute for Quantiative Social Science. I was kindly invited by Tess Wise, a Ph.D. student in the Department of Government at Harvard who is organizing this workshop.

My title is “Data Skepticism in Industry” but as I wrote the talk (link to my prezi here) it transformed a bit and now it’s more about the problems not only for data professionals inside industry but for the public as well. So I talk about creepy models and how there are multiple longterm feedback loops having a degrading effect on culture and democracy in the name of short-term profits. 

Since we’re on the subject of creepy, my train reading this morning is this book entitled “Murdoch’s Politics,” which talks about how Rupert Murdoch lives by design in the center of all things creepy. 

Categories: data science, modeling

Disorderly Conduct with Alexis and Jesse #OWS


So there’s a new podcast called Disorderly Conduct which “explores finance without a permit” and is hosted by Alexis Goldstein, whom I met through her work on Occupy the SEC, and Jesse Myerson, and activist and a writer.

I was recently a very brief guest on their “In the Weeds” feature, where I was asked to answer the question, “What is the single best way to rein in the power of Wall Street?” in three minutes. The answers given by:

  1. me,
  2. The Other 98% organizer Nicole Carty (@nacarty),
  3. contributing writer David Dayen (@ddayen),
  4. Americans for Financial Reform Policy Director Marcus Stanley (@MarcusMStanley), and
  5. Marxist militant José Martín (@sabokitty)

can be found here or you can download the episode here.

Occupy Finance video series

We’ve been having our Occupy Finance book club meetings every Sunday, and although our group has decided not to record them, a friend of our group and a videographer in her own right, Donatella Barbarella, has started to interview the authors and post them on YouTube. The first few interviews have made their way to the interwebs:

  1. Linda talking about Chapter 1: Financialization and the 99%.
  2. Me talking about Chapter 2: the Bailout
  3. Tamir talking about Chapter 3: How banks work

Doing Data Science now out!

O’Reilly is releasing the book today. I can’t wait to see a hard copy!! And when I say “hard copy,” keep in mind that all of O’Reilly’s books are soft cover.

Categories: #OWS, data science, finance

*Doing Data Science* now available on Kindle!

My book with Rachel Schutt is now available on Kindle. I’ve tested this by buying it myself from and looking at it on my computer’s so-called cloud reader.

Here’s the good news. It is actually possible to do this, and it’s satisfying to see!

Here’s the bad news. The kindle reader doesn’t render latex well, or for that matter many of the various fonts we use for various reasons. The result is a pretty comical display of formatting inconsistency. In particular, whenever a formula comes up it might seem like we’re

screaming about it

and often the quoted passages come in

very very tiny indeed.

I hope it’s readable. If you prefer less comical formatting, the hard copy edition is coming out on October 22nd, next Tuesday.

Next, a word about the book’s ranking. Amazon has this very generous way of funneling down into categories sufficiently so that the ranking of a given book looks really high. So right now I can see this on the book’s page:

but for a while, before yesterday, it took a few more iterations of digging to get to single digits, so it was more like:

But you, know, I’ll take what I can get to be #1! It’s all about metrics!!!

One last thing, which is that the full title is now “Doing Data Science: Straight Talk from the Frontline” and for the record, I wanted the full title to be something more like “Doing Data Science: the no bullshit approach” but for some reason I was overruled. Whatevs.

Categories: data science

Cumulative covariance plots

One thing I do a lot when I work with data is figure out how to visualize my signals, especially with respect to time.

Lots of things change over time – relationships between variables, for example – and it’s often crucial to get deeply acquainted with how exactly that works with your in-sample data.

Say I am trying to predict “y”: so for a data point at time t, we’ll say we try to predict y(t). I’ll take an “x”, a variable that is expected to predict “y”, and I’ll demean both series x and y, hopefully in a causal way, and I will rename them x’ and y’, and then, making sure I’ve ordered everything with respect to time, I’ll plot the cumulative sum of the product x'(t) * y'(t).

In the case that both x'(t) and y'(t) have the both sign – so they’re both bigger than average or they’re both smaller than average, this product is positive, and otherwise it’s negative. So if you plot the cumulative sum, you get an upwards trend if things are positively correlated and downwards trend if things are negatively correlated. If you think about it, you are computing the numerator of the correlation function, so it is indeed just an unscaled version of total correlation.

Plus, since you ordered everything by time first, you can see how the relationship between these variables evolved over time.

Also, in the case that you are working with financial models, you can make a simplifying assumption that both x and y are pretty well demeaned already (especially at short time scales) and this gives you the cumulative PnL plot of your model. In other words, it tells you how much money your model is making.

So I was doing this exercise of plotting the cumulative covariance with some data the other day, and I got a weird picture. It kind of looked like a “U” plot: it went down dramatically at the beginning, then was pretty flat but trending up, then it went straight up at the end. It ended up not quite as high as it started, which is to say that in terms of straight-up overall correlation, I was calculating something negative but not very large.

But what could account for that U-shape? After some time I realized that the data had been extracted from the database in such a way that, after ordering my data by date, it was hugely biased in the beginning and at the end, in different directions, and that this was unavoidable, and the picture helped me determine exactly which data to exclude from my set.

After getting rid of the biased data at the beginning and the end, I concluded that I had a positive correlation here, even though if I’d trusted the overall “dirty” correlation I would have thought it was negative.

This is good information, and confirmed my belief that it’s always better to visualize data over time than it is to believe one summary statistic like correlation.

Categories: data science, modeling

Data Skeptic post

I wrote a blog post for O’Reilly’s website to accompany my essay, On Being a Data Skeptic. Here’s an excerpt:

I left finance pretty disgusted with the whole thing, and because I needed to make money and because I’m a nerd, I pretty quickly realized I could rebrand myself a “data scientist” and get a pretty cool job, and that’s what I did. Once I started working in the field, though, I was kind of shocked by how positive everyone was about the “big data revolution” and the “power of data science.”

Not to underestimate the power of data––it’s clearly powerful! And big data has the potential to really revolutionize the way we live our lives for the better––or sometimes not. It really depends.

From my perspective, this was, in tenor if not in the details, the same stuff we’d been doing in finance for a couple of decades and that fields like advertising were slow to pick up on. And, also from my perspective, people needed to be way more careful and skeptical of their powers than they currently seem to be. Because whereas in finance we need to worry about models manipulating the market, in data science we need to worry about models manipulating people, which is in fact scarier. Modelers, if anything, have a bigger responsibility now than ever before.

Categories: data science, finance, modeling

Guest post: Rage against the algorithms

This is a guest post by , a Tow Fellow at the Columbia University Graduate School of Journalism where he is researching the use of data and algorithms in the news. You can find out more about his research and other projects on his website or by following him on Twitter. Crossposted from engenhonetwork with permission from the author.


How can we know the biases of a piece of software? By reverse engineering it, of course.

When was the last time you read an online review about a local business or service on a platform like Yelp? Of course you want to make sure the local plumber you hire is honest, or that even if the date is dud, at least the restaurant isn’t lousy. A recent survey found that 76 percent of consumers check online reviews before buying, so a lot can hinge on a good or bad review. Such sites have become so important to local businesses that it’s not uncommon for scheming owners to hire shills to boost themselves or put down their rivals.

To protect users from getting duped by fake reviews Yelp employs an algorithmic review reviewer which constantly scans reviews and relegates suspicious ones to a “filtered reviews” page, effectively de-emphasizing them without deleting them entirely. But of course that algorithm is not perfect, and it sometimes de-emphasizes legitimate reviews and leaves actual fakes intact—oops. Some businesses have complained, alleging that the filter can incorrectly remove all of their most positive reviews, leaving them with a lowly one- or two-stars average.

This is just one example of how algorithms are becoming ever more important in society, for everything from search engine personalizationdiscriminationdefamation, and censorship online, to how teachers are evaluated, how markets work, how political campaigns are run, and even how something like immigration is policed. Algorithms, driven by vast troves of data, are the new power brokers in society, both in the corporate world as well as in government.

They have biases like the rest of us. And they make mistakes. But they’re opaque, hiding their secrets behind layers of complexity. How can we deal with the power that algorithms may exert on us? How can we better understand where they might be wronging us?

Transparency is the vogue response to this problem right now. The big “open data” transparency-in-government push that started in 2009 was largely the result of an executive memo from President Obama. And of course corporations are on board too; Google publishes a biannual transparency report showing how often they remove or disclose information to governments. Transparency is an effective tool for inculcating public trust and is even the way journalists are now trained to deal with the hole where mighty Objectivity once stood.

But transparency knows some bounds. For example, though the Freedom of Information Act facilitates the public’s right to relevant government data, it has no legal teeth for compelling the government to disclose how that data was algorithmically generated or used in publicly relevant decisions (extensions worth considering).

Moreover, corporations have self-imposed limits on how transparent they want to be, since exposing too many details of their proprietary systems may undermine a competitive advantage (trade secrets), or leave the system open to gaming and manipulation. Furthermore, whereas transparency of data can be achieved simply by publishing a spreadsheet or database, transparency of an algorithm can be much more complex, resulting in additional labor costs both in creation as well as consumption of that information—a cognitive overload that keeps all but the most determined at bay. Methods for usable transparency need to be developed so that the relevant aspects of an algorithm can be presented in an understandable way.

Given the challenges to employing transparency as a check on algorithmic power, a new and complementary alternative is emerging. I call it algorithmic accountability reporting. At its core it’s really about reverse engineering—articulating the specifications of a system through a rigorous examination drawing on domain knowledge, observation, and deduction to unearth a model of how that system works.

As interest grows in understanding the broader impacts of algorithms, this kind of accountability reporting is already happening in some newsrooms, as well as in academic circles. At the Wall Street Journal a team of reporters probed e-commerce platforms to identify instances of potential price discrimination in dynamic and personalized online pricing. By polling different websites they were able to spot several, such as, that were adjusting prices dynamically based on the location of the person visiting the site. At the Daily Beast, reporter Michael Keller dove into the iPhone spelling correction feature to help surface patterns of censorship and see which words, like “abortion,” the phone wouldn’t correct if they were misspelled. In my own investigation for Slate, I traced the contours of the editorial criteria embedded in search engine autocomplete algorithms. By collecting hundreds of autocompletions for queries relating to sex and violence I was able to ascertain which terms Google and Bing were blocking or censoring, uncovering mistakes in how these algorithms apply their editorial criteria.

All of these stories share a more or less common method. Algorithms are essentially black boxes, exposing an input and output without betraying any of their inner organs. You can’t see what’s going on inside directly, but if you vary the inputs in enough different ways and pay close attention to the outputs, you can start piecing together some likeness for how the algorithm transforms each input into an output. The black box starts to divulge some secrets.

Algorithmic accountability is also gaining traction in academia. At Harvard, Latanya Sweeney has looked at how online advertisements can be biased by the racial association of names used as queries. When you search for “black names” as opposed to “white names” ads using the word “arrest” appeared more often for online background check service Instant Checkmate. She thinks the disparity in the use of “arrest” suggests a discriminatory connection between race and crime. Her method, as with all of the other examples above, does point to a weakness though: Is the discrimination caused by Google, by Instant Checkmate, or simply by pre-existing societal biases? We don’t know, and correlation does not equal intention. As much as algorithmic accountability can help us diagnose the existence of a problem, we have to go deeper and do more journalistic-style reporting to understand the motivations or intentions behind an algorithm. We still need to answer the question of why.

And this is why it’s absolutely essential to have computational journalists not just engaging in the reverse engineering of algorithms, but also reporting and digging deeper into the motives and design intentions behind algorithms. Sure, it can be hard to convince companies running such algorithms to open up in detail about how their algorithms work, but interviews can still uncover details about larger goals and objectives built into an algorithm, better contextualizing a reverse-engineering analysis. Transparency is still important here too, as it adds to the information that can be used to characterize the technical system.

Despite the fact that forward thinkers like Larry Lessig have been writing for some time about how code is a lever on behavior, we’re still in the early days of developing methods for holding that code and its influence accountable. “There’s no conventional or obvious approach to it. It’s a lot of testing or trial and error, and it’s hard to teach in any uniform way,” noted Jeremy Singer-Vine, a reporter and programmer who worked on the WSJ price discrimination story. It will always be a messy business with lots of room for creativity, but given the growing power that algorithms wield in society it’s vital to continue to develop, codify, and teach more formalized methods of algorithmic accountability. In the absence of new legal measures, it may just provide a novel way to shed light on such systems, particularly in cases where transparency doesn’t or can’t offer much clarity.

New Essay, On Being a Data Skeptic, now out

It is available here and is based on a related essay written by Susan Webber entitled “Management’s Great Addiction: It’s time we recognized that we just can’t measure everything.” It is being published by O’Reilly as an e-book.

No, I don’t know who that woman is looking skeptical on the cover. I wish they’d asked me for a picture of a skeptical person, I think my 11-year-old son would’ve done a better job.

Categories: data science, modeling, musing

Sometimes, The World Is Telling You To Polish Up Your LinkedIn Profile

September 27, 2013 9 comments

The above title was stolen verbatim from an excellent essay by Dan Milstein on the Hut 8 Labs blog (hat tip Deane Yang). The actual title of the essay is “No Deadlines For You! Software Dev Without Estimates, Specs or Other Lies”

He wrote the essay about how, as an engineer, you can both make yourself invaluable to your company and avoid meaningless and arbitrary deadlines on your projects. So, he’s an engineer, but the advice he gives is surprisingly close to the advice I was trying to give on Monday night when I spoke at the Columbia Data Science Society (my slides are here, by the way). More on that below.

Milstein is an engaging writer. He wrote a book called Coding, Fast and Slow, which I now feel like reading just because I enjoy his insights and style. Here’s a small excerpt:

Let’s say you’ve started at a new job, leading a small team of engineers. On your first day, an Important Person comes by your desk. After some welcome-to-the-business chit chat, he/she hands you a spec. You look it over—it describes a new report to add to the company’s product. Of course, like all specs, it’s pretty vague, and, worse, it uses some jargon you’ve heard around the office, but haven’t quite figured out yet.

You look up from the spec to discover that the Important Person is staring at you expectantly: “So, <Your Name>, do you think you and your team can get that done in 3 months?”

What do you do?

Here are some possible approaches (all of which I’ve tried… and none of which has ever worked out well):

  • Immediately try to flesh out the spec in more detail

“How are we summing up this number? Is this piece of data required? What does <jargon word> mean, here, exactly?”

  • Stall, and take the spec to your new team

“Hmm. Hmm. Hmmmmmmmm. Do you think, um, Bob (that’s his name, right?) has the best handle on these kinds of things?”

  • Give the spec a quick skim, and then listen to the seductive voice of System I

“Sure, yeah, 3 months sounds reasonable” (OMG, I wish this wasn’t something I’ve doneSO MANY TIMES).

  • Push back aggressively

“I read this incredibly convincing blog post 1 about how it’s impossible to commit to deadlines for software projects, sorry, I just can’t do that.”

He then goes on to write that very blog post. In it, he explains what you should do, which is to learn why the project has been planned in the first place, and what the actual business question is, so you have full context for your job and you know what it means to the company for this to succeed or fail.

The way I say this, regularly, to aspiring data scientists I run into, is that you are often given a data science question that’s been filtered from a business question, through a secondary person who has some idea that they’ve molded that business question into a “mathematical question,” and they want you to do the work of answering that question, under some time constraint and resource constraints that they’ve also picked out of the air.

But often that process has perverted the original aims – often because proxies have magically appeared in the place of the original objects of interest – and it behooves a data scientist who doesn’t want to be working on the wrong problem to go to the original source and verify that their work is answering a vital business question, that they’re optimizing for the right thing, and that they understand the actual constraints (like deadlines but also resources) rather than the artificial constraints made up by whoever is in charge of telling the nerds what to do.

In other words, I suggest that each data scientist “becomes part business person,” and talks to the business owner of the given problem directly until they’re sure they know what needs to get done with data.

Milstein has a bunch of great tips on how to go through with this process, including:

  1. Counting on people’s enjoyment of hearing their own ideas repeated and fears understood,
  2. Using a specific template when talking to Important People, namely a) “I’m going to echo that back, make sure I understand”, b) echo it back, c) “Do I have that right?”.
  3. To always think and discuss your work in terms of risks and information for the business. Things like, you need this information to answer this risk. The point here is it always stays relevant to the business people while you do your technical thing. This means always keeping a finger on the pulse of the business problem.
  4. Framing choices for the Important Person in terms of clear trade-offs of risk, investments, and completion. This engages the business in what your process is in a completely understandable way.
  5. Finally, if your manager doesn’t let you talk directly to the Important People in the business, and you can’t convince your manager to change his or her mind, then you might wanna polish up your LinkedIn profile, because otherwise you are fated to work on failed projects. Great advice.
Categories: data science

A Code of Conduct for data scientists from the Bellagio Fellows

September 25, 2013 3 comments

The 2013 PopTech & Rockefeller Foundation Bellagio Fellows – Kate CrawfordPatrick MeierClaudia PerlichAmy LuersGustavo Faleiros and Jer Thorp – yesterday published “Seven Principles for Big Data and Resilience Projects” on Patrick Meier’s blog iRevolution.

Although they claim that these principles are meant for “best practices for resilience building projects that leverage Big Data and Advanced Computing,” I think they’re more general than that (although I’m not sure exactly what a resilience building project is) I and I really like them. They are looking for public comments too. Go to the post for the full description of each, but here is a summary:

1. Open Source Data Tools

Wherever possible, data analytics and manipulation tools should be open source, architecture independent and broadly prevalent (R, python, etc.).

2. Transparent Data Infrastructure

Infrastructure for data collection and storage should operate based on transparent standards to maximize the number of users that can interact with the infrastructure.

3. Develop and Maintain Local Skills

Make “Data Literacy” more widespread. Leverage local data labor and build on existing skills.

4. Local Data Ownership

Use Creative Commons and licenses that state that data is not to be used for commercial purposes.

5. Ethical Data Sharing

Adopt existing data sharing protocols like the ICRC’s (2013). Permission for sharing is essential. How the data will be used should be clearly articulated. An opt in approach should be the preference wherever possible, and the ability for individuals to remove themselves from a data set after it has been collected must always be an option.

6. Right Not To Be Sensed

Local communities have a right not to be sensed. Large scale city sensing projects must have a clear framework for how people are able to be involved or choose not to participate.

7. Learning from Mistakes

Big Data and Resilience projects need to be open to face, report, and discuss failures.

Upcoming talks

September 23, 2013 4 comments

Tomorrow evening I’m meeting with the Columbia Data Science Society and talking to them – who as I understand it are mostly engineers – about “how to think like a data scientist”.

On October 11th I’ll be in D.C. sitting on a panel discussion organized by the Americans for Financial Reform. It’s part of a day-long event on the topic of transparency in financial regulation. The official announcement isn’t out yet but I’ll post it here as soon as I can. I’ll be giving my two cents on what mathematical tools can do and cannot do with respect to this stuff.

On October 16th I’ll again be in D.C. giving a talk in the MAA Distinguished Lecture Series to a mostly high school math teacher audience. My talk is entitled, “Start Your Own Netflix”.

Finally, I’m going to Harvard on October 30th to give a talk in their Applied Statistics Workshop series. I haven’t figured out exactly what I’m talking about but it will be something nerdy and skeptical.

Categories: data science, finance

I’d like you to eventually die

September 22, 2013 24 comments

Google has formally thrown their hat into the “rich people should never die” arena, with an official announcement of their new project called Calico, “a new company that will focus on health and well-being, in particular the challenge of aging and associated diseases”. Their plan is to use big data and genetic research to avoid aging.

I saw this coming when they hired Ray Kurzweil. Here’s an excerpt from my post:

A few days ago I read a New York Times interview of Ray Kurzweil, who thinks he’s going to live forever and also claims he will cure cancer if and when he gets it (his excuse for not doing it in his spare time now: “Well, I mean, I do have to pick my priorities. Nobody can do everything.”). He also just got hired at Google.

Here’s the thing. We need people to die. Our planet cannot sustain all the people currently alive as well as all the people who are going to someday be born. Just not gonna happen. Plus, it would be a ridiculously boring place to live. Think about how boring it is already for young people to be around old people. I bore myself around my kids, and I’m only 30 years older than they are.

And yes, it’s tragic when someone we love actually becomes one of those people whose time has come, especially if they’re young and especially if it seemed preventable. For that matter, I’m all for figuring out how to improve the quality of life for people.

But the idea that we’re going to figure out how to keep alive a bunch of super rich advertising executives just doesn’t seem right – because, let’s face it, there will have to be a way to choose who lives and who dies, and I know who is at the top of that list – and I for one am not on board with the plan. Larry Page, Tim Cook, and Ray Kurzweil: I’d really like it if you eventually died.

On the other hand, I’m not super worried about this plan coming through either. Big data can do a lot but it’s not going to make people live forever. Or let’s say it another way: if they can use big data to make people live forever, they can also use big data to convince me that super special rich white men living in Silicon Valley should take up resources and airtime for the rest of eternity.

Categories: data science, rant

The bursting of the big data bubble

September 20, 2013 42 comments

It’s been a good ride. I’m not gonna lie, it’s been a good time to be a data whiz, a quant-turned-data scientist. I get lots of attention and LinkedIn emails just for my title and my math Ph.D., and it’s flattering. But all of that is going to change, starting now.

You see, there are some serious headwinds. They started a while ago but they’re picking up speed, and the magical wave of hype propelling us forward is giving way. I can tell, I’ve got a nose for sinking ships and sailing metaphors.

First, the hype and why it’s been so strong.

It seems like data and the ability to use data is the secret sauce in so many of the big success stories. Look at Google. They managed to think of the entire web as their data source, and have earned quite a bit of respect and advertising money for their chore of organizing it like a huge-ass free library for our benefit. That took some serious data handling and modeling know-how.

We humans are pretty good at detecting patterns, so after a few companies made it big with the secret data sauce, we inferred that, when you take a normal tech company and sprinkle on data, you get the next Google.

Next, a few reasons it’s unsustainable

Most companies don’t have the data that Google has, and can never hope to cash in on stuff at the scale of the ad traffic that Google sees. Even so, there are lots of smaller but real gains that lots of companies – but not all – could potentially realize if they collected the right kind of data and had good data people helping them.

Unfortunately, this process rarely actually happens the right way, often because the business people ask their data people the wrong questions to being with, and since they think of their data people as little more than pieces of software – data in, magic out – they don’t get their data people sufficiently involved with working on something that data can address.

Also, since there are absolutely no standards for what constitutes a data scientist, and anyone who’s taken a machine learning class at college can claim to be one, the data scientists walking around often have no clue how to actually form the right questions to ask anyway. They are lopsided data people, and only know how to answer already well-defined questions like the ones that Kaggle comes up with. That’s less than half of what a good data scientist does, but people have no idea what a good data scientist does.

Plus, it’s super hard to accumulate hard evidence that you have a crappy data science team. If you’ve hired one or more unqualified data scientists, how can you tell? They still might be able to implement crappy models which don’t answer the right question, but in order to see that you’d need to also have a good data scientist who implements a better solution to the right question. But you only have one. It’s a counterfactual problem.

Here’s what I see happening. People have invested some real money in data, and they’ve gotten burned with a lack of medium-term results. Now they’re getting impatient for proof that data is an appropriate place to invest what little money their VC’s have offered them. That means they want really short-term results, which means they’re lowballing data science expertise, which means they only attract people who’ve taken one machine learning class and fancy themselves experts.

In other words, data science expertise has been commodified, and it’s a race to the bottom. Who will solve my business-critical data problem on a short-term consulting basis for less than $5000? Less than $4000?

What’s next?

There really is a difference between A) crude models that someone constructs not really knowing what they’re doing and B) thoughtful models which gain an edge along the margin. It requires someone who actually knows what they’re doing to get the latter kind of model. But most people are unaware of even the theoretical difference between type A and type B models, nor would they recognize which type they’ve got once they get one.

Even so, over time, type B models outperform type A models, and if you care enough about the marginal edge between the two types, say because you’re in a competitive environment, then you will absolutely need type B to make money. And by the way, if you don’t care about that marginal edge, then by all means you should use a type A solution. But you should at least know the difference and make that choice deliberately.

My forecast is that, once the hype wave of big data is dead and gone, there will emerge reasonable standards of what a data scientist should actually be able to do, and moreover a standard of when and how to hire a good one. It’ll be a rubrik, and possibly some tests, of both problem solving and communication.

Personally, I’m looking forward to a more reasonable and realistic vision of how data and data expertise can help with things. I might have to change my job title, but I’m used to it.

Categories: data science

Get every new post delivered to your Inbox.

Join 980 other followers