Archive for the ‘data science’ Category

When big data goes bad in a totally predictable way

Three quick examples this morning in the I-told-you-so category. I’d love to hear Kenneth Neil Cukier explain how “objective” data science is when confronted with this stuff.

1. When an unemployed black woman pretends to be white her job offers skyrocket (Urban Intellectuals, h/t Mike Loukides). Excerpt from the article: “Two years ago, I noticed that had added a “diversity questionnaire” to the site.  This gives an applicant the opportunity to identify their sex and race to potential employers. guarantees that this “option” will not jeopardize your chances of gaining employment.  You must answer this questionnaire in order to apply to a posted position—it cannot be skipped.  At times, I would mark off that I was a Black female, but then I thought, this might be hurting my chances of getting employed, so I started selecting the “decline to identify” option instead.  That still had no effect on my getting a job.  So I decided to try an experiment:  I created a fake job applicant and called her Bianca White.”

2. How big data could identify the next felon – or blame the wrong guy (Bloomberg). From the article: “The use of physical characteristics such as hair, eye and skin color to predict future crimes would raise ‘giant red privacy flags’ since they are a proxy for race and could reinforce discriminatory practices in hiring, lending or law enforcement, said Chi Chi Wu, staff attorney at the National Consumer Law Center.”

3. How algorithms magnify misbehavior (the Guardian, h/t Suresh Naidu). From the article: “For one British university, what began as a time-saving exercise ended in disgrace when a computer model set up to streamline its admissions process exposed – and then exacerbated - gender and racial discrimination.”

This is just the beginning, unfortunately.

Categories: data science, modeling

What’s the difference between big data and business analytics?

I offend people daily. People tell me they do “big data” and that they’ve been doing big data for years. Their argument is that they’re doing business analytics on a larger and larger scale, so surely by now it must be “big data”.


There’s an essential difference between true big data techniques, as actually performed at surprisingly few firms but exemplified by Google, and the human-intervention data-driven techniques referred to as business analytics.

No matter how big the data you use is, at the end of the day, if you’re doing business analytics, you have a person looking at spreadsheets or charts or numbers, making a decision after possibly a discussion with 150 other people, and then tweaking something about the way the business is run.

If you’re really doing big data, then those 150 people probably get fired laid off, or even more likely are never hired in the first place, and the computer is programmed to update itself via an optimization method.

That’s not to say it doesn’t also spit out monitoring charts and numbers, and it’s not to say no person takes a look every now and then to make sure the machine is humming along, but there’s no point at which the algorithm waits for human intervention.

In other words, in a true big data setup, the human has stepped outside the machine and lets the machine do its thing. That means, of course, that it takes way more to set up that machine in the first place, and probably people make huge mistakes all the time in doing this, but sometimes they don’t. Google search got pretty good at this early on.

So with a business analytics set up we might keep track of the number of site visitors and a few sales metrics so we can later try to (and fail to) figure out whether a specific email marketing campaign had the intended effect.

But in a big data set-up it’s typically much more microscopic and detail oriented, collecting everything it can, maybe 1,000 attributed of a single customer, and figuring out what that guy is likely to do next time, how much they’ll spend, and the magic question, whether there will even be a next time.

So the first thing I offend people about is that they’re not really part of the “big data revolution”. And the second thing is that, usually, their job is potentially up for grabs by an algorithm.

Categories: data science, modeling

Larry Summers and the Lending Club

So here’s something potential Fed Chair Larry Summers is involved with, a company called Lending Club, which creates a money lending system that cuts out the middle man banks.

Specifically, people looking for money come to the site and tell their stories, and try to get loans. The investors invest in whichever loans look good to them, for however much money they want. For a perspective on the risks and rewards of this kind of peer-to-peer lending operation, look at this Wall Street Journal article which explains things strictly from the investor’s point of view.

A few red flags go up for me as I learn more about Lending Club.

First, from this NYTimes article, “The company [Lending Club] itself is not regulated as a bank. But it has teamed up with a bank in Utah, one of the states that allows banks to charge high interest rates, and that bank is overseen by state regulators and the Federal Deposit Insurance Corporation.”

I’m not sure how the FDIC is involved exactly, but the Utah connection is good for something, namely allowing high interest rates. According to the same article, 37% of loans are for APR’s of between 19% and 29%.

Next, Summers is referred to in that article as being super concerned about the ability for the consumers to pay back the loans. But I wonder how someone is supposed to be both desperate enough to go for a 25% APR loan and also able to pay back the money. This sounds like loan sharking to me.

Probably what bothers me most though is that Lending Club, in addition to offering credit scores and income when they have that information, also scores people asking for loans with a proprietary model which is, as you guessed it, unregulated. Specifically, if it’s anything like ZestFinance, could use signals more correlated to being uneducated and/or poor than to the willingness or ability to pay back loans.

By the way, I’m not saying this concept is bad for everyone- there are probably winners on the side of the loanees, and it might be possible that they get a loan they otherwise couldn’t get or they get better terms than otherwise or a more bespoke contract than otherwise. I’m more worried about the idea of this becoming the new normal of how money changes hands and how that would affect people already squeezed out of the system.

I’d love your thoughts.

Categories: data science, finance, modeling

Should lawmakers use algorithms?

Here is an idea I’ve been hearing floating around the big data/ tech community: the idea of having algorithms embedded into law.

The argument for is pretty convincing on its face: Google has gotten its algorithms to work better and better over time by optimizing correctly and using tons of data. To some extent we can think of their business strategies and rules as a kind of “internal regulation”. So why don’t we take a page out of that book and improve our laws and specifically our regulations with constant feedback loops and big data?

No algos in law

There are some concerns I have right off the bat about this concept, putting aside the hugely self-serving dimension of it.

First of all, we would be adding opacity – of the mathematical modeling kind – to an already opaque system of law. It’s hard enough to read the legalese in a credit card contract without there also being a black box algorithm to make it impossible.

Second of all, whereas the incentives in Google are often aligned with the algorithm “working better”, whatever that means in any given case, the incentives of the people who write laws often aren’t.

So, for example, financial regulation is largely written by lobbyists. If you gave them a new tool, that of adding black box algorithms, then you could be sure they would use it to further obfuscate what is already a hopelessly complicated set of rules, and on top of it they’d be sure to measure the wrong thing and optimize to something random that would not interfere with their main goal of making big bets.

Right now lobbyists are used so heavily in part because they understand the complexity of their industries more than the lawmakers themselves. In other words, they actually add value in a certain way (besides in the monetary way). Adding black boxes would emphasize this asymmetric information problem, which is a terrible idea.

Third, I’m worried about the “black box” part of algorithms. There’s a strange assumption among modelers that you have to make algorithms secret or else people will game them. But as I’ve said before, if people can game your model, that just means your model sucks, and specifically that your proxies are not truly behavior-based.

So if it pertains to a law against shoplifting, say, you can’t have an embedded model which uses the proxy of “looking furtive and having bulges in your clothes.” You actually need to have proof that someone stole something.

If you think about that example for a moment, it’s absolutely not appropriate to use poor proxies in law, nor is it appropriate to have black boxes at all – we should all know what our laws are. This is true for regulation as well, since it’s after all still law which affects how people are expected to behave.

And by the way, what counts as a black box is to some extent in the eye of the beholder. It wouldn’t be enough to have the source code available, since that’s only accessible to a very small subset of the population.

Instead, anyone who is under the expectation of following a law should also be able to read and understand the law. That’s why the CFPB is trying to make credit card contracts be written in Plain English. Similarly, regulation law should be written in a way so that the employees of the regulator in question can understand it, and that means you shouldn’t have to have a Ph.D. in a quantitative field and know python.

Algos as tools

Here’s where algorithms may help, although it is still tricky: not in the law itself but in the implementation of the law. So it makes sense that the SEC has algorithms trying to catch insider trading – in fact it’s probably the only way for them to attempt to catch the bad guys. For that matter they should have many more algorithms to catch other kinds of bad guys, for example to catch people with suspicious accounting or consistently optimistic ratings.

In this case proxies are reasonable, but on the other hand it doesn’t translate into law but rather into a ranking of workflow for the people at the regulatory agency. In other words the SEC should use algorithms to decide which cases to pursue and on what timeframe.

Even so, there are plenty of reasons to worry. One could view the “Stop & Frisk” strategy in New York as following an algorithm as well, namely to stop young men in high-crime areas that have “furtive motions”. This algorithm happens to single out many innocent black and latino men.

Similarly, some of the highly touted New York City open data projects amount to figuring out that if you focus on looking for building code violations in high-crime areas, then you get a better hit rate. Again, the consequence of using the algorithm is that poor people are targeted at a higher rate for all sorts of crimes (key quote from the article: “causation is for other people”).

Think about this asymptotically: if you live in a nice neighborhood, the limited police force and inspection agencies never check you out since their algorithms have decided the probability of bad stuff happening is too low to bother. If, on the other hand, you are poor and live in a high-crime area, you get checked out daily by various inspectors, who bust you for whatever.

Said this way, it kind of makes sense that white kids smoke pot at the same rate as black kids but are almost never busted for it.

There are ways to partly combat this problem, as I’ve described before, by using randomization.


It seems to me that we can’t have algorithms directly embedded in laws, because of the highly opaque nature of them together with commonly misaligned incentives. They might be useful as tools for regulators, but the regulators who choose to use internal algorithms need to carefully check that their algorithms don’t have unreasonable and biased consequences, which is really hard.

Categories: data science, finance, modeling

PyData talk today

Not much time because I’m giving a keynote talk at the PyData 2013 conference in Cambridge today, which is being held at the Microsoft NERD conference center.

It’s gonna be videotaped so I’ll link to that when it’s ready.

My title is “Storytelling With Data” but for whatever reason on the schedule handed out yesterday the name had been changed to “Scalable Storytelling With Data”. I’m thinking of addressing this name change in my talk – one of the points of the talk, in fact, is that with great tools, we don’t need to worry too much about the scale.

Plus since it’s Sunday morning I’m going to make an effort to tie my talk into an old testament story, which is totally bizarre since I’m not at all religious but for some reason it feels right. Please wish me luck.

The Stop and Frisk sleight of hand

I’m finishing up an essay called “On Being a Data Skeptic” in which I catalog different standard mistakes people make with data – sometimes unintentionally, sometimes intentionally.

It occurred to me, as I wrote it, and as I read the various press conferences with departing mayor Bloomberg and Police Commissioner Raymond Kelly when they addressed the Stop and Frisk policy, that they are guilty of making one of these standard mistakes. Namely, they use a sleight of hand with respect to the evaluation metric of the policy.

Recall that an evaluation metric for a model is the way you decide whether the model works. So if you’re predicting whether someone would like a movie, you should go back and check whether your recommendations were good, and revise your model if not. It’s a crucial part of the model, and a poor choice for it can have dire consequences – you could end up optimizing to the wrong thing.

[Aside: as I've complained about before, the Value Added Model for teachers doesn't have an evaluation method of record, which is a very bad sign indeed about the model. And that's a Bloomberg brainchild as well.]

So what am I talking about?

Here’s the model: stopping and frisking suspicious-looking people in high-crime areas will improve the safety and well-being of the city as a whole.

Here’s Bloomberg/Kelly’s evaluation method: the death rate by murder has gone down in New York during the policy. However, that rate is highly variable and depends just as much on whether there’s a crack epidemic going on as anything else. Or maybe it’s improved medical care. Truth is people don’t really know. In any case ascribing credit for the plunging death rate to Stop and Frisk is a tenuous causal argument. Plus since Stop and Frisk events have decreased drastically recently, we haven’t seen the murder rate shoot up.

Here’s another possible evaluation method: trust in the police. And considering that 400,000 innocent black and Latino New Yorkers were stopped last year under this policy (here are more stats), versus less than 50,000 whites, and most of them were young men, it stands to reason that the average young minority male feels less trust towards police than the average young white male. In fact, this is an amazing statistic put together by the NYCLU from 2011:

The number of stops of young black men exceeded the entire city population of young black men (168,126 as compared to 158,406).

If I’m a black guy I have an expectation of getting stopped and frisked at least once per year. How does that make me trust cops?

Let’s choose an evaluation method closer to what we can actually control, and let’s optimize to it.

Update: a guest columnist fills in for David Brooks, hopefully not for the last time, and gives us his take on Kelly, Obama, and racial profiling.

Categories: data science, modeling, rant

The creepy mindset of online credit scoring

Usually I like to think through abstract ideas – thought experiments, if you will – and not get too personal. I take exceptions for certain macroeconomists who are already public figures but most of the time that’s it.

Here’s a new category of people I’ll call out by name: CEO’s who defend creepy models using the phrase “People will trade their private information for economic value.”

That’s a quote of Douglas Merrill, CEO of Zest Finance, taken from this video taken at a recent data conference in Berkeley (hat tip Rachel Schutt). It was a panel discussion, the putative topic of which was something like “Attacking the structure of everything”, whatever that’s supposed to mean (I’m guessing it has something to do with being proud of “disrupting shit”).

Do you know the feeling you get when you’re with someone who’s smart, articulate, who probably buys organic eggs from a nice farmer’s market, but who doesn’t expose an ounce of sympathy for people who aren’t successful entrepreneurs? When you’re with someone who has benefitted so entirely and so consistently from the system that they have an almost religious belief that the system is perfect and they’ve succeeded through merit alone?

It’s something in between the feeling that, maybe you’re just naive because you’ve led such a blessed life, or maybe you’re actually incapable of human empathy, I don’t know which because it’s never been tested.

That’s the creepy feeling I get when I hear Douglas Merrill speak, but it actually started earlier, when I got the following email almost exactly one year ago via LinkedIn:

Hi Catherine,

Your profile looked interesting to me.

I’m seeking stellar, creative thinkers like you, for our team in Hollywood, CA. If you would consider relocating for the right opportunity, please read on.

You will use your math wizardry to develop radically new methods for data access, manipulation, and modeling. The outcome of your work will result in game-changing software and tools that will disrupt the credit industry and better serve millions of Americans.

You would be working alongside people like Douglas Merrill – the former CIO of Google – along with a handful of other ex-Googlers and Capital One folks. More info can be found on our LinkedIn company profile or at

At ZestFinance we’re bringing social responsibility to the consumer loan industry.

Do you have a few moments to talk about this? If you are not interested, but know someone else who might be a fit, please send them my way!

I hope to hear from you soon. Thank you for your time.


Wow, let’s “better serve millions of Americans” through manipulation of their private data, and then let’s call it being socially responsible! And let’s work with Capital One which is known to be practically a charity.


Message to ZestFinance: “getting rich with predatory lending” doesn’t mean “being socially responsible” unless you have a really weird definition of that term.

Going back to the video, I have a few more tasty quotes from Merrill:

  1. First when he’s describing how he uses personal individual information scraped from the web: “All data is credit data.”
  2. Second, when he’s comparing ZestFinance to FICO credit scoring: “Context is developed by knowing thousands of things about you. I know you as a person, not just you via five or six variables.”

I’d like to remind people that, in spite of the creepiness here, and the fact that his business plan is a death spiral of modeling, everything this guy is talking about is totally legal. And as I said in this post, I’d like to see some pushback to guys like Merrill as well as to the NSA.

Categories: data science, rant

On being a data science skeptic: due out soon

A few months ago, at the end of January, I wrote a post about Bill Gates naive views on the objectivity of data. One of the commenters, “CitizensArrest,” asked me to take a look at a related essay written by Susan Webber entitled “Management’s Great Addiction: It’s time we recognized that we just can’t measure everything.”

Webber’s essay is really excellent, not to mention impressively prescient considering it was published in 2006, before the credit crisis. The format of the essay is simple: it brings up and explains various dangers in the context of measurement and modeling of business data, and calls for finding a space in business for skepticism. What an idea! Imagine if that had actually happened in finance when it should have back in 2006.

Please go read her essay, it’s short.

Recently, when O’Reilly asked me to write an essay, I thought back to this short piece and decided to use it as a template for explaining why I think there’s a just-as-desperate need for skepticism in 2013 here in the big data world as there was back then in finance.

Whereas most of Webber’s essay talks about people blindly accepting numbers as true, objective, precise, and important, and the related tragic consequences, I’ve added a small wrinkle to this discussion. Namely, I also devote concern over the people who underestimate the power of data.

Most of this disregard for unintended consequences is blithe and unintentional (and some of it isn’t), but even so it can be hugely damaging, especially to the individuals being modeled: think foreclosed homes due to crappy housing-related models in the past, and think creepy models and the death spiral of modeling for the present and future.

Anyhoo, I’m actively writing it now, and it’ll be coming out soon. Stay tuned!

Categories: data science, finance, modeling

How to be wrong

My friend Josh Vekhter sent me this blog post written by someone who calls herself celandine13 and tutors students with learning disabilities.

In the post, she reframes the concept of mistake or “being bad at something” as often stemming from some fundamental misunderstanding or poor procedure:

Once you move it to “you’re performing badly because you have the wrong fingerings,” or “you’re performing badly because you don’t understand what a limit is,” it’s no longer a vague personal failing but a causal necessity.  Anyone who never understood limits will flunk calculus.  It’s not you, it’s the bug.

This also applies to “lazy.”  Lazy just means “you’re not meeting your obligations and I don’t know why.”  If it turns out that you’ve been missing appointments because you don’t keep a calendar, then you’re not intrinsically “lazy,” you were just executing the wrong procedure.  And suddenly you stop wanting to call the person “lazy” when it makes more sense to say they need organizational tools.

And she wants us to stop with the labeling and get on with the understanding of why the mistake was made and addressing that, like she does when she tutors students. She even singles out certain approaches she considers to be flawed from the start:

This is part of why I think tools like Knewton, while they can be more effective than typical classroom instruction, aren’t the whole story.  The data they gather (at least so far) is statistical: how many questions did you get right, in which subjects, with what learning curve over time?  That’s important.  It allows them to do things that classroom teachers can’t always do, like estimate when it’s optimal to review old material to minimize forgetting.  But it’s still designed on the error model. It’s not approaching the most important job of teachers, which is to figure out why you’re getting things wrong — what conceptual misunderstanding, or what bad study habit, is behind your problems.  (Sometimes that can be a very hard and interesting problem.  For example: one teacher over many years figured out that the grammar of Black English was causing her students to make conceptual errors in math.)

On the one hand I like the reframing: it’s always good to see knee-jerk reactions become more contemplative, and it’s always good to see people trying to help rather than trying to blame. In fact, one of my tenets of real life is that mistakes will be made, and it’s not the mistake that we should be anxious about but how we act to fix the mistake that exposes who we are as people.

I would, however, like to take issue with her anti-example in the case of Knewton, which is an online adaptive learning company. Full disclosure: I interviewed with Knewton before I took my current job, and I like the guys who work there. But, I’d add, I like them partly because of the healthy degree of skepticism they take with them to their jobs.

What the blogwriter celandine13 is pointing out, correctly, is that understanding causality is pretty awesome when you can do it. If you can figure out why someone is having trouble learning something, and if you can address that underlying issue, then fixing the consequences of that issue get a ton easier. Agreed, but I have three points to make:

  1. First, a non-causal data mining engine such as Knewton will also stumble upon a way to fix the underlying problem by dint of having a ton of data and noting that people who failed a calculus test, say, did much better after having limits explained to them in a certain way. This is much like the spellcheck engine of Google works by keeping track of previous spelling errors, and not by mind reading how people think about spelling wrong.
  2. Second, it’s not always easy to find the underlying cause of bad testing performance, even if you’re looking for it directly. I’m not saying it’s fruitless – tutors I know are incredibly good at that – but there’s room for both “causality detectives” and tons of smart data mining in this field.
  3. Third, it’s definitely not always easy to address the underlying cause of bad test performance. If you find out that the grammar of Black English affects students’ math test scores, what do you do about it?

Having said all that, I’d like to once more agree with the underlying message that a mistake is a first and foremost a signal rather than a reflection of someone’s internal thought processes. The more we think of mistakes as learning opportunities the faster we learn.

Who stays off the data radar?

Last night’s Data Skeptics Meetup talk by Suresh Naidu was great, as I suspected it would be. I’m not going to be able to cover everything he talked about (a discussion is forming here as well) but I’ll touch on a few things related to my chosen topic for the day, namely who stays off the data radar.

In his talk Suresh discussed the history of governments tracking people with data, which more or less until recently was the history of the census. The issue of trust or lack thereof that people have in being classified and tracked has been central since the get-go, and with it the understanding by the data collectors that people respond differently to data collection when they anticipate it being used against them.

Among other examples he mentioned the efforts of the U.S. Census Bureau to stay independent (specifically, away from any kind of tax decisions) in order to be trusted but then turning around during war time and using census tracks to put Japanese into internment camps.

It made me wonder, who distrusts data collection so much that they manage to stay off the data radar?

Suresh gave quite a few examples of people who did this out of fear of persecution or what have you, and because, at least in the example of the Domesday Book, once land ownership was written down it was somehow “more official and objective” than anything else, which of course resulted in some people getting screwed out of their land.

It’s not just a historical problem, of course: it’s still true that certain populations, especially illegal immigrant populations, are afraid of how the census will be used and go undercounted. Who can say when the census might start being used to deport illegal immigrants?

As a kind of anti example, he mentioned that the census was essentially canceled in 1920 because the South knew that so many ex-slaves were moving north that their representation in government was growing weak. I say anti-example because in this case it wasn’t out of distrust, to avoid detection, but it was a savvy and political move, to remain looking large.

What about the modern version of government tracking? In this case, of course, it’s not just census data, but anything else the NSA happens to collect about us. I’m no expert (tell me if you know data on this) but I will hazard a guess on who avoids being tracked:

  1. Old people who don’t have computers and never have,
  2. Members of hacking group Anonymous who know how it works and how to bypass the system, and
  3. People who have worked or are now working at the NSA.

Of course there are a few other rare people that just happen to care enough about privacy to educate themselves on how to avoid being tracked. But it’s hard to do, obviously.

Let me soften the requirements a bit – instead of staying off the radar completely, who makes it really hard to find them?

If you’re talking about individuals, I’d start with this answer: politicians. In my work with Peter Darche and Lee Drutman from the Sunlight Foundation (blog post coming soon!) trying to follow money in politics, it’s amazed me time and time again how difficult it’s been to put together the political events for a given politician – events that are individually publicly recorded but are seemingly intentionally siloed so it will be extremely difficult to put together a narrative. Thanks to Peter’s recent efforts, and the Sunlight Foundations long-term efforts, we are getting to the point where we can do this, but it’s been a data munging problem from hell.

If you’re generalizing to entities and corporations, then the “making data collection hard” award should probably go to the corporations with hundreds of subsidiaries all over the world which now don’t even need to be reported on tax forms.

Funny how the very people who know the most about how data can be used are paranoid about being tracked.

Categories: data science

Tonight: first Data Skeptics Meetup, Suresh Naidu

I’m psyched to see Suresh Naidu tonight in the first Data Skeptics Meetup. He’s talking about Political Uses and Abuses of Data and his abstract is this:

While a lot has been made of the use of technology for election campaigns, little discussion has focused on other political uses of data. From targeting dissidents and tax-evaders to organizing protests, the same datasets and analytics that let data scientists do prediction of consumer and voter behavior can also be used to forecast political opponents, mobilize likely leaders, solve collective problems and generally push people around. In this discussion, Suresh will put this in a 1000 year government data-collection perspective, and talk about how data science might be getting used in authoritarian countries, both by regimes and their opponents.

Given the recent articles highlighting this kind of stuff, I’m sure the topic will provoke a lively discussion – my favorite kind!

Unfortunately the Meetup is full but I’d love you guys to give suggestions for more speakers and/or more topics.

The politics of data mining

At first glance, data miners inside governments, start-ups, corporations, and political campaigns are all doing basically the same thing. They’ll all need great engineering infrastructure, good clean data, a working knowledge of statistical techniques and enough domain knowledge to get things done.

We’ve seen recent articles that are evidence for this statement: Facebook data people move to the NSA or other government agencies easily, and Obama’s political campaign data miners have launched a new data mining start-up. I am a data miner myself, and I could honestly work at any of those places – my skills would translate, if not my personality.

I do think there are differences, though, and here I’m not talking about ethics or trust issues, I’m talking about pure politics[1].

Namely, the world of data mining is divided into two broad categories: people who want to cause things to happen and people who want to prevent things from happening.

I know that sounds incredibly vague, so let me give some examples.

In start-ups, irrespective of what you’re actually doing (what you’re actually doing is probably incredibly banal, like getting people to click on ads), you feel like you’re the first person ever to do it, at least on this scale, or at least with this dataset, and that makes it technically challenging and exciting.

Or, even if you’re not the first, at least what you’re creating or building is state-of-the-art and is going to be used to “disrupt” or destroy lagging competition. You feel like a motherfucker, and it feels great[2]!

The same thing can be said for Obama’s political data miners: if you read this article, you’ll know they felt like they’d invented a new field of data mining, and a cult along with it, and it felt great! And although it’s probably not true that they did something all that impressive technically, in any case they did a great job of applying known techniques to a different data set, and they got lots of people to allow access to their private information based on their trust of Obama, and they mined the fuck out of it to persuade people to go out and vote and to go out and vote for Obama.

Now let’s talk about corporations. I’ve worked in enough companies to know that “covering your ass” is a real thing, and can overwhelm a given company’s other goals. And the larger the company, the more the fear sets in and the more time is spent covering one’s ass and less time is spent inventing and staying state-of-the-art. If you’ve ever worked in a place where it takes months just to integrate two different versions of SalesForce you know what I mean.

Those corporate people have data miners too, and in the best case they are somewhat protected from the conservative, risk averse, cover-your-ass atmosphere, but mostly they’re not. So if you work for a pharmaceutical company, you might spend your time figuring out how to draw up the numbers to make them look good for the CEO so he doesn’t get axed.

In other words, you spend your time preventing something from happening rather than causing something to happen.

Finally, let’s talk about government data miners. If there’s one thing I learned when I went to the State Department Tech@State “Moneyball Diplomacy” conference a few weeks back, it’s that they are the most conservative of all. They spend their time worrying about a terrorist attack and how to prevent it. It’s all about preventing bad things from happening, and that makes for an atmosphere where causing good things to happen takes a rear seat.

I’m not saying anything really new here; I think this stuff is pretty uncontroversial. Maybe people would quibble over when a start-up becomes a corporation (my answer: mostly they never do, but certainly by the time of an IPO they’ve already done it). Also, of course, there are ass-coverers in start-ups and there are risk-takers in corporation and maybe even in government, but they don’t dominate.

If you think through things in this light, it makes sense that Obama’s data miners didn’t want to stay in government and decided to go work on advertising stuff. And although they might have enough clout and buzz to get hired by a big corporation, I think they’ll find it pretty frustrating to be dealing with the cover-my-ass types that will hire them. It also makes sense that Facebook, which spends its time making sure no other social network grows enough to compete with it, works so well with the NSA.

1. If you want to talk ethics, though, join me on Monday at Suresh Naidu’s Data Skeptics Meetup where he’ll be talking about Political Uses and Abuses of Data.

2. This is probably why start-up guys are so arrogant.

Guest post, The Vortex: A Cookie Swapping Game for Anti-Surveillance

This is a guest post by Rachel Law, a conceptual artist, designer and programmer living in Brooklyn, New York. She recently graduated from Parsons MFA Design&Technology. Her practice is centered around social myths and how technology facilitates the creation of new communities. Currently she is writing a book with McKenzie Wark called W.A.N.T, about new ways of analyzing networks and debunking ‘mapping’.

Let’s start with a timely question. How would you like to be able to change how you are identified by online networks? We’ll talk more about how you’re currently identified below, but for now just imagine having control over that process for once – how would that feel? Vortex is something I’ve invented that will try to make that happen.

Namely, Vortex is a data management game that allows players to swap cookies, change IPs and disguise their locations. Through play, individuals experience how their browser changes in real time when different cookies are equipped. Vortex is a proof of concept that illustrates how network collisions in gameplay expose contours of a network determined by consumer behavior.

What happens when users are allowed to swap cookies?

These cookies, placed by marketers to track behavioral patterns, are stored on our personal devices from mobile phones to laptops to tablets, as a symbolic and data-driven signifier of who we are. In other words, to the eyes of the database, the cookies are us. They are our identities, controlling the way we use, browse and experience the web.  Depending on cookie type, they might follow us across multiple websites, save entire histories about how we navigate and look at things and pass this information to companies while still living inside our devices.

If we have the ability to swap cookies, the debate on privacy shifts from relying on corporations to follow regulations to empowering users by giving them the opportunity to manage how they want to be perceived by the network.

What are cookies?

The corporate technological ability to track customers and piece together entire personal histories is a recent development. While there are several ways of doing so, the most common and prevalent method is with HTTP cookies. Invented in 1994 by a computer programmer, Lou Montulli, HTTP cookies were originally created with the shopping cart system as a way for the computer to store the current state of the session, i.e. how many items existed in the cart without overloading the company’s server. These session histories were saved inside each user’s computer or individual device, where companies accessed and updated consumer history constantly as a form of ‘internet history’. Information such as where you clicked, how to you clicked, what you clicked first, your general purchasing history and preferences were all saved in your browsing history and accessed by companies through cookies.

Cookies were originally implemented to the general public without their knowledge until the Financial Times published an article about how they were made and utilized on websites without user knowledge on February 12th, 1996 . This revelation led to a public outcry over privacy issues, especially since data was being gathered without the knowledge or consent of users. In addition, corporations had access to information stored on personal computers as the cookie sessions were stored on your computer and not their servers.

At the center of the debate was the issue on third-party cookies, also known as “persistent” or “tracking” cookies.  When you are browsing a webpage, there may be components on the page that are hosted on the same server, but different domain. These external objects then pass cookies to you if you click an image, link or article. They are then used by advertising and media mining corporations to track users across multiple sites to garner more knowledge about the users browsing patterns to create more specific and targeted advertising.

In August 2013, Wall Street Journal ran an article on how Mac users were being unfairly targeted by travel site Orbitz with advertisements that were 13% more expensive than PC users. New York Times followed it up with a similar article in November 2012 about how the data collected and re-sold to advertisers. These advertisers would analyze users buying habits to create micro-categories where the personal experiences were tailored to maximize potential profits.

What does that mean for us?

The current state of today’s internet is no longer the same as the carefree 90s of ‘internet democracy’ and utopian ‘cyberspace’.  Media­mining exploits invasive technologies such as IP tracking, geo­locating and cookies to create specific advertisements targeted to individuals. Browsing is now determined by your consumer profile ­ what you see, hear and the feeds you receive are tailored from your friends’ lists, emails, online purchases etc. The ‘Internet’ does not exist. Instead, it is many overlapping filter bubbles which selectively curate us into data objects to be consumed and purchased by advertisers.

This information, though anonymous, is built up over time and used to track and trace an individual’s history – sometimes spanning an entire lifetime. Who you are, and your real name is irrelevant in the overall scale of collected data, depersonalizing and dehumanizing you into nothing but a list of numbers on a spreadsheet.

The superstore Target, provides a useful case study for data profiling in its use of statisticians on their marketing teams. In 2002, Target realized that when a couple is expecting a child, the way they shop and purchase products changes. But they needed a tool to be able to see and take advantage of the pattern. As such, they asked mathematicians to come up with algorithms to identify behavioral patterns that would indicate a newly expectant mother and push direct marketing materials their way. In a public relations fiasco, Target had sent maternity and infant care advertisements to a household, inadvertedly revealing that their teenage daughter was pregnant before she told her parents .

This build-up of information creates a ‘database of ruin’, enough information that marketers and advertisers know more about your life and predictive patterns than any single entity. Databases that can predict whether you’re expecting, or when you’ve moved, or what stage of your life or income level you’re at… information that you have no control over where it goes to, who is reading it or how it is being used. More importantly, these databases have collected enough information that they know secrets such as family history of illness, criminal or drug records or other private information that could potentially cause harm upon the individual data point if released – without ever needing to know his or her name.

What happens now is two terrifying possibilities:

  1. Corporate databases with information about you, your family and friends that you have zero control over, including sensitive information such as health, criminal/drug records etc. that are bought and re-sold to other companies for profit maximization.
  1. New forms of discrimination where your buying/consumer habits determine which level of internet you can access, or what kind of internet you can experience. This discrimination is so insidious because it happens on a user account level which you cannot see unless you have access to other people’s accounts.

Here’s a visual describing this process:


What can Vortex do, and where can I download a copy?

As Vortex lives on the browser, it can manage both pseudo-identities (invented) as well as ‘real’ identities shared with you by other users. These identity profiles are created through mining websites for cookies, swapping them with friends as well as arranging and re-arranging them to create new experiences. By swapping identities, you are essentially ‘disguised’ as someone else – the network or website will not be able to recognize you. The idea is that being completely anonymous is difficult, but being someone else and hiding with misinformation is easy.

This does not mean a death knell for online shopping or e-commerce industries. For instance, if a user decides to go shoe-shopping for summer, he/she could equip their browser with the cookies most associated and aligned with shopping, shoes and summer. Targeted advertising becomes a targeted choice for both advertisers and users. Advertisers will not have to worry about misinterpreting or mis-targeting inappropriate advertisements i.e. showing tampon advertisements to a boyfriend who happened to borrow his girlfriend’s laptop; and at the same time users can choose what kind of advertisements they want to see. (i.e. Summer is coming, maybe it’s time to load up all those cookies linked to shoes and summer and beaches and see what websites have to offer; or disable cookies it completely if you hate summer apparel.)

Currently the game is a working prototype/demo. The code is licensed under creative commons and will be available on GitHub by the end of summer. I am trying to get funding to make it free, safe & easy to use; but right now I’m broke from grad school and a proper back-end to be built for creating accounts that is safe and cannot be intercepted. If you have any questions on technical specs or interest in collaborating to make it happen – particularly looking for people versed in python/mongodb, please email me:

Where’s the outrage over private snooping?

There’s been a tremendous amount of hubbub recently surrounding the data collection data mining that the NSA has been discovered to be doing.

For me what’s weird is that so many people are up in arms about what our government knows about us but not, seemingly, about what private companies know about us.

I’m not suggesting that we should be sanguine about the NSA program – it’s outrageous, and it’s outrageous that we didn’t know about it. I’m glad it’s come out into the open and I’m glad it’s spawned an immediate and public debate about the citizen’s rights to privacy. I just wish that debate extended to privacy in general, and not just the right to be anonymous with respect to the government.

What gets to me are the countless articles that make a big deal of Facebook or Google sharing private information directly with the government, while never mentioning that Acxiom buys and sells from Facebook on a daily basis much more specific and potentially damning information about people (most people in this country) than the metadata that the government purports to have.

Of course, we really don’t have any idea what the government has or doesn’t have. Let’s assume they are also an Acxiom customer, for that matter, which stands to reason.

It begs the question, at least to me, of why we distrust the government with our private data but we trust private companies with our private data. I have a few theories, tell me if you agree.

Theory 1: people think about worst case scenarios, not probabilities

When the government is spying on you, worst case you get thrown into jail or Guantanamo Bay for no good reason, left to rot. That’s horrific but not, for the average person, very likely (although, of course, a world where that does become likely is exactly what we want to prevent by having some concept of privacy).

When private companies are spying on you, they don’t have the power to put you in jail. They do increasingly have the power, however, to deny you a job, a student loan, a mortgage, and life insurance. And, depending on who you are, those things are actually pretty likely.

Theory 2: people think private companies are only after our money

Private companies who hold our private data are only profit-seeking, so the worst thing they can do is try to get us to buy something, right? I don’t think so, as I pointed out above. But maybe people think so in general, and that’s why we’re not outraged about how our personal data and profiles are used all the time on the web.

Theory 3: people are more afraid of our rights being taken away than good things not happening to them

As my friend Suresh pointed out to me when I discussed this with him, people hold on to what they have (constitutional rights) and they fear those things being taken away (by the government). They spend less time worrying about what they don’t have (a house) and how they might be prevented from getting it (by having a bad e-score).

So even though private snooping can (and increasingly does) close all sorts of options for peoples’ lives, if they don’t think about them, they don’t notice. It’s hard to know why you get denied a job, especially if you’ve been getting worse and worse credit card terms and conditions over the years. In general it’s hard to notice when things don’t happen.

Theory 4: people think the government protects them from bad things, but who’s going to protect them from the government?

This I totally get, but the fact is the U.S. government isn’t protecting us from data collectors, and has even recently gotten together with Facebook and Google to prevent the European Union from enacting pretty good privacy laws. Let’s not hold our breath for them to understand what’s at stake here.

(Updated) Theory 5: people think they can opt out of private snooping but can’t opt out of being a citizen

Two things. First, can you really opt out? You can clear your cookies and not be on gmail and not go on Facebook and Acxiom will still track you. Believe it.

Second, I’m actually not worried about you (you reader of mathbabe) or myself for that matter. I’m not getting denied a mortgage any time soon. It’s the people who don’t know to protect themselves, don’t know to opt out, that I’m worried about and who will get down-scored and funneled into bad options that I worry about.

Theory 5 6: people just haven’t thought about it enough to get pissed

This is the one I’m hoping for.

I’d love to see this conversation expand to include privacy in general. What’s so bad about asking for data about ourselves to be automatically forgotten, say by Verizon, if we’ve paid our bills and 6 months have gone by? What’s so bad about asking for any personal information about us to have a similar time limit? I for one do not wish mistakes my children make when they’re impetuous teenagers to haunt them when they’re trying to start a family.

Categories: data science, rant

Moneyball Diplomacy

I’m on a train again to D.C. to attend a conference on how to use big data to enhance U.S. diplomacy and development.

I’ll be on a panel in the afternoon called Diving Into Data, which has the following blurb attached to it:

Facebook processes over 500 terabytes of data each day. More than a half billion tweets are sent daily. And so the volume of data grows. Much of this data is superfluous and is of little value to foreign policy and development experts. But a portion does contain significant information and the challenge is how to find and make use of that data. What will a rigorous economic analysis of this data reveal and how could the findings be effectively applied? Looking beyond real-time awareness and some of the other well know uses of big data, this panel will explore how a more thorough in-depth analysis of big data could prove useful in providing insights and trends that could be applied in the formulation and implementation of foreign policy.

Also on the schedule today, two keynote speakers: Nassim Taleb, author of a few books I haven’t read but everyone else has, and Kenneth Neil Cukier, author of a “big data” article I really didn’t like which was published in Foreign Affairs and which I blogged about here under the title of “The rise of big brother, big data”.

The full schedule of the day is here.

Speaking of big brother, this conference will be particularly interesting to me considering the remarkable amount of news we’ve been learning about this week centered on the U.S. as a surveillance state. Actually nothing I’ve read has surprised me, considering what I learned when I read this opinion piece on the subject, and when I watched this video with former NSA mathematician-turned whistleblower, which I blogged about here back in August 2012.

Categories: data science, modeling

Book out for early review

I’m happy to say that the book I’m writing with Rachel Schutt called Doing Data Science is officially out for early review. That means a few chapters which we’ve deemed “ready” have been sent to some prominent people in the field to see what they think. Thanks, prominent and busy people!

It also means that things are (knock on wood) wrapping up on the editing side. I’m cautiously optimistic that this book will be a valuable resource for people interested in what data scientists do, especially people interested in switching fields. The range of topics is broad, which I guess means that the most obvious complaint about the book will be that we didn’t cover things deeply enough, and perhaps that the level of pre-requisite assumptions is uneven. It’s hard to avoid.

Thanks to my awesome editor Courtney Nash over at O’Reilly for all her help!

And by the way, we have an armadillo on our cover, which is just plain cool:


How much would you pay to be my friend?

I am on my way to D.C. for a health analytics conference, where I hope to learn the state of the art for health data and modeling. So stay tuned for updates on that.

In the meantime, ponder this concept (hat tip Matt Stoller, who describes it as ‘neoliberal prostitution’). It’s a dating website called “What’s Your Price?” where suitors bid for dates.

Screen Shot 2013-06-03 at 7.19.56 AM


What’s creepier, the sex-for-pay aspect of this, or the it’s-possibly-not-about-sex-it’s-about-dating aspect? I’m gonna go with the latter, personally, since it’s a new idea for me. What else can I monetize that I’ve been giving away too long for free?

Hey, kid, you want a bedtime story? It’s gonna cost you.

Categories: data science, modeling, news

Let’s enjoy the backlash against hackathons

As much as I have loved my DataKind hackathons, where I get to meet a bunch of friendly nerds who are spend their weekend trying to solve problems using technology, I also have my reservations about the whole weekend hackathon culture, especially when:

  1. It’s a competition, so really you’re not solving problems as much as boasting, and/or
  2. you’re trying to solve a problem that nobody really cares about but which might make someone money, so you’re essentially working for free for a future VC asshole, and/or
  3. you kind of solve a problem that matters, but only for people like you (example below).

As Jake Porway mentions in this fine piece, having data and good intentions do not mean you can get serious results over a weekend. From his essay:

Without subject matter experts available to articulate problems in advance, you get results like those from the Reinvent Green Hackathon. Reinvent Green was a city initiative in NYC aimed at having technologists improve sustainability in New York. Winners of this hackathon included an app to help cyclists “bikepool” together and a farmer’s market inventory app. These apps are great on their own, but they don’t solve the city’s sustainability problems. They solve the participants’ problems because as a young affluent hacker, my problem isn’t improving the city’s recycling programs, it’s finding kale on Saturdays.

Don’t get me wrong, I’ve made some good friends and created some great collaborations via hackathons (and especially via Jake). But it only gets good when there’s major planning beforehand, a real goal, and serious follow-up. Actually a weekend hackathon is, at best, a platform from which to launch something more serious and sustained.

People who don’t get that are there for something other than that. What is it? Maybe this parody hackathon announcement can tell us.

It’s called National Day of Hacking Your Own Assumptions and Entitlement, and it has a bunch of hilarious and spot-on satirical commentary, including this definition of a hackathon:

Basically, a bunch of pallid millenials cram in a room and do computer junk. Harmless, but very exciting to the people who make money off the results.

This question from a putative participant of an “entrepreneur”-style hackathon:

“Why do we insist on applying a moral or altruistic gloss to our moneymaking ventures?”

And the internal thought process of a participant in a White House-sponsored hackathon:

I realized, especially in the wake of the White House murdering Aaron Swartz, persecuting/torturing Bradley Manning and threatening Jeremy Hammond with decades behind bars for pursuit of open information and government/corporate accountability that really, no-one who calls her or himself a “hacker” has any business partnering with an entity as authoritarian, secretive and tyrannical as the White House– unless of course you’re just a piece-of-shit money-grubbing disingenuous bootlicker who uses the mantle of “hackerdom” to add a thrilling and unjustified outlaw sheen to your dull life of careerist keyboard-poking for the status quo.

Technocrats and big data

Today I’m finally getting around to reporting on the congressional subcommittee I went to a few weeks ago on big data and analytics. Needless to say it wasn’t what I’d hoped.

My observations are somewhat disjointed, since there was no coherent discussion, so I guess I’ll just make a list:

  1. The Congressmen and women seem to know nothing more about the “Big Data Revolution” than what they’d read in the now-famous McKinsey report which talks about how we’ll need 180,000 data scientists in the next decade and how much money we’ll save and how competitive it will make our country.
  2. In other words, with one small exception I’ll discuss below, the Congresspeople were impressed, even awed, at the intelligence and power of the panelists. They were basically asking for advice on how to let big data happen on a bigger and better scale. Regulation never came up, it was all about, “how do we nurture this movement that is vital to our country’s health and future?”
  3. There were three useless panelists, all completely high on big data and making their money being like that. First there was a schmuck from the NSF who just said absolutely nothing, had been to a million panels before, and was simply angling to be invited to yet more.
  4. Next there was a guy who had started training data-ready graduates in some masters degree program. All he ever talked about is how programs like his should be funded, especially his, and how he was talking directly with employers in his area to figure out what to train his students to know.
  5. It was especially interesting to see how this second guy reacted when the single somewhat thoughtful and informed Congressman, whose name I didn’t catch because he came in and left quickly and his name tag was miniscule, asked him about whether or not he taught his students to be skeptical. The guy was like, I teach my students to be ready to deal with big data just like their employers want. The congressman was like, no that’s not what I asked, I asked whether they can be skeptical of perceived signals versus noise, whether they can avoid making huge costly mistakes with big data. The guy was like, I teach my students to deal with big data.
  6. Finally there was the head of IBM Research who kept coming up with juicy and misleading pro-data tidbits which made him sound like some kind of saint for doing his job. For example, he brought up the “premature infants are being saved” example I talked about in this post.
  7. The IBM guy was also the only person who ever mentioned privacy issues at all, and he summarized his, and presumably everyone else’s position on this subject, by saying “people are happy to give away their private information for the services they get in return.” Thanks, IBM guy!
  8. One more priceless moment was when one of the Congressmen asked the panel if industry has enough interaction with policy makers. The head of IBM Research said, “Why yes, we do!” Thanks, IBM guy!

I was reminded of this weird vibe and power dynamic, where an unchallenged mysterious power of big data rules over reason, when I read this New York Times column entitled Some Cracks in the Cult of Technocrats (hat tip Suresh Naidu). Here’s the leading paragraph:

We are living in the age of the technocrats. In business, Big Data, and the Big Brains who can parse it, rule. In government, the technocrats are on top, too. From Washington to Frankfurt to Rome, technocrats have stepped in where politicians feared to tread, rescuing economies, or at least propping them up, in the process.

The column was written by Chrystia Freeland and it discusses a recent paper entitled Economics versus Politics: Pitfalls of Policy Advice by Daron Acemoglu from M.I.T. and James Robinson from Harvard. A description of the paper from Freeland’s column:

Their critique is not the standard technocrat’s lament that wise policy is, alas, politically impossible to implement. Instead, their concern is that policy which is eminently sensible in theory can fail in practice because of its unintended political consequences.

In particular, they believe we need to be cautious about “good” economic policies that have the side effect of either reinforcing already dominant groups or weakening already frail ones.

“You should apply double caution when it comes to policies which will strengthen already powerful groups,” Dr. Acemoglu told me. “The central starting point is a certain suspicion of elites. You really cannot trust the elites when they are totally in charge of policy.”

Three examples they discuss in the paper: trade unions, financial deregulation in the U.S., privatization in Russia. Examples where something economists suggested would make the system better also acted to reinforce power of already powerful people.

If there’s one thing I might infer from my trip to Washington, it’s that the technocrats in charge nowadays, whose advice is being followed, may have subtly shifted away from deregulation economists and towards big data folks. Not that I’m holding my breath for Bob Rubin to be losing his grip any time soon.

Categories: data science, finance, news

Fight back against surveillance using TrackMeNot, TrackMeNot mobile?

After two days of travelling to the west coast and back, I’m glad to be back to my blog (and, of course, my coffee machine, which is the real source of my ability to blog every morning without distraction: it makes coffee at the push of a button, and that coffee has a delicious amount of caffeine).

Yesterday at the hotel I grabbed a free print edition of the Wall Street Journal to read on the plane, and I was super interested in this article called Phone Firm Sells Data on Customers. They talk about how phone companies (Verizon, specifically) are selling location data and browsing data about customers, how some people might be creeped out by this, and then they say:

The new offerings are also evidence of a shift in the relationship between carriers and their subscribers. Instead of merely offering customers a trusted conduit for communication, carriers are coming to see subscribers as sources of data that can be mined for profit, a practice more common among providers of free online services like Google Inc. and Facebook Inc.

Here’s the thing. It’s one thing to make a deal with the devil when I use Facebook: you give me something free, in return I let you glean information about me. But in terms of Verizon, I pay them like $200 per month for my family’s phone usage. That’s not free! Fuck you guys for turning around and selling my data!

And how are marketers going to use such location data? They will know how desperate you are for their goods and charge you accordingly. Like this for example, but on a much wider scale.

There are a two things I can do to object to this practice. First, I write this post and others, railing against such needless privacy invasion practices. Second, I can go to Verizon, my phone company, and get myself off the list. The instructions for doing so seem to be here, but I haven’t actually followed them yet.

Here’s what I wish a third option were: a mobile version of Trackmenot, which I learned about last week from Annelies Kamran.

Trackmenot, created by Daniel C. Howe and Helen Nissenbaum at what looks like the CS department of NYU, confuses the data gatherers by giving them an overload of bullshit information.

Specifically, it’s a Firefox add-on which sends you to all sorts of websites while you’re not actually using your browser. The data gatherers get endlessly confused about what kind of person you actually are this way, thereby fucking up the whole personal data information industry.

I have had this idea in the past, and I’m super happy it already exists. Now can someone do it for mobile please? Or even better, tell me it already exists?

Categories: data science, modeling, news

Get every new post delivered to your Inbox.

Join 891 other followers