Archive for the ‘data science’ Category

A/B testing in politics

As research for my book I’m studying the way people use big data techniques, mostly from the marketing world, in politics. So naturally I was intrigued by Kyle Rush’s blogpost about A/B testing on the Obama campaign. Kyle was the Deputy Director of Frontend Web Development at Obama for America.

In case you don’t know the lingo, A/B testing is a test done by marketers to decide which of two ad designs is more effective – the ad with the dark blue background or the ad with the dark red background, for example. But in this case it was more like, the ad with Obama’s family or the ad with Obama’s family and the American flag in the background.

The idea is, as a marketer, you offer your target audience both ads – actually, any individual in the target audience either sees ad A or ad B, randomly – and then, after enough people have seen the ads, you see which population responds more, and you go with that version. Then you move on to the next test, where you keep the characteristic that just won and you test some other aspect of the ad, like the font.

As a mathematical testing framework, A/B testing is interesting and has structural complications – how do you know you’re getting a global maximum instead of a local maximum? In other words, if you’d first tested the font, and then the background color, would you have ended up with a “better ad”? What if there are 50 things you’d like to test, how do you decide which order to test them in?

But that’s not what interests me about Kyle’s Obama A/B testing blogpost. Rather, I’m fascinated by the definition of success that was chosen.

After all, an A/B test is all about which ad “works better,” so there has to be some way to measure success, and it has to be measured in real time if you want to go through many iterations of your ad.

In the case of the Obama campaign, there were two definitions of success, or maybe three: how often people signed up to be on Obama’s newsletter, how often they gave money, and how much money they gave. I infer this from Kyle’s braggy second sentence, “Overall we executed about 500 a/b tests on our web pages in a 20 month period which increased donation conversions by 49% and sign up conversions by 161%.” Those were the measures Kyle and his team was optimizing on.

Most of the blog post focused on getting people to donate more, and specifically on getting them to fill out the credit card donation page form. Here’s what they A/B tested:

Our plan was to separate the field groups into four smaller steps so that users did not feel overwhelmed by the length of the form. Essentially the idea was to get users to the top of the mountain by showing them a small incline rather than a steep slope.

What I find super interesting about this stuff (and of course this not the only “data science” that was used in Obama’s campaign, there was a separate team focused on getting Facebook users to share their friends’ lists and such) is that nowhere is there even a slight nod to the question of whether this stuff will improve or even maintain democracy. They don’t even discuss how maintainable this is.

I mean, we gave the Obama analytics team lots of credit for stuff, but in the end what they did was optimize a bunch of people’s donation money. Is that something we should cheer? It seems more like an arms race with the Republican party, in which the Democrats pulled ahead temporarily. And all it means is that the fight for donations will be even more manipulative, by both sides, by the next presidential election cycle.

As Felix Salmon pointed out to me over beer and sausages last week, the problem with big data in politics is that the easiest thing you can measure in politics is money, which means everything is optimized to that metric of success, leaving all other considerations ignored and probably stifled. And yes, “sign ups” are also measurable, but they more or less correspond to people who will receive weekly or daily requests for money from the candidate.

Readers, please tell me I’m wrong. Or suggest a way we can measure something and optimize to something that is less cynical than the size of a war chest.

Categories: arms race, data science

A critique of a review of a book by Bruce Schneier

I haven’t yet read Bruce Schneier’s new book, Data and Goliath: The Hidden Battles To Collect Your Data and Control Your World. I plan to in the coming days, while I’m traveling with my kids for spring break.

Even so, I already feel capable of critiquing this review of his book (hat tip Jordan Ellenberg), written by Columbia Business School Professor and Investment Banker Jonathan Knee. You see, I’m writing a book myself on big data, so I feel like I understand many of the issues intimately.

The review starts out flattering, but then it hits this turn:

When it comes to his specific policy recommendations, however, Mr. Schneier becomes significantly less compelling. And the underlying philosophy that emerges — once he has dispensed with all pretense of an evenhanded presentation of the issues — seems actually subversive of the very democratic principles that he claims animates his mission.

That’s a pretty hefty charge. Let’s take a look into Knee’s evidence that Schneier wants to subvert democratic principles.


First, he complains that Schneier wants the government to stop collecting and mining massive amounts of data in its search for terrorists. Knee thinks this is dumb because it would be great to have lots of data on the “bad guys” once we catch them.

Any time someone uses the phrase “bad guys,” it makes me wince.

But putting that aside, Knee is either ignorant of or is completely ignoring what mass surveillance and data dredging actually creates: the false positives, the time and money and attention, not to mention the potential for misuse and hacking. Knee’s opinion on that is simply that we normal citizens just don’t know enough to have an opinion on whether it works, including Schneier, and in spite of Schneier knowing Snowden pretty well.

It’s just like waterboarding – Knee says – we can’t be sure it isn’t a great fucking idea.

Wait, before we move on, who is more pro-democracy, the guy who wants to stop totalitarian social control methods, or the guy who wants to leave it to the opaque authorities?

Corporate Data Collection

Here’s where Knee really gets lost in Schneier’s logic, because – get this – Schneier wants corporate collection and sale of consumer data to stop. The nerve. As Knee says:

Mr. Schneier promotes no less than a fundamental reshaping of the media and technology landscape. Companies with access to large amounts of personal data would be “automatically classified as fiduciaries” and subject to “special legal restrictions and protections.”

That these limits would render illegal most current business models — under which consumers exchange enhanced access by advertisers for free services – does not seem to bother Mr. Schneier”

I can’t help but think that Knee cannot understand any argument that would threaten the business world as he knows it. After all, he is a business professor and an investment banker. Things seem pretty well worked out when you live in such an environment.

By Knee’s logic, even if the current business model is subverting democracy – which I also argue in my book – we shouldn’t tamper with it because it’s a business model.

The way Knee paints Schneier as anti-democratic is by using the classic fallacy in big data which I wrote about here:

Although professing to be primarily preoccupied with respect of individual autonomy, the fact that Americans as a group apparently don’t feel the same way as he does about privacy appears to have little impact on the author’s radical regulatory agenda. He actually blames “the media” for the failure of his positions to attract more popular support.

Quick summary: Americans as a group do not feel this way because they do not understand what they are trading when they trade their privacy. Commercial and governmental interests, meanwhile, are all united in convincing Americans not to think too hard about it. There are very few people devoting themselves to alerting people to the dark side of big data, and Schneier is one of them. It is a patriotic act.

Also, yes Professor Knee, “the media” generally speaking writes down whatever a marketer in the big data world says is true. There are wonderful exceptions, of course.

So, here’s a question for Knee. What if you found out about a threat on the citizenry, and wanted to put a stop to it? You might write a book and explain the threat; the fact that not everyone already agrees with you wouldn’t make your book anti-democratic, would it?


The rest of the review basically boils down to, “you don’t understand the teachings of the Reverend Dr. Martin Luther King Junior like I do.”

Do you know about Godwin’s law, which says that as soon as someone invokes the Nazis in an argument about anything, they’ve lost the argument?

I feel like we need another, similar rule, which says, if you’re invoking MLK and claiming the other person is misinterpreting him while you have him nailed, then you’ve lost the argument.

Data Justice Launches!

I’m super excited to announce that I’m teaming up with Nathan Newman and Frank Pasquale on a newly launched project called Data Justice and subtitled Challenging Rising Exploitation and Economic Inequality from Big Data.

Nathan Newman is the director of Data Justice and is a lawyer and policy advocate. You might remember his work with racial and economic profiling of Google ads. Frank Pasquale is a law professor at the University of Maryland and the author of a book I recently reviewed called The Black Box Society.

The mission for Data Justice can be read here and explains how we hope to build a movement on the data justice front by working across various disciplines like law, computer science, and technology. We also have a blog and a press release which I hope you have time to read.

Categories: data science, modeling

Reforming the data-driven justice system

This article from the New York Times really interests me. It’s entitled Unlikely Cause Unites the Left and the Right: Justice Reformand although it doesn’t specifically mention “data driven” approaches in justice reform, it describes “emerging proposals to reduce prison populations, overhaul sentencing, reduce recidivism and take on similar initiatives.”

I think this sentence, especially the reference to reducing recidivism, is code for the evidence-based sentencing that my friend Luis Daniel recently posted about. I recently finished a draft chapter in my book about such “big data” models, and after much research I can assure you that this stuff runs the gamut between putting poor people away for longer because they’re poor and actually focusing resources where they’re needed.

The idea that there’s a coalition that’s taking this on that includes both Koch Industries and the ACLU is fascinating and bizarre and – if I may exhibit a rare moment of optimism – hopeful. In particular I’m desperately hoping they have involved people who understand enough about modeling not to assume that the results of models are “objective”.

There are, in fact, lots of ways to set up data-gathering and usage in the justice system to actively fight against unfairness and unreasonably long incarcerations, rather than to simply codify such practices. I hope some of that conversation happens soon.

Categories: data science, modeling

Big data and class

About a month ago there was an interesting article in the New York Times entitled Blowing Off Class? We Know. It discusses the “big data” movement in colleges around the country. For example, at Ball State, they track which students go to parties at the student center. Presumably to help them study for tests, or maybe to figure out which ones to hit up for alumni gifts later on.

There’s a lot to discuss in this article, but I want to focus today on one piece:

Big data has a lot of influential and moneyed advocates behind it, and I’ve asked some of them whether their enthusiasm might also be tinged with a little paternalism. After all, you don’t see elite institutions regularly tracking their students’ comings and goings this way. Big data advocates don’t dispute that, but they also note that elite institutions can ensure that their students succeed simply by being very selective in the first place.

The rest “get the students they get,” said William F. L. Moses, the managing director of education programs at the Kresge Foundation, which has given grants to the innovation alliance and to bolster data-analytics efforts at other colleges. “They have a moral obligation to help them succeed.”

This is a sentiment I’ve noticed a lot, although it’s not usually this obvious. Namely, the elite don’t need to be monitored, but the rabble does. The rich and powerful get to be quirky philosophers but the rest of the population need to be ranked and filed. And, by the way, we are spying on them for their own good.

In other words, never mind how big data creates and expands classism; classism already helps decide who is put into the realm of big data in the first place.

It feeds into the larger question of who is entitled to privacy. If you want to be strict about your definition of pricacy, you might say “nobody.” But if you recognize that privacy is a spectrum, where we have a variable amount of information being collected on people, and also a variable amount of control over people whose information we have collected, then upon study, you will conclude that privacy, or at least relative privacy, is for the rich and powerful. And it starts early.

Wage Gaps Don’t Magically Get Smaller Because Big Data

Today, just a rant. Sorry. I mean, I’m not a perfect person either, and of course that’s glaringly obvious, but this fluff piece from Wired, written by Pam Wikham of Raytheon, is just aggravating.

The title is Big Data, Smaller Wage Gap? and, you know, it almost gives us the impression that she has a plan to close the wage gap using big data, or alternatively an argument that the wage gap will automatically close with the advent of big data techniques. It turns out to be the former, but not really.

After complaining about the wage gap for women in general, and after we get to know how much she loves her young niece, here’s the heart of the plan (emphasis mine, on the actual plan parts of the plan):

Analytics and microtargeting aren’t just for retailers and politicians — they can help us grow the ranks of executive women and close the gender wage gap. Employers analyze who clicked on internal job postings, and we can pursue qualified women who looked but never applied. We can go beyond analyzing the salary and rank histories of women who have left our companies. We can use big data analytics to tell us what exit interviews don’t.

Facebook posts, Twitter feeds and LinkedIn groups provide a trove of valuable intel from ex-employees. What they write is blunt, candid and useful. All the data is there for the taking — we just have to collect it and figure out what it means. We can delve deep into whether we’re promoting the best people, whether we’re doing enough to keep our ranks diverse, whether potential female leaders are being left behind and, importantly, why.

That’s about it, after that she goes back to her niece.

Here’s the thing, I’m not saying it’s not an important topic, but that plan doesn’t seem worthy of the title of the piece. It’s super vague and fluffy and meaningless. I guess, if I had to give it meaning, it would be that she’s proposing to understand internal corporate sexism using data, rather than assuming “data is objective” and that all models will make things better. And that’s one tiny step, but it’s not much. It’s really not enough.

Here’s an idea, and it kind of uses big data, or at least small data, so we might be able to sell it. Ask people in your corporate structure what the actual characteristics are of people they promote, and how they are measured, or if they are measured, and look at the data to see if what they say is consistent with what they do, and whether those characteristics are inherently sexist. It’s a very specific plan and no fancy mathematical techniques are necessary, but we don’t have to tell anyone that.

What combats sexism is a clarification and transparent description of job requirements and a willingness to follow through. Look at blind orchestra auditions for a success story there. By contrast, my experience with the corporate world is that, when hiring or promoting, they often list a long series of unmeasurable but critical properties like “good cultural fit” and “leadership qualities” that, for whatever reason, more men are rated high on than women.

Categories: data science, rant

What would a data-driven Congress look like?

Recently I’ve seen two very different versions of what a more data-driven Congress would look like, both emerging from the recent cruddy Cromnibus bill mess.

First, there’s this Bloomberg article, written by the editors, about using data to produce evidence on whether a given policy is working or not. Given what I know about how data is produced, and how definitions of success are politically manipulated, I don’t have much hope for this idea.

Second, there was a reader’s comments on this New York Times article, also about the Cromnibus bill. Namely, the reader was calling on the New York Times to not only explore a few facts about what was contained in the bill, but lay it out with more numbers and more consistency. I think this is a great idea. What if, when Congress gave us a shitty bill, we could see stuff like:

  1. how much money is allocated to each thing, both raw dollars and as a percentage of the whole bill,
  2. who put it in the omnibus bill,
  3. the history of that proposed spending, and the history of voting,
  4. which lobbyists were pushing it, and who gets paid by them, and ideally
  5. all of this would be in an easy-to-use interactive.

That’s the kind of data that I’d love to see. Data journalism is an emerging field, and we might not be there yet, but it’s something to strive for.

Categories: data science, statistics

Get every new post delivered to your Inbox.

Join 3,236 other followers