Archive for the ‘data science’ Category

Interactive scoring models: why hasn’t this happened yet?

September 12, 2013 10 comments

My friend Suresh just reminded me about this article written a couple of years ago by Malcolm Gladwell and published in the New Yorker.

It concerns various scoring models that claim to be both comprehensive (which means it covers the whole thing, not just one aspect of the thing) and heterogeneous (which means it is broad enough to cover all things in a category), say for cars or for colleges.

Weird things happen when you try to do this, like not caring much about price or exterior detailing for sports cars.

Two things. First, this stuff is actually really hard to do well. I like how Gladwell addresses this issue:

At no point, however, do the college guides acknowledge the extraordinary difficulty of the task they have set themselves.

Second of all, I think the issue of combining heterogeneity and comprehensiveness is addressable, but it has to be addressed interactively.

Specifically, what if instead of a single fixed score, there was a place where a given car-buyer or college-seeker could go to fill out a form of preferences? For each defined and rated aspect, the user would fill answer a question about how much they cared about that aspect. They’d assign a weight to each aspect. A given question would look something like this:

For colleges, some people care a lot about whether their college has a ton of alumni giving, other people care more about whether the surrounding town is urban or rural. Let’s let people create their own scoring system. It’s technically easy.

I’ve suggested this before when I talked about rating math articles on various dimensions (hard, interesting, technical, well-written) and then letting people come and search based on weighting those dimensions and ranking. But honestly we can start even dumber, with car ratings and college ratings.

Categories: data science, modeling

Working in the NYC Mayor’s Office

September 10, 2013 7 comments

I recently took a job in the NYC Mayor’s Office as an unpaid consultant. It’s an interesting time to be working for the Mayor, to be sure – everyone’s waiting to see what happens this week with the election, and all sorts of things are up in the air. Planning essentially stops at December 31st.

Note the expiration date.

Note the expiration date.

I’m working in a data group which deals with social service agency data. That means Child Services, Homeless Services, and the like. Any agency where there there is direct contact with lots of people and their data. The idea is for me to help them out with a project that, if successful, I might be able to take to another city as a product. I’m still working full-time at the same job.

Specifically, my goal is to figure out a way to use data to help the people involved – the homeless, for example – get connected to better services. As a side effect I think this should make the agency more efficient. Far too many data studies only care about efficiency – how to make do with fewer police or fewer ambulances – with no thought or care about whether the people experiencing the services are being affected. I want to start with the people, and hope for efficiency gains, which I believe will come.

One thing that has already amazed me about this job, which I’ve just started, is the conversations people have about the ethics of data privacy.

It is a well-known fact that, as you link more and more data about people together, you can predict their behavior better. So for example, you could theoretically link all the different agency data for a given person into a profile, including crime data, health data, education and the like.

This might help you profile that person, and that might help you offer them better services. But it also might not be what that person wants you to do, especially if you start adding social media information. There’s a tension between the best model and reasonable limits of privacy and decency, even when the model is intended to be used in a primarily helpful manner. It’s more obvious when you’re attempting something insidious like predictive policing, of course.

Now, it shouldn’t shock me to have such conversations, because after all we are talking about some of the most vulnerable populations here. But even so, it does.

In all my time as a predictive modeler, I’ve never been in that kind of conversation, about the malicious things people could do with such-and-such profile information, or with this or that model, unless I started it myself.

When you work as a quant in finance, the data you work with is utterly sanitized to the point where, although it eventually trickles down to humans, you are asked to think of it as generated by some kind of machine, which we call “the market.”

Similarly, when you work in ad tech or other internet modeling, you think of users as the targets of your predatory goals: click on this, user, or buy that, user! They are prey, and the more we know about them the better our aim will be. If we can buy their profiles from Acxiom, all the better for our purposes.

This is the opposite of all of that. Super interesting, and glad I am being given this opportunity.

Categories: data science, modeling

The art of definition

Definitions are basic objects in mathematics. Even so, I’ve never seen the art of definition explicitly taught, and I have rarely seen the need for a definition explicitly discussed.

Have you ever noticed how damn hard it is to make a good definition and yet how utterly useful a good definition can be?

The basic definitions inform the research of any field, and a good definition will lead to better theorems than a bad one. If you get them right, if you really nail down the definition, then everything works out much more cleanly than otherwise.

So for example, it doesn’t make sense to work in algebraic geometry without the concepts of affine and projective space, and varieties, and schemes. They are to algebraic geometry like circles and triangles are to elementary geometry. You define your objects, then you see how they act and how they interact.

I saw first hand how a good definition improves clarity of thought back in grad school. I was lucky enough to talk to John Tate (my mathematical hero) about my thesis, and after listening to me go on for some time with a simple object but complicated proofs, he suggested that I add an extra sentence to my basic object, an assumption with a fixed structure.

This gave me a bit more explaining to do up front – but even there added intuition – and greatly simplified the statement and proofs of my theorems. It also improved my talks about my thesis. I could now go in and spend some time motivating the definition, and then state the resulting theorem very cleanly once people were convinced.

Another example from my husband’s grad seminar this semester: he’s starting out with the concept of triangulated categories coming from Verdier’s thesis. One mysterious part of the definition involves the so-called “octahedral axiom,” which mathematicians have been grappling with ever since it was invented. As far as Johan tells it, people struggle with why it’s necessary but not that it’s necessary, or at least something very much like it. What’s amazing is that Verdier managed to get it right when he was so young.

Why? Because definition building is naturally iterative, and it can take years to get it right. It’s not an obvious process. I have no doubt that many arguments were once fought over whether the most basic definitions, although I’m no historian. There’s a whole evolutionary struggle that I can imagine could take place as well – people could make the wrong definition, and the community would not be able to prove good stuff about that, so it would eventually give way to stronger, more robust definitions. Better to start out carefully.

Going back to that. I think it’s strange that the building up of definitions is not explicitly taught. I think it’s a result of the way math is taught as if it’s already known, so the mystery of how people came up with the theorems is almost hidden, never mind the original objects and questions about them. For that matter, it’s not often discussed why we care whether a given theorem is important, just whether it’s true. Somehow the “importance” conversations happen in quiet voices over wine at the seminar dinners.

Personally, I got just as much out of Tate’s help with my thesis as anything else about my thesis. The crystalline focus that he helped me achieve with the correct choice of the “basic object of study” has made me want to do that every single time I embark on a project, in data science or elsewhere.

Simons Center for Data Analysis

Has anyone heard of the new Simons Center for Data Analysis?

Neither had I until just now. But some guy named Leslie Greengard, who is a distinguished mathematician and computer scientist, just got named its director (hat tip Peter Woit).

Please inform me if you know more about this center. I got nothing except this tiny description:

As SCDA’s director, Greengard will build and lead a team of scientists committed to analyzing large-scale, rich data sets and to developing innovative mathematical methods to examine such data.

Categories: data science, news

Experimentation in education – still a long way to go

Yesterday’s New York Times ran a piece by Gina Kolata on randomized experiments in education. Namely, they’ve started to use randomized experiments like they do in medical trials. Here’s what’s going on:

… a little-known office in the Education Department is starting to get some real data, using a method that has transformed medicine: the randomized clinical trial, in which groups of subjects are randomly assigned to get either an experimental therapy, the standard therapy, a placebo or nothing.

They have preliminary results:

The findings could be transformative, researchers say. For example, one conclusion from the new research is that the choice of instructional materials — textbooks, curriculum guides, homework, quizzes — can affect achievement as profoundly as teachers themselves; a poor choice of materials is at least as bad as a terrible teacher, and a good choice can help offset a bad teacher’s deficiencies.

So far, the office — the Institute of Education Sciences — has supported 175 randomized studies. Some have already concluded; among the findings are that one popular math textbook was demonstrably superior to three competitors, and that a highly touted computer-aided math-instruction program had no effect on how much students learned.

Other studies are under way. Cognitive psychology researchers, for instance, are assessing an experimental math curriculum in Tampa, Fla.

If you go to any of the above links, you’ll see that the metric of success is consistently defined as a standardized test score. That’s the only gauge of improvement. So any “progress” that’s made is by definition measured by such a test.

In other words, if we optimize to this system, we will optimize for textbooks which raise standardized test scores. If it doesn’t improve kids’ test scores, it might as well not be in the book. In fact it will probably “waste time” with respect to raising scores, so there will effectively be a penalty for, say, fun puzzles, or understanding why things are true, or learning to write.

Now, if scores are all we cared about, this could and should be considered progress. Certainly Gina Kolata, the NYTimes journalist, didn’t mention that we might not care only about this – she recorded it as unfettered good, as she was expected to by the Education Department, no doubt. But, as a data scientist who gets paid to think about the feedback loops and side effects of choices like “metrics of success,” I have a problem with it.

I don’t have a thing against randomized tests – using them is a good idea, and will maybe even quiet some noise around all the different curriculums, online and in person. I do think, though, that we need to have more ways of evaluating an educational experience than a test score.

After all, if I take a pill once a day to prevent a disease, then what I care about is whether I get the disease, not which pill I took or what color it was. Medicine is a very outcome- focused discipline in a way that education is not. Of course, there are exceptions, say when the treatment has strong and negative side-effects, and the overall effect is net negative. Kind of like when the teacher raises his or her kids’ scores but also causes them to lose interest in learning.

If we go the way of the randomized trial, why not give the students some self-assessments and review capabilities of their text and their teacher (which is not to say teacher evaluations give clean data, because we know from experience they don’t)? Why not ask the students how they liked the book and how much they care about learning? Why not track the students’ attitudes, self-assessment, and goals for a subject for a few years, since we know longer-term effects are sometimes more important that immediate test score changes?

In other words, I’m calling for collecting more and better data beyond one-dimensional test scores. If you think about it, teenagers get treated better by their cell phone companies or Netflix than by their schools.

I know what you’re thinking – that students are all lazy and would all complain about anyone or anything that gave them extra work. My experience is that kids actually aren’t like this, know the difference between rote work and real learning, and love the learning part.

Another complaint I hear coming – long-term studies take too long and are too expensive. But ultimately these things do matter in the long term, and as we’ve seen in medicine, skimping on experiments often leads to bigger and more expensive problems. Plus, we’re not going to improve education overnight.

And by the way, if and/or when we do this, we need to implement strict privacy policies for the students’ answers – you don’t want a 7-year-old’s attitude about math held against him when he of she applies to college.


Short your kids, go long your neighbor: betting on people is coming soon

Yet another aspect of Gary Shteyngart’s dystopian fiction novel Super Sad True Love Story is coming true for reals this week.

Besides anticipating Occupy Wall Street, as well as Bloomberg’s sweep of Zuccotti Park (although getting it wrong on how utterly successful such sweeping would be), Shteyngart proposed the idea of instant, real-time and broadcast credit ratings.

Anyone walking around the streets of New York, as they’d pass a certain type of telephone pole – the kind that identifies you via your cell phone and communicates with data warehousing services and databases – would have their credit rating flashed onto a screen. If you went to a party, depending on how you impressed the other party go-ers, your score could plummet or rise in real time, and everyone would be able to keep track and treat you accordingly.

I mean, there were other things about the novel too, but as a data person these details certainly stuck with me since they are both extremely gross and utterly plausible.

And why do I say they are coming true now? I base my claim on two news stories I’ve been sent by my various blog readers recently.

[Aside: if you read my blog and find an awesome article that you want to send me, by all means do! My email address is available on my "About" page.]

First, coming via Suresh and Marcos, we learn that data broker Acxiom is letting people see their warehoused data. A few caveats, bien sûr:

  1. You get to see your own profile, here, starting in 2 days, but only your own.
  2. And actually, you only get to see some of your data. So they won’t tell you if you’re a suspected gambling addict, for example. It’s a curated view, and they want your help curating it more. You know, for your own good.
  3. And they’re doing it so that people have clarity on their business.
  4. Haha! Just kidding. They’re doing it because they’re trying to avoid regulations and they feel like this gesture of transparency might make people less suspicious of them.
  5. And they’re counting on people’s laziness. They’re allowing people to opt out, but of course the people who should opt out would likely never even know about that possibility.
  6. Just keep in mind that, as an individual, you won’t know what they really think they know about you, but as a corporation you can buy complete information about anyone who hasn’t opted out.

In any case those credit scores that Shteyngart talks about are already happening. The only issue is who gets flashed those numbers and when. Instead of the answers being “anyone walking down the street” and “when you walk by a pole” it’s “any corporation on the interweb” and “whenever you browse”.

After all, why would they give something away for free? Where’s the profit in showing the credit scores of anyone to everyone? Hmmmm….

That brings me to my second news story of the morning coming to me via Constantine, namely this TechCrunch story which explains how a startup called Fantex is planning to allow individuals to invest in celebrity athletes’ stocks. Yes, you too can own a tiny little piece of someone famous, for a price. From the article:

People can then buy shares of that player’s brand, like a stock, in the Fantex-consumer market. Presumably, if San Francisco 49ers tight end Vernon Davis has a monster year and looks like he’s going to get a bigger endorsement deal or a larger contract in a few years, his stock would rise and a fan could sell their Davis stock and cash out with a real, monetary profit. People would own tracking or targeted stocks in Fantex that would depend on the specific brand that they choose; these stocks would then rise and fall based on their own performance, not on the overall performance of Fantex.

Let’s put these two things together. I think it’s not too much of a stretch to acknowledge a reason for everyone to know everyone else’s credit score! Namely, we can can bet on each other’s futures!

I can’t think of any set-up more exhilarating to the community of hedge fund assholes than a huge, new open market – containing profit potentials for every single citizen of earth – where you get to make money when someone goes to the wrong college, or when someone enters into an unfortunate marriage and needs a divorce, or when someone gets predictably sick. An orgy in the exact center of tech and finance.

Are you with me peoples?!

I don’t know what your Labor Day plans are, but I’m getting ready my list of people to short in this spanking new market.

Summers’ Lending Club makes money by bypassing the Equal Credit Opportunity Act

Don’t know about you, but for some reason I have a sinking feeling when it comes to the idea of Larry Summers. Word on the CNBC street is that he’s about to be named new Fed Chair, and I am living in a state of cognitive dissonance.

To distract myself, I’m going to try better to explain what I started to explain here, when I talked about the online peer-to-peer lending company Lending Club. Summers sits on the board of Lending Club, and from my perspective it’s a logical continuation of his career of deregulation and/or bypassing of vital regulation to enrich himself.

In this case, it’s a vehicle for bypassing the FTC’s Equal Credit Opportunities Rights. It’s not perfect, but it “prohibits credit discrimination on the basis of race, color, religion, national origin, sex, marital status, age, or because you get public assistance.” It forces credit scores to be relatively behavior based, like you see here. Let me contrast that to Lending Club.

Lending Club also uses mathematical models to score people who want to borrow money. These act as credit scores. But in this case, they use data like browsing history or anything they can grab about you on the web or from data warehousing companies like Acxiom (which I’ve written about here). From this Bloomberg article on Lending Club:

“What we’ve done is radically transform the way consumer lending operates,” Laplanche says in his speech. He says that LendingClub keeps staffing low by using algorithms to screen prospective borrowers for risk — rejecting 90 percent of them – - and has no physical branches like banks. “The savings can be passed on to more borrowers in terms of lower interest rates and investors in terms of attractive returns.”

I’d focus on the benefit for investors. Big money is now involved in this stuff. Turns out that bypassing credit score regulation is great for business, so of course.

For example, such models might look at your circle of friends on Facebook to see if you “run with the right crowd” before loaning you money. You can now blame your friends if you don’t get that loan! From this CNN article on the subject (hat tip David):

“It turns out humans are really good at knowing who is trustworthy and reliable in their community,” said Jeff Stewart, a co-founder and CEO of Lenddo. “What’s new is that we’re now able to measure through massive computing power.”

Moving along from taking out loans to getting jobs, there’s this description of how recruiters work online to perform digital background checks for potential employees. It’s a different set of laws this time that is subject to arbitrage but it’s exactly the same idea:

Non-discrimination laws prohibit employers from asking job applicants certain questions. They’re not supposed to ask about things like age, race, gender, disability, marital, and veteran status. (As you can imagine, sometimes a picture alone can reveal this privileged information. These safeguards against discrimination urge employers to simply not use this knowledge to make hiring decisions.) In addition to protecting people from systemic prejudice, these employment laws intend to shield us from capricious bias and whimsy. While casually snooping, however, a recruiter can’t unsee your Facebook rant on immigration amnesty, the same for your baby bump on Instagram. From profile pics and bios, blog posts and tweets, simple HR reconnaissance can glean tons of off-limits information.

Along with forcing recruiters to gaze with eyes wide shut, straddling legal liability and ignorance, invisible employment screens deny American workers the robust protections afforded by the FTC and the Fair Credit Reporting Act. The FCRA ensures that prospective employees are notified before their backgrounds and credit scores are verified. Employees are free to decline the checks, but employers are also free to deny further consideration unless a screening is allowed to take place. What’s important here is that employees must first give consent.

When a report reveals unsavory information about a candidate, and the employer chooses to take what’s called “adverse action,”—like deny a job offer—the employer is required to share the content of the background reports with the candidate. The applicant then has the right to explain or dispute inaccurate and incomplete aspects of the background check. Consent, disclosure, and recourse constitute a straightforward approach to employment screening.

Contrast this citizen-empowering logic with the casual Google search or to the informal, invisible social-media exam. As applicants, we don’t know if employers are looking, we’re not privy to what they see, and we have no way to appeal.

As legal scholars Daniel Solove and Chris Hoofnagle discuss, the amateur Google screens that are now a regular feature of work-life go largely unnoticed. Applicants are simply not called back. And they’ll never know the real reason.

I think the silent failure is the scariest part for me – people who don’t get jobs won’t know why.

Similarly, people denied loans from Lending Club by a secret algorithm don’t know why either. Maybe it’s because I made friends with the wrong person on Facebook? Maybe I should just go ahead and stop being friends with anyone who might put my electronic credit score at risk?

Of course this rant is predicated on the assumption that we think anti-discrimination laws are a good thing. In an ideal world, of course, we wouldn’t need them. But that’s not where we live.

Categories: data science, finance, modeling rips off poor people; let’s take control of our online personas

You’ve probably heard rumors about this here and there, but the Wall Street Journal convincingly reported yesterday that websites charge certain people more for the exact thing.

Specifically, poor people were more likely to pay more for, say, a stapler from than richer people. Home Depot and Lowes does the same for their online customers, and Discover and Capitol One make different credit card offers to people depending on where they live (“hey, do you live in a PayDay lender neighborhood? We got the card for you!”).

They got pretty quantitative for, and did tests to determine the cost. From the article:

It is possible that Staples’ online-pricing formula uses other factors that the Journal didn’t identify. The Journal tested to see whether price was tied to different characteristics including population, local income, proximity to a Staples store, race and other demographic factors. Statistically speaking, by far the strongest correlation involved the distance to a rival’s store from the center of a ZIP Code. That single factor appeared to explain upward of 90% of the pricing pattern.

If anyone’s ever seen a census map, race is highly segregated by ZIP code, and my guess is we’d see pretty high correlations along racial lines as well, although they didn’t mention it in the article except to say that explicit race-related pricing is illegal. The article does mentions that things get more expensive in rural areas, which are also poorer, so there’s that acknowledged correlation.

But wait, how much of a price difference are we talking about? From the article:

Prices varied for about a third of the more than 1,000 randomly selected products tested. The discounted and higher prices differed by about 8% on average.

In other words, a really non-trivial amount.

The messed up thing about this, or at least one of them, is that we could actually have way more control over our online personas than we think. It’s invisible to us, typically, so we don’t think about our cookies and our displayed IP addresses. But we could totally manipulate these signatures to our advantage if we set our minds to it.

Hackers, get thyselves to work making this technology easily available.

For that matter, given the 8% difference, there’s money on the line so some straight-up capitalist somewhere should be meeting that need. I for one would be willing to give someone a sliver of the amount saved every time they manipulated my online persona to save me money. You save me $1.00, I’ll give you a dime.

Here’s my favorite part of this plan: it would be easy for Staples to keep track of how much people are manipulating their ZIP codes. So if infers a certain ZIP code for me to display a certain price, but then in check-out I ask them to send the package to a different ZIP code, Staples will know after-the-fact that I fooled them. But whatever, last time I looked it didn’t cost more or less to send mail to California or wherever than to Manhattan [Update: they do charge differently for packages, though. That's the only differential in cost I think is reasonable to pay].

I’d love to see them make a case for how this isn’t fair to them.

Categories: data science, modeling, rant

When big data goes bad in a totally predictable way

Three quick examples this morning in the I-told-you-so category. I’d love to hear Kenneth Neil Cukier explain how “objective” data science is when confronted with this stuff.

1. When an unemployed black woman pretends to be white her job offers skyrocket (Urban Intellectuals, h/t Mike Loukides). Excerpt from the article: “Two years ago, I noticed that had added a “diversity questionnaire” to the site.  This gives an applicant the opportunity to identify their sex and race to potential employers. guarantees that this “option” will not jeopardize your chances of gaining employment.  You must answer this questionnaire in order to apply to a posted position—it cannot be skipped.  At times, I would mark off that I was a Black female, but then I thought, this might be hurting my chances of getting employed, so I started selecting the “decline to identify” option instead.  That still had no effect on my getting a job.  So I decided to try an experiment:  I created a fake job applicant and called her Bianca White.”

2. How big data could identify the next felon – or blame the wrong guy (Bloomberg). From the article: “The use of physical characteristics such as hair, eye and skin color to predict future crimes would raise ‘giant red privacy flags’ since they are a proxy for race and could reinforce discriminatory practices in hiring, lending or law enforcement, said Chi Chi Wu, staff attorney at the National Consumer Law Center.”

3. How algorithms magnify misbehavior (the Guardian, h/t Suresh Naidu). From the article: “For one British university, what began as a time-saving exercise ended in disgrace when a computer model set up to streamline its admissions process exposed – and then exacerbated - gender and racial discrimination.”

This is just the beginning, unfortunately.

Categories: data science, modeling

What’s the difference between big data and business analytics?

I offend people daily. People tell me they do “big data” and that they’ve been doing big data for years. Their argument is that they’re doing business analytics on a larger and larger scale, so surely by now it must be “big data”.


There’s an essential difference between true big data techniques, as actually performed at surprisingly few firms but exemplified by Google, and the human-intervention data-driven techniques referred to as business analytics.

No matter how big the data you use is, at the end of the day, if you’re doing business analytics, you have a person looking at spreadsheets or charts or numbers, making a decision after possibly a discussion with 150 other people, and then tweaking something about the way the business is run.

If you’re really doing big data, then those 150 people probably get fired laid off, or even more likely are never hired in the first place, and the computer is programmed to update itself via an optimization method.

That’s not to say it doesn’t also spit out monitoring charts and numbers, and it’s not to say no person takes a look every now and then to make sure the machine is humming along, but there’s no point at which the algorithm waits for human intervention.

In other words, in a true big data setup, the human has stepped outside the machine and lets the machine do its thing. That means, of course, that it takes way more to set up that machine in the first place, and probably people make huge mistakes all the time in doing this, but sometimes they don’t. Google search got pretty good at this early on.

So with a business analytics set up we might keep track of the number of site visitors and a few sales metrics so we can later try to (and fail to) figure out whether a specific email marketing campaign had the intended effect.

But in a big data set-up it’s typically much more microscopic and detail oriented, collecting everything it can, maybe 1,000 attributed of a single customer, and figuring out what that guy is likely to do next time, how much they’ll spend, and the magic question, whether there will even be a next time.

So the first thing I offend people about is that they’re not really part of the “big data revolution”. And the second thing is that, usually, their job is potentially up for grabs by an algorithm.

Categories: data science, modeling

Larry Summers and the Lending Club

So here’s something potential Fed Chair Larry Summers is involved with, a company called Lending Club, which creates a money lending system that cuts out the middle man banks.

Specifically, people looking for money come to the site and tell their stories, and try to get loans. The investors invest in whichever loans look good to them, for however much money they want. For a perspective on the risks and rewards of this kind of peer-to-peer lending operation, look at this Wall Street Journal article which explains things strictly from the investor’s point of view.

A few red flags go up for me as I learn more about Lending Club.

First, from this NYTimes article, “The company [Lending Club] itself is not regulated as a bank. But it has teamed up with a bank in Utah, one of the states that allows banks to charge high interest rates, and that bank is overseen by state regulators and the Federal Deposit Insurance Corporation.”

I’m not sure how the FDIC is involved exactly, but the Utah connection is good for something, namely allowing high interest rates. According to the same article, 37% of loans are for APR’s of between 19% and 29%.

Next, Summers is referred to in that article as being super concerned about the ability for the consumers to pay back the loans. But I wonder how someone is supposed to be both desperate enough to go for a 25% APR loan and also able to pay back the money. This sounds like loan sharking to me.

Probably what bothers me most though is that Lending Club, in addition to offering credit scores and income when they have that information, also scores people asking for loans with a proprietary model which is, as you guessed it, unregulated. Specifically, if it’s anything like ZestFinance, could use signals more correlated to being uneducated and/or poor than to the willingness or ability to pay back loans.

By the way, I’m not saying this concept is bad for everyone- there are probably winners on the side of the loanees, and it might be possible that they get a loan they otherwise couldn’t get or they get better terms than otherwise or a more bespoke contract than otherwise. I’m more worried about the idea of this becoming the new normal of how money changes hands and how that would affect people already squeezed out of the system.

I’d love your thoughts.

Categories: data science, finance, modeling

Should lawmakers use algorithms?

Here is an idea I’ve been hearing floating around the big data/ tech community: the idea of having algorithms embedded into law.

The argument for is pretty convincing on its face: Google has gotten its algorithms to work better and better over time by optimizing correctly and using tons of data. To some extent we can think of their business strategies and rules as a kind of “internal regulation”. So why don’t we take a page out of that book and improve our laws and specifically our regulations with constant feedback loops and big data?

No algos in law

There are some concerns I have right off the bat about this concept, putting aside the hugely self-serving dimension of it.

First of all, we would be adding opacity – of the mathematical modeling kind – to an already opaque system of law. It’s hard enough to read the legalese in a credit card contract without there also being a black box algorithm to make it impossible.

Second of all, whereas the incentives in Google are often aligned with the algorithm “working better”, whatever that means in any given case, the incentives of the people who write laws often aren’t.

So, for example, financial regulation is largely written by lobbyists. If you gave them a new tool, that of adding black box algorithms, then you could be sure they would use it to further obfuscate what is already a hopelessly complicated set of rules, and on top of it they’d be sure to measure the wrong thing and optimize to something random that would not interfere with their main goal of making big bets.

Right now lobbyists are used so heavily in part because they understand the complexity of their industries more than the lawmakers themselves. In other words, they actually add value in a certain way (besides in the monetary way). Adding black boxes would emphasize this asymmetric information problem, which is a terrible idea.

Third, I’m worried about the “black box” part of algorithms. There’s a strange assumption among modelers that you have to make algorithms secret or else people will game them. But as I’ve said before, if people can game your model, that just means your model sucks, and specifically that your proxies are not truly behavior-based.

So if it pertains to a law against shoplifting, say, you can’t have an embedded model which uses the proxy of “looking furtive and having bulges in your clothes.” You actually need to have proof that someone stole something.

If you think about that example for a moment, it’s absolutely not appropriate to use poor proxies in law, nor is it appropriate to have black boxes at all – we should all know what our laws are. This is true for regulation as well, since it’s after all still law which affects how people are expected to behave.

And by the way, what counts as a black box is to some extent in the eye of the beholder. It wouldn’t be enough to have the source code available, since that’s only accessible to a very small subset of the population.

Instead, anyone who is under the expectation of following a law should also be able to read and understand the law. That’s why the CFPB is trying to make credit card contracts be written in Plain English. Similarly, regulation law should be written in a way so that the employees of the regulator in question can understand it, and that means you shouldn’t have to have a Ph.D. in a quantitative field and know python.

Algos as tools

Here’s where algorithms may help, although it is still tricky: not in the law itself but in the implementation of the law. So it makes sense that the SEC has algorithms trying to catch insider trading – in fact it’s probably the only way for them to attempt to catch the bad guys. For that matter they should have many more algorithms to catch other kinds of bad guys, for example to catch people with suspicious accounting or consistently optimistic ratings.

In this case proxies are reasonable, but on the other hand it doesn’t translate into law but rather into a ranking of workflow for the people at the regulatory agency. In other words the SEC should use algorithms to decide which cases to pursue and on what timeframe.

Even so, there are plenty of reasons to worry. One could view the “Stop & Frisk” strategy in New York as following an algorithm as well, namely to stop young men in high-crime areas that have “furtive motions”. This algorithm happens to single out many innocent black and latino men.

Similarly, some of the highly touted New York City open data projects amount to figuring out that if you focus on looking for building code violations in high-crime areas, then you get a better hit rate. Again, the consequence of using the algorithm is that poor people are targeted at a higher rate for all sorts of crimes (key quote from the article: “causation is for other people”).

Think about this asymptotically: if you live in a nice neighborhood, the limited police force and inspection agencies never check you out since their algorithms have decided the probability of bad stuff happening is too low to bother. If, on the other hand, you are poor and live in a high-crime area, you get checked out daily by various inspectors, who bust you for whatever.

Said this way, it kind of makes sense that white kids smoke pot at the same rate as black kids but are almost never busted for it.

There are ways to partly combat this problem, as I’ve described before, by using randomization.


It seems to me that we can’t have algorithms directly embedded in laws, because of the highly opaque nature of them together with commonly misaligned incentives. They might be useful as tools for regulators, but the regulators who choose to use internal algorithms need to carefully check that their algorithms don’t have unreasonable and biased consequences, which is really hard.

Categories: data science, finance, modeling

PyData talk today

Not much time because I’m giving a keynote talk at the PyData 2013 conference in Cambridge today, which is being held at the Microsoft NERD conference center.

It’s gonna be videotaped so I’ll link to that when it’s ready.

My title is “Storytelling With Data” but for whatever reason on the schedule handed out yesterday the name had been changed to “Scalable Storytelling With Data”. I’m thinking of addressing this name change in my talk – one of the points of the talk, in fact, is that with great tools, we don’t need to worry too much about the scale.

Plus since it’s Sunday morning I’m going to make an effort to tie my talk into an old testament story, which is totally bizarre since I’m not at all religious but for some reason it feels right. Please wish me luck.

The Stop and Frisk sleight of hand

I’m finishing up an essay called “On Being a Data Skeptic” in which I catalog different standard mistakes people make with data – sometimes unintentionally, sometimes intentionally.

It occurred to me, as I wrote it, and as I read the various press conferences with departing mayor Bloomberg and Police Commissioner Raymond Kelly when they addressed the Stop and Frisk policy, that they are guilty of making one of these standard mistakes. Namely, they use a sleight of hand with respect to the evaluation metric of the policy.

Recall that an evaluation metric for a model is the way you decide whether the model works. So if you’re predicting whether someone would like a movie, you should go back and check whether your recommendations were good, and revise your model if not. It’s a crucial part of the model, and a poor choice for it can have dire consequences – you could end up optimizing to the wrong thing.

[Aside: as I've complained about before, the Value Added Model for teachers doesn't have an evaluation method of record, which is a very bad sign indeed about the model. And that's a Bloomberg brainchild as well.]

So what am I talking about?

Here’s the model: stopping and frisking suspicious-looking people in high-crime areas will improve the safety and well-being of the city as a whole.

Here’s Bloomberg/Kelly’s evaluation method: the death rate by murder has gone down in New York during the policy. However, that rate is highly variable and depends just as much on whether there’s a crack epidemic going on as anything else. Or maybe it’s improved medical care. Truth is people don’t really know. In any case ascribing credit for the plunging death rate to Stop and Frisk is a tenuous causal argument. Plus since Stop and Frisk events have decreased drastically recently, we haven’t seen the murder rate shoot up.

Here’s another possible evaluation method: trust in the police. And considering that 400,000 innocent black and Latino New Yorkers were stopped last year under this policy (here are more stats), versus less than 50,000 whites, and most of them were young men, it stands to reason that the average young minority male feels less trust towards police than the average young white male. In fact, this is an amazing statistic put together by the NYCLU from 2011:

The number of stops of young black men exceeded the entire city population of young black men (168,126 as compared to 158,406).

If I’m a black guy I have an expectation of getting stopped and frisked at least once per year. How does that make me trust cops?

Let’s choose an evaluation method closer to what we can actually control, and let’s optimize to it.

Update: a guest columnist fills in for David Brooks, hopefully not for the last time, and gives us his take on Kelly, Obama, and racial profiling.

Categories: data science, modeling, rant

The creepy mindset of online credit scoring

Usually I like to think through abstract ideas – thought experiments, if you will – and not get too personal. I take exceptions for certain macroeconomists who are already public figures but most of the time that’s it.

Here’s a new category of people I’ll call out by name: CEO’s who defend creepy models using the phrase “People will trade their private information for economic value.”

That’s a quote of Douglas Merrill, CEO of Zest Finance, taken from this video taken at a recent data conference in Berkeley (hat tip Rachel Schutt). It was a panel discussion, the putative topic of which was something like “Attacking the structure of everything”, whatever that’s supposed to mean (I’m guessing it has something to do with being proud of “disrupting shit”).

Do you know the feeling you get when you’re with someone who’s smart, articulate, who probably buys organic eggs from a nice farmer’s market, but who doesn’t expose an ounce of sympathy for people who aren’t successful entrepreneurs? When you’re with someone who has benefitted so entirely and so consistently from the system that they have an almost religious belief that the system is perfect and they’ve succeeded through merit alone?

It’s something in between the feeling that, maybe you’re just naive because you’ve led such a blessed life, or maybe you’re actually incapable of human empathy, I don’t know which because it’s never been tested.

That’s the creepy feeling I get when I hear Douglas Merrill speak, but it actually started earlier, when I got the following email almost exactly one year ago via LinkedIn:

Hi Catherine,

Your profile looked interesting to me.

I’m seeking stellar, creative thinkers like you, for our team in Hollywood, CA. If you would consider relocating for the right opportunity, please read on.

You will use your math wizardry to develop radically new methods for data access, manipulation, and modeling. The outcome of your work will result in game-changing software and tools that will disrupt the credit industry and better serve millions of Americans.

You would be working alongside people like Douglas Merrill – the former CIO of Google – along with a handful of other ex-Googlers and Capital One folks. More info can be found on our LinkedIn company profile or at

At ZestFinance we’re bringing social responsibility to the consumer loan industry.

Do you have a few moments to talk about this? If you are not interested, but know someone else who might be a fit, please send them my way!

I hope to hear from you soon. Thank you for your time.


Wow, let’s “better serve millions of Americans” through manipulation of their private data, and then let’s call it being socially responsible! And let’s work with Capital One which is known to be practically a charity.


Message to ZestFinance: “getting rich with predatory lending” doesn’t mean “being socially responsible” unless you have a really weird definition of that term.

Going back to the video, I have a few more tasty quotes from Merrill:

  1. First when he’s describing how he uses personal individual information scraped from the web: “All data is credit data.”
  2. Second, when he’s comparing ZestFinance to FICO credit scoring: “Context is developed by knowing thousands of things about you. I know you as a person, not just you via five or six variables.”

I’d like to remind people that, in spite of the creepiness here, and the fact that his business plan is a death spiral of modeling, everything this guy is talking about is totally legal. And as I said in this post, I’d like to see some pushback to guys like Merrill as well as to the NSA.

Categories: data science, rant

On being a data science skeptic: due out soon

A few months ago, at the end of January, I wrote a post about Bill Gates naive views on the objectivity of data. One of the commenters, “CitizensArrest,” asked me to take a look at a related essay written by Susan Webber entitled “Management’s Great Addiction: It’s time we recognized that we just can’t measure everything.”

Webber’s essay is really excellent, not to mention impressively prescient considering it was published in 2006, before the credit crisis. The format of the essay is simple: it brings up and explains various dangers in the context of measurement and modeling of business data, and calls for finding a space in business for skepticism. What an idea! Imagine if that had actually happened in finance when it should have back in 2006.

Please go read her essay, it’s short.

Recently, when O’Reilly asked me to write an essay, I thought back to this short piece and decided to use it as a template for explaining why I think there’s a just-as-desperate need for skepticism in 2013 here in the big data world as there was back then in finance.

Whereas most of Webber’s essay talks about people blindly accepting numbers as true, objective, precise, and important, and the related tragic consequences, I’ve added a small wrinkle to this discussion. Namely, I also devote concern over the people who underestimate the power of data.

Most of this disregard for unintended consequences is blithe and unintentional (and some of it isn’t), but even so it can be hugely damaging, especially to the individuals being modeled: think foreclosed homes due to crappy housing-related models in the past, and think creepy models and the death spiral of modeling for the present and future.

Anyhoo, I’m actively writing it now, and it’ll be coming out soon. Stay tuned!

Categories: data science, finance, modeling

How to be wrong

My friend Josh Vekhter sent me this blog post written by someone who calls herself celandine13 and tutors students with learning disabilities.

In the post, she reframes the concept of mistake or “being bad at something” as often stemming from some fundamental misunderstanding or poor procedure:

Once you move it to “you’re performing badly because you have the wrong fingerings,” or “you’re performing badly because you don’t understand what a limit is,” it’s no longer a vague personal failing but a causal necessity.  Anyone who never understood limits will flunk calculus.  It’s not you, it’s the bug.

This also applies to “lazy.”  Lazy just means “you’re not meeting your obligations and I don’t know why.”  If it turns out that you’ve been missing appointments because you don’t keep a calendar, then you’re not intrinsically “lazy,” you were just executing the wrong procedure.  And suddenly you stop wanting to call the person “lazy” when it makes more sense to say they need organizational tools.

And she wants us to stop with the labeling and get on with the understanding of why the mistake was made and addressing that, like she does when she tutors students. She even singles out certain approaches she considers to be flawed from the start:

This is part of why I think tools like Knewton, while they can be more effective than typical classroom instruction, aren’t the whole story.  The data they gather (at least so far) is statistical: how many questions did you get right, in which subjects, with what learning curve over time?  That’s important.  It allows them to do things that classroom teachers can’t always do, like estimate when it’s optimal to review old material to minimize forgetting.  But it’s still designed on the error model. It’s not approaching the most important job of teachers, which is to figure out why you’re getting things wrong — what conceptual misunderstanding, or what bad study habit, is behind your problems.  (Sometimes that can be a very hard and interesting problem.  For example: one teacher over many years figured out that the grammar of Black English was causing her students to make conceptual errors in math.)

On the one hand I like the reframing: it’s always good to see knee-jerk reactions become more contemplative, and it’s always good to see people trying to help rather than trying to blame. In fact, one of my tenets of real life is that mistakes will be made, and it’s not the mistake that we should be anxious about but how we act to fix the mistake that exposes who we are as people.

I would, however, like to take issue with her anti-example in the case of Knewton, which is an online adaptive learning company. Full disclosure: I interviewed with Knewton before I took my current job, and I like the guys who work there. But, I’d add, I like them partly because of the healthy degree of skepticism they take with them to their jobs.

What the blogwriter celandine13 is pointing out, correctly, is that understanding causality is pretty awesome when you can do it. If you can figure out why someone is having trouble learning something, and if you can address that underlying issue, then fixing the consequences of that issue get a ton easier. Agreed, but I have three points to make:

  1. First, a non-causal data mining engine such as Knewton will also stumble upon a way to fix the underlying problem by dint of having a ton of data and noting that people who failed a calculus test, say, did much better after having limits explained to them in a certain way. This is much like the spellcheck engine of Google works by keeping track of previous spelling errors, and not by mind reading how people think about spelling wrong.
  2. Second, it’s not always easy to find the underlying cause of bad testing performance, even if you’re looking for it directly. I’m not saying it’s fruitless – tutors I know are incredibly good at that – but there’s room for both “causality detectives” and tons of smart data mining in this field.
  3. Third, it’s definitely not always easy to address the underlying cause of bad test performance. If you find out that the grammar of Black English affects students’ math test scores, what do you do about it?

Having said all that, I’d like to once more agree with the underlying message that a mistake is a first and foremost a signal rather than a reflection of someone’s internal thought processes. The more we think of mistakes as learning opportunities the faster we learn.

Who stays off the data radar?

Last night’s Data Skeptics Meetup talk by Suresh Naidu was great, as I suspected it would be. I’m not going to be able to cover everything he talked about (a discussion is forming here as well) but I’ll touch on a few things related to my chosen topic for the day, namely who stays off the data radar.

In his talk Suresh discussed the history of governments tracking people with data, which more or less until recently was the history of the census. The issue of trust or lack thereof that people have in being classified and tracked has been central since the get-go, and with it the understanding by the data collectors that people respond differently to data collection when they anticipate it being used against them.

Among other examples he mentioned the efforts of the U.S. Census Bureau to stay independent (specifically, away from any kind of tax decisions) in order to be trusted but then turning around during war time and using census tracks to put Japanese into internment camps.

It made me wonder, who distrusts data collection so much that they manage to stay off the data radar?

Suresh gave quite a few examples of people who did this out of fear of persecution or what have you, and because, at least in the example of the Domesday Book, once land ownership was written down it was somehow “more official and objective” than anything else, which of course resulted in some people getting screwed out of their land.

It’s not just a historical problem, of course: it’s still true that certain populations, especially illegal immigrant populations, are afraid of how the census will be used and go undercounted. Who can say when the census might start being used to deport illegal immigrants?

As a kind of anti example, he mentioned that the census was essentially canceled in 1920 because the South knew that so many ex-slaves were moving north that their representation in government was growing weak. I say anti-example because in this case it wasn’t out of distrust, to avoid detection, but it was a savvy and political move, to remain looking large.

What about the modern version of government tracking? In this case, of course, it’s not just census data, but anything else the NSA happens to collect about us. I’m no expert (tell me if you know data on this) but I will hazard a guess on who avoids being tracked:

  1. Old people who don’t have computers and never have,
  2. Members of hacking group Anonymous who know how it works and how to bypass the system, and
  3. People who have worked or are now working at the NSA.

Of course there are a few other rare people that just happen to care enough about privacy to educate themselves on how to avoid being tracked. But it’s hard to do, obviously.

Let me soften the requirements a bit – instead of staying off the radar completely, who makes it really hard to find them?

If you’re talking about individuals, I’d start with this answer: politicians. In my work with Peter Darche and Lee Drutman from the Sunlight Foundation (blog post coming soon!) trying to follow money in politics, it’s amazed me time and time again how difficult it’s been to put together the political events for a given politician – events that are individually publicly recorded but are seemingly intentionally siloed so it will be extremely difficult to put together a narrative. Thanks to Peter’s recent efforts, and the Sunlight Foundations long-term efforts, we are getting to the point where we can do this, but it’s been a data munging problem from hell.

If you’re generalizing to entities and corporations, then the “making data collection hard” award should probably go to the corporations with hundreds of subsidiaries all over the world which now don’t even need to be reported on tax forms.

Funny how the very people who know the most about how data can be used are paranoid about being tracked.

Categories: data science

Tonight: first Data Skeptics Meetup, Suresh Naidu

I’m psyched to see Suresh Naidu tonight in the first Data Skeptics Meetup. He’s talking about Political Uses and Abuses of Data and his abstract is this:

While a lot has been made of the use of technology for election campaigns, little discussion has focused on other political uses of data. From targeting dissidents and tax-evaders to organizing protests, the same datasets and analytics that let data scientists do prediction of consumer and voter behavior can also be used to forecast political opponents, mobilize likely leaders, solve collective problems and generally push people around. In this discussion, Suresh will put this in a 1000 year government data-collection perspective, and talk about how data science might be getting used in authoritarian countries, both by regimes and their opponents.

Given the recent articles highlighting this kind of stuff, I’m sure the topic will provoke a lively discussion – my favorite kind!

Unfortunately the Meetup is full but I’d love you guys to give suggestions for more speakers and/or more topics.

The politics of data mining

At first glance, data miners inside governments, start-ups, corporations, and political campaigns are all doing basically the same thing. They’ll all need great engineering infrastructure, good clean data, a working knowledge of statistical techniques and enough domain knowledge to get things done.

We’ve seen recent articles that are evidence for this statement: Facebook data people move to the NSA or other government agencies easily, and Obama’s political campaign data miners have launched a new data mining start-up. I am a data miner myself, and I could honestly work at any of those places – my skills would translate, if not my personality.

I do think there are differences, though, and here I’m not talking about ethics or trust issues, I’m talking about pure politics[1].

Namely, the world of data mining is divided into two broad categories: people who want to cause things to happen and people who want to prevent things from happening.

I know that sounds incredibly vague, so let me give some examples.

In start-ups, irrespective of what you’re actually doing (what you’re actually doing is probably incredibly banal, like getting people to click on ads), you feel like you’re the first person ever to do it, at least on this scale, or at least with this dataset, and that makes it technically challenging and exciting.

Or, even if you’re not the first, at least what you’re creating or building is state-of-the-art and is going to be used to “disrupt” or destroy lagging competition. You feel like a motherfucker, and it feels great[2]!

The same thing can be said for Obama’s political data miners: if you read this article, you’ll know they felt like they’d invented a new field of data mining, and a cult along with it, and it felt great! And although it’s probably not true that they did something all that impressive technically, in any case they did a great job of applying known techniques to a different data set, and they got lots of people to allow access to their private information based on their trust of Obama, and they mined the fuck out of it to persuade people to go out and vote and to go out and vote for Obama.

Now let’s talk about corporations. I’ve worked in enough companies to know that “covering your ass” is a real thing, and can overwhelm a given company’s other goals. And the larger the company, the more the fear sets in and the more time is spent covering one’s ass and less time is spent inventing and staying state-of-the-art. If you’ve ever worked in a place where it takes months just to integrate two different versions of SalesForce you know what I mean.

Those corporate people have data miners too, and in the best case they are somewhat protected from the conservative, risk averse, cover-your-ass atmosphere, but mostly they’re not. So if you work for a pharmaceutical company, you might spend your time figuring out how to draw up the numbers to make them look good for the CEO so he doesn’t get axed.

In other words, you spend your time preventing something from happening rather than causing something to happen.

Finally, let’s talk about government data miners. If there’s one thing I learned when I went to the State Department Tech@State “Moneyball Diplomacy” conference a few weeks back, it’s that they are the most conservative of all. They spend their time worrying about a terrorist attack and how to prevent it. It’s all about preventing bad things from happening, and that makes for an atmosphere where causing good things to happen takes a rear seat.

I’m not saying anything really new here; I think this stuff is pretty uncontroversial. Maybe people would quibble over when a start-up becomes a corporation (my answer: mostly they never do, but certainly by the time of an IPO they’ve already done it). Also, of course, there are ass-coverers in start-ups and there are risk-takers in corporation and maybe even in government, but they don’t dominate.

If you think through things in this light, it makes sense that Obama’s data miners didn’t want to stay in government and decided to go work on advertising stuff. And although they might have enough clout and buzz to get hired by a big corporation, I think they’ll find it pretty frustrating to be dealing with the cover-my-ass types that will hire them. It also makes sense that Facebook, which spends its time making sure no other social network grows enough to compete with it, works so well with the NSA.

1. If you want to talk ethics, though, join me on Monday at Suresh Naidu’s Data Skeptics Meetup where he’ll be talking about Political Uses and Abuses of Data.

2. This is probably why start-up guys are so arrogant.


Get every new post delivered to your Inbox.

Join 977 other followers