Archive for the ‘data science’ Category

Harvard Applied Statistics workshop today

I’m on an Amtrak train to Boston today to give a talk in the Applied Statistics workshop at Harvard, which is run out of the Harvard Institute for Quantiative Social Science. I was kindly invited by Tess Wise, a Ph.D. student in the Department of Government at Harvard who is organizing this workshop.

My title is “Data Skepticism in Industry” but as I wrote the talk (link to my prezi here) it transformed a bit and now it’s more about the problems not only for data professionals inside industry but for the public as well. So I talk about creepy models and how there are multiple longterm feedback loops having a degrading effect on culture and democracy in the name of short-term profits. 

Since we’re on the subject of creepy, my train reading this morning is this book entitled “Murdoch’s Politics,” which talks about how Rupert Murdoch lives by design in the center of all things creepy. 

Categories: data science, modeling

Disorderly Conduct with Alexis and Jesse #OWS


So there’s a new podcast called Disorderly Conduct which “explores finance without a permit” and is hosted by Alexis Goldstein, whom I met through her work on Occupy the SEC, and Jesse Myerson, and activist and a writer.

I was recently a very brief guest on their “In the Weeds” feature, where I was asked to answer the question, “What is the single best way to rein in the power of Wall Street?” in three minutes. The answers given by:

  1. me,
  2. The Other 98% organizer Nicole Carty (@nacarty),
  3. contributing writer David Dayen (@ddayen),
  4. Americans for Financial Reform Policy Director Marcus Stanley (@MarcusMStanley), and
  5. Marxist militant José Martín (@sabokitty)

can be found here or you can download the episode here.

Occupy Finance video series

We’ve been having our Occupy Finance book club meetings every Sunday, and although our group has decided not to record them, a friend of our group and a videographer in her own right, Donatella Barbarella, has started to interview the authors and post them on YouTube. The first few interviews have made their way to the interwebs:

  1. Linda talking about Chapter 1: Financialization and the 99%.
  2. Me talking about Chapter 2: the Bailout
  3. Tamir talking about Chapter 3: How banks work

Doing Data Science now out!

O’Reilly is releasing the book today. I can’t wait to see a hard copy!! And when I say “hard copy,” keep in mind that all of O’Reilly’s books are soft cover.

Categories: #OWS, data science, finance

*Doing Data Science* now available on Kindle!

My book with Rachel Schutt is now available on Kindle. I’ve tested this by buying it myself from and looking at it on my computer’s so-called cloud reader.

Here’s the good news. It is actually possible to do this, and it’s satisfying to see!

Here’s the bad news. The kindle reader doesn’t render latex well, or for that matter many of the various fonts we use for various reasons. The result is a pretty comical display of formatting inconsistency. In particular, whenever a formula comes up it might seem like we’re

screaming about it

and often the quoted passages come in

very very tiny indeed.

I hope it’s readable. If you prefer less comical formatting, the hard copy edition is coming out on October 22nd, next Tuesday.

Next, a word about the book’s ranking. Amazon has this very generous way of funneling down into categories sufficiently so that the ranking of a given book looks really high. So right now I can see this on the book’s page:

but for a while, before yesterday, it took a few more iterations of digging to get to single digits, so it was more like:

But you, know, I’ll take what I can get to be #1! It’s all about metrics!!!

One last thing, which is that the full title is now “Doing Data Science: Straight Talk from the Frontline” and for the record, I wanted the full title to be something more like “Doing Data Science: the no bullshit approach” but for some reason I was overruled. Whatevs.

Categories: data science

Cumulative covariance plots

One thing I do a lot when I work with data is figure out how to visualize my signals, especially with respect to time.

Lots of things change over time – relationships between variables, for example – and it’s often crucial to get deeply acquainted with how exactly that works with your in-sample data.

Say I am trying to predict “y”: so for a data point at time t, we’ll say we try to predict y(t). I’ll take an “x”, a variable that is expected to predict “y”, and I’ll demean both series x and y, hopefully in a causal way, and I will rename them x’ and y’, and then, making sure I’ve ordered everything with respect to time, I’ll plot the cumulative sum of the product x’(t) * y’(t).

In the case that both x’(t) and y’(t) have the both sign – so they’re both bigger than average or they’re both smaller than average, this product is positive, and otherwise it’s negative. So if you plot the cumulative sum, you get an upwards trend if things are positively correlated and downwards trend if things are negatively correlated. If you think about it, you are computing the numerator of the correlation function, so it is indeed just an unscaled version of total correlation.

Plus, since you ordered everything by time first, you can see how the relationship between these variables evolved over time.

Also, in the case that you are working with financial models, you can make a simplifying assumption that both x and y are pretty well demeaned already (especially at short time scales) and this gives you the cumulative PnL plot of your model. In other words, it tells you how much money your model is making.

So I was doing this exercise of plotting the cumulative covariance with some data the other day, and I got a weird picture. It kind of looked like a “U” plot: it went down dramatically at the beginning, then was pretty flat but trending up, then it went straight up at the end. It ended up not quite as high as it started, which is to say that in terms of straight-up overall correlation, I was calculating something negative but not very large.

But what could account for that U-shape? After some time I realized that the data had been extracted from the database in such a way that, after ordering my data by date, it was hugely biased in the beginning and at the end, in different directions, and that this was unavoidable, and the picture helped me determine exactly which data to exclude from my set.

After getting rid of the biased data at the beginning and the end, I concluded that I had a positive correlation here, even though if I’d trusted the overall “dirty” correlation I would have thought it was negative.

This is good information, and confirmed my belief that it’s always better to visualize data over time than it is to believe one summary statistic like correlation.

Categories: data science, modeling

Data Skeptic post

I wrote a blog post for O’Reilly’s website to accompany my essay, On Being a Data Skeptic. Here’s an excerpt:

I left finance pretty disgusted with the whole thing, and because I needed to make money and because I’m a nerd, I pretty quickly realized I could rebrand myself a “data scientist” and get a pretty cool job, and that’s what I did. Once I started working in the field, though, I was kind of shocked by how positive everyone was about the “big data revolution” and the “power of data science.”

Not to underestimate the power of data––it’s clearly powerful! And big data has the potential to really revolutionize the way we live our lives for the better––or sometimes not. It really depends.

From my perspective, this was, in tenor if not in the details, the same stuff we’d been doing in finance for a couple of decades and that fields like advertising were slow to pick up on. And, also from my perspective, people needed to be way more careful and skeptical of their powers than they currently seem to be. Because whereas in finance we need to worry about models manipulating the market, in data science we need to worry about models manipulating people, which is in fact scarier. Modelers, if anything, have a bigger responsibility now than ever before.

Categories: data science, finance, modeling

Guest post: Rage against the algorithms

This is a guest post by , a Tow Fellow at the Columbia University Graduate School of Journalism where he is researching the use of data and algorithms in the news. You can find out more about his research and other projects on his website or by following him on Twitter. Crossposted from engenhonetwork with permission from the author.


How can we know the biases of a piece of software? By reverse engineering it, of course.

When was the last time you read an online review about a local business or service on a platform like Yelp? Of course you want to make sure the local plumber you hire is honest, or that even if the date is dud, at least the restaurant isn’t lousy. A recent survey found that 76 percent of consumers check online reviews before buying, so a lot can hinge on a good or bad review. Such sites have become so important to local businesses that it’s not uncommon for scheming owners to hire shills to boost themselves or put down their rivals.

To protect users from getting duped by fake reviews Yelp employs an algorithmic review reviewer which constantly scans reviews and relegates suspicious ones to a “filtered reviews” page, effectively de-emphasizing them without deleting them entirely. But of course that algorithm is not perfect, and it sometimes de-emphasizes legitimate reviews and leaves actual fakes intact—oops. Some businesses have complained, alleging that the filter can incorrectly remove all of their most positive reviews, leaving them with a lowly one- or two-stars average.

This is just one example of how algorithms are becoming ever more important in society, for everything from search engine personalizationdiscriminationdefamation, and censorship online, to how teachers are evaluated, how markets work, how political campaigns are run, and even how something like immigration is policed. Algorithms, driven by vast troves of data, are the new power brokers in society, both in the corporate world as well as in government.

They have biases like the rest of us. And they make mistakes. But they’re opaque, hiding their secrets behind layers of complexity. How can we deal with the power that algorithms may exert on us? How can we better understand where they might be wronging us?

Transparency is the vogue response to this problem right now. The big “open data” transparency-in-government push that started in 2009 was largely the result of an executive memo from President Obama. And of course corporations are on board too; Google publishes a biannual transparency report showing how often they remove or disclose information to governments. Transparency is an effective tool for inculcating public trust and is even the way journalists are now trained to deal with the hole where mighty Objectivity once stood.

But transparency knows some bounds. For example, though the Freedom of Information Act facilitates the public’s right to relevant government data, it has no legal teeth for compelling the government to disclose how that data was algorithmically generated or used in publicly relevant decisions (extensions worth considering).

Moreover, corporations have self-imposed limits on how transparent they want to be, since exposing too many details of their proprietary systems may undermine a competitive advantage (trade secrets), or leave the system open to gaming and manipulation. Furthermore, whereas transparency of data can be achieved simply by publishing a spreadsheet or database, transparency of an algorithm can be much more complex, resulting in additional labor costs both in creation as well as consumption of that information—a cognitive overload that keeps all but the most determined at bay. Methods for usable transparency need to be developed so that the relevant aspects of an algorithm can be presented in an understandable way.

Given the challenges to employing transparency as a check on algorithmic power, a new and complementary alternative is emerging. I call it algorithmic accountability reporting. At its core it’s really about reverse engineering—articulating the specifications of a system through a rigorous examination drawing on domain knowledge, observation, and deduction to unearth a model of how that system works.

As interest grows in understanding the broader impacts of algorithms, this kind of accountability reporting is already happening in some newsrooms, as well as in academic circles. At the Wall Street Journal a team of reporters probed e-commerce platforms to identify instances of potential price discrimination in dynamic and personalized online pricing. By polling different websites they were able to spot several, such as, that were adjusting prices dynamically based on the location of the person visiting the site. At the Daily Beast, reporter Michael Keller dove into the iPhone spelling correction feature to help surface patterns of censorship and see which words, like “abortion,” the phone wouldn’t correct if they were misspelled. In my own investigation for Slate, I traced the contours of the editorial criteria embedded in search engine autocomplete algorithms. By collecting hundreds of autocompletions for queries relating to sex and violence I was able to ascertain which terms Google and Bing were blocking or censoring, uncovering mistakes in how these algorithms apply their editorial criteria.

All of these stories share a more or less common method. Algorithms are essentially black boxes, exposing an input and output without betraying any of their inner organs. You can’t see what’s going on inside directly, but if you vary the inputs in enough different ways and pay close attention to the outputs, you can start piecing together some likeness for how the algorithm transforms each input into an output. The black box starts to divulge some secrets.

Algorithmic accountability is also gaining traction in academia. At Harvard, Latanya Sweeney has looked at how online advertisements can be biased by the racial association of names used as queries. When you search for “black names” as opposed to “white names” ads using the word “arrest” appeared more often for online background check service Instant Checkmate. She thinks the disparity in the use of “arrest” suggests a discriminatory connection between race and crime. Her method, as with all of the other examples above, does point to a weakness though: Is the discrimination caused by Google, by Instant Checkmate, or simply by pre-existing societal biases? We don’t know, and correlation does not equal intention. As much as algorithmic accountability can help us diagnose the existence of a problem, we have to go deeper and do more journalistic-style reporting to understand the motivations or intentions behind an algorithm. We still need to answer the question of why.

And this is why it’s absolutely essential to have computational journalists not just engaging in the reverse engineering of algorithms, but also reporting and digging deeper into the motives and design intentions behind algorithms. Sure, it can be hard to convince companies running such algorithms to open up in detail about how their algorithms work, but interviews can still uncover details about larger goals and objectives built into an algorithm, better contextualizing a reverse-engineering analysis. Transparency is still important here too, as it adds to the information that can be used to characterize the technical system.

Despite the fact that forward thinkers like Larry Lessig have been writing for some time about how code is a lever on behavior, we’re still in the early days of developing methods for holding that code and its influence accountable. “There’s no conventional or obvious approach to it. It’s a lot of testing or trial and error, and it’s hard to teach in any uniform way,” noted Jeremy Singer-Vine, a reporter and programmer who worked on the WSJ price discrimination story. It will always be a messy business with lots of room for creativity, but given the growing power that algorithms wield in society it’s vital to continue to develop, codify, and teach more formalized methods of algorithmic accountability. In the absence of new legal measures, it may just provide a novel way to shed light on such systems, particularly in cases where transparency doesn’t or can’t offer much clarity.

New Essay, On Being a Data Skeptic, now out

It is available here and is based on a related essay written by Susan Webber entitled “Management’s Great Addiction: It’s time we recognized that we just can’t measure everything.” It is being published by O’Reilly as an e-book.

No, I don’t know who that woman is looking skeptical on the cover. I wish they’d asked me for a picture of a skeptical person, I think my 11-year-old son would’ve done a better job.

Categories: data science, modeling, musing

Sometimes, The World Is Telling You To Polish Up Your LinkedIn Profile

September 27, 2013 9 comments

The above title was stolen verbatim from an excellent essay by Dan Milstein on the Hut 8 Labs blog (hat tip Deane Yang). The actual title of the essay is “No Deadlines For You! Software Dev Without Estimates, Specs or Other Lies”

He wrote the essay about how, as an engineer, you can both make yourself invaluable to your company and avoid meaningless and arbitrary deadlines on your projects. So, he’s an engineer, but the advice he gives is surprisingly close to the advice I was trying to give on Monday night when I spoke at the Columbia Data Science Society (my slides are here, by the way). More on that below.

Milstein is an engaging writer. He wrote a book called Coding, Fast and Slow, which I now feel like reading just because I enjoy his insights and style. Here’s a small excerpt:

Let’s say you’ve started at a new job, leading a small team of engineers. On your first day, an Important Person comes by your desk. After some welcome-to-the-business chit chat, he/she hands you a spec. You look it over—it describes a new report to add to the company’s product. Of course, like all specs, it’s pretty vague, and, worse, it uses some jargon you’ve heard around the office, but haven’t quite figured out yet.

You look up from the spec to discover that the Important Person is staring at you expectantly: “So, <Your Name>, do you think you and your team can get that done in 3 months?”

What do you do?

Here are some possible approaches (all of which I’ve tried… and none of which has ever worked out well):

  • Immediately try to flesh out the spec in more detail

“How are we summing up this number? Is this piece of data required? What does <jargon word> mean, here, exactly?”

  • Stall, and take the spec to your new team

“Hmm. Hmm. Hmmmmmmmm. Do you think, um, Bob (that’s his name, right?) has the best handle on these kinds of things?”

  • Give the spec a quick skim, and then listen to the seductive voice of System I

“Sure, yeah, 3 months sounds reasonable” (OMG, I wish this wasn’t something I’ve doneSO MANY TIMES).

  • Push back aggressively

“I read this incredibly convincing blog post 1 about how it’s impossible to commit to deadlines for software projects, sorry, I just can’t do that.”

He then goes on to write that very blog post. In it, he explains what you should do, which is to learn why the project has been planned in the first place, and what the actual business question is, so you have full context for your job and you know what it means to the company for this to succeed or fail.

The way I say this, regularly, to aspiring data scientists I run into, is that you are often given a data science question that’s been filtered from a business question, through a secondary person who has some idea that they’ve molded that business question into a “mathematical question,” and they want you to do the work of answering that question, under some time constraint and resource constraints that they’ve also picked out of the air.

But often that process has perverted the original aims – often because proxies have magically appeared in the place of the original objects of interest – and it behooves a data scientist who doesn’t want to be working on the wrong problem to go to the original source and verify that their work is answering a vital business question, that they’re optimizing for the right thing, and that they understand the actual constraints (like deadlines but also resources) rather than the artificial constraints made up by whoever is in charge of telling the nerds what to do.

In other words, I suggest that each data scientist “becomes part business person,” and talks to the business owner of the given problem directly until they’re sure they know what needs to get done with data.

Milstein has a bunch of great tips on how to go through with this process, including:

  1. Counting on people’s enjoyment of hearing their own ideas repeated and fears understood,
  2. Using a specific template when talking to Important People, namely a) “I’m going to echo that back, make sure I understand”, b) echo it back, c) “Do I have that right?”.
  3. To always think and discuss your work in terms of risks and information for the business. Things like, you need this information to answer this risk. The point here is it always stays relevant to the business people while you do your technical thing. This means always keeping a finger on the pulse of the business problem.
  4. Framing choices for the Important Person in terms of clear trade-offs of risk, investments, and completion. This engages the business in what your process is in a completely understandable way.
  5. Finally, if your manager doesn’t let you talk directly to the Important People in the business, and you can’t convince your manager to change his or her mind, then you might wanna polish up your LinkedIn profile, because otherwise you are fated to work on failed projects. Great advice.
Categories: data science

A Code of Conduct for data scientists from the Bellagio Fellows

September 25, 2013 3 comments

The 2013 PopTech & Rockefeller Foundation Bellagio Fellows - Kate CrawfordPatrick MeierClaudia PerlichAmy LuersGustavo Faleiros and Jer Thorp - yesterday published “Seven Principles for Big Data and Resilience Projects” on Patrick Meier’s blog iRevolution.

Although they claim that these principles are meant for “best practices for resilience building projects that leverage Big Data and Advanced Computing,” I think they’re more general than that (although I’m not sure exactly what a resilience building project is) I and I really like them. They are looking for public comments too. Go to the post for the full description of each, but here is a summary:

1. Open Source Data Tools

Wherever possible, data analytics and manipulation tools should be open source, architecture independent and broadly prevalent (R, python, etc.).

2. Transparent Data Infrastructure

Infrastructure for data collection and storage should operate based on transparent standards to maximize the number of users that can interact with the infrastructure.

3. Develop and Maintain Local Skills

Make “Data Literacy” more widespread. Leverage local data labor and build on existing skills.

4. Local Data Ownership

Use Creative Commons and licenses that state that data is not to be used for commercial purposes.

5. Ethical Data Sharing

Adopt existing data sharing protocols like the ICRC’s (2013). Permission for sharing is essential. How the data will be used should be clearly articulated. An opt in approach should be the preference wherever possible, and the ability for individuals to remove themselves from a data set after it has been collected must always be an option.

6. Right Not To Be Sensed

Local communities have a right not to be sensed. Large scale city sensing projects must have a clear framework for how people are able to be involved or choose not to participate.

7. Learning from Mistakes

Big Data and Resilience projects need to be open to face, report, and discuss failures.

Upcoming talks

September 23, 2013 4 comments

Tomorrow evening I’m meeting with the Columbia Data Science Society and talking to them – who as I understand it are mostly engineers - about “how to think like a data scientist”.

On October 11th I’ll be in D.C. sitting on a panel discussion organized by the Americans for Financial Reform. It’s part of a day-long event on the topic of transparency in financial regulation. The official announcement isn’t out yet but I’ll post it here as soon as I can. I’ll be giving my two cents on what mathematical tools can do and cannot do with respect to this stuff.

On October 16th I’ll again be in D.C. giving a talk in the MAA Distinguished Lecture Series to a mostly high school math teacher audience. My talk is entitled, “Start Your Own Netflix”.

Finally, I’m going to Harvard on October 30th to give a talk in their Applied Statistics Workshop series. I haven’t figured out exactly what I’m talking about but it will be something nerdy and skeptical.

Categories: data science, finance

I’d like you to eventually die

September 22, 2013 24 comments

Google has formally thrown their hat into the “rich people should never die” arena, with an official announcement of their new project called Calico, “a new company that will focus on health and well-being, in particular the challenge of aging and associated diseases”. Their plan is to use big data and genetic research to avoid aging.

I saw this coming when they hired Ray Kurzweil. Here’s an excerpt from my post:

A few days ago I read a New York Times interview of Ray Kurzweil, who thinks he’s going to live forever and also claims he will cure cancer if and when he gets it (his excuse for not doing it in his spare time now: “Well, I mean, I do have to pick my priorities. Nobody can do everything.”). He also just got hired at Google.

Here’s the thing. We need people to die. Our planet cannot sustain all the people currently alive as well as all the people who are going to someday be born. Just not gonna happen. Plus, it would be a ridiculously boring place to live. Think about how boring it is already for young people to be around old people. I bore myself around my kids, and I’m only 30 years older than they are.

And yes, it’s tragic when someone we love actually becomes one of those people whose time has come, especially if they’re young and especially if it seemed preventable. For that matter, I’m all for figuring out how to improve the quality of life for people.

But the idea that we’re going to figure out how to keep alive a bunch of super rich advertising executives just doesn’t seem right – because, let’s face it, there will have to be a way to choose who lives and who dies, and I know who is at the top of that list – and I for one am not on board with the plan. Larry Page, Tim Cook, and Ray Kurzweil: I’d really like it if you eventually died.

On the other hand, I’m not super worried about this plan coming through either. Big data can do a lot but it’s not going to make people live forever. Or let’s say it another way: if they can use big data to make people live forever, they can also use big data to convince me that super special rich white men living in Silicon Valley should take up resources and airtime for the rest of eternity.

Categories: data science, rant

The bursting of the big data bubble

September 20, 2013 42 comments

It’s been a good ride. I’m not gonna lie, it’s been a good time to be a data whiz, a quant-turned-data scientist. I get lots of attention and LinkedIn emails just for my title and my math Ph.D., and it’s flattering. But all of that is going to change, starting now.

You see, there are some serious headwinds. They started a while ago but they’re picking up speed, and the magical wave of hype propelling us forward is giving way. I can tell, I’ve got a nose for sinking ships and sailing metaphors.

First, the hype and why it’s been so strong.

It seems like data and the ability to use data is the secret sauce in so many of the big success stories. Look at Google. They managed to think of the entire web as their data source, and have earned quite a bit of respect and advertising money for their chore of organizing it like a huge-ass free library for our benefit. That took some serious data handling and modeling know-how.

We humans are pretty good at detecting patterns, so after a few companies made it big with the secret data sauce, we inferred that, when you take a normal tech company and sprinkle on data, you get the next Google.

Next, a few reasons it’s unsustainable

Most companies don’t have the data that Google has, and can never hope to cash in on stuff at the scale of the ad traffic that Google sees. Even so, there are lots of smaller but real gains that lots of companies – but not all – could potentially realize if they collected the right kind of data and had good data people helping them.

Unfortunately, this process rarely actually happens the right way, often because the business people ask their data people the wrong questions to being with, and since they think of their data people as little more than pieces of software – data in, magic out – they don’t get their data people sufficiently involved with working on something that data can address.

Also, since there are absolutely no standards for what constitutes a data scientist, and anyone who’s taken a machine learning class at college can claim to be one, the data scientists walking around often have no clue how to actually form the right questions to ask anyway. They are lopsided data people, and only know how to answer already well-defined questions like the ones that Kaggle comes up with. That’s less than half of what a good data scientist does, but people have no idea what a good data scientist does.

Plus, it’s super hard to accumulate hard evidence that you have a crappy data science team. If you’ve hired one or more unqualified data scientists, how can you tell? They still might be able to implement crappy models which don’t answer the right question, but in order to see that you’d need to also have a good data scientist who implements a better solution to the right question. But you only have one. It’s a counterfactual problem.

Here’s what I see happening. People have invested some real money in data, and they’ve gotten burned with a lack of medium-term results. Now they’re getting impatient for proof that data is an appropriate place to invest what little money their VC’s have offered them. That means they want really short-term results, which means they’re lowballing data science expertise, which means they only attract people who’ve taken one machine learning class and fancy themselves experts.

In other words, data science expertise has been commodified, and it’s a race to the bottom. Who will solve my business-critical data problem on a short-term consulting basis for less than $5000? Less than $4000?

What’s next?

There really is a difference between A) crude models that someone constructs not really knowing what they’re doing and B) thoughtful models which gain an edge along the margin. It requires someone who actually knows what they’re doing to get the latter kind of model. But most people are unaware of even the theoretical difference between type A and type B models, nor would they recognize which type they’ve got once they get one.

Even so, over time, type B models outperform type A models, and if you care enough about the marginal edge between the two types, say because you’re in a competitive environment, then you will absolutely need type B to make money. And by the way, if you don’t care about that marginal edge, then by all means you should use a type A solution. But you should at least know the difference and make that choice deliberately.

My forecast is that, once the hype wave of big data is dead and gone, there will emerge reasonable standards of what a data scientist should actually be able to do, and moreover a standard of when and how to hire a good one. It’ll be a rubrik, and possibly some tests, of both problem solving and communication.

Personally, I’m looking forward to a more reasonable and realistic vision of how data and data expertise can help with things. I might have to change my job title, but I’m used to it.

Categories: data science

Interactive scoring models: why hasn’t this happened yet?

September 12, 2013 10 comments

My friend Suresh just reminded me about this article written a couple of years ago by Malcolm Gladwell and published in the New Yorker.

It concerns various scoring models that claim to be both comprehensive (which means it covers the whole thing, not just one aspect of the thing) and heterogeneous (which means it is broad enough to cover all things in a category), say for cars or for colleges.

Weird things happen when you try to do this, like not caring much about price or exterior detailing for sports cars.

Two things. First, this stuff is actually really hard to do well. I like how Gladwell addresses this issue:

At no point, however, do the college guides acknowledge the extraordinary difficulty of the task they have set themselves.

Second of all, I think the issue of combining heterogeneity and comprehensiveness is addressable, but it has to be addressed interactively.

Specifically, what if instead of a single fixed score, there was a place where a given car-buyer or college-seeker could go to fill out a form of preferences? For each defined and rated aspect, the user would fill answer a question about how much they cared about that aspect. They’d assign a weight to each aspect. A given question would look something like this:

For colleges, some people care a lot about whether their college has a ton of alumni giving, other people care more about whether the surrounding town is urban or rural. Let’s let people create their own scoring system. It’s technically easy.

I’ve suggested this before when I talked about rating math articles on various dimensions (hard, interesting, technical, well-written) and then letting people come and search based on weighting those dimensions and ranking. But honestly we can start even dumber, with car ratings and college ratings.

Categories: data science, modeling

Working in the NYC Mayor’s Office

September 10, 2013 7 comments

I recently took a job in the NYC Mayor’s Office as an unpaid consultant. It’s an interesting time to be working for the Mayor, to be sure – everyone’s waiting to see what happens this week with the election, and all sorts of things are up in the air. Planning essentially stops at December 31st.

Note the expiration date.

Note the expiration date.

I’m working in a data group which deals with social service agency data. That means Child Services, Homeless Services, and the like. Any agency where there there is direct contact with lots of people and their data. The idea is for me to help them out with a project that, if successful, I might be able to take to another city as a product. I’m still working full-time at the same job.

Specifically, my goal is to figure out a way to use data to help the people involved – the homeless, for example – get connected to better services. As a side effect I think this should make the agency more efficient. Far too many data studies only care about efficiency – how to make do with fewer police or fewer ambulances – with no thought or care about whether the people experiencing the services are being affected. I want to start with the people, and hope for efficiency gains, which I believe will come.

One thing that has already amazed me about this job, which I’ve just started, is the conversations people have about the ethics of data privacy.

It is a well-known fact that, as you link more and more data about people together, you can predict their behavior better. So for example, you could theoretically link all the different agency data for a given person into a profile, including crime data, health data, education and the like.

This might help you profile that person, and that might help you offer them better services. But it also might not be what that person wants you to do, especially if you start adding social media information. There’s a tension between the best model and reasonable limits of privacy and decency, even when the model is intended to be used in a primarily helpful manner. It’s more obvious when you’re attempting something insidious like predictive policing, of course.

Now, it shouldn’t shock me to have such conversations, because after all we are talking about some of the most vulnerable populations here. But even so, it does.

In all my time as a predictive modeler, I’ve never been in that kind of conversation, about the malicious things people could do with such-and-such profile information, or with this or that model, unless I started it myself.

When you work as a quant in finance, the data you work with is utterly sanitized to the point where, although it eventually trickles down to humans, you are asked to think of it as generated by some kind of machine, which we call “the market.”

Similarly, when you work in ad tech or other internet modeling, you think of users as the targets of your predatory goals: click on this, user, or buy that, user! They are prey, and the more we know about them the better our aim will be. If we can buy their profiles from Acxiom, all the better for our purposes.

This is the opposite of all of that. Super interesting, and glad I am being given this opportunity.

Categories: data science, modeling

The art of definition

Definitions are basic objects in mathematics. Even so, I’ve never seen the art of definition explicitly taught, and I have rarely seen the need for a definition explicitly discussed.

Have you ever noticed how damn hard it is to make a good definition and yet how utterly useful a good definition can be?

The basic definitions inform the research of any field, and a good definition will lead to better theorems than a bad one. If you get them right, if you really nail down the definition, then everything works out much more cleanly than otherwise.

So for example, it doesn’t make sense to work in algebraic geometry without the concepts of affine and projective space, and varieties, and schemes. They are to algebraic geometry like circles and triangles are to elementary geometry. You define your objects, then you see how they act and how they interact.

I saw first hand how a good definition improves clarity of thought back in grad school. I was lucky enough to talk to John Tate (my mathematical hero) about my thesis, and after listening to me go on for some time with a simple object but complicated proofs, he suggested that I add an extra sentence to my basic object, an assumption with a fixed structure.

This gave me a bit more explaining to do up front – but even there added intuition – and greatly simplified the statement and proofs of my theorems. It also improved my talks about my thesis. I could now go in and spend some time motivating the definition, and then state the resulting theorem very cleanly once people were convinced.

Another example from my husband’s grad seminar this semester: he’s starting out with the concept of triangulated categories coming from Verdier’s thesis. One mysterious part of the definition involves the so-called “octahedral axiom,” which mathematicians have been grappling with ever since it was invented. As far as Johan tells it, people struggle with why it’s necessary but not that it’s necessary, or at least something very much like it. What’s amazing is that Verdier managed to get it right when he was so young.

Why? Because definition building is naturally iterative, and it can take years to get it right. It’s not an obvious process. I have no doubt that many arguments were once fought over whether the most basic definitions, although I’m no historian. There’s a whole evolutionary struggle that I can imagine could take place as well – people could make the wrong definition, and the community would not be able to prove good stuff about that, so it would eventually give way to stronger, more robust definitions. Better to start out carefully.

Going back to that. I think it’s strange that the building up of definitions is not explicitly taught. I think it’s a result of the way math is taught as if it’s already known, so the mystery of how people came up with the theorems is almost hidden, never mind the original objects and questions about them. For that matter, it’s not often discussed why we care whether a given theorem is important, just whether it’s true. Somehow the “importance” conversations happen in quiet voices over wine at the seminar dinners.

Personally, I got just as much out of Tate’s help with my thesis as anything else about my thesis. The crystalline focus that he helped me achieve with the correct choice of the “basic object of study” has made me want to do that every single time I embark on a project, in data science or elsewhere.

Simons Center for Data Analysis

Has anyone heard of the new Simons Center for Data Analysis?

Neither had I until just now. But some guy named Leslie Greengard, who is a distinguished mathematician and computer scientist, just got named its director (hat tip Peter Woit).

Please inform me if you know more about this center. I got nothing except this tiny description:

As SCDA’s director, Greengard will build and lead a team of scientists committed to analyzing large-scale, rich data sets and to developing innovative mathematical methods to examine such data.

Categories: data science, news

Experimentation in education – still a long way to go

Yesterday’s New York Times ran a piece by Gina Kolata on randomized experiments in education. Namely, they’ve started to use randomized experiments like they do in medical trials. Here’s what’s going on:

… a little-known office in the Education Department is starting to get some real data, using a method that has transformed medicine: the randomized clinical trial, in which groups of subjects are randomly assigned to get either an experimental therapy, the standard therapy, a placebo or nothing.

They have preliminary results:

The findings could be transformative, researchers say. For example, one conclusion from the new research is that the choice of instructional materials — textbooks, curriculum guides, homework, quizzes — can affect achievement as profoundly as teachers themselves; a poor choice of materials is at least as bad as a terrible teacher, and a good choice can help offset a bad teacher’s deficiencies.

So far, the office — the Institute of Education Sciences — has supported 175 randomized studies. Some have already concluded; among the findings are that one popular math textbook was demonstrably superior to three competitors, and that a highly touted computer-aided math-instruction program had no effect on how much students learned.

Other studies are under way. Cognitive psychology researchers, for instance, are assessing an experimental math curriculum in Tampa, Fla.

If you go to any of the above links, you’ll see that the metric of success is consistently defined as a standardized test score. That’s the only gauge of improvement. So any “progress” that’s made is by definition measured by such a test.

In other words, if we optimize to this system, we will optimize for textbooks which raise standardized test scores. If it doesn’t improve kids’ test scores, it might as well not be in the book. In fact it will probably “waste time” with respect to raising scores, so there will effectively be a penalty for, say, fun puzzles, or understanding why things are true, or learning to write.

Now, if scores are all we cared about, this could and should be considered progress. Certainly Gina Kolata, the NYTimes journalist, didn’t mention that we might not care only about this – she recorded it as unfettered good, as she was expected to by the Education Department, no doubt. But, as a data scientist who gets paid to think about the feedback loops and side effects of choices like “metrics of success,” I have a problem with it.

I don’t have a thing against randomized tests – using them is a good idea, and will maybe even quiet some noise around all the different curriculums, online and in person. I do think, though, that we need to have more ways of evaluating an educational experience than a test score.

After all, if I take a pill once a day to prevent a disease, then what I care about is whether I get the disease, not which pill I took or what color it was. Medicine is a very outcome- focused discipline in a way that education is not. Of course, there are exceptions, say when the treatment has strong and negative side-effects, and the overall effect is net negative. Kind of like when the teacher raises his or her kids’ scores but also causes them to lose interest in learning.

If we go the way of the randomized trial, why not give the students some self-assessments and review capabilities of their text and their teacher (which is not to say teacher evaluations give clean data, because we know from experience they don’t)? Why not ask the students how they liked the book and how much they care about learning? Why not track the students’ attitudes, self-assessment, and goals for a subject for a few years, since we know longer-term effects are sometimes more important that immediate test score changes?

In other words, I’m calling for collecting more and better data beyond one-dimensional test scores. If you think about it, teenagers get treated better by their cell phone companies or Netflix than by their schools.

I know what you’re thinking – that students are all lazy and would all complain about anyone or anything that gave them extra work. My experience is that kids actually aren’t like this, know the difference between rote work and real learning, and love the learning part.

Another complaint I hear coming – long-term studies take too long and are too expensive. But ultimately these things do matter in the long term, and as we’ve seen in medicine, skimping on experiments often leads to bigger and more expensive problems. Plus, we’re not going to improve education overnight.

And by the way, if and/or when we do this, we need to implement strict privacy policies for the students’ answers – you don’t want a 7-year-old’s attitude about math held against him when he of she applies to college.


Short your kids, go long your neighbor: betting on people is coming soon

Yet another aspect of Gary Shteyngart’s dystopian fiction novel Super Sad True Love Story is coming true for reals this week.

Besides anticipating Occupy Wall Street, as well as Bloomberg’s sweep of Zuccotti Park (although getting it wrong on how utterly successful such sweeping would be), Shteyngart proposed the idea of instant, real-time and broadcast credit ratings.

Anyone walking around the streets of New York, as they’d pass a certain type of telephone pole – the kind that identifies you via your cell phone and communicates with data warehousing services and databases – would have their credit rating flashed onto a screen. If you went to a party, depending on how you impressed the other party go-ers, your score could plummet or rise in real time, and everyone would be able to keep track and treat you accordingly.

I mean, there were other things about the novel too, but as a data person these details certainly stuck with me since they are both extremely gross and utterly plausible.

And why do I say they are coming true now? I base my claim on two news stories I’ve been sent by my various blog readers recently.

[Aside: if you read my blog and find an awesome article that you want to send me, by all means do! My email address is available on my "About" page.]

First, coming via Suresh and Marcos, we learn that data broker Acxiom is letting people see their warehoused data. A few caveats, bien sûr:

  1. You get to see your own profile, here, starting in 2 days, but only your own.
  2. And actually, you only get to see some of your data. So they won’t tell you if you’re a suspected gambling addict, for example. It’s a curated view, and they want your help curating it more. You know, for your own good.
  3. And they’re doing it so that people have clarity on their business.
  4. Haha! Just kidding. They’re doing it because they’re trying to avoid regulations and they feel like this gesture of transparency might make people less suspicious of them.
  5. And they’re counting on people’s laziness. They’re allowing people to opt out, but of course the people who should opt out would likely never even know about that possibility.
  6. Just keep in mind that, as an individual, you won’t know what they really think they know about you, but as a corporation you can buy complete information about anyone who hasn’t opted out.

In any case those credit scores that Shteyngart talks about are already happening. The only issue is who gets flashed those numbers and when. Instead of the answers being “anyone walking down the street” and “when you walk by a pole” it’s “any corporation on the interweb” and “whenever you browse”.

After all, why would they give something away for free? Where’s the profit in showing the credit scores of anyone to everyone? Hmmmm….

That brings me to my second news story of the morning coming to me via Constantine, namely this TechCrunch story which explains how a startup called Fantex is planning to allow individuals to invest in celebrity athletes’ stocks. Yes, you too can own a tiny little piece of someone famous, for a price. From the article:

People can then buy shares of that player’s brand, like a stock, in the Fantex-consumer market. Presumably, if San Francisco 49ers tight end Vernon Davis has a monster year and looks like he’s going to get a bigger endorsement deal or a larger contract in a few years, his stock would rise and a fan could sell their Davis stock and cash out with a real, monetary profit. People would own tracking or targeted stocks in Fantex that would depend on the specific brand that they choose; these stocks would then rise and fall based on their own performance, not on the overall performance of Fantex.

Let’s put these two things together. I think it’s not too much of a stretch to acknowledge a reason for everyone to know everyone else’s credit score! Namely, we can can bet on each other’s futures!

I can’t think of any set-up more exhilarating to the community of hedge fund assholes than a huge, new open market – containing profit potentials for every single citizen of earth – where you get to make money when someone goes to the wrong college, or when someone enters into an unfortunate marriage and needs a divorce, or when someone gets predictably sick. An orgy in the exact center of tech and finance.

Are you with me peoples?!

I don’t know what your Labor Day plans are, but I’m getting ready my list of people to short in this spanking new market.

Summers’ Lending Club makes money by bypassing the Equal Credit Opportunity Act

Don’t know about you, but for some reason I have a sinking feeling when it comes to the idea of Larry Summers. Word on the CNBC street is that he’s about to be named new Fed Chair, and I am living in a state of cognitive dissonance.

To distract myself, I’m going to try better to explain what I started to explain here, when I talked about the online peer-to-peer lending company Lending Club. Summers sits on the board of Lending Club, and from my perspective it’s a logical continuation of his career of deregulation and/or bypassing of vital regulation to enrich himself.

In this case, it’s a vehicle for bypassing the FTC’s Equal Credit Opportunities Rights. It’s not perfect, but it “prohibits credit discrimination on the basis of race, color, religion, national origin, sex, marital status, age, or because you get public assistance.” It forces credit scores to be relatively behavior based, like you see here. Let me contrast that to Lending Club.

Lending Club also uses mathematical models to score people who want to borrow money. These act as credit scores. But in this case, they use data like browsing history or anything they can grab about you on the web or from data warehousing companies like Acxiom (which I’ve written about here). From this Bloomberg article on Lending Club:

“What we’ve done is radically transform the way consumer lending operates,” Laplanche says in his speech. He says that LendingClub keeps staffing low by using algorithms to screen prospective borrowers for risk — rejecting 90 percent of them – - and has no physical branches like banks. “The savings can be passed on to more borrowers in terms of lower interest rates and investors in terms of attractive returns.”

I’d focus on the benefit for investors. Big money is now involved in this stuff. Turns out that bypassing credit score regulation is great for business, so of course.

For example, such models might look at your circle of friends on Facebook to see if you “run with the right crowd” before loaning you money. You can now blame your friends if you don’t get that loan! From this CNN article on the subject (hat tip David):

“It turns out humans are really good at knowing who is trustworthy and reliable in their community,” said Jeff Stewart, a co-founder and CEO of Lenddo. “What’s new is that we’re now able to measure through massive computing power.”

Moving along from taking out loans to getting jobs, there’s this description of how recruiters work online to perform digital background checks for potential employees. It’s a different set of laws this time that is subject to arbitrage but it’s exactly the same idea:

Non-discrimination laws prohibit employers from asking job applicants certain questions. They’re not supposed to ask about things like age, race, gender, disability, marital, and veteran status. (As you can imagine, sometimes a picture alone can reveal this privileged information. These safeguards against discrimination urge employers to simply not use this knowledge to make hiring decisions.) In addition to protecting people from systemic prejudice, these employment laws intend to shield us from capricious bias and whimsy. While casually snooping, however, a recruiter can’t unsee your Facebook rant on immigration amnesty, the same for your baby bump on Instagram. From profile pics and bios, blog posts and tweets, simple HR reconnaissance can glean tons of off-limits information.

Along with forcing recruiters to gaze with eyes wide shut, straddling legal liability and ignorance, invisible employment screens deny American workers the robust protections afforded by the FTC and the Fair Credit Reporting Act. The FCRA ensures that prospective employees are notified before their backgrounds and credit scores are verified. Employees are free to decline the checks, but employers are also free to deny further consideration unless a screening is allowed to take place. What’s important here is that employees must first give consent.

When a report reveals unsavory information about a candidate, and the employer chooses to take what’s called “adverse action,”—like deny a job offer—the employer is required to share the content of the background reports with the candidate. The applicant then has the right to explain or dispute inaccurate and incomplete aspects of the background check. Consent, disclosure, and recourse constitute a straightforward approach to employment screening.

Contrast this citizen-empowering logic with the casual Google search or to the informal, invisible social-media exam. As applicants, we don’t know if employers are looking, we’re not privy to what they see, and we have no way to appeal.

As legal scholars Daniel Solove and Chris Hoofnagle discuss, the amateur Google screens that are now a regular feature of work-life go largely unnoticed. Applicants are simply not called back. And they’ll never know the real reason.

I think the silent failure is the scariest part for me – people who don’t get jobs won’t know why.

Similarly, people denied loans from Lending Club by a secret algorithm don’t know why either. Maybe it’s because I made friends with the wrong person on Facebook? Maybe I should just go ahead and stop being friends with anyone who might put my electronic credit score at risk?

Of course this rant is predicated on the assumption that we think anti-discrimination laws are a good thing. In an ideal world, of course, we wouldn’t need them. But that’s not where we live.

Categories: data science, finance, modeling rips off poor people; let’s take control of our online personas

You’ve probably heard rumors about this here and there, but the Wall Street Journal convincingly reported yesterday that websites charge certain people more for the exact thing.

Specifically, poor people were more likely to pay more for, say, a stapler from than richer people. Home Depot and Lowes does the same for their online customers, and Discover and Capitol One make different credit card offers to people depending on where they live (“hey, do you live in a PayDay lender neighborhood? We got the card for you!”).

They got pretty quantitative for, and did tests to determine the cost. From the article:

It is possible that Staples’ online-pricing formula uses other factors that the Journal didn’t identify. The Journal tested to see whether price was tied to different characteristics including population, local income, proximity to a Staples store, race and other demographic factors. Statistically speaking, by far the strongest correlation involved the distance to a rival’s store from the center of a ZIP Code. That single factor appeared to explain upward of 90% of the pricing pattern.

If anyone’s ever seen a census map, race is highly segregated by ZIP code, and my guess is we’d see pretty high correlations along racial lines as well, although they didn’t mention it in the article except to say that explicit race-related pricing is illegal. The article does mentions that things get more expensive in rural areas, which are also poorer, so there’s that acknowledged correlation.

But wait, how much of a price difference are we talking about? From the article:

Prices varied for about a third of the more than 1,000 randomly selected products tested. The discounted and higher prices differed by about 8% on average.

In other words, a really non-trivial amount.

The messed up thing about this, or at least one of them, is that we could actually have way more control over our online personas than we think. It’s invisible to us, typically, so we don’t think about our cookies and our displayed IP addresses. But we could totally manipulate these signatures to our advantage if we set our minds to it.

Hackers, get thyselves to work making this technology easily available.

For that matter, given the 8% difference, there’s money on the line so some straight-up capitalist somewhere should be meeting that need. I for one would be willing to give someone a sliver of the amount saved every time they manipulated my online persona to save me money. You save me $1.00, I’ll give you a dime.

Here’s my favorite part of this plan: it would be easy for Staples to keep track of how much people are manipulating their ZIP codes. So if infers a certain ZIP code for me to display a certain price, but then in check-out I ask them to send the package to a different ZIP code, Staples will know after-the-fact that I fooled them. But whatever, last time I looked it didn’t cost more or less to send mail to California or wherever than to Manhattan [Update: they do charge differently for packages, though. That's the only differential in cost I think is reasonable to pay].

I’d love to see them make a case for how this isn’t fair to them.

Categories: data science, modeling, rant

Get every new post delivered to your Inbox.

Join 887 other followers