Cathy O'Neil, mathbabe

Birdwatching

October 20, 2012 Cathy O'Neil, mathbabe 7 comments

Today I’m posting my friend Becky’s poem about wasting time on a hobby you love. I spent the day at a yarn festival admiring hand-spun, dyed, and knit sweaters that cost about 5 times as much money and infinitely more time than the machine-made ones you can buy in any clothing store. I believe there’s no economic theory that could possibly explain why thousands of other people were just as excited as I was to be there.

——

What pastime could be less economically productive?

Owl swivels her tufted attention,
fixing her severity
on a silent stirring
in the fraying field
a mute meditation
just beyond
my upturned incomprehension.

What activity could be of less social value?

Hawk tears into hare
with his Swiss Army face,
unblinkingly slices
the limp sinew of snow,
a leap of fur
a moment before.

What hobby could be of less measurable benefit?

Egret unfolds her fistful of light,
lifts her improbable wings,
no metaphor for an angel
but the real deal –
You can see for yourself
how Spirit fancies feathers.

What avocation could be a more fervent waste of time?

Only Prayer –
Hummingbird’s eggs are a pair of pearl earrings
nestled in a pocket of lichen and silk –
and Love,
Loon’s lone lament.

Categories: Uncategorized

What’s a fair price?

October 19, 2012 Cathy O'Neil, mathbabe 15 comments

My readers may be interested to know that I am currently composing an acceptance letter to be on the board of Goldman Sachs.

Not that they’ve offered it, but Felix Salmon was kind enough to suggest me for the job yesterday and I feel like I should get a head start. Please give me suggestions for key phrases: how I’d do things differently or not, why I would be a breath of fresh air, how it’s been long enough having the hens guard the fox house, etc., that kind of thing.

But for now, I’d like to bring up the quasi-modeling, quasi-ethical topic (my favorite!) of setting a price. My friend Eugene sent me this nice piece he read yesterday on recommendation engines describing the algorithms used by Netflix and Amazon among others, which is strangely similar to my post yesterday coming out of Matt Gattis’s experience working at hunch. It was written by Computer Science professors Joseph A. Konstan and John Riedl from the University of Minnesota, and it does a nice job of describing the field, although there isn’t as much explicit math and formulae.

One thing they brought up in their article is the idea of a business charging certain people more money for items they expect them to buy based on their purchase history. So, if Fresh Direct did this to me, I’d have to pay more every week for Amish Country Farms 1% milk, since we go through about 8 cartons a week around here. They could basically charge me anything they want for that stuff, my 4-year-old is made of 95% milk and 5% nutella.

Except, no, they couldn’t do that. I’d just shop somewhere else for it, somewhere nobody knew my history. It would be a pain to go back to the grocery store but I’d do it anyway, because I’d feel cheated by that system. I’d feel unfairly singled out. For me it would be an ethical decision, and I’d vocally and publicly try to shame the company that did that to me.

It reminds me of arguments I used to have at D.E. Shaw with some of my friends and co-workers who were self-described libertarians. I don’t even remember how they’d start, but they’d end with my libertarian friend positing that rich people should be charged more for the same item. I have some sympathy with some libertarian viewpoints but this isn’t one of them.

First of all, I’d argue, people don’t walk around with a sign on their face saying how much money they have in the bank (of course this is become less and less true as information is collected online). Second of all, even if Warren Buffett himself walked into a hamburger joint, there’s no way they’re going to charge him $1000 for a burger. Not because he can’t afford it, and not even because he could go somewhere else for a cheaper burger (although he could), but because it’s not considered fair.

In some sense rich people do pay more for things, of course. They spend more money on clothes and food than poor people. But on the other hand, they’re also getting different clothes and different food. And even if they spend more money on the exact same item, a pound of butter, say, they’re paying rent for the nicer environment where they shop in their pricey neighborhood.

Now that I write this, I realize I don’t completely believe it. There are exceptions when it is considered totally fair to charge rich people more. My example is that I visited Accra, Ghana, and the taxi drivers consistently quoted me prices that were 2 or 3 times the price of the native Ghanaians, and neither of us thought it was unfair for them to do so. When my friend Jake was with me he’d argue them down to a number which was probably more like 1.5 times the usual price, out of principle, but when I was alone I didn’t do this, possibly because I was only there for 2 weeks. In this case, being a white person in Accra, I basically did have a sign on my face saying I had more money and could afford to spend more.

One last thought on price gouging: it happens all the time, I’m not saying it doesn’t, I am just trying to say it’s an ethical issue. If we are feeling price gouged, we are upset about it. If we see someone else get price gouged, we typically want to expose it as unfair, even if it’s happening to someone who can afford it.

Categories: data science, musing

Columbia Data Science course, week 7: Hunch.com, recommendation engines, SVD, alternating least squares, convexity, filter bubbles

October 18, 2012 Cathy O'Neil, mathbabe 6 comments

Last night in Rachel Schutt’s Columbia Data Science course we had Matt Gattis come and talk to us about recommendation engines. Matt graduated from MIT in CS, worked at SiteAdvisor, and co-founded hunch as its CTO, which recently got acquired by eBay. Here’s what Matt had to say about his company:

Hunch

Hunch is a website that gives you recommendations of any kind. When we started out it worked like this: we’d ask you a bunch of questions (people seem to love answering questions), and then you could ask the engine questions like, what cell phone should I buy? or, where should I go on a trip? and it would give you advice. We use machine learning to learn and to give you better and better advice.

Later we expanded into more of an API where we crawled the web for data rather than asking people direct questions. We can also be used by third party to personalize content for a given site, a nice business proposition which led eBay to acquire us. My role there was doing the R&D for the underlying recommendation engine.

Matt has been building code since he was a kid, so he considers software engineering to be his strong suit. Hunch is a cross-domain experience so he doesn’t consider himself a domain expert in any focused way, except for recommendation systems themselves.

The best quote Matt gave us yesterday was this: “Forming a data team is kind of like planning a heist.” He meant that you need people with all sorts of skills, and that one person probably can’t do everything by herself. Think Ocean’s Eleven but sexier.

A real-world recommendation engine

You have users, and you have items to recommend. Each user and each item has a node to represent it. Generally users like certain items. We represent this as a bipartite graph. The edges are “preferences”. They could have weights: they could be positive, negative, or on a continuous scale (or discontinuous but many-valued like a star system). The implications of this choice can be heavy but we won’t get too into them today.

So you have all this training data in the form of preferences. Now you wanna predict other preferences. You can also have metadata on users (i.e. know they are male or female, etc.) or on items (a product for women).

For example, imagine users came to your website. You may know each user’s gender, age, whether they’re liberal or conservative, and their preferences for up to 3 items.

We represent a given user as a vector of features, sometimes including only their meta data, sometimes including only their preferences (which would lead to a sparse vector since you don’t know all their opinions) and sometimes including both, depending on what you’re doing with the vector.

Nearest Neighbor Algorithm?

Let’s review nearest neighbor algorithm (discussed here): if we want to predict whether a user A likes something, we just look at the user B closest to user A who has an opinion and we assume A’s opinion is the same as B’s.

To implement this you need a definition of a metric so you can measure distance. One example: Jaccard distance, i.e. the number of things preferences they have in common divided by the total number of things. Other examples: cosine similarity or euclidean distance. Note: you might get a different answer depending on which metric you choose.

What are some problems using nearest neighbors?

There are too many dimensions, so the closest neighbors are too far away from each other. There are tons of features, moreover, that are highly correlated with each other. For example, you might imagine that as you get older you become more conservative. But then counting both age and politics would mean you’re double counting a single feature in some sense. This would lead to bad performance, because you’re using redundant information. So we need to build in an understanding of the correlation and project onto smaller dimensional space.
Some features are more informative than others. Weighting features may therefore be helpful: maybe your age has nothing to do with your preference for item 1. Again you’d probably use something like covariances to choose your weights.
If your vector (or matrix, if you put together the vectors) is too sparse, or you have lots of missing data, then most things are unknown and the Jaccard distance means nothing because there’s no overlap.
There’s measurement (reporting) error: people may lie.
There’s a calculation cost – computational complexity.
Euclidean distance also has a scaling problem: age differences outweigh other differences if they’re reported as 0 (for don’t like) or 1 (for like). Essentially this means that raw euclidean distance doesn’t explicitly optimize.
Also, old and young people might think one thing but middle-aged people something else. We seem to be assuming a linear relationship but it may not exist
User preferences may also change over time, which falls outside the model. For example, at Ebay, they might be buying a printer, which makes them only want ink for a short time.
Overfitting is also a problem. The one guy is closest, but it could be noise. How do you adjust for that? One idea is to use k-nearest neighbor, with say k=5.
It’s also expensive to update the model as you add more data.

Matt says the biggest issues are overfitting and the “too many dimensions” problem. He’ll explain how he deals with them.

Going beyond nearest neighbor: machine learning/classification

In its most basic form, we’ve can model separately for each item using a linear regression. Denote by $f_{i, j}$ user $i$ ‘s preference for item $j$ (or attribute, if item $j$ is a metadata item). Say we want to model a given user’s preferences for a given item using only the 3 metadata properties of that user, which we assume are numeric. Then we can look for the best choice of $\beta_k$ as follows:

$p_i = \beta_1 f_{1, i} + \beta_2 f_{2, i} + \beta_3 f_{3, i} +$ $\epsilon$

Remember, this model only works for one item. We need to build as many models as we have items. We know how to solve the above per item by linear algebra. Indeed one of the drawbacks is that we’re not using other items’ information at all to create the model for a given item.

This solves the “weighting of the features” problem we discussed above, but overfitting is still a problem, and it comes in the form of having huge coefficients when we don’t have enough data (i.e. not enough opinions on given items). We have a bayesian prior that these weights shouldn’t be too far out of whack, and we can implement this by adding a penalty term for really large coefficients.

This ends up being equivalent to adding a prior matrix to the covariance matrix. how do you choose lambda? Experimentally: use some data as your training set, evaluate how well you did using particular values of lambda, and adjust.

Important technical note: You can’t use this penalty term for large coefficients and assume the “weighting of the features” problem is still solved, because in fact you’re implicitly penalizing some coefficients more than others. The easiest way to get around this is to normalize your variables before entering them into the model, similar to how we did it in this earlier class.

The dimensionality problem

We still need to deal with this very large problem. We typically use both Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).

To understand how this works, let’s talk about how we reduce dimensions and create “latent features” internally every day. For example, we invent concepts like “coolness” – but I can’t directly measure how cool someone is, like I could weigh them or something. Different people exhibit pattern of behavior which we internally label to our one dimension of “coolness”.

We let the machines do the work of figuring out what the important “latent features” are. We expect them to explain the variance in the answers to the various questions. The goal is to build a model which has a representation in a lower dimensional subspace which gathers “taste information” to generate recommendations.

SVD

Given a matrix $X,$ compose it into three matrices:

$X = U S V^{\tau}.$

Here $X$ is $m \times n, U$ is $m \times k, S$ is $k\times k,$ and $V$ is $k\times n,$ where $m$ is the number of users, $n$ is the number of items, and $k$ is the rank of $X.$

The rows of $U$ correspond to users, whereas $V$ has a row for each item. The square matrix $S$ is diagonal where each entry is a singular value, which measure the importance of each dimension. If we put them in decreasing order, which we do, then the dimensions are ordered by importance from highest to lowest. Every matrix has such a decomposition.

Important properties:

The columns of $U$ and $V$ are orthogonal to each other.
So we can order the columns by singular values.
We can take lower rank approximation of X by throwing away part of $S.$ In this way we might have $k$ much smaller than either $n$ or $m$ , and this is what we mean by compression.
There is an important interpretation to the values in the matrices $U$ and $V.$ For example, we can see, by using SVD, that “the most important latent feature” is often something like seeing if you’re a man or a woman.

[Question: did you use domain expertise to choose questions at Hunch? Answer: we tried to make them as fun as possible. Then, of course, we saw things needing to be asked which would be extremely informative, so we added those. In fact we found that we could ask merely 20 questions and then predict the rest of them with 80% accuracy. They were questions that you might imagine and some that surprised us, like competitive people v. uncompetitive people, introverted v. extroverted, thinking v. perceiving, etc., not unlike MBTI.]

More details on our encoding:

Most of the time the questions are binary (yes/no).
We create a separate variable for every variable.
Comparison questions may be better at granular understanding, and get to revealed preferences, but we don’t use them.

Note if we have a rank $k$ matrix $X$ and we use the SVD above, we can take the approximation with only $k-3$ rows of the middle matrix $S,$ so in other words we take the top $k-3$ most important latent features, and the corresponding rows of $U$ and $V,$ and we get back something very close to $X.$

Note that the problem of sparsity or missing data is not fixed by the above SVD approach, nor is the computational complexity problem; SVD is expensive.

PCA

Now we’re still looking for $U$ and $V$ as above, but we don’t have $S$ anymore, so $X = U \cdot V^{\tau},$ and we have a more general optimization problem. Specifically, we want to minimize:

$argmin \sum_{i, j \in P} (p_{i, j} - u_i \cdot v_j)^2.$

Let me explain. We denote by $u_i$ the row of $U$ corresponding to user $i,$ and similarly we denote by $v_j$ the row of $V$ corresponding to item $j.$ Items can include meta-data information (so the age vectors of all the users will be a row in $V$ ).

Then the dot product $u_i \cdot v_j$ is taken to mean the predicted value of user $i$ ‘s preference for item $j,$ and we compare that to the actual preference $p_{i, j}$ . The set $P$ is just the set of all actual known preferences or meta-data attribution values.

So, we want to find the best choices of $U$ and $V$ which overall minimize the squared differences between prediction and observation on everything we actually know, and the idea is that if it’s really good on stuff we know, it will also be good on stuff we’re guessing.

Now we have a parameter, namely the number $D$ which is how may latent features we want to use. The matrix $U$ will have a row for each user and a column for each latent feature, and the matrix $V$ will have a row for each item and a column for each latent features.

How do we choose $D?$ It’s typically about 100, since it’s more than 20 (we already know we had a pretty good grasp on someone if we ask them 20 questions) and it’s as much as we care to add before it’s computational too much work. Note the resulting latent features will be uncorrelated, since they are solving an efficiency problem (not a proof).

But how do we actually find $U$ and $V?$

Alternating Least Squares

This optimization doesn’t have a nice closed formula like ordinary least squares with one set of coefficients. Instead, we use an iterative algorithm like with gradient descent. As long as your problem is convex you’ll converge ok (i.e. you won’t find yourself at a local but not global maximum), and we will force our problem to be convex using regularization.

Algorithm:

Pick a random $V$
Optimize $U$ while $V$ is fixed
Optimize $V$ while $U$ is fixed
Keep doing the above two steps until you’re not changing very much at all.

Example: Fix $V$ and update $U.$

The way we do this optimization is user by user. So for user $i,$ we want to find

$argmin_{u_i} \sum_{j \in P_i} (p_{i, j} - u_i * v_j)^2,$

where $v_j$ is fixed. In other words, we just care about this user for now.

But wait a minute, this is the same as linear least squares, and has a closed form solution! In other words, set:

$u_i = (V_{*, i}^{\tau} V_{*, i})^{-1} V_{*, i}^{\tau} P_{* i},$

where $V_{*, i}$ is the subset of $V$ for which we have preferences coming from user $i.$ Taking the inverse is easy since it’s $D \times D,$ which is small. And there aren’t that many preferences per user, so solving this many times is really not that hard. Overall we’ve got a do-able update for $U.$

When you fix U and optimize V, it’s analogous; you only ever have to consider the users that rated that movie, which may be pretty large, but you’re only ever inverting a $D \times D$ matrix.

Another cool thing: since each user is only dependent on their item’s preferences, we can parallelize this update of $U$ or $V.$ We can run it on as many different machines as we want to make it fast.

There are lots of different versions of this. Sometimes you need to extend it to make it work in your particular case.

Note: as stated this is not actually convex, but similar to the regularization we did for least squares, we can add a penalty for large entries in $U$ and $V,$ depending on some parameter $\lambda,$ which again translates to the same thing, i.e. adding a diagonal matrix to the covariance matrix, when you solve least squares. This makes the problem convex if $\lambda$ is big enough.

You can add new users, new data, keep optimizing U and V. You can choose which users you think need more updating. Or if they have enough ratings, you can decide not to update the rest of them.

As with any machine learning model, you should perform cross-validation for this model – leave out a bit and see how you did. This is a way of testing overfitting problems.

Thought experiment – filter bubbles

What are the implications of using error minimization to predict preferences? How does presentation of recommendations affect the feedback collected?

For example, can we end up in local maxima with rich-get-richer effects? In other words, does showing certain items at the beginning “give them an unfair advantage” over other things? And so do certain things just get popular or not based on luck?

How do we correct for this?

Categories: data science, math education, statistics

Growing old: better than the alternatives

October 17, 2012 Cathy O'Neil, mathbabe 9 comments

I enjoyed this article in the Wall Street Journal recently entitled “The ‘New’ Old Age is No Way to Live”. In it the author rejects the idea of following his Baby Boomer brethren in continuing to exercise daily, being hugely productive, and just generally being in denial of their age. From the article:

We are advised that an extended life span has given us an unprecedented opportunity. And if we surrender to old age, we are fools or, worse, cowards. Around me I see many of my contemporaries remaining in their prime-of-life vocations, often working harder than ever before, even if they have already achieved a great deal. Some are writing the novels stewing in their heads but never attempted, or enrolling in classes in conversational French, or taking up jogging, or even signing up for cosmetic surgery and youth-enhancing hormone treatments.

The rest of the article is devoted to describing his trip to the Greek island of Hydra to research how to grow old. There are lots of philosophical references as well as counter-intuitive defenses of being set in your ways and how striving is empty-headed. Whatever, it’s his column. Personally, I like changing my mind about things and striving.

The point I want to make is this: there are far too few people coming out and saying that getting old can be a good thing. It can be a fun thing. Our culture is so afraid of getting old, it’s almost as bad as being fat on the list of no-nos.

I don’t get it. Why? Why can’t we be proud of growing old? It allows us, at the very least, to hold forth more, which is my favorite thing to do.

Since I turned 40 I’ve stopped dying my hair, which is going white, and I’ve taken to calling the people around me “honey”, “sugar”, or “baby”. I feel like I can get away with that now, which is fun. Honestly I’m looking forward to the stuff I can say and do when I’m 70, because I’m planning to be one of those outrageous old women full of spice and opinions. I’m going to make big turkey dinners with all the fixings even when it’s just October and invite my neighbors and friends to come over if my kids are too busy with their lives and family. But if they decide to visit, and if they have kids themselves, I’m going to spoil my grandkids rotten, because I’m totally allowed to do that when I’m the grandma.

Instead of lying about my age down, I’ve taken to lying about my age up. I feel like I am getting away with something if I can pass for 50. After all, why would I still want to be 30? I was close to miserable back then, and I’ve learned a ton in the past 10 years.

Update: my friend Cosma just sent me this poem by Jenny Joseph. For the record I’m wearing purple today:

Warning

When I am an old woman I shall wear purple
With a red hat which doesn’t go, and doesn’t suit me.
And I shall spend my pension on brandy and summer gloves
And satin sandals, and say we’ve no money for butter.
I shall sit down on the pavement when I’m tired
And gobble up samples in shops and press alarm bells
And run my stick along the public railings
And make up for the sobriety of my youth.
I shall go out in my slippers in the rain
And pick flowers in other people’s gardens
And learn to spit.

You can wear terrible shirts and grow more fat
And eat three pounds of sausages at a go
Or only bread and pickle for a week
And hoard pens and pencils and beermats and things in boxes.

But now we must have clothes that keep us dry
And pay our rent and not swear in the street
And set a good example for the children.
We must have friends to dinner and read the papers.

But maybe I ought to practice a little now?
So people who know me are not too shocked and surprised
When suddenly I am old, and start to wear purple.

Categories: musing

Causal versus causal

October 16, 2012 Cathy O'Neil, mathbabe 44 comments

Today I want to talk about the different ways the word “causal” is thrown around by statisticians versus finance quants, because it’s both confusing and really interesting.

But before I do, can I just take a moment to be amazed at how pervasive Gangnam Style has become? When I first posted the video on August 1st, I had no idea how much of a sensation it was destined to become. Here’s the Google trend graph for “Gangnam” versus “Obama”:

It really hit home last night as I was reading a serious Bloomberg article take on the economic implications of Gangnam Style whilst the song was playing in the background at the playoff game between the Cardinals and the Giants.

Back to our regularly scheduled program. I’m first going to talk about how finance quants think about “causal models” and second how statisticians do. This has come out of conversations with Suresh Naidu and Rachel Schutt.

Causal modeling in finance

When I learned how to model causally, it basically meant something very simple: I never used “future information” to make a prediction about the future. I was strictly using information from the past, or that was available and I had access to, to make predictions about the future. In other words, as I trained a model, I always had in mind a timestamp explaining what the “present time” is, and all data I had access to at that moment had timestamps of availability for before that present time so that I could use this information to make a statement about what I think would happen after that present time. If I did this carefully, then my model was termed “causal.” It respected time, and in particular it didn’t have great-looking predictive power just because it was peeking ahead.

Causal modeling in statistics

By contrast, when statisticians talk about a causal model, they mean something very different. Namely, they mean whether the model shows that something caused something else to happen. For example, if we saw certain plants in a certain soil all died but those in a different soil lived, then they’d want to know if the soil caused the death of the plants. Usually to answer this kind of questions, in an ideal situation, statisticians set up randomly chosen experiments where the only difference between the treatments is that one condition (i.e. the type of soil, but not how often you water it or the type of sunlight it gets). When they can’t set it up perfectly (say because it involves people dying instead of plants) they do the best they can.

The differences and commonalities

On the one hand both concepts refer and depend on time. There’s no way X caused Y to happen if X happened after Y. But whereas in finance we only care about time, in statistics there’s more to it.

So for example, if there’s a third underlying thing that causes both X and Y, but X happens before Y, then the finance people are psyched because they have a way of betting on the direction of Y: just keep an eye on X! But the statisticians are not amused, since there’s no way to prove causality in this case unless you get your hands on that third thing.

Although I understand wanting to know the underlying reasons things happen, I have a personal preference for the finance definition, which is just plain easier to understand and test, and usually the best we can do with real world data. In my experience the most interesting questions relate to things that you can’t set up experiments for. So, for example, it’s hard to know whether blue-collar presidents would be impose less elitist policy than millionaires, because we only have millionaires.

Moreover, it usually is interesting to know what you can predict for the future knowing what you know now, even if there’s no proof of causation, and not only because you can maybe make money betting on something (but that’s part of it).

Categories: data science, statistics

Gaming the Google mail filter and the modeling feedback loop

October 15, 2012 Cathy O'Neil, mathbabe 9 comments

The gmail filter

If you’re like me, a large part of your life takes place in your gmail account. My gmail address is the only one I use, and I am extremely vigilant about reading emails – probably too much so.

On the flip side, I spend quite a bit of energy removing crap from my gmail. When I have the time and opportunity, and if I receive an unwanted email, I will set a gmail filter instead of just deleting. This is usually in response to mailing lists I get on by buying something online, so it’s not quite spam. For obvious spam I just click on the spam icon and it disappears.

You see, when I check out online to pay for my stuff, I am not incredibly careful about making sure I’m not signing up to be on a mailing list. I just figure I’ll filter anything I don’t want later.

Which brings me to the point. I’ve noticed lately that, more and more often, the filter doesn’t work, at least on the automatic setting. If you open an email you don’t want, you can click on “filter messages like these” and it will automatically fill out a filter form with the “from” email address that is listed.

More and more often, these quasi-spammers are getting around this somehow. I don’t know how they do it, because it’s not as simple as changing their “from” address every time, which would work pretty well. Somehow not even the email I’ve chosen to filter is actually deleted through this process.

I end up having to copy and paste the name of the product into a filter, but this isn’t a perfect solution either, since then if my friend emails me about this product I will automatically delete that genuine email.

The modeling feedback loop

This is a perfect example of the feedback loop of modeling; first there was a model which automatically filled out a filter form, then people in charge of sending out mailing lists for products realized they were being successfully filtered and figured out how to game the model. Now the model doesn’t work anymore.

The worst part of the gaming strategy is how well it works. If everybody uses the filter model, and you are the only person who games it, then you have a tremendous advantage over other marketers. So the incentive for gaming is very high.

Note this feedback loop doesn’t always exist: the stars and planets didn’t move differently just because Newton figured out his laws, and people don’t start writing with poorer penmanship just because we have machine learning algorithms that read envelopes at the post office.

But this feedback loop does seem to be associated with especially destructive models (think rating agency models for MBS’s and CDO’s). In particular, any model which is “gamed” to someone’s advantage probably exhibits something like this. It will work until the modelers strike back with a better model, in an escalation not unlike an arms race (note to ratings agency modelers: unless you choose to not make the model better even when people are clearly gaming it).

As far as I know, there’s nothing we can do about this feedback loop except to be keenly aware of it and be ready for war.

Categories: data science, finance

The investigative mathematical journalist

October 14, 2012 Cathy O'Neil, mathbabe 19 comments

I’ve been out of academic math a few years now, but I still really enjoy talking to mathematicians. They are generally nice and nerdy and utterly earnest about their field and the questions in their field and why they’re interesting.

In fact, I enjoy these conversations more now than when I was an academic mathematician myself. Partly this is because, as a professional, I was embarrassed to ask people stupid questions, because I thought I should already know the answers. I wouldn’t have asked someone to explain motives and the Hodge Conjecture in simple language because honestly, I’m pretty sure I’d gone to about 4 lectures as a graduate student explaining all of this and if I could just remember the answer I would feel smarter.

But nowadays, having left and nearly forgotten that kind of exquisite anxiety that comes out of trying to appear superhuman, I have no problem at all asking someone to clarify something. And if they give me an answer that refers to yet more words I don’t know, I’ll ask them to either rephrase or explain those words.

In other words, I’m becoming something of an investigative mathematical journalist. And I really enjoy it. I think I could do this for a living, or at least as a large project.

What I have in mind is the following: I go around the country (I’ll start here in New York) and interview people about their field. I ask them to explain the “big questions” and what awesomeness would come from actually having answers. Why is their field interesting? How does it connect to other fields? What is the end goal? How would achieving it inform other fields?

Then I’d write them up like columns. So one column might be “Hodge Theory” and it would explain the main problem, the partial results, and the connections to other theories and fields, or another column might be “motives” and it would explain the underlying reason for inventing yet another technology and how it makes things easier to think about.

Obviously I could write a whole book on a given subject, but I wouldn’t. My audience would be, primarily, other mathematicians, but I’d write it to be readable by people who have degrees in other quantitative fields like physics or statistics.

Even more obviously, every time I chose a field and a representative to interview and every time I chose to stop there, I’d be making in some sense a political choice, which would inevitably piss someone off, because I realize people are very sensitive to this. This is presuming anybody every read my surveys in the first place, which is a big if.

Even so, I think it would be a contribution to mathematics. I actually think a pretty serious problem with academic math is that people from disparate fields really have no idea what each other is doing. I’m generalizing, of course, and colloquiums do tend to address this, when they are well done and available. But for the most part, let’s face it, people are essentially only rewarded for writing stuff that is incredibly “insider” for their field. that only a few other experts can understand. Survey of topics, when they’re written, are generally not considered “research” but more like a public service.

And by the way, this is really different from the history of mathematics, in that I have never really cared about who did what, and I still don’t (although I’m not against name a few people in my columns). The real goal here is to end up with a more or less accurate map of the active research areas in mathematics and how they are related. So an enormous network, with various directed edges of different types. In fact, writing this down makes me want to build my map as I go, an annotated visualization to pair with the columns.

Also, it obviously doesn’t have to be me doing all this: I’m happy to make it an open-source project with a few guidelines and version control. But I do want to kick it off because I think it’s a neat idea.

A few questions about my mathematical journalism plan.

Who’s going to pay me to do this?
Where should I publish it?

If the answers are “nobody” and “on mathbabe.org” then I’m afraid it won’t happen, at least by me. Any ideas?

One more thing. This idea could just as well be done for another field altogether, like physics or biology. Are there models of people doing something like that in those fields that you know about? Or is there someone actually already doing this in math?

Categories: math, math education, musing

Philanthropy can do better than Rajat Gupta

October 13, 2012 Cathy O'Neil, mathbabe 8 comments

Last night I was watching a YouTube video in between playoff games (both of which disappointed). Conan O’Brien was accepting an honorary patronage at the philosophical society of the University of Dublin. His speech was hilarious, and there was an extended, intimate Q&A session afterwards.

One thing he mentioned was an amended version of the (to me, very moving) words he had closed his last NBC Tonight Show with, “If you work really hard and you’re kind then amazing things will happen.” Namely, he wanted to add this sentence: “If you work really hard and you’re a huge asshole, then you can make tons of money on Wall Street.”

These wise words came back to me this morning when I read about Bill Gates and Kofi Annan’s letters to Judge Jed Rakoff regarding Goldman Sachs insider trader Rajat Gupta. The letters were intended to reduce sentencing, considering how unbelievably philanthropical Gupta had been as he was stealing all this money.

I’m not doubting that the dude did some good things with his ill-gotten gains. After all, I don’t have a letter from Bill Gates about how I helped remove malaria from the world.

But wait a minute, maybe that’s because I didn’t steal money from taxpayers like he did to put myself into the position of spending millions of dollars doing good things! Because I’m thinking that if I had the money that Gupta had, I might well have spent good money doing good things.

And therein lies the problem with this whole picture. He did some good (I’ll assume), but then again he had the advantage of being someone in our society who could do good, i.e. he was loaded. Wouldn’t it make more sense for us to set up a system wherein people could do good who are good, who have good ideas and great plans?

Unfortunately, those people exist, but they’re generally poor, or stuck in normal jobs making ends meet for their family, and they don’t get their plans heard. In particular they aren’t huge assholes stealing money and then trying to get out of trouble by hiring hugely expensive lawyers and leaning on their philanthropy buds.

The current system of grant-writing doesn’t at all support the people with good ideas: it doesn’t teach these “social inventors” how to build a charitable idea into a business plan. So what happens is that the good ideas drift away without the important detailed knowledge of how to surround it with resources. And generally the people with really innovative ideas aren’t by nature detail-oriented people who can figure out how to start a business, they’re kind of nerdy.

I’m serious, I think the government should sponsor something like a “philanthropy institute” for entrepreneurial non-revenue generating ideas that are good for society. People could come to open meetings and discuss their ideas for improving stuff, and there’d be full-time staff and fellows, with the goal of seizing upon good ideas and developing them like business plans.

Categories: finance, rant

Personal privacy and institutional transparency

October 12, 2012 Cathy O'Neil, mathbabe 1 comment

Ever noticed that it’s vulnerable individuals who are transparent about their data (i.e. public and open on Facebook and the like) whereas it’s for-profit institutions like pharmaceutical companies, charged with being stewards of public health, that get to be as down-low as they want?

Do you agree with me that that’s ass-backwards?

Well, there were two potentially good things mentioned in yesterday’s New York Times to ameliorate this mismatch. I say “potentially” because they are both very clearly susceptible to political spin-doctoring.

The first is that Big Pharma company GlaxoSmithKline has claimed they will be more transparent about their internal medical trials, even the ones that fail. This would be a huge step in the right direction if it really happens.

The second is that Senator John D. Rockefeller IV of West Virginia is spearheading an investigation into data brokers and the industry of information warehousing. A good step towards better legislation but this could just be a call for lobbyists money, so I’ll believe it when I see it.

What with the whole-genome DNA sequencing methods getting relatively cheap, modern privacy legislation is desperately needed so people won’t be afraid to use life-saving techniques for fear of losing their health insurance. Obama’s Presidential Commission for the Study of Bioethical Issues agrees with me.

Categories: data science, open source tools

Columbia Data Science course, week 6: Kaggle, crowd-sourcing, decision trees, random forests, social networks, and experimental design

October 11, 2012 Cathy O'Neil, mathbabe 3 comments

Yesterday we had two guest lecturers, who took up approximately half the time each. First we welcomed William Cukierski from Kaggle, a data science competition platform.

Will went to Cornell for a B.A. in physics and to Rutgers to get his Ph.D. in biomedical engineering. He focused on cancer research, studying pathology images. While working on writing his dissertation, he got more and more involved in Kaggle competitions, finishing very near the top in multiple competitions, and now works for Kaggle. Here’s what Will had to say.

Crowd-sourcing in Kaggle

What is a data scientist? Some say it’s someone who is better at stats than an engineer and better at engineering than a statistician. But one could argue it’s actually someone who is worse at stats than a statistician. Being a data scientist is when you learn more and more about more and more until you know nothing about everything.

Kaggle using prizes to induce the public to do stuff. This is not a new idea:

the Royal Navy in 1714 couldn’t measure longitude, and put out a prize worth $6 million in today’s dollars to get help. John Harrison, an unknown cabinetmaker, figured it out how to make a clock to solve the problem.
In the U.S. in 2002 FOX issued a prize for the next pop solo artist, which resulted in American Idol.
There’s also the X-prize company; $10 million was offered for the Ansari X-prize and $100 million was lost in trying to solve it. So it doesn’t always mean it’s an efficient process (but it’s efficient for the people offering the prize if it gets solved!)

There are two kinds of crowdsourcing models. First, we have the distributive crowdsourcing model, like wikipedia, which as for relatively easy but large amounts of contributions. Then, there’s the singular, focused difficult problems that Kaggle, DARPA, InnoCentive and other companies specialize in.

Somee of the problems with some crowdsourcing projects include:

they don’t always evaluate your submission objectively. Instead they have a subjective measure, so they might just decide your design is bad or something. This leads to high barrier to entry, since people don’t trust the evaluation criterion.
Also, one doesn’t get recognition until after they’ve won or ranked highly. This leads to high sunk costs for the participants.
Also, bad competitions often conflate participants with mechanical turks: in other words, they assume you’re stupid. This doesn’t lead anywhere good.
Also, the competitions sometimes don’t chunk the work into bite size pieces, which means it’s too big to do or too small to be interesting.

A good competition has a do-able, interesting question, with an evaluation metric which is transparent and entirely objective. The problem is given, the data set is given, and the metric of success is given. Moreover, prizes are established up front.

The participants are encouraged to submit their models up to twice a day during the competitions, which last on the order of a few days. This encourages a “leapfrogging” between competitors, where one ekes out a 5% advantage, giving others incentive to work harder. It also establishes a band of accuracy around a problem which you generally don’t have- in other words, given no other information, you don’t know if your 75% accurate model is the best possible.

The test set y’s are hidden, but the x’s are given, so you just use your model to get your predicted y’s for the test set and upload them into the Kaggle machine to see your evaluation score. This way you don’t share your actual code with Kaggle unless you win the prize (and Kaggle doesn’t have to worry about which version of python you’re running).

Note this leapfrogging effect is good and bad. It encourages people to squeeze out better performing models but it also tends to make models much more complicated as they get better. One reason you don’t want competitions lasting too long is that, after a while, the only way to inch up performance is to make things ridiculously complicated. For example, the original Netflix Prize lasted two years and the final winning model was too complicated for them to actually put into production.

The hole that Kaggle is filling is the following: there’s a mismatch between those who need analysis and those with skills. Even though companies desperately need analysis, they tend to hoard data; this is the biggest obstacle for success.

They have had good results so far. Allstate, with a good actuarial team, challenged their data science competitors to improve their actuarial model, which, given attributes of drivers, approximates the probability of a car crash. The 202 competitors improved Allstate’s internal model by 271%.

There were other examples, including one where the prize was $1,000 and it benefited the company $100,000.

A student then asked, is that fair? There are actually two questions embedded in that one. First, is it fair to the data scientists working at the companies that engage with Kaggle? Some of them might lose their job, for example. Second, is it fair to get people to basically work for free and ultimately benefit a for-profit company? Does it result in data scientists losing their fair market price?

Of course Kaggle charges a fee for hosting competitions, but is it enough?

[Mathbabe interjects her view: personally, I suspect this is a model which seems like an arbitrage opportunity for companies but only while the data scientists of the world haven’t realized their value and have extra time on their hands. As soon as they price their skills better they’ll stop working for free, unless it’s for a cause they actually believe in.]

Facebook is hiring data scientists, they hosted a Kaggle competition, where the prize was an interview. There were 422 competitors.

[Mathbabe can’t help but insert her view: it’s a bit too convenient for Facebook to have interviewees for data science positions in such a posture of gratitude for the mere interview. This distracts them from asking hard questions about what the data policies are and the underlying ethics of the company.]

There’s a final project for the class, namely an essay grading contest. The students will need to build it, train it, and test it, just like any other Kaggle competition. Group work is encouraged.

Thought Experiment: What are the ethical implications of a robo-grader?

Some of the students’ thoughts:

It depends on how much you care about your grade.
Actual human graders aren’t fair anyway.
Is this the wrong question? The goal of a test is not to write a good essay but rather to do well in a standardized test. The real profit center for standardized testing is, after all, to sell books to tell you how to take the tests. It’s a screening, you follow the instructions, and you get a grade depending on how well you follow instructions.
There are really two question: 1) Is it wise to move from the human to the machine version of same thing for any given thing? and 2) Are machines making things more structured, and is this inhibiting creativity? One thing is for sure, robo-grading prevents me from being compared to someone more creative.
People want things to be standardized. It gives us a consistency that we like. People don’t want artistic cars, for example.
Will: We used machine learning to research cancer, where the stakes are much higher. In fact this whole field of data science has to be thinking about these ethical considerations sooner or later, and I think it’s sooner. In the case of doctors, you could give the same doctor the same slide two months apart and get different diagnoses. We aren’t consistent ourselves, but we think we are. Let’s keep that in mind when we talk about the “fairness” of using machine learning algorithms in tricky situations.

Introduction to Feature Selection

“Feature extraction and selection are the most important but underrated step of machine learning. Better features are better than better algorithms.” – Will

“We don’t have better algorithms, we just have more data” –Peter Norvig

Will claims that Norvig really wanted to say we have better features.

We are getting bigger and bigger data sets, but that’s not always helpful. The danger is if the number of features is larger than the number of samples or if we have a sparsity problem.

We improve our feature selection process to try to improve performance of predictions. A criticism of feature selection is that it’s no better than data dredging. If we just take whatever answer we get that correlates with our target, that’s not good.

There’s a well known bias-variance tradeoff: a model is “high bias” if it’s is too simple (the features aren’t encoding enough information). In this case lots more data doesn’t improve your model. On the other hand, if your model is too complicated, then “high variance” leads to overfitting. In this case you want to reduce the number of features you are using.

We will take some material from a famous paper by Isabelle Guyon published in 2003 entitled “An Introduction to Variable and Feature Selection”.

There are three categories of feature selection methods: filters, wrappers, and embedded methods. Filters order variables (i.e. possible features) with respect to some ranking (e.g. correlation with target). This is sometimes good on a first pass over the space of features. Filters take account of the predictive power of individual features, and estimate mutual information or what have you. However, the problem with filters is that you get correlated features. In other words, the filter doesn’t care about redundancy.

This isn’t always bad and it isn’t always good. On the one hand, two redundant features can be more powerful when they are both used, and on the other hand something that appears useless alone could actually help when combined with another possibly useless-looking feature.

Wrapper feature selection tries to find subsets of features that will do the trick. However, as anyone who has studied the binomial coefficients knows, the number of possible size $k$ subsets of $n$ things, called $n\choose k$ , grows exponentially. So there’s a nasty opportunity for over fitting by doing this. Most subset methods are capturing some flavor of minimum-redundancy-maximum-relevance. So, for example, we could have a greedy algorithm which starts with the best feature, takes a few more highly ranked, removes the worst, and so on. This a hybrid approach with a filter method.

We don’t have to retrain models at each step of such an approach, because there are fancy ways to see how objective function changes as we change the subset of features we are trying out. These are called “finite differences” and rely essentially on Taylor Series expansions of the objective function.

One last word: if you have a domain expertise on hand, don’t go into the machine learning rabbit hole of feature selection unless you’ve tapped into your expert completely!

Decision Trees

We’ve all used decision trees. They’re easy to understand and easy to use. How do we construct? Choosing a feature to pick at each step is like playing 20 questions. We take whatever the most informative thing is first. For the sake of this discussion, assume we break compound questions into multiple binary questions, so the answer is “+” or “-“.

To quantify “what is the most informative feature”, we first define entropy for a random variable $X$ to mean:

$H(X) = - p(x_+) log_2(p(x_+)) - p(x_-) log_2(p(x_-)).$

Note when $p(x_*) = 0,$ we define the term to vanish. This is consistent with the fact that

$\lim_{t\to 0} t log(t) = 0.$

In particular, if either option has probability zero, the entropy is 0. It is maximized at 0.5 for binary variables:

which we can easily compute using the fact that in the binary case, $p(x_+) = 1- p(x_-)$ and a bit of calculus.

Using this definition, we define the information gain for a given feature, which is defined as the entropy we lose if we know the value of that feature.

To make a decision tree, then, we want to maximize information gain, and make a split on that. We keep going until all the points at the end are in the same class or we end up with no features left. In this case we take the majority vote. Optionally we prune the tree to avoid overfitting.

This is an example of an embedded feature selection algorithm. We don’t need to use a filter here because the “information gain” method is doing our feature selection for us.

How do you handle continuous variables?

In the case of continuous variables, you need to ask for the correct threshold of value so that it can be though of as a binary variable. So you could partition a user’s spend into “less than $5” and “at least $5” and you’d be getting back to the binary variable case. In this case it takes some extra work to decide on the information gain because it depends on the threshold as well as the feature.

Random Forests

Random forests are cool. They incorporate “bagging” (bootstrap aggregating) and trees to make stuff better. Plus they’re easy to use: you just need to specify the number of trees you want in your forest, as well as the number of features to randomly select at each node.

A bootstrap sample is a sample with replacement, which we usually take to be 80% of the actual data, but of course can be adjusted depending on how much data we have.

To construct a random forest, we construct a bunch of decision trees (we decide how many). For each tree, we take a bootstrap sample of our data, and for each node we randomly select (a second point of bootstrapping actually) a few features, say 5 out of the 100 total features. Then we use our entropy-information-gain engine to decide which among those features we will split our tree on, and we keep doing this, choosing a different set of five features for each node of our tree.

Note we could decide beforehand how deep the tree should get, but we typically don’t prune the trees, since a great feature of random forests is that it incorporates idiosyncratic noise.

Here’s what does a decision tree looks like for surviving on the Titanic.

David Huffaker, Google: Hybrid Approach to Social Research

David is one of Rachel’s collaborators in Google. They had a successful collaboration, starting with complementary skill sets, an explosion of goodness ensued when they were put together to work on Google+ with a bunch of other people, especially engineers. David brings a social scientist perspective to the analysis of social networks. He’s strong in quantitative methods for understanding and analyzing online social behavior. He got a Ph.D. in Media, Technology, and Society from Northwestern.

Google does a good job of putting people together. They blur the lines between research and development. The researchers are embedded on product teams. The work is iterative, and the engineers on the team strive to have near-production code from day 1 of a project. They leverage cloud infrastructure to deploy experiments to their mass user base and to rapidly deploy a prototype at scale.

Note that, considering the scale of Google’s user base, redesign as they scaling up is not a viable option. They instead do experiments with smaller groups of users.

David suggested that we, as data scientists, consider how to move into an experimental design so as to move to a causal claim between variables rather than a descriptive relationship. In other words, to move from the descriptive to the predictive.

As an example, he talked about the genesis of the “circle of friends” feature of Google+. They know people want to selectively share; they’ll send pictures to their family, whereas they’d probably be more likely to send inside jokes to their friends. They came up with the idea of circles, but it wasn’t clear if people would use them. How do they answer the question: will they use circles to organize their social network? It’s important to know what motivates them when they decide to share.

They took a mixed-method approach, so they used multiple methods to triangulate on findings and insights. Given a random sample of 100,000 users, they set out to determine the popular names and categories of names given to circles. They identified 168 active users who filled out surveys and they had longer interviews with 12.

They found that the majority were engaging in selective sharing, that most people used circles, and that the circle names were most often work-related or school-related, and that they had elements of a strong-link (“epic bros”) or a weak-link (“acquaintances from PTA”)

They asked the survey participants why they share content. The answers primarily came in three categories: first, the desire to share about oneself – personal experiences, opinions, etc. Second, discourse: people wanna participate in a conversation. Third, evangelism: people wanna spread information.

Next they asked participants why they choose their audiences. Again, three categories: first, privacy – many people were public or private by default. Second, relevance – they wanted to share only with those who may be interested, and they don’t wanna pollute other people’s data stream. Third, distribution – some people just want to maximize their potential audience.

The takeaway from this study was this: people do enjoy selectively sharing content, depending on context, and the audience. So we have to think about designing features for the product around content, context, and audience.

Network Analysis

We can use large data and look at connections between actors like a graph. For Google+, the users are the nodes and the edges (directed) are “in the same circle”.

Other examples of networks:

nodes are users in 2nd life, interactions between users are possible in three different ways, corresponding to three different kinds of edges
nodes are websites, edges are links
nodes are theorems, directed edges are dependencies

After you define and draw a network, you can hopefully learn stuff by looking at it or analyzing it.

Social at Google

As you may have noticed, “social” is a layer across all of Google. Search now incorporates this layer: if you search for something you might see that your friend “+1″‘ed it. This is called a social annotation. It turns out that people care more about annotation when it comes from someone with domain expertise rather than someone you’re very close to. So you might care more about the opinion of a wine expert at work than the opinion of your mom when it comes to purchasing wine.

Note that sounds obvious but if you started the other way around, asking who you’d trust, you might start with your mom. In other words, “close ties,” even if you can determine those, are not the best feature to rank annotations. But that begs the question, what is? Typically in a situation like this we use click-through rate, or how long it takes to click.

In general we need to always keep in mind a quantitative metric of success. This defines success for us, so we have to be careful.

Privacy

Human facing technology has thorny issues of privacy which makes stuff hard. We took a survey of how people felt uneasy about content. We asked, how does it affect your engagement? What is the nature of your privacy concerns?

Turns out there’s a strong correlation between privacy concern and low engagement, which isn’t surprising. It’s also related to how well you understand what information is being shared, and the question of when you post something, where does it go and how much control do you have over it. When you are confronted with a huge pile of complicated all settings, you tend to start feeling passive.

Again, we took a survey and found broad categories of concern as follows:

identity theft

financial loss

digital world

access to personal data
really private stuff I searched on
unwanted spam
provocative photo (oh shit my boss saw that)
unwanted solicitation
unwanted ad targeting

physical world

offline threats
harm to my family
stalkers
employment risks
hassle

What is the best way to decrease concern and increase undemanding and control?

Possibilities:

Write and post a manifesto of your data policy (tried that, nobody likes to read manifestos)
Educate users on our policies a la the Netflix feature “because you liked this, we think you might like this”
Get rid of all stored data after a year

Rephrase: how do we design setting to make it easier for people? how do you make it transparent?

make a picture or graph of where data is going.
give people a privacy switchboard
give people access to quick settings
make the settings you show them categorized by things you don’t have a choice about vs. things you do
make reasonable default setting so people don’t have to worry about it.

David left us with these words of wisdom: as you move forward and have access to big data, you really should complement them with qualitative approaches. Use mixed methods to come to a better understanding of what’s going on. Qualitative surveys can really help.

Categories: data science, math education, statistics

Live and let live, motherfuckers

October 10, 2012 Cathy O'Neil, mathbabe 5 comments

It’s high time I tell you guys about my favorite blog, Effing Dykes.

Why now? Well, I’ve wanted to write a post about body image like Effing Dykes’ The Body Electric ever since I started this blog (ever since I turned 10, actually). But I couldn’t get it right. Not in a million years could I have written something so beautiful or so right. So I’m really grateful she has written it. Please read and enjoy.

That url again: http://effingdykes.blogspot.com/2012/09/the-body-electric.html

Note: I’ve stolen the catchy phrase “live and let live, motherfuckers” (can you say “phrase of the week”?) from that post, as well as this picture, which reminds me of my wordpress profile pic as well as all of my friends from high school:

p.s. I had a wardrobe crisis last week when I realized I only owned one ugly plaid flannel shirt, but luckily Old Navy has an ugly plaid flannel shirt sale going on.

Categories: Uncategorized

Neil Barofsky on the Fed Stress Test

October 10, 2012 Cathy O'Neil, mathbabe 3 comments

I recently started using Twitter, and I only follow 8 people, one of them being Neil Barofsky, author of Bailout, which I blogged about here (Twitter is a useful way to stalk your crushes, as Twitter users already know).

I’m glad I do follow him, because yesterday he tweeted (twatted?) about an article he wrote on LinkedIn which I never would have found otherwise. It’s called “Banks Rule While the Rest of us Drool,” and he gave credit to his daughter for that title, which is crushworthy in itself. It’s essentially a bloggy rant against a Wall Street Journal article which I had just read and was thinking of writing a ranty blog post against myself.

But now I don’t have to write it! I’ll just tell you about the WSJ article, quote from it a bit (and complain about it a bit since I can’t help myself), and then quote Barofsky’s awesome disgust with it. Here goes.

The Fed conducts stress tests on the banks, and they are making them secret, so the banks can’t game them, as well as requiring more frequent and better quality data. All good. From the WSJ article:

The Fed asks the big banks to submit reams of data and then publishes each bank’s potential loan losses and how much capital each institution would need to absorb them. Banks also submit plans of how they would deploy capital, including any plans to raise dividends or buy back stock.

After several institutions failed last year’s tests and had their capital plans denied, executives at many of the big banks began challenging the Fed to explain why there were such large gaps between their numbers and the Fed’s, according to people close to the banks.

Fed officials say they have worked hard to help bankers better understand the math, convening the Boston symposium and multiple conference calls. But they don’t want to hand over their models to the banks, in part because they don’t want the banks to game the numbers, officials say.

Just to be clear, when they say “large gaps”, I’m pretty sure the banks mean they are perfectly safe when the Fed thinks they’re undercapitalized. I am pretty sure the banks are arguing they should be giving huger bonuses to their C*O’s whereas the Fed thinks not. I’m just guessing on the direction, but I could be wrong, it’s not spelled out in the article.

Here’s another thing that drives me up the wall, from the WSJ article:

Banks say the Fed has asked them for too much, too fast. Some bankers, for instance, have complained the Fed now is demanding they include the physical address of properties backing loans on their books, not just the billing address for the borrower. Not all banks, it turns out, have that information readily available.

Daryl Bible, the chief risk officer at BB&T Corp., a Winston-Salem, N.C.-based bank with $179 billion in assets, challenged the Fed’s need for all of the data it is collecting, saying in a Sept. 4 comment letter to the regulator that “the reporting requirements appear to have advanced beyond the linkage of risk to capital and an organization’s viability,” burdening banks without adding any value to the stress test exercise. BB&T declined further comment.

Oh really? Can you, Daryl Bible, think of no reason at all we might want to know the addresses of the houses you gave bad mortgages to? Really? Do you really think you deserve to be a Chief Risk Officer of a firm with $179 billion in assets if your imagination of how to calculate risk is so puny?

But the most infuriating part of the article is at the end, and I’m going to let Neil take it away:

… at the end of the article the reporters reveal that the Fed recently “backed off” a requirement that the CFOs of the banks actually confirm that the numbers they are providing are accurate. The reason? The banks argued, and the Fed apparently agreed, that providing data about what’s going on in the banks is simply too “confusing for any CFO to be able to be sure his bank had gotten it right.” In other words, rather than demand personal accountability, the Fed seems to be content with relying on unverified and potentially inaccurate data. If this does not prove both the inherent unreliability of these tests and that the banks are still so hopelessly complex that their executives do not know what’s going on inside of them (See Whale, London), I’m not sure what would.

Categories: finance

Suresh Naidu: analyzing the language of political partisanship

October 9, 2012 Cathy O'Neil, mathbabe 4 comments

I was lucky enough to attend Suresh Naidu‘s lecture last night on his recent work analyzing congressional speeches with co-authors Jacob Jensen, Ethan Kaplan, and Laurence Wilse-Samson.

Namely, along with his co-authors, he found popular three-word phrases, measured and ranked their partisanship (by how often a democrat uttered the phrase versus a republican), and measured the extent to which those phrases were being used in the public discussion before congress started using them or after congress started using them.

Note this means that phrases that were uttered often by both parties were ignored. Only phrases that were uttered more by one party than the other like “free market system” were counted. Also, the words were reduced to their stems and small common words were ignored, so the phrase “united states of america” was reduced to “unite.state.america”. So if parties were talking about the same issue but insisted on using certain phrases (“death tax” for example), then it would show up. This certainly jives with my sense of how partisanship is established by politicians, and for the sake of the paper it can be taken to be the definition.

The first data set he used was a digitized version of all of the speeches from the House since the end of the Civil War, which was also the beginning of the “two-party” system as we know it. Third party politicians were ignored. The proxy for “the public discussion” was taken from Google Book N-grams. It consists of books that were published in English in a given year.

Some of the conclusions that I can remember are as follows:

The three-word phrases themselves are a super interesting data set; their prevalence, how the move from one side of the aisle to the other over time, and what they discuss (so for example, they don’t discuss international issues that much – which doesn’t mean the politicians don’t discuss international issues, but that it’s not a particularly partisan issue or at least their language around this issue is similar).
When the issue is economic and highly partisan, it tends to show up “in the public” via Google Books before it shows up in Congress. Which is to say, there’s been a new book written by some economist, presumably, who introduces language into the public discussion that later gets picked up by Congress.
When the issue is non-economic or only somewhat partisan, it tends to show up in Congress before or at the same time as in the public domain. Members of Congress seem to feel comfortable making up their own phrases and repeating them in such circumstances.

So the cult of the economic expert has been around for a while now.

Suresh and his crew also made an overall measurement of the partisanship of a given 2-year session of congress. It was interesting to discuss how this changed over time, and how having large partisanship, in terms of language, did not necessarily correlate with having stalemate congresses. Indeed if I remember correctly, a moment of particularly high partisanship, as defined above via language, was during the time the New Deal was passed.

Also, as we also discussed (it was a lively audience), language may be a marker of partisan identity without necessarily pointing to underlying ideological differences. For example, the phrase “Martin Luther King” has been ranked high as a partisan democratic phrase since the civil rights movement but then again it’s customary (I’ve been told) for democrats to commemorate MLK’s birthday, but not for republicans to do so.

Given their speech, this analysis did a good job identifying which party a politician belonged to, but the analysis was not causal in the sense of time: we needed to know the top partisan phrases of that session of Congress to be able to predict the party of a given politician. Indeed the “top phrases” changed so quickly that the predictive power may be mostly lost between sessions.

Not that this is a big deal, since of course we know what party a politician is from, but it would be interesting to use this as a measure of how radical or centered a given politician is or will be.

Even if you aren’t interested in the above results and discussion, the methodology is very cool. Suresh and his co-authors view text as its own data set and analyze it as such.

And after all, the words historical politicians spoke is what we have on record – we can’t look into their brain and see what they were thinking. It’s of course interesting and important to have historians (domain experts) inform the process as well, e.g. for the “Martin Luther King” phrase above, but barring expert knowledge this is lots better than nothing. One thing it tells us, just in case we didn’t study political history, is that we’ve seen way worse partisanship in the past than we see now, although things have consistently been getting worse since the 1980’s.

Here’s a wordcloud from the 2007 session; blue and red are what you think, and bigger means more partisan:

Categories: data science, musing, statistics

The Neighbors

October 8, 2012 Cathy O'Neil, mathbabe 3 comments

When I was a senior in high school, my parents moved house to the outskirts of Lexington, Massachusetts, from the center of town where I’d grown up. The neighborhood had a totally different feel, even though it was the same town. In particular it had a kind of prissiness that we didn’t understand or care for.

My best friend Becky ran away from home to live with my family during this year, so most of my memories of that house involve her. Our good friend Karen often visited as well; she drove her beat-up old VW van up the hill and parked it right across from our house on the street. This was totally legal, by the way, and there were plenty of people who parked on the street nearby.

Just to describe the van a bit more: it had about 5 different color paints on it, but not in any kind of artistic way. It was just old. And it had a million, possibly more than a million, memories of teenage sex hanging on to it- at some point there had even been a mattress installed in the back of the van. I remember this from earlier in high school, when the van had been owned by Karen’s older half-sister and had been parked out behind the high school.

Just in case this is getting too seedy for you, keep in mind we were the freaks and geeks of high school (J-House), we talked about D&D and always used condoms. I don’t even know why I’m saying “we” because I personally never got any action in the legendary van, but I was certainly aware of it.

So anyway, Karen would drive up the hill and park her ugly-but-legendary van there, and every time she’d do it, she’d get a nasty note on her windshield by the time she left, something along these lines:

“Please don’t part your van in front of our window. It is an eyesore. – the Neighbors”

I remember laughing hysterically with Karen and Becky the first time Karen got such a note and bringing it to my mom, who, in her characteristically nerdy way, said something about how it’s perfectly legal to park on the street and to ignore it.

What was awesome about this was how, from then on, Karen would very carefully park her van from then on right in front of the window of the Neighbors (their last name was actually “Neighbors”). Sometimes she’d pull up a bit, then pull back, then get it settled just so. And she always got the note, even though we never actually saw them leave the house. They were like magical prissy elves.

One more story about the Neighbors which is too good to resist. There was a swimming pool in the back of the house, which my mom hated with all her heart because she was in charge of the upkeep and it kept mysteriously turning green. And Becky and I were going through a back-to-nature phase, which meant we were planning to go hiking up in the White Mountains. So one day we were testing our tent out in the front yard, learning how to open and close it, and we happened to be wearing swimming suits, since we’d been swimming.

The Neighbors called my house (this is back when there were things called “telephone books” and you could find someone’s phone number without asking them) and complained to my grandma, who happened to answer the phone, and who also happened to be wearing nothing but a swimming suit, that “there are skimpily clad young ladies cavorting on the front lawn in an obscene manner.”

Now, my grandma had arthritis and couldn’t comfortably walk or stand for very long, but this phone call seemed to give her extra strength. She walked to the front door and stood there, arms crossed, looking defiantly out at the neighborhood for five minutes. After about four minutes I asked her if everything was all right and she said, “perfectly fine.”

Categories: musing

Dissolve the SEC

October 6, 2012 Cathy O'Neil, mathbabe 26 comments

A few days ago I wrote about the $5 million fine the SEC gave to NYSE for allowing certain customers prices before other customers. I was baffled that the fine is so low- access like that allows the customers to make outrageous profits, and it seems like the resulting fine should be more along the lines of those profits, since kickbacks are probably in terms of percentages of take. The lawyer fees from this case on both sides is much higher than $5 million, for christ’s sakes.

But now I’m even more outraged by the newest smallest fine, this time an $800,000 fine for a dark pool trading firm eBX. From the Boston.com article:

Federal securities regulators on Wednesday charged Boston-based eBX LLC, a “dark pool” securities exchange, with failing to protect confidential trading information of customers and for failing to disclose that it let an outside firm use their trading data.

The Securities and Exchange Commission said eBX, which runs the alternative trading system LeveL ATS, agreed to settle the charges and to pay an $800,000 penalty.

You know that if I can actually consider paying the fine myself, then the fine is too small. It’s along the lines of the cost of college for my kids.

Look, I don’t care what it’s for: if the SEC finds you guilty of fraud, it should threaten to put you out of business. Otherwise why should they waste their time doing it?

On the one hand, I’m outraged that these fraudulent practices are being so lightly punished. Indeed it’s worse than no punishment at all to get such a light punishment, because it establishes precedent. Now exchanges know how much it costs to let certain traders get better access to data than others, and as long as they charge sufficiently, they’ll be sure to make profit on it. Similarly dark trading pools know how much to charge third-party data vendors for their clients’ “confidential trading information.” Awesome.

On the other hand, I’m outraged at the SEC for not picking their fights better and for general incompetence. Here they are nabbing firms for real fraud, and they can’t get more than $800,000? At the same time, they’ve decided to go into high frequency trading but what that seems to mean to them is that they’ll finally collect some tick data. I’ve got some news for them: it’s gonna take more than a little bit of data to understand that world.

The SEC needs to concentrate more on not trying to keep up with the HFT’ers of the world, since it’s a lost cause, and spend more time thinking through what policy changes they’d need to actually do their job well – for example, what would they need to get Citigroup and Bank of America to admit wrongdoing when they defraud their customers? Instead of wasting their time trying to keep up with HFT quants, what would they need to institute a transaction tax, or some other policy to slow down trading? What would they need to be able to shut down firms who sell confidential client trading information?

The SEC needs to write a list of policy demands, pronto.

And if the political pressure the SEC receives to not actually get anyone in trouble is too strong for them to do their job well, they should either quit in protest or make a huge stink about being kept from completing their mission.

I get it, I’ve talked to people inside the SEC who want to do a better job but feel like they aren’t being given the power to. But I say, enough with the resigned shrugs already, this stuff is out of control! Continuing in this way is giving the public the false impression that there’s someone on the case. Well, there’s someone on the case, all right, but they aren’t being allowed to or don’t see the point of doing their work. It’s bullshit.

I say dissolve the SEC so that people will no longer have any false hopes of meaningful financial reform.

I’ve been reading Sheila Bair’s book Bull by the Horns, and it’s really good. Maybe by the end of it I’ll have changed my mind and I’ll see a place for the SEC. Maybe I’ll have hope that these things have natural cycles and the SEC will have another day in the power position, like it had in the 1980’s. But right now I’m in the part of the book where the regulators, apart from the FDIC, are taking orders directly from financial lobbyists, and it makes me completely crazy.

Categories: finance, rant

Columbia Data Science course, week 5: GetGlue, time series, financial modeling, advanced regression, and ethics

October 5, 2012 Cathy O'Neil, mathbabe Comments off

I was happy to be giving Rachel Schutt’s Columbia Data Science course this week, where I discussed time series, financial modeling, and ethics. I blogged previous classes here.

The first few minutes of class were for a case study with GetGlue, a New York-based start-up that won the mashable breakthrough start-up of the year in 2011 and is backed by some of the VCs that also fund big names like Tumblr, etsy, foursquare, etc. GetGlue is part of the social TV space. Lead Scientist, Kyle Teague, came to tell the class a little bit about GetGlue, and some of what he worked on there. He also came to announce that GetGlue was giving the class access to a fairly large data set of user check-ins to tv shows and movies. Kyle’s background is in electrical engineering, he placed in the 2011 KDD cup (which we learned about last week from Brian), and he started programming when he was a kid.

GetGlue’s goal is to address the problem of content discovery within the movie and tv space, primarily. The usual model for finding out what’s on TV is the 1950’s TV Guide schedule, and that’s still how we’re supposed to find things to watch. There are thousands of channels and it’s getting increasingly difficult to find out what’s good on. GetGlue wants to change this model, by giving people personalized TV recommendations and personalized guides. There are other ways GetGlue uses Data Science but for the most part we focused on how this the recommendation system works. Users “check-in” to tv shows, which means they can tell people they’re watching a show. This creates a time-stamped data point. They can also do other actions such as like, or comment on the show. So this is a -tuple: {user, action, object} where the object is a tv show or movie. This induces a bi-partite graph. A bi-partite graph or network contains two types of nodes: users and tv shows. An edges exist between users and an tv shows, but not between users and users or tv shows and tv shows. So Bob and Mad Men are connected because Bob likes Mad Men, and Sarah and Mad Men and Lost are connected because Sarah liked Mad Men and Lost. But Bob and Sarah aren’t connected, nor are Mad Men and Lost. A lot can be learned from this graph alone.

But GetGlue finds ways to create edges between users and between objects (tv shows, or movies.) Users can follow each other or be friends on GetGlue, and also GetGlue can learn that two people are similar[do they do this?]. GetGlue also hires human evaluators to make connections or directional edges between objects. So True Blood and Buffy the Vampire Slayer might be similar for some reason and so the humans create an edge in the graph between them. There were nuances around the edge being directional. They may draw an arrow pointing from Buffy to True Blood but not vice versa, for example, so their notion of “similar” or “close” captures both content and popularity. (That’s a made-up example.) Pandora does something like this too.

Another important aspect is time. The user checked-in or liked a show at a specific time, so the -tuple extends to have a time-stamp: {user,action,object,timestamp}. This is essentially the data set the class has access to, although it’s slightly more complicated and messy than that. Their first assignment with this data will be to explore it, try to characterize it and understand it, gain intuition around it and visualize what they find.

Students in the class asked him questions around topics of the value of formal education in becoming a data scientist (do you need one? Kyle’s time spent doing signal processing in research labs was valuable, but so was his time spent coding for fun as a kid), what would be messy about a data set, why would the data set be messy (often bugs in the code), how would they know? (their QA and values that don’t make sense), what language does he use to prototype algorithms (python), how does he know his algorithm is good.

Then it was my turn. I started out with my data scientist profile:

As you can see, I feel like I have the most weakness in CS. Although I can use python pretty proficiently, and in particular I can scrape and parce data, prototype models, and use matplotlib to draw pretty pictures, I am no java map-reducer and I bow down to those people who are. I am also completely untrained in data visualization but I know enough to get by and give presentations that people understand.

Thought Experiment

I asked the students the following question:

What do you lose when you think of your training set as a big pile of data and ignore the timestamps?

They had some pretty insightful comments. One thing they mentioned off the bat is that you won’t know cause and effect if you don’t have any sense of time. Of course that’s true but it’s not quite what I meant, so I amended the question to allow you to collect relative time differentials, so “time since user last logged in” or “time since last click” or “time since last insulin injection”, but not absolute timestamps.

What I was getting at, and what they came up with, was that when you ignore the passage of time through your data, you ignore trends altogether, as well as seasonality. So for the insulin example, you might note that 15 minutes after your insulin injection your blood sugar goes down consistently, but you might not notice an overall trend of your rising blood sugar over the past few months if your dataset for the past few months has no absolute timestamp on it.

This idea, of keeping track of trends and seasonalities, is very important in financial data, and essential to keep track of if you want to make money, considering how small the signals are.

How to avoid overfitting when you model with time series

After discussing seasonality and trends in the various financial markets, we started talking about how to avoid overfitting your model.

Specifically, I started out with having a strict concept of in-sample (IS) and out-of-sample (OOS) data. Note the OOS data is not meant as testing data- that all happens inside OOS data. It’s meant to be the data you use after finalizing your model so that you have some idea how the model will perform in production.

Next, I discussed the concept of causal modeling. Namely, we should never use information in the future to predict something now. Similarly, when we have a set of training data, we don’t know the “best fit coefficients” for that training data until after the last timestamp on all the data. As we move forward in time from the first timestamp to the last, we expect to get different sets of coefficients as more events happen.

One consequence of this is that, instead of getting on set of coefficients, we actually get an evolution of each coefficient. This is helpful because it gives us a sense of how stable those coefficients are. In particular, if one coefficient has changed sign 10 times over the training set, then we expect a good estimate for it is zero, not the so-called “best fit” at the end of the data.

One last word on causal modeling and IS/OOS. It is consistent with production code. Namely, you are always acting, in the training and in the OOS simulation, as if you’re running your model in production and you’re seeing how it performs. Of course you fit your model in sample, so you expect it to perform better there than in production.

Another way to say this is that, once you have a model in production, you will have to make decisions about the future based only on what you know now (so it’s causal) and you will want to update your model whenever you gather new data. So your coefficients of your model are living organisms that continuously evolve.

Submodels of Models

We often “prepare” the data before putting it into a model. Typically the way we prepare it has to do with the mean or the variance of the data, or sometimes the log (and then the mean or the variance of that transformed data).

But to be consistent with the causal nature of our modeling, we need to make sure our running estimates of mean and variance are also causal. Once we have causal estimates of our mean $\overline{y}$ and variance $\sigma_y^2$, we can normalize the next data point with these estimates just like we do to get from a gaussian distribution to the normal gaussian distribution:

$y \mapsto \frac{y - \overline{y}}{\sigma_y}$

Of course we may have other things to keep track of as well to prepare our data, and we might run other submodels of our model. For example we may choose to consider only the “new” part of something, which is equivalent to trying to predict something like $y_t - y_{t-1}$ instead of $y_t.$ Or we may train a submodel to figure out what part of $y_{t-1}$ predicts $y_t,$ so a submodel which is a univariate regression or something.

There are lots of choices here, but the point is it’s all causal, so you have to be careful when you train your overall model how to introduce your next data point and make sure the steps are all in order of time, and that you’re never ever cheating and looking ahead in time at data that hasn’t happened yet.

Financial time series

In finance we consider returns, say daily. And it’s not percent returns, actually it’s log returns: if $F_t$ denotes a close on day $t,$ then the return that day is defined as $log(F_t/F_{t-1}).$ See more about this here.

So if you start with S&P closing levels:

Then you get the following log returns:

What’s that mess? It’s crazy volatility caused by the financial crisis. We sometimes (not always) want to account for that volatility by normalizing with respect to it (described above). Once we do that we get something like this:

Which is clearly better behaved. Note this process is discussed in this post.

We could also normalize with respect to the mean, but we typically assume the mean of daily returns is 0, so as to not bias our models on short term trends.

Financial Modeling

One thing we need to understand about financial modeling is that there’s a feedback loop. If you find a way to make money, it eventually goes away- sometimes people refer to this as the fact that the “market learns over time”.

One way to see this is that, in the end, your model comes down to knowing some price is going to go up in the future, so you buy it before it goes up, you wait, and then you sell it at a profit. But if you think about it, your buying it has actually changed the process, and decreased the signal you were anticipating. That’s how the market learns – it’s a combination of a bunch of algorithms anticipating things and making them go away.

The consequence of this learning over time is that the existing signals are very weak. We are happy with a 3% correlation for models that have a horizon of 1 day (a “horizon” for your model is how long you expect your prediction to be good). This means not much signal, and lots of noise! In particular, lots of the machine learning “metrics of success” for models, such as measurements of precision or accuracy, are not very relevant in this context.

So instead of measuring accuracy, we generally draw a picture to assess models, namely of the (cumulative) PnL of the model. This generalizes to any model as well- you plot the cumulative sum of the product of demeaned forecast and demeaned realized. In other words, you see if your model consistently does better than the “stupidest” model of assuming everything is average.

If you plot this and you drift up and to the right, you’re good. If it’s too jaggedy, that means your model is taking big bets and isn’t stable.

Why regression?

From above we know the signal is weak. If you imagine there’s some complicated underlying relationship between your information and the thing you’re trying to predict, get over knowing what that is – there’s too much noise to find it. Instead, think of the function as possibly complicated, but continuous, and imagine you’ve written it out as a Taylor Series. Then you can’t possibly expect to get your hands on anything but the linear terms.

Don’t think about using logistic regression, either, because you’d need to be ignoring size, which matters in finance- it matters if a stock went up 2% instead of 0.01%. But logistic regression forces you to have an on/off switch, which would be possible but would lose a lot of information. Considering the fact that we are always in a low-information environment, this is a bad idea.

Note that although I’m claiming you probably want to use linear regression in a noisy environment, the actual terms themselves don’t have to be linear in the information you have. You can always take products of various terms as x’s in your regression. but you’re still fitting a linear model in non-linear terms.

Advanced regression

The first thing I need to explain is the exponential downweighting of old data, which I already used in a graph above, where I normalized returns by volatility with a decay of 0.97. How do I do this?

Working from this post again, the formula is given by essentially a weighted version of the normal one, where I weight recent data more than older data, and where the weight of older data is a power of some parameter $s$ which is called the decay. The exponent is the number of time intervals since that data was new. Putting that together, the formula we get is:

$V_{old} = (1-s) \cdot \sum_i r_i^2 s^i.$

We are actually dividing by the sum of the weights, but the weights are powers of some number s, so it’s a geometric sum and the sum is given by $1/(1-s).$

One cool consequence of this formula is that it’s easy to update: if we have a new return $r_0$ to add to the series, then it’s not hard to show we just want

$V_{new} = s \cdot V_{old} + (1-s) \cdot r_0^2.$

In fact this is the general rule for updating exponential downweighted estimates, and it’s one reason we like them so much- you only need to keep in memory your last estimate and the number $s.$

How do you choose your decay length? This is an art instead of a science, and depends on the domain you’re in. Think about how many days (or time periods) it takes to weight a data point at half of a new data point, and compare that to how fast the market forgets stuff.

This downweighting of old data is an example of inserting a prior into your model, where here the prior is “new data is more important than old data”. What are other kinds of priors you can have?

Priors

Priors can be thought of as opinions like the above. Besides “new data is more important than old data,” we may decide our prior is “coefficients vary smoothly.” This is relevant when we decide, say, to use a bunch of old values of some time series to help predict the next one, giving us a model like:

$y = F_t = \alpha_0 + \alpha_1 F_{t-1} + \alpha_2 F_{t-2} + \epsilon,$

which is just the example where we take the last two values of the time series $F$ to predict the next one. But we could use more than two values, of course.

[Aside: in order to decide how many values to use, you might want to draw an autocorrelation plot for your data.]

The way you’d place the prior about the relationship between coefficients (in this case consecutive lagged data points) is by adding a matrix to your covariance matrix when you perform linear regression. See more about this here.

Ethics

I then talked about modeling and ethics. My goal is to get this next-gen group of data scientists sensitized to the fact that they are not just nerds sitting in the corner but have increasingly important ethical questions to consider while they work.

People tend to overfit their models. It’s human nature to want your baby to be awesome. They also underestimate the bad news and blame other people for bad news, because nothing their baby has done or is capable of is bad, unless someone else made them do it. Keep these things in mind.

I then described what I call the deathspiral of modeling, a term I coined in this post on creepy model watching.

I counseled the students to

try to maintain skepticism about their models and how their models might get used,
shoot holes in their own ideas,
accept challenges and devise tests as scientists rather than defending their models using words – if someone thinks they can do better, than let them try, and agree on an evaluation method beforehand,
In general, try to consider the consequences of their models.

I then showed them Emanuel Derman’s Hippocratic Oath of Modeling, which was made for financial modeling but fits perfectly into this framework. I discussed the politics of working in industry, namely that even if they are skeptical of their model there’s always the chance that it will be used the wrong way in spite of the modeler’s warnings. So the Hippocratic Oath is, unfortunately, insufficient in reality (but it’s a good start!).

Finally, there are ways to do good: I mentioned stuff like DataKind. There are also ways to be transparent: I mentioned Open Models, which is so far just an idea, but Victoria Stodden is working on RunMyCode, which is similar and very awesome.

Categories: data science, finance, math education, open source tools, statistics

Next-Gen Data Scientists

October 4, 2012 Cathy O'Neil, mathbabe 3 comments

This is written by Rachel Schutt and crossposted from her Columbiadatascience blog

Data is information and is extremely powerful. Models and algorithms that use data can literally change the world. Quantitatively-minded people have always been able to solve important problems, so this is nothing new, and there’s always been data, so this is nothing new.

What is new is the massive amounts of data we have on all aspects of our lives, from the micro to the macro. The data we have from government, finance, education, the environment, social welfare, health, entertainment, the internet will be used to make policy-decisions and to build products back into the fabric of our culture.

I want you, my students, to be the ones doing it. I look around the classroom and see a group of thoughtful, intelligent people who want to do good, and are absolutely capable of doing it.

I don’t call myself a “data scientist”. I call myself a statistician. I refuse to be called a data scientist because as it’s currently used, it’s a meaningless, arbitrary marketing term. However, the existence of the term, and apparent “sexiness” of the profession draws attention to data and opens up opportunities. So we need Next-Gen Data Scientists. That’s you! Here’s what I mean when I say Next-Gen Data Scientist:

Next-Gen Data Scientists have humility. They don’t lie about their credentials and they don’t spend most of their efforts on self-promotion.
Next-Gen Data Scientists have integrity. Their work is not about trying to be “cool” or solving some “cool” problem. It’s about being a problem solver and finding simple, elegant solutions. (or complicated, if necessary)
Next-Gen Data Scientists don’t try to impress with complicated algorithms and models that don’t work.
Next-Gen Data Scientists spend a lot more time trying to get data into shape then anyone cares to admit.
Next-Gen Data Scientists have the experience or education to actually know what they’re talking about. They’ve put their time in.
Next-Gen Data Scientists are skeptical – skeptical about models themselves and how they can fail and the way they’re used or can be misused.
Next-Gen Data Scientists make sure they know what they’re talking about before running around trying to show everyone else they exist.
Next-Gen Data Scientsts have a variety of skills including coding, statistics, machine learning, visualization, communication, math.
Next-Gen Data Scientists do enough Science to merit the word “Scientist”, someone who tests hypotheses and welcomes challenges and alternative theories.
Next-Gen Data Scientists are solving a new breed of problem that surrounds the structure and exploration of data and the computational issues surrounding it.
Next-Gen Data Scientists don’t find religion in tools, methods or academic departments. They are versatile and interdisciplinary.
Next-Gen Data Scientists are highly skilled and ought to get paid well enough that they don’t have to worry too much about money
Next-Gen Data Scientists don’t let money blind them to the point that their models are used for unethical purposes.
Next-Gen Data Scientists seek out opportunities to solve problems of social value.
Next-Gen Data Scientists understand the implications and consequences of the models they’re building.
Next-Gen Data Scientists collaborate and cooperate.
Next-Gen Data Scientists bring their humanity with them to problem solving, and algorithm/model-building.

Categories: data science, guest post

Knitting porn

October 4, 2012 Cathy O'Neil, mathbabe 3 comments

I owe you guys a post on my talk last night at Rachel Schutt’s Data Science course at Columbia (which I’ve been blogging about for the past four weeks here). Yesterday I spoke about time series, financial modeling, and ethics.

But unfortunately, right now I’m tending to my 3-year-old, who was up all night sick. While you wait I thought I’d show you some knitting porn I can’t get enough of:

Categories: musing

Bad news wish list

October 3, 2012 Cathy O'Neil, mathbabe 4 comments

You know that feeling you get when, a few years after you went to a wedding of your friends, you find out they’re getting a divorce?

It’s not a nice feeling. It’s work for you, and nasty work at that: you have to go back over your memories of those two in the past years, where you’d been projecting happiness and contentment all this time, and replace it with argument and bitterness. Not to mention the sorrow and sympathy you naturally bestow on your friends.

If it happens enough times, which it has to me, then going to weddings at all is kind of a funereal affair. I no longer project happy thoughts towards the newly married couple. If anything I worry for them and cross my fingers, hoping for the best. You may even say I’ve lost my faith in the institution.

Considering this, I can kind of understand why some religions don’t allow divorce. If you don’t allow it, then the bad news will never come out, and you won’t have to retroactively fit your internal model of other people’s lives to reality. You can go on blithely assuming everyone’s doing great. While we’re at it, no kids are getting neglected or abused because we don’t talk about that kind of thing.

By way of unreasonable analogy, I’d like to discuss the lack of conversation we’ve seen by the presidential campaigns on both sides about the state of the financial system. I’m starting to think it’s part of the religion of politicians that they never talk about this stuff, because they treat it as an embarrassing failure along the lines of a catholic divorce.

Or maybe I don’t have to be so philosophical about it- is it religion, or is it just money?

I had trouble following much about the two national conventions, because it made me so incensed that nothing was really being discussed, and that it was all so full of shit. But one thing I managed to glean from the coverage of the “events” being sponsored by the various lobbyist groups at the two conventions is that, whereas most lobbyists sponsor events at one of the conventions, like the NRA sponsors something at the Republican convention and the unions sponsor stuff at the Democratic convention, the financial lobbyists sponsor huge swanky events at both.

I interpret this to mean that they are paying to not be discussed as a platform issue. They seem to have paid enough, because I don’t hear anything from the Romney camp about shit Obama has or hasn’t done, or shit Geithner has or hasn’t done.

In fact, there’s a “Stories I’d like to See” column in Reuters column entitled “Tales of a TARP built to benefit bankers, and waiting for CEOs to pay the price”, and written by Stephen Brill, which discusses this exact issue in the context of Neil Barofsky’s book Bailout, which I blogged about here. From the column:

A presidential campaign that wanted to call out the Obama administration for being too friendly to Wall Street and the banks at the expense of Main Street would be using Bailout as the cheat sheet that keeps on giving. But with the Romney campaign’s attack coming from the opposite direction – that the president and his team have killed the economy by shackling Wall Street – and with Romney on record in favor of allowing the mortgage crisis to “bottom out” with no government intervention, the former Massachusetts governor and his team have no use for Bailout.

The second half of the article is really good, asking very commonsensical question about the recent settlement BofA got from the SEC for blatantly lying to shareholders around the time they acquired Merrill Lynch. Specifically the author notes that the (current) shareholders are left paying the (2008) shareholders, which is dumb, but the asshole Ken Lewis, who actually lied doesn’t seem to be getting into any trouble at all. From the column:

And, as long as we’re talking about harm done to shareholders, why wouldn’t we now see a new, post-settlement shareholders’ suit not against the company but targeted only at Lewis and some of his former colleagues who got Bank of America into this jam in the first place and just caused it to pay out $2.4 billion? (The plaintiffs here could be any current shareholders, because they are the ones who are writing the $2.4 billion check.) Again, did the company indemnify Lewis and other executives against shareholder suits, meaning that if a shareholder now sues Lewis over this $2.4 billion settlement, the shareholder is once again only suing himself?

Can someone please sort this out?

I really like this idea, that we have a list of topics for people to sort out, even though it’s going to be bad news. What other topics should we ask for on our bad news wish list?

Categories: #OWS, finance, news

Student loans are a regressive tax

October 2, 2012 Cathy O'Neil, mathbabe 17 comments

I don’t think this approach of looking at student loans is new, but it’s new to me. A friend of mine mentioned this to me over the weekend.

For simplicity, assume everyone goes to college. Next, assume they all go to similar colleges – similar in cost and in quality. We will revisit these assumptions later. Finally, assume that costs of college keep going up the way they’re going and that student loan interest rates stay high.

What this means when you put it all together is that sufficiently rich people, or more likely their parents, will pay a one-time very large fee to attend college, but then they’ll be done with it. The rest of the people will be stuck paying monthly fees that will never go away. Moreover, because the interest rates are pretty high, the total amount non-rich people pay over their lifetime is substantially more than what rich people pay.

This is essentially a regressive tax, whereby poor people pay more than rich people.

Other points:

The government student loans don’t have interest rates that are extremely high, but there’s a limit of how much you can borrow with that program, which leads many people even now to borrow privately at much higher rates.
In the case of government-backed student loans this “tax” is essentially going to the government. In the case of private student loans, the private creditors are receiving the tax.
Since you can’t discharge student debt via bankruptcy, even private student debt, it really is a life-long tax. It’s even true that if you haven’t paid off your student debt by the time you retire, your social security payments get cut.
What about our assumptions that all schools have the same quality? Not true. Rich people tend to go to better schools. This means the poor are paying a tax for an inferior service. Of course, it’s also true that truly elite schools like Harvard have excellent financial support for their poorer students. This means there’s a two-tier school system if you’re poor: you can go to a normal school and pay tax, or you can excel and get into an elite school and it will be free.
What about our assumption that all schools have the same cost? Of course not true; we can look for better quality education for a reasonable price.
What about our assumption that everyone goes to college? Not true, but it’s still true that going to college and finishing sets you up for far better wage earning than if you only have a high school diploma. And although going to college and not finishing may not, nobody think they’re the ones who won’t finish.

Conclusion: Either we have to keep costs down or we have to make college government-subsidized or we have to make student loan interest rates really low or we have to offset this regressive tax with a highly progressive income tax.

Categories: #OWS, finance

Newer Entries Older Entries

mathbabe

Archive

Birdwatching

What’s a fair price?

Columbia Data Science course, week 7: Hunch.com, recommendation engines, SVD, alternating least squares, convexity, filter bubbles

Growing old: better than the alternatives

Warning

Causal versus causal

Gaming the Google mail filter and the modeling feedback loop

The investigative mathematical journalist

Philanthropy can do better than Rajat Gupta

Personal privacy and institutional transparency

Columbia Data Science course, week 6: Kaggle, crowd-sourcing, decision trees, random forests, social networks, and experimental design

Live and let live, motherfuckers

Neil Barofsky on the Fed Stress Test

Suresh Naidu: analyzing the language of political partisanship

The Neighbors

Dissolve the SEC

Columbia Data Science course, week 5: GetGlue, time series, financial modeling, advanced regression, and ethics

Next-Gen Data Scientists

Knitting porn

Bad news wish list

Student loans are a regressive tax

Top Posts & Pages

Follow Blog via Email

Recent Posts

Meta