October | 2012 | mathbabe

Columbia Data Science course, week 6: Kaggle, crowd-sourcing, decision trees, random forests, social networks, and experimental design

October 11, 2012 Cathy O'Neil, mathbabe 3 comments

Yesterday we had two guest lecturers, who took up approximately half the time each. First we welcomed William Cukierski from Kaggle, a data science competition platform.

Will went to Cornell for a B.A. in physics and to Rutgers to get his Ph.D. in biomedical engineering. He focused on cancer research, studying pathology images. While working on writing his dissertation, he got more and more involved in Kaggle competitions, finishing very near the top in multiple competitions, and now works for Kaggle. Here’s what Will had to say.

Crowd-sourcing in Kaggle

What is a data scientist? Some say it’s someone who is better at stats than an engineer and better at engineering than a statistician. But one could argue it’s actually someone who is worse at stats than a statistician. Being a data scientist is when you learn more and more about more and more until you know nothing about everything.

Kaggle using prizes to induce the public to do stuff. This is not a new idea:

the Royal Navy in 1714 couldn’t measure longitude, and put out a prize worth $6 million in today’s dollars to get help. John Harrison, an unknown cabinetmaker, figured it out how to make a clock to solve the problem.
In the U.S. in 2002 FOX issued a prize for the next pop solo artist, which resulted in American Idol.
There’s also the X-prize company; $10 million was offered for the Ansari X-prize and $100 million was lost in trying to solve it. So it doesn’t always mean it’s an efficient process (but it’s efficient for the people offering the prize if it gets solved!)

There are two kinds of crowdsourcing models. First, we have the distributive crowdsourcing model, like wikipedia, which as for relatively easy but large amounts of contributions. Then, there’s the singular, focused difficult problems that Kaggle, DARPA, InnoCentive and other companies specialize in.

Somee of the problems with some crowdsourcing projects include:

they don’t always evaluate your submission objectively. Instead they have a subjective measure, so they might just decide your design is bad or something. This leads to high barrier to entry, since people don’t trust the evaluation criterion.
Also, one doesn’t get recognition until after they’ve won or ranked highly. This leads to high sunk costs for the participants.
Also, bad competitions often conflate participants with mechanical turks: in other words, they assume you’re stupid. This doesn’t lead anywhere good.
Also, the competitions sometimes don’t chunk the work into bite size pieces, which means it’s too big to do or too small to be interesting.

A good competition has a do-able, interesting question, with an evaluation metric which is transparent and entirely objective. The problem is given, the data set is given, and the metric of success is given. Moreover, prizes are established up front.

The participants are encouraged to submit their models up to twice a day during the competitions, which last on the order of a few days. This encourages a “leapfrogging” between competitors, where one ekes out a 5% advantage, giving others incentive to work harder. It also establishes a band of accuracy around a problem which you generally don’t have- in other words, given no other information, you don’t know if your 75% accurate model is the best possible.

The test set y’s are hidden, but the x’s are given, so you just use your model to get your predicted y’s for the test set and upload them into the Kaggle machine to see your evaluation score. This way you don’t share your actual code with Kaggle unless you win the prize (and Kaggle doesn’t have to worry about which version of python you’re running).

Note this leapfrogging effect is good and bad. It encourages people to squeeze out better performing models but it also tends to make models much more complicated as they get better. One reason you don’t want competitions lasting too long is that, after a while, the only way to inch up performance is to make things ridiculously complicated. For example, the original Netflix Prize lasted two years and the final winning model was too complicated for them to actually put into production.

The hole that Kaggle is filling is the following: there’s a mismatch between those who need analysis and those with skills. Even though companies desperately need analysis, they tend to hoard data; this is the biggest obstacle for success.

They have had good results so far. Allstate, with a good actuarial team, challenged their data science competitors to improve their actuarial model, which, given attributes of drivers, approximates the probability of a car crash. The 202 competitors improved Allstate’s internal model by 271%.

There were other examples, including one where the prize was $1,000 and it benefited the company $100,000.

A student then asked, is that fair? There are actually two questions embedded in that one. First, is it fair to the data scientists working at the companies that engage with Kaggle? Some of them might lose their job, for example. Second, is it fair to get people to basically work for free and ultimately benefit a for-profit company? Does it result in data scientists losing their fair market price?

Of course Kaggle charges a fee for hosting competitions, but is it enough?

[Mathbabe interjects her view: personally, I suspect this is a model which seems like an arbitrage opportunity for companies but only while the data scientists of the world haven’t realized their value and have extra time on their hands. As soon as they price their skills better they’ll stop working for free, unless it’s for a cause they actually believe in.]

Facebook is hiring data scientists, they hosted a Kaggle competition, where the prize was an interview. There were 422 competitors.

[Mathbabe can’t help but insert her view: it’s a bit too convenient for Facebook to have interviewees for data science positions in such a posture of gratitude for the mere interview. This distracts them from asking hard questions about what the data policies are and the underlying ethics of the company.]

There’s a final project for the class, namely an essay grading contest. The students will need to build it, train it, and test it, just like any other Kaggle competition. Group work is encouraged.

Thought Experiment: What are the ethical implications of a robo-grader?

Some of the students’ thoughts:

It depends on how much you care about your grade.
Actual human graders aren’t fair anyway.
Is this the wrong question? The goal of a test is not to write a good essay but rather to do well in a standardized test. The real profit center for standardized testing is, after all, to sell books to tell you how to take the tests. It’s a screening, you follow the instructions, and you get a grade depending on how well you follow instructions.
There are really two question: 1) Is it wise to move from the human to the machine version of same thing for any given thing? and 2) Are machines making things more structured, and is this inhibiting creativity? One thing is for sure, robo-grading prevents me from being compared to someone more creative.
People want things to be standardized. It gives us a consistency that we like. People don’t want artistic cars, for example.
Will: We used machine learning to research cancer, where the stakes are much higher. In fact this whole field of data science has to be thinking about these ethical considerations sooner or later, and I think it’s sooner. In the case of doctors, you could give the same doctor the same slide two months apart and get different diagnoses. We aren’t consistent ourselves, but we think we are. Let’s keep that in mind when we talk about the “fairness” of using machine learning algorithms in tricky situations.

Introduction to Feature Selection

“Feature extraction and selection are the most important but underrated step of machine learning. Better features are better than better algorithms.” – Will

“We don’t have better algorithms, we just have more data” –Peter Norvig

Will claims that Norvig really wanted to say we have better features.

We are getting bigger and bigger data sets, but that’s not always helpful. The danger is if the number of features is larger than the number of samples or if we have a sparsity problem.

We improve our feature selection process to try to improve performance of predictions. A criticism of feature selection is that it’s no better than data dredging. If we just take whatever answer we get that correlates with our target, that’s not good.

There’s a well known bias-variance tradeoff: a model is “high bias” if it’s is too simple (the features aren’t encoding enough information). In this case lots more data doesn’t improve your model. On the other hand, if your model is too complicated, then “high variance” leads to overfitting. In this case you want to reduce the number of features you are using.

We will take some material from a famous paper by Isabelle Guyon published in 2003 entitled “An Introduction to Variable and Feature Selection”.

There are three categories of feature selection methods: filters, wrappers, and embedded methods. Filters order variables (i.e. possible features) with respect to some ranking (e.g. correlation with target). This is sometimes good on a first pass over the space of features. Filters take account of the predictive power of individual features, and estimate mutual information or what have you. However, the problem with filters is that you get correlated features. In other words, the filter doesn’t care about redundancy.

This isn’t always bad and it isn’t always good. On the one hand, two redundant features can be more powerful when they are both used, and on the other hand something that appears useless alone could actually help when combined with another possibly useless-looking feature.

Wrapper feature selection tries to find subsets of features that will do the trick. However, as anyone who has studied the binomial coefficients knows, the number of possible size $k$ subsets of $n$ things, called $n\choose k$ , grows exponentially. So there’s a nasty opportunity for over fitting by doing this. Most subset methods are capturing some flavor of minimum-redundancy-maximum-relevance. So, for example, we could have a greedy algorithm which starts with the best feature, takes a few more highly ranked, removes the worst, and so on. This a hybrid approach with a filter method.

We don’t have to retrain models at each step of such an approach, because there are fancy ways to see how objective function changes as we change the subset of features we are trying out. These are called “finite differences” and rely essentially on Taylor Series expansions of the objective function.

One last word: if you have a domain expertise on hand, don’t go into the machine learning rabbit hole of feature selection unless you’ve tapped into your expert completely!

Decision Trees

We’ve all used decision trees. They’re easy to understand and easy to use. How do we construct? Choosing a feature to pick at each step is like playing 20 questions. We take whatever the most informative thing is first. For the sake of this discussion, assume we break compound questions into multiple binary questions, so the answer is “+” or “-“.

To quantify “what is the most informative feature”, we first define entropy for a random variable $X$ to mean:

$H(X) = - p(x_+) log_2(p(x_+)) - p(x_-) log_2(p(x_-)).$

Note when $p(x_*) = 0,$ we define the term to vanish. This is consistent with the fact that

$\lim_{t\to 0} t log(t) = 0.$

In particular, if either option has probability zero, the entropy is 0. It is maximized at 0.5 for binary variables:

which we can easily compute using the fact that in the binary case, $p(x_+) = 1- p(x_-)$ and a bit of calculus.

Using this definition, we define the information gain for a given feature, which is defined as the entropy we lose if we know the value of that feature.

To make a decision tree, then, we want to maximize information gain, and make a split on that. We keep going until all the points at the end are in the same class or we end up with no features left. In this case we take the majority vote. Optionally we prune the tree to avoid overfitting.

This is an example of an embedded feature selection algorithm. We don’t need to use a filter here because the “information gain” method is doing our feature selection for us.

How do you handle continuous variables?

In the case of continuous variables, you need to ask for the correct threshold of value so that it can be though of as a binary variable. So you could partition a user’s spend into “less than $5” and “at least $5” and you’d be getting back to the binary variable case. In this case it takes some extra work to decide on the information gain because it depends on the threshold as well as the feature.

Random Forests

Random forests are cool. They incorporate “bagging” (bootstrap aggregating) and trees to make stuff better. Plus they’re easy to use: you just need to specify the number of trees you want in your forest, as well as the number of features to randomly select at each node.

A bootstrap sample is a sample with replacement, which we usually take to be 80% of the actual data, but of course can be adjusted depending on how much data we have.

To construct a random forest, we construct a bunch of decision trees (we decide how many). For each tree, we take a bootstrap sample of our data, and for each node we randomly select (a second point of bootstrapping actually) a few features, say 5 out of the 100 total features. Then we use our entropy-information-gain engine to decide which among those features we will split our tree on, and we keep doing this, choosing a different set of five features for each node of our tree.

Note we could decide beforehand how deep the tree should get, but we typically don’t prune the trees, since a great feature of random forests is that it incorporates idiosyncratic noise.

Here’s what does a decision tree looks like for surviving on the Titanic.

David Huffaker, Google: Hybrid Approach to Social Research

David is one of Rachel’s collaborators in Google. They had a successful collaboration, starting with complementary skill sets, an explosion of goodness ensued when they were put together to work on Google+ with a bunch of other people, especially engineers. David brings a social scientist perspective to the analysis of social networks. He’s strong in quantitative methods for understanding and analyzing online social behavior. He got a Ph.D. in Media, Technology, and Society from Northwestern.

Google does a good job of putting people together. They blur the lines between research and development. The researchers are embedded on product teams. The work is iterative, and the engineers on the team strive to have near-production code from day 1 of a project. They leverage cloud infrastructure to deploy experiments to their mass user base and to rapidly deploy a prototype at scale.

Note that, considering the scale of Google’s user base, redesign as they scaling up is not a viable option. They instead do experiments with smaller groups of users.

David suggested that we, as data scientists, consider how to move into an experimental design so as to move to a causal claim between variables rather than a descriptive relationship. In other words, to move from the descriptive to the predictive.

As an example, he talked about the genesis of the “circle of friends” feature of Google+. They know people want to selectively share; they’ll send pictures to their family, whereas they’d probably be more likely to send inside jokes to their friends. They came up with the idea of circles, but it wasn’t clear if people would use them. How do they answer the question: will they use circles to organize their social network? It’s important to know what motivates them when they decide to share.

They took a mixed-method approach, so they used multiple methods to triangulate on findings and insights. Given a random sample of 100,000 users, they set out to determine the popular names and categories of names given to circles. They identified 168 active users who filled out surveys and they had longer interviews with 12.

They found that the majority were engaging in selective sharing, that most people used circles, and that the circle names were most often work-related or school-related, and that they had elements of a strong-link (“epic bros”) or a weak-link (“acquaintances from PTA”)

They asked the survey participants why they share content. The answers primarily came in three categories: first, the desire to share about oneself – personal experiences, opinions, etc. Second, discourse: people wanna participate in a conversation. Third, evangelism: people wanna spread information.

Next they asked participants why they choose their audiences. Again, three categories: first, privacy – many people were public or private by default. Second, relevance – they wanted to share only with those who may be interested, and they don’t wanna pollute other people’s data stream. Third, distribution – some people just want to maximize their potential audience.

The takeaway from this study was this: people do enjoy selectively sharing content, depending on context, and the audience. So we have to think about designing features for the product around content, context, and audience.

Network Analysis

We can use large data and look at connections between actors like a graph. For Google+, the users are the nodes and the edges (directed) are “in the same circle”.

Other examples of networks:

nodes are users in 2nd life, interactions between users are possible in three different ways, corresponding to three different kinds of edges
nodes are websites, edges are links
nodes are theorems, directed edges are dependencies

After you define and draw a network, you can hopefully learn stuff by looking at it or analyzing it.

Social at Google

As you may have noticed, “social” is a layer across all of Google. Search now incorporates this layer: if you search for something you might see that your friend “+1″‘ed it. This is called a social annotation. It turns out that people care more about annotation when it comes from someone with domain expertise rather than someone you’re very close to. So you might care more about the opinion of a wine expert at work than the opinion of your mom when it comes to purchasing wine.

Note that sounds obvious but if you started the other way around, asking who you’d trust, you might start with your mom. In other words, “close ties,” even if you can determine those, are not the best feature to rank annotations. But that begs the question, what is? Typically in a situation like this we use click-through rate, or how long it takes to click.

In general we need to always keep in mind a quantitative metric of success. This defines success for us, so we have to be careful.

Privacy

Human facing technology has thorny issues of privacy which makes stuff hard. We took a survey of how people felt uneasy about content. We asked, how does it affect your engagement? What is the nature of your privacy concerns?

Turns out there’s a strong correlation between privacy concern and low engagement, which isn’t surprising. It’s also related to how well you understand what information is being shared, and the question of when you post something, where does it go and how much control do you have over it. When you are confronted with a huge pile of complicated all settings, you tend to start feeling passive.

Again, we took a survey and found broad categories of concern as follows:

identity theft

financial loss

digital world

access to personal data
really private stuff I searched on
unwanted spam
provocative photo (oh shit my boss saw that)
unwanted solicitation
unwanted ad targeting

physical world

offline threats
harm to my family
stalkers
employment risks
hassle

What is the best way to decrease concern and increase undemanding and control?

Possibilities:

Write and post a manifesto of your data policy (tried that, nobody likes to read manifestos)
Educate users on our policies a la the Netflix feature “because you liked this, we think you might like this”
Get rid of all stored data after a year

Rephrase: how do we design setting to make it easier for people? how do you make it transparent?

make a picture or graph of where data is going.
give people a privacy switchboard
give people access to quick settings
make the settings you show them categorized by things you don’t have a choice about vs. things you do
make reasonable default setting so people don’t have to worry about it.

David left us with these words of wisdom: as you move forward and have access to big data, you really should complement them with qualitative approaches. Use mixed methods to come to a better understanding of what’s going on. Qualitative surveys can really help.

Categories: data science, math education, statistics

Live and let live, motherfuckers

October 10, 2012 Cathy O'Neil, mathbabe 5 comments

It’s high time I tell you guys about my favorite blog, Effing Dykes.

Why now? Well, I’ve wanted to write a post about body image like Effing Dykes’ The Body Electric ever since I started this blog (ever since I turned 10, actually). But I couldn’t get it right. Not in a million years could I have written something so beautiful or so right. So I’m really grateful she has written it. Please read and enjoy.

That url again: http://effingdykes.blogspot.com/2012/09/the-body-electric.html

Note: I’ve stolen the catchy phrase “live and let live, motherfuckers” (can you say “phrase of the week”?) from that post, as well as this picture, which reminds me of my wordpress profile pic as well as all of my friends from high school:

p.s. I had a wardrobe crisis last week when I realized I only owned one ugly plaid flannel shirt, but luckily Old Navy has an ugly plaid flannel shirt sale going on.

Categories: Uncategorized

Neil Barofsky on the Fed Stress Test

October 10, 2012 Cathy O'Neil, mathbabe 3 comments

I recently started using Twitter, and I only follow 8 people, one of them being Neil Barofsky, author of Bailout, which I blogged about here (Twitter is a useful way to stalk your crushes, as Twitter users already know).

I’m glad I do follow him, because yesterday he tweeted (twatted?) about an article he wrote on LinkedIn which I never would have found otherwise. It’s called “Banks Rule While the Rest of us Drool,” and he gave credit to his daughter for that title, which is crushworthy in itself. It’s essentially a bloggy rant against a Wall Street Journal article which I had just read and was thinking of writing a ranty blog post against myself.

But now I don’t have to write it! I’ll just tell you about the WSJ article, quote from it a bit (and complain about it a bit since I can’t help myself), and then quote Barofsky’s awesome disgust with it. Here goes.

The Fed conducts stress tests on the banks, and they are making them secret, so the banks can’t game them, as well as requiring more frequent and better quality data. All good. From the WSJ article:

The Fed asks the big banks to submit reams of data and then publishes each bank’s potential loan losses and how much capital each institution would need to absorb them. Banks also submit plans of how they would deploy capital, including any plans to raise dividends or buy back stock.

After several institutions failed last year’s tests and had their capital plans denied, executives at many of the big banks began challenging the Fed to explain why there were such large gaps between their numbers and the Fed’s, according to people close to the banks.

Fed officials say they have worked hard to help bankers better understand the math, convening the Boston symposium and multiple conference calls. But they don’t want to hand over their models to the banks, in part because they don’t want the banks to game the numbers, officials say.

Just to be clear, when they say “large gaps”, I’m pretty sure the banks mean they are perfectly safe when the Fed thinks they’re undercapitalized. I am pretty sure the banks are arguing they should be giving huger bonuses to their C*O’s whereas the Fed thinks not. I’m just guessing on the direction, but I could be wrong, it’s not spelled out in the article.

Here’s another thing that drives me up the wall, from the WSJ article:

Banks say the Fed has asked them for too much, too fast. Some bankers, for instance, have complained the Fed now is demanding they include the physical address of properties backing loans on their books, not just the billing address for the borrower. Not all banks, it turns out, have that information readily available.

Daryl Bible, the chief risk officer at BB&T Corp., a Winston-Salem, N.C.-based bank with $179 billion in assets, challenged the Fed’s need for all of the data it is collecting, saying in a Sept. 4 comment letter to the regulator that “the reporting requirements appear to have advanced beyond the linkage of risk to capital and an organization’s viability,” burdening banks without adding any value to the stress test exercise. BB&T declined further comment.

Oh really? Can you, Daryl Bible, think of no reason at all we might want to know the addresses of the houses you gave bad mortgages to? Really? Do you really think you deserve to be a Chief Risk Officer of a firm with $179 billion in assets if your imagination of how to calculate risk is so puny?

But the most infuriating part of the article is at the end, and I’m going to let Neil take it away:

… at the end of the article the reporters reveal that the Fed recently “backed off” a requirement that the CFOs of the banks actually confirm that the numbers they are providing are accurate. The reason? The banks argued, and the Fed apparently agreed, that providing data about what’s going on in the banks is simply too “confusing for any CFO to be able to be sure his bank had gotten it right.” In other words, rather than demand personal accountability, the Fed seems to be content with relying on unverified and potentially inaccurate data. If this does not prove both the inherent unreliability of these tests and that the banks are still so hopelessly complex that their executives do not know what’s going on inside of them (See Whale, London), I’m not sure what would.

Categories: finance

Suresh Naidu: analyzing the language of political partisanship

October 9, 2012 Cathy O'Neil, mathbabe 4 comments

I was lucky enough to attend Suresh Naidu‘s lecture last night on his recent work analyzing congressional speeches with co-authors Jacob Jensen, Ethan Kaplan, and Laurence Wilse-Samson.

Namely, along with his co-authors, he found popular three-word phrases, measured and ranked their partisanship (by how often a democrat uttered the phrase versus a republican), and measured the extent to which those phrases were being used in the public discussion before congress started using them or after congress started using them.

Note this means that phrases that were uttered often by both parties were ignored. Only phrases that were uttered more by one party than the other like “free market system” were counted. Also, the words were reduced to their stems and small common words were ignored, so the phrase “united states of america” was reduced to “unite.state.america”. So if parties were talking about the same issue but insisted on using certain phrases (“death tax” for example), then it would show up. This certainly jives with my sense of how partisanship is established by politicians, and for the sake of the paper it can be taken to be the definition.

The first data set he used was a digitized version of all of the speeches from the House since the end of the Civil War, which was also the beginning of the “two-party” system as we know it. Third party politicians were ignored. The proxy for “the public discussion” was taken from Google Book N-grams. It consists of books that were published in English in a given year.

Some of the conclusions that I can remember are as follows:

The three-word phrases themselves are a super interesting data set; their prevalence, how the move from one side of the aisle to the other over time, and what they discuss (so for example, they don’t discuss international issues that much – which doesn’t mean the politicians don’t discuss international issues, but that it’s not a particularly partisan issue or at least their language around this issue is similar).
When the issue is economic and highly partisan, it tends to show up “in the public” via Google Books before it shows up in Congress. Which is to say, there’s been a new book written by some economist, presumably, who introduces language into the public discussion that later gets picked up by Congress.
When the issue is non-economic or only somewhat partisan, it tends to show up in Congress before or at the same time as in the public domain. Members of Congress seem to feel comfortable making up their own phrases and repeating them in such circumstances.

So the cult of the economic expert has been around for a while now.

Suresh and his crew also made an overall measurement of the partisanship of a given 2-year session of congress. It was interesting to discuss how this changed over time, and how having large partisanship, in terms of language, did not necessarily correlate with having stalemate congresses. Indeed if I remember correctly, a moment of particularly high partisanship, as defined above via language, was during the time the New Deal was passed.

Also, as we also discussed (it was a lively audience), language may be a marker of partisan identity without necessarily pointing to underlying ideological differences. For example, the phrase “Martin Luther King” has been ranked high as a partisan democratic phrase since the civil rights movement but then again it’s customary (I’ve been told) for democrats to commemorate MLK’s birthday, but not for republicans to do so.

Given their speech, this analysis did a good job identifying which party a politician belonged to, but the analysis was not causal in the sense of time: we needed to know the top partisan phrases of that session of Congress to be able to predict the party of a given politician. Indeed the “top phrases” changed so quickly that the predictive power may be mostly lost between sessions.

Not that this is a big deal, since of course we know what party a politician is from, but it would be interesting to use this as a measure of how radical or centered a given politician is or will be.

Even if you aren’t interested in the above results and discussion, the methodology is very cool. Suresh and his co-authors view text as its own data set and analyze it as such.

And after all, the words historical politicians spoke is what we have on record – we can’t look into their brain and see what they were thinking. It’s of course interesting and important to have historians (domain experts) inform the process as well, e.g. for the “Martin Luther King” phrase above, but barring expert knowledge this is lots better than nothing. One thing it tells us, just in case we didn’t study political history, is that we’ve seen way worse partisanship in the past than we see now, although things have consistently been getting worse since the 1980’s.

Here’s a wordcloud from the 2007 session; blue and red are what you think, and bigger means more partisan:

Categories: data science, musing, statistics

The Neighbors

October 8, 2012 Cathy O'Neil, mathbabe 3 comments

When I was a senior in high school, my parents moved house to the outskirts of Lexington, Massachusetts, from the center of town where I’d grown up. The neighborhood had a totally different feel, even though it was the same town. In particular it had a kind of prissiness that we didn’t understand or care for.

My best friend Becky ran away from home to live with my family during this year, so most of my memories of that house involve her. Our good friend Karen often visited as well; she drove her beat-up old VW van up the hill and parked it right across from our house on the street. This was totally legal, by the way, and there were plenty of people who parked on the street nearby.

Just to describe the van a bit more: it had about 5 different color paints on it, but not in any kind of artistic way. It was just old. And it had a million, possibly more than a million, memories of teenage sex hanging on to it- at some point there had even been a mattress installed in the back of the van. I remember this from earlier in high school, when the van had been owned by Karen’s older half-sister and had been parked out behind the high school.

Just in case this is getting too seedy for you, keep in mind we were the freaks and geeks of high school (J-House), we talked about D&D and always used condoms. I don’t even know why I’m saying “we” because I personally never got any action in the legendary van, but I was certainly aware of it.

So anyway, Karen would drive up the hill and park her ugly-but-legendary van there, and every time she’d do it, she’d get a nasty note on her windshield by the time she left, something along these lines:

“Please don’t part your van in front of our window. It is an eyesore. – the Neighbors”

I remember laughing hysterically with Karen and Becky the first time Karen got such a note and bringing it to my mom, who, in her characteristically nerdy way, said something about how it’s perfectly legal to park on the street and to ignore it.

What was awesome about this was how, from then on, Karen would very carefully park her van from then on right in front of the window of the Neighbors (their last name was actually “Neighbors”). Sometimes she’d pull up a bit, then pull back, then get it settled just so. And she always got the note, even though we never actually saw them leave the house. They were like magical prissy elves.

One more story about the Neighbors which is too good to resist. There was a swimming pool in the back of the house, which my mom hated with all her heart because she was in charge of the upkeep and it kept mysteriously turning green. And Becky and I were going through a back-to-nature phase, which meant we were planning to go hiking up in the White Mountains. So one day we were testing our tent out in the front yard, learning how to open and close it, and we happened to be wearing swimming suits, since we’d been swimming.

The Neighbors called my house (this is back when there were things called “telephone books” and you could find someone’s phone number without asking them) and complained to my grandma, who happened to answer the phone, and who also happened to be wearing nothing but a swimming suit, that “there are skimpily clad young ladies cavorting on the front lawn in an obscene manner.”

Now, my grandma had arthritis and couldn’t comfortably walk or stand for very long, but this phone call seemed to give her extra strength. She walked to the front door and stood there, arms crossed, looking defiantly out at the neighborhood for five minutes. After about four minutes I asked her if everything was all right and she said, “perfectly fine.”

Categories: musing

Dissolve the SEC

October 6, 2012 Cathy O'Neil, mathbabe 26 comments

A few days ago I wrote about the $5 million fine the SEC gave to NYSE for allowing certain customers prices before other customers. I was baffled that the fine is so low- access like that allows the customers to make outrageous profits, and it seems like the resulting fine should be more along the lines of those profits, since kickbacks are probably in terms of percentages of take. The lawyer fees from this case on both sides is much higher than $5 million, for christ’s sakes.

But now I’m even more outraged by the newest smallest fine, this time an $800,000 fine for a dark pool trading firm eBX. From the Boston.com article:

Federal securities regulators on Wednesday charged Boston-based eBX LLC, a “dark pool” securities exchange, with failing to protect confidential trading information of customers and for failing to disclose that it let an outside firm use their trading data.

The Securities and Exchange Commission said eBX, which runs the alternative trading system LeveL ATS, agreed to settle the charges and to pay an $800,000 penalty.

You know that if I can actually consider paying the fine myself, then the fine is too small. It’s along the lines of the cost of college for my kids.

Look, I don’t care what it’s for: if the SEC finds you guilty of fraud, it should threaten to put you out of business. Otherwise why should they waste their time doing it?

On the one hand, I’m outraged that these fraudulent practices are being so lightly punished. Indeed it’s worse than no punishment at all to get such a light punishment, because it establishes precedent. Now exchanges know how much it costs to let certain traders get better access to data than others, and as long as they charge sufficiently, they’ll be sure to make profit on it. Similarly dark trading pools know how much to charge third-party data vendors for their clients’ “confidential trading information.” Awesome.

On the other hand, I’m outraged at the SEC for not picking their fights better and for general incompetence. Here they are nabbing firms for real fraud, and they can’t get more than $800,000? At the same time, they’ve decided to go into high frequency trading but what that seems to mean to them is that they’ll finally collect some tick data. I’ve got some news for them: it’s gonna take more than a little bit of data to understand that world.

The SEC needs to concentrate more on not trying to keep up with the HFT’ers of the world, since it’s a lost cause, and spend more time thinking through what policy changes they’d need to actually do their job well – for example, what would they need to get Citigroup and Bank of America to admit wrongdoing when they defraud their customers? Instead of wasting their time trying to keep up with HFT quants, what would they need to institute a transaction tax, or some other policy to slow down trading? What would they need to be able to shut down firms who sell confidential client trading information?

The SEC needs to write a list of policy demands, pronto.

And if the political pressure the SEC receives to not actually get anyone in trouble is too strong for them to do their job well, they should either quit in protest or make a huge stink about being kept from completing their mission.

I get it, I’ve talked to people inside the SEC who want to do a better job but feel like they aren’t being given the power to. But I say, enough with the resigned shrugs already, this stuff is out of control! Continuing in this way is giving the public the false impression that there’s someone on the case. Well, there’s someone on the case, all right, but they aren’t being allowed to or don’t see the point of doing their work. It’s bullshit.

I say dissolve the SEC so that people will no longer have any false hopes of meaningful financial reform.

I’ve been reading Sheila Bair’s book Bull by the Horns, and it’s really good. Maybe by the end of it I’ll have changed my mind and I’ll see a place for the SEC. Maybe I’ll have hope that these things have natural cycles and the SEC will have another day in the power position, like it had in the 1980’s. But right now I’m in the part of the book where the regulators, apart from the FDIC, are taking orders directly from financial lobbyists, and it makes me completely crazy.

Categories: finance, rant

Columbia Data Science course, week 5: GetGlue, time series, financial modeling, advanced regression, and ethics

October 5, 2012 Cathy O'Neil, mathbabe Comments off

I was happy to be giving Rachel Schutt’s Columbia Data Science course this week, where I discussed time series, financial modeling, and ethics. I blogged previous classes here.

The first few minutes of class were for a case study with GetGlue, a New York-based start-up that won the mashable breakthrough start-up of the year in 2011 and is backed by some of the VCs that also fund big names like Tumblr, etsy, foursquare, etc. GetGlue is part of the social TV space. Lead Scientist, Kyle Teague, came to tell the class a little bit about GetGlue, and some of what he worked on there. He also came to announce that GetGlue was giving the class access to a fairly large data set of user check-ins to tv shows and movies. Kyle’s background is in electrical engineering, he placed in the 2011 KDD cup (which we learned about last week from Brian), and he started programming when he was a kid.

GetGlue’s goal is to address the problem of content discovery within the movie and tv space, primarily. The usual model for finding out what’s on TV is the 1950’s TV Guide schedule, and that’s still how we’re supposed to find things to watch. There are thousands of channels and it’s getting increasingly difficult to find out what’s good on. GetGlue wants to change this model, by giving people personalized TV recommendations and personalized guides. There are other ways GetGlue uses Data Science but for the most part we focused on how this the recommendation system works. Users “check-in” to tv shows, which means they can tell people they’re watching a show. This creates a time-stamped data point. They can also do other actions such as like, or comment on the show. So this is a -tuple: {user, action, object} where the object is a tv show or movie. This induces a bi-partite graph. A bi-partite graph or network contains two types of nodes: users and tv shows. An edges exist between users and an tv shows, but not between users and users or tv shows and tv shows. So Bob and Mad Men are connected because Bob likes Mad Men, and Sarah and Mad Men and Lost are connected because Sarah liked Mad Men and Lost. But Bob and Sarah aren’t connected, nor are Mad Men and Lost. A lot can be learned from this graph alone.

But GetGlue finds ways to create edges between users and between objects (tv shows, or movies.) Users can follow each other or be friends on GetGlue, and also GetGlue can learn that two people are similar[do they do this?]. GetGlue also hires human evaluators to make connections or directional edges between objects. So True Blood and Buffy the Vampire Slayer might be similar for some reason and so the humans create an edge in the graph between them. There were nuances around the edge being directional. They may draw an arrow pointing from Buffy to True Blood but not vice versa, for example, so their notion of “similar” or “close” captures both content and popularity. (That’s a made-up example.) Pandora does something like this too.

Another important aspect is time. The user checked-in or liked a show at a specific time, so the -tuple extends to have a time-stamp: {user,action,object,timestamp}. This is essentially the data set the class has access to, although it’s slightly more complicated and messy than that. Their first assignment with this data will be to explore it, try to characterize it and understand it, gain intuition around it and visualize what they find.

Students in the class asked him questions around topics of the value of formal education in becoming a data scientist (do you need one? Kyle’s time spent doing signal processing in research labs was valuable, but so was his time spent coding for fun as a kid), what would be messy about a data set, why would the data set be messy (often bugs in the code), how would they know? (their QA and values that don’t make sense), what language does he use to prototype algorithms (python), how does he know his algorithm is good.

Then it was my turn. I started out with my data scientist profile:

As you can see, I feel like I have the most weakness in CS. Although I can use python pretty proficiently, and in particular I can scrape and parce data, prototype models, and use matplotlib to draw pretty pictures, I am no java map-reducer and I bow down to those people who are. I am also completely untrained in data visualization but I know enough to get by and give presentations that people understand.

Thought Experiment

I asked the students the following question:

What do you lose when you think of your training set as a big pile of data and ignore the timestamps?

They had some pretty insightful comments. One thing they mentioned off the bat is that you won’t know cause and effect if you don’t have any sense of time. Of course that’s true but it’s not quite what I meant, so I amended the question to allow you to collect relative time differentials, so “time since user last logged in” or “time since last click” or “time since last insulin injection”, but not absolute timestamps.

What I was getting at, and what they came up with, was that when you ignore the passage of time through your data, you ignore trends altogether, as well as seasonality. So for the insulin example, you might note that 15 minutes after your insulin injection your blood sugar goes down consistently, but you might not notice an overall trend of your rising blood sugar over the past few months if your dataset for the past few months has no absolute timestamp on it.

This idea, of keeping track of trends and seasonalities, is very important in financial data, and essential to keep track of if you want to make money, considering how small the signals are.

How to avoid overfitting when you model with time series

After discussing seasonality and trends in the various financial markets, we started talking about how to avoid overfitting your model.

Specifically, I started out with having a strict concept of in-sample (IS) and out-of-sample (OOS) data. Note the OOS data is not meant as testing data- that all happens inside OOS data. It’s meant to be the data you use after finalizing your model so that you have some idea how the model will perform in production.

Next, I discussed the concept of causal modeling. Namely, we should never use information in the future to predict something now. Similarly, when we have a set of training data, we don’t know the “best fit coefficients” for that training data until after the last timestamp on all the data. As we move forward in time from the first timestamp to the last, we expect to get different sets of coefficients as more events happen.

One consequence of this is that, instead of getting on set of coefficients, we actually get an evolution of each coefficient. This is helpful because it gives us a sense of how stable those coefficients are. In particular, if one coefficient has changed sign 10 times over the training set, then we expect a good estimate for it is zero, not the so-called “best fit” at the end of the data.

One last word on causal modeling and IS/OOS. It is consistent with production code. Namely, you are always acting, in the training and in the OOS simulation, as if you’re running your model in production and you’re seeing how it performs. Of course you fit your model in sample, so you expect it to perform better there than in production.

Another way to say this is that, once you have a model in production, you will have to make decisions about the future based only on what you know now (so it’s causal) and you will want to update your model whenever you gather new data. So your coefficients of your model are living organisms that continuously evolve.

Submodels of Models

We often “prepare” the data before putting it into a model. Typically the way we prepare it has to do with the mean or the variance of the data, or sometimes the log (and then the mean or the variance of that transformed data).

But to be consistent with the causal nature of our modeling, we need to make sure our running estimates of mean and variance are also causal. Once we have causal estimates of our mean $\overline{y}$ and variance $\sigma_y^2$, we can normalize the next data point with these estimates just like we do to get from a gaussian distribution to the normal gaussian distribution:

$y \mapsto \frac{y - \overline{y}}{\sigma_y}$

Of course we may have other things to keep track of as well to prepare our data, and we might run other submodels of our model. For example we may choose to consider only the “new” part of something, which is equivalent to trying to predict something like $y_t - y_{t-1}$ instead of $y_t.$ Or we may train a submodel to figure out what part of $y_{t-1}$ predicts $y_t,$ so a submodel which is a univariate regression or something.

There are lots of choices here, but the point is it’s all causal, so you have to be careful when you train your overall model how to introduce your next data point and make sure the steps are all in order of time, and that you’re never ever cheating and looking ahead in time at data that hasn’t happened yet.

Financial time series

In finance we consider returns, say daily. And it’s not percent returns, actually it’s log returns: if $F_t$ denotes a close on day $t,$ then the return that day is defined as $log(F_t/F_{t-1}).$ See more about this here.

So if you start with S&P closing levels:

Then you get the following log returns:

What’s that mess? It’s crazy volatility caused by the financial crisis. We sometimes (not always) want to account for that volatility by normalizing with respect to it (described above). Once we do that we get something like this:

Which is clearly better behaved. Note this process is discussed in this post.

We could also normalize with respect to the mean, but we typically assume the mean of daily returns is 0, so as to not bias our models on short term trends.

Financial Modeling

One thing we need to understand about financial modeling is that there’s a feedback loop. If you find a way to make money, it eventually goes away- sometimes people refer to this as the fact that the “market learns over time”.

One way to see this is that, in the end, your model comes down to knowing some price is going to go up in the future, so you buy it before it goes up, you wait, and then you sell it at a profit. But if you think about it, your buying it has actually changed the process, and decreased the signal you were anticipating. That’s how the market learns – it’s a combination of a bunch of algorithms anticipating things and making them go away.

The consequence of this learning over time is that the existing signals are very weak. We are happy with a 3% correlation for models that have a horizon of 1 day (a “horizon” for your model is how long you expect your prediction to be good). This means not much signal, and lots of noise! In particular, lots of the machine learning “metrics of success” for models, such as measurements of precision or accuracy, are not very relevant in this context.

So instead of measuring accuracy, we generally draw a picture to assess models, namely of the (cumulative) PnL of the model. This generalizes to any model as well- you plot the cumulative sum of the product of demeaned forecast and demeaned realized. In other words, you see if your model consistently does better than the “stupidest” model of assuming everything is average.

If you plot this and you drift up and to the right, you’re good. If it’s too jaggedy, that means your model is taking big bets and isn’t stable.

Why regression?

From above we know the signal is weak. If you imagine there’s some complicated underlying relationship between your information and the thing you’re trying to predict, get over knowing what that is – there’s too much noise to find it. Instead, think of the function as possibly complicated, but continuous, and imagine you’ve written it out as a Taylor Series. Then you can’t possibly expect to get your hands on anything but the linear terms.

Don’t think about using logistic regression, either, because you’d need to be ignoring size, which matters in finance- it matters if a stock went up 2% instead of 0.01%. But logistic regression forces you to have an on/off switch, which would be possible but would lose a lot of information. Considering the fact that we are always in a low-information environment, this is a bad idea.

Note that although I’m claiming you probably want to use linear regression in a noisy environment, the actual terms themselves don’t have to be linear in the information you have. You can always take products of various terms as x’s in your regression. but you’re still fitting a linear model in non-linear terms.

Advanced regression

The first thing I need to explain is the exponential downweighting of old data, which I already used in a graph above, where I normalized returns by volatility with a decay of 0.97. How do I do this?

Working from this post again, the formula is given by essentially a weighted version of the normal one, where I weight recent data more than older data, and where the weight of older data is a power of some parameter $s$ which is called the decay. The exponent is the number of time intervals since that data was new. Putting that together, the formula we get is:

$V_{old} = (1-s) \cdot \sum_i r_i^2 s^i.$

We are actually dividing by the sum of the weights, but the weights are powers of some number s, so it’s a geometric sum and the sum is given by $1/(1-s).$

One cool consequence of this formula is that it’s easy to update: if we have a new return $r_0$ to add to the series, then it’s not hard to show we just want

$V_{new} = s \cdot V_{old} + (1-s) \cdot r_0^2.$

In fact this is the general rule for updating exponential downweighted estimates, and it’s one reason we like them so much- you only need to keep in memory your last estimate and the number $s.$

How do you choose your decay length? This is an art instead of a science, and depends on the domain you’re in. Think about how many days (or time periods) it takes to weight a data point at half of a new data point, and compare that to how fast the market forgets stuff.

This downweighting of old data is an example of inserting a prior into your model, where here the prior is “new data is more important than old data”. What are other kinds of priors you can have?

Priors

Priors can be thought of as opinions like the above. Besides “new data is more important than old data,” we may decide our prior is “coefficients vary smoothly.” This is relevant when we decide, say, to use a bunch of old values of some time series to help predict the next one, giving us a model like:

$y = F_t = \alpha_0 + \alpha_1 F_{t-1} + \alpha_2 F_{t-2} + \epsilon,$

which is just the example where we take the last two values of the time series $F$ to predict the next one. But we could use more than two values, of course.

[Aside: in order to decide how many values to use, you might want to draw an autocorrelation plot for your data.]

The way you’d place the prior about the relationship between coefficients (in this case consecutive lagged data points) is by adding a matrix to your covariance matrix when you perform linear regression. See more about this here.

Ethics

I then talked about modeling and ethics. My goal is to get this next-gen group of data scientists sensitized to the fact that they are not just nerds sitting in the corner but have increasingly important ethical questions to consider while they work.

People tend to overfit their models. It’s human nature to want your baby to be awesome. They also underestimate the bad news and blame other people for bad news, because nothing their baby has done or is capable of is bad, unless someone else made them do it. Keep these things in mind.

I then described what I call the deathspiral of modeling, a term I coined in this post on creepy model watching.

I counseled the students to

try to maintain skepticism about their models and how their models might get used,
shoot holes in their own ideas,
accept challenges and devise tests as scientists rather than defending their models using words – if someone thinks they can do better, than let them try, and agree on an evaluation method beforehand,
In general, try to consider the consequences of their models.

I then showed them Emanuel Derman’s Hippocratic Oath of Modeling, which was made for financial modeling but fits perfectly into this framework. I discussed the politics of working in industry, namely that even if they are skeptical of their model there’s always the chance that it will be used the wrong way in spite of the modeler’s warnings. So the Hippocratic Oath is, unfortunately, insufficient in reality (but it’s a good start!).

Finally, there are ways to do good: I mentioned stuff like DataKind. There are also ways to be transparent: I mentioned Open Models, which is so far just an idea, but Victoria Stodden is working on RunMyCode, which is similar and very awesome.

Categories: data science, finance, math education, open source tools, statistics

Next-Gen Data Scientists

October 4, 2012 Cathy O'Neil, mathbabe 3 comments

This is written by Rachel Schutt and crossposted from her Columbiadatascience blog

Data is information and is extremely powerful. Models and algorithms that use data can literally change the world. Quantitatively-minded people have always been able to solve important problems, so this is nothing new, and there’s always been data, so this is nothing new.

What is new is the massive amounts of data we have on all aspects of our lives, from the micro to the macro. The data we have from government, finance, education, the environment, social welfare, health, entertainment, the internet will be used to make policy-decisions and to build products back into the fabric of our culture.

I want you, my students, to be the ones doing it. I look around the classroom and see a group of thoughtful, intelligent people who want to do good, and are absolutely capable of doing it.

I don’t call myself a “data scientist”. I call myself a statistician. I refuse to be called a data scientist because as it’s currently used, it’s a meaningless, arbitrary marketing term. However, the existence of the term, and apparent “sexiness” of the profession draws attention to data and opens up opportunities. So we need Next-Gen Data Scientists. That’s you! Here’s what I mean when I say Next-Gen Data Scientist:

Next-Gen Data Scientists have humility. They don’t lie about their credentials and they don’t spend most of their efforts on self-promotion.
Next-Gen Data Scientists have integrity. Their work is not about trying to be “cool” or solving some “cool” problem. It’s about being a problem solver and finding simple, elegant solutions. (or complicated, if necessary)
Next-Gen Data Scientists don’t try to impress with complicated algorithms and models that don’t work.
Next-Gen Data Scientists spend a lot more time trying to get data into shape then anyone cares to admit.
Next-Gen Data Scientists have the experience or education to actually know what they’re talking about. They’ve put their time in.
Next-Gen Data Scientists are skeptical – skeptical about models themselves and how they can fail and the way they’re used or can be misused.
Next-Gen Data Scientists make sure they know what they’re talking about before running around trying to show everyone else they exist.
Next-Gen Data Scientsts have a variety of skills including coding, statistics, machine learning, visualization, communication, math.
Next-Gen Data Scientists do enough Science to merit the word “Scientist”, someone who tests hypotheses and welcomes challenges and alternative theories.
Next-Gen Data Scientists are solving a new breed of problem that surrounds the structure and exploration of data and the computational issues surrounding it.
Next-Gen Data Scientists don’t find religion in tools, methods or academic departments. They are versatile and interdisciplinary.
Next-Gen Data Scientists are highly skilled and ought to get paid well enough that they don’t have to worry too much about money
Next-Gen Data Scientists don’t let money blind them to the point that their models are used for unethical purposes.
Next-Gen Data Scientists seek out opportunities to solve problems of social value.
Next-Gen Data Scientists understand the implications and consequences of the models they’re building.
Next-Gen Data Scientists collaborate and cooperate.
Next-Gen Data Scientists bring their humanity with them to problem solving, and algorithm/model-building.

Categories: data science, guest post

Knitting porn

October 4, 2012 Cathy O'Neil, mathbabe 3 comments

I owe you guys a post on my talk last night at Rachel Schutt’s Data Science course at Columbia (which I’ve been blogging about for the past four weeks here). Yesterday I spoke about time series, financial modeling, and ethics.

But unfortunately, right now I’m tending to my 3-year-old, who was up all night sick. While you wait I thought I’d show you some knitting porn I can’t get enough of:

Categories: musing

Bad news wish list

October 3, 2012 Cathy O'Neil, mathbabe 4 comments

You know that feeling you get when, a few years after you went to a wedding of your friends, you find out they’re getting a divorce?

It’s not a nice feeling. It’s work for you, and nasty work at that: you have to go back over your memories of those two in the past years, where you’d been projecting happiness and contentment all this time, and replace it with argument and bitterness. Not to mention the sorrow and sympathy you naturally bestow on your friends.

If it happens enough times, which it has to me, then going to weddings at all is kind of a funereal affair. I no longer project happy thoughts towards the newly married couple. If anything I worry for them and cross my fingers, hoping for the best. You may even say I’ve lost my faith in the institution.

Considering this, I can kind of understand why some religions don’t allow divorce. If you don’t allow it, then the bad news will never come out, and you won’t have to retroactively fit your internal model of other people’s lives to reality. You can go on blithely assuming everyone’s doing great. While we’re at it, no kids are getting neglected or abused because we don’t talk about that kind of thing.

By way of unreasonable analogy, I’d like to discuss the lack of conversation we’ve seen by the presidential campaigns on both sides about the state of the financial system. I’m starting to think it’s part of the religion of politicians that they never talk about this stuff, because they treat it as an embarrassing failure along the lines of a catholic divorce.

Or maybe I don’t have to be so philosophical about it- is it religion, or is it just money?

I had trouble following much about the two national conventions, because it made me so incensed that nothing was really being discussed, and that it was all so full of shit. But one thing I managed to glean from the coverage of the “events” being sponsored by the various lobbyist groups at the two conventions is that, whereas most lobbyists sponsor events at one of the conventions, like the NRA sponsors something at the Republican convention and the unions sponsor stuff at the Democratic convention, the financial lobbyists sponsor huge swanky events at both.

I interpret this to mean that they are paying to not be discussed as a platform issue. They seem to have paid enough, because I don’t hear anything from the Romney camp about shit Obama has or hasn’t done, or shit Geithner has or hasn’t done.

In fact, there’s a “Stories I’d like to See” column in Reuters column entitled “Tales of a TARP built to benefit bankers, and waiting for CEOs to pay the price”, and written by Stephen Brill, which discusses this exact issue in the context of Neil Barofsky’s book Bailout, which I blogged about here. From the column:

A presidential campaign that wanted to call out the Obama administration for being too friendly to Wall Street and the banks at the expense of Main Street would be using Bailout as the cheat sheet that keeps on giving. But with the Romney campaign’s attack coming from the opposite direction – that the president and his team have killed the economy by shackling Wall Street – and with Romney on record in favor of allowing the mortgage crisis to “bottom out” with no government intervention, the former Massachusetts governor and his team have no use for Bailout.

The second half of the article is really good, asking very commonsensical question about the recent settlement BofA got from the SEC for blatantly lying to shareholders around the time they acquired Merrill Lynch. Specifically the author notes that the (current) shareholders are left paying the (2008) shareholders, which is dumb, but the asshole Ken Lewis, who actually lied doesn’t seem to be getting into any trouble at all. From the column:

And, as long as we’re talking about harm done to shareholders, why wouldn’t we now see a new, post-settlement shareholders’ suit not against the company but targeted only at Lewis and some of his former colleagues who got Bank of America into this jam in the first place and just caused it to pay out $2.4 billion? (The plaintiffs here could be any current shareholders, because they are the ones who are writing the $2.4 billion check.) Again, did the company indemnify Lewis and other executives against shareholder suits, meaning that if a shareholder now sues Lewis over this $2.4 billion settlement, the shareholder is once again only suing himself?

Can someone please sort this out?

I really like this idea, that we have a list of topics for people to sort out, even though it’s going to be bad news. What other topics should we ask for on our bad news wish list?

Categories: #OWS, finance, news

Student loans are a regressive tax

October 2, 2012 Cathy O'Neil, mathbabe 17 comments

I don’t think this approach of looking at student loans is new, but it’s new to me. A friend of mine mentioned this to me over the weekend.

For simplicity, assume everyone goes to college. Next, assume they all go to similar colleges – similar in cost and in quality. We will revisit these assumptions later. Finally, assume that costs of college keep going up the way they’re going and that student loan interest rates stay high.

What this means when you put it all together is that sufficiently rich people, or more likely their parents, will pay a one-time very large fee to attend college, but then they’ll be done with it. The rest of the people will be stuck paying monthly fees that will never go away. Moreover, because the interest rates are pretty high, the total amount non-rich people pay over their lifetime is substantially more than what rich people pay.

This is essentially a regressive tax, whereby poor people pay more than rich people.

Other points:

The government student loans don’t have interest rates that are extremely high, but there’s a limit of how much you can borrow with that program, which leads many people even now to borrow privately at much higher rates.
In the case of government-backed student loans this “tax” is essentially going to the government. In the case of private student loans, the private creditors are receiving the tax.
Since you can’t discharge student debt via bankruptcy, even private student debt, it really is a life-long tax. It’s even true that if you haven’t paid off your student debt by the time you retire, your social security payments get cut.
What about our assumptions that all schools have the same quality? Not true. Rich people tend to go to better schools. This means the poor are paying a tax for an inferior service. Of course, it’s also true that truly elite schools like Harvard have excellent financial support for their poorer students. This means there’s a two-tier school system if you’re poor: you can go to a normal school and pay tax, or you can excel and get into an elite school and it will be free.
What about our assumption that all schools have the same cost? Of course not true; we can look for better quality education for a reasonable price.
What about our assumption that everyone goes to college? Not true, but it’s still true that going to college and finishing sets you up for far better wage earning than if you only have a high school diploma. And although going to college and not finishing may not, nobody think they’re the ones who won’t finish.

Conclusion: Either we have to keep costs down or we have to make college government-subsidized or we have to make student loan interest rates really low or we have to offset this regressive tax with a highly progressive income tax.

Categories: #OWS, finance

High frequency trading: how it happened, what’s wrong with it, and what we should do

October 1, 2012 Cathy O'Neil, mathbabe 53 comments

High frequency trading (HFT) is in the news. Politicians and regulators are thinking of doing something to slow stuff down. The problem is, it’s really complicated to understand it in depth and to add rules in a nuanced way. Instead we have to do something pretty simple and stupid if we want to do anything.

How it happened

In some ways HFT is the inevitable consequence of market forces – one has an advantage when one makes a good decision more quickly, so there was always going to be some pressure to speed up trading, to get that technological edge on the competition.

But there was something more at work here too. The NYSE exchange used to be a non-profit mutual, co-owned by every broker who worked there. When it transformed to a profit-seeking enterprise, and when other exchanges popped up in competition with it was the beginning of the age of HFT.

All of a sudden, to make an extra buck, it made sense to allow someone to be closer and have better access, for a hefty fee. And there was competition among the various exchanges for that excellent access. Eventually this market for exchange access culminated in the concept of co-location, whereby trading firms were allowed to put their trading algorithms on servers in the same room as the servers that executed the trades. This avoids those pesky speed-of-light issues when sitting across the street from the executing servers.

Not surprisingly, this has allowed the execution of trades to get into the mind-splittingly small timeframe of double-digit microseconds. That’s microseconds, where from wikipedia: “One microsecond is to one second as one second is to 11.54 days.”

What’s wrong with it

Turns out, when things get this fast, sometimes mistakes happen. Sometimes errors occur. I’m writing in the third-person passive voice because we are no longer talking directly about human involvement, or even, typically, a single algorithm, but rather the combination of a sea of algorithms which together can do unexpected things.

People know about the so-called “flash crash” and more recently Knight Capital’s trading debacle where an algorithm at opening bell went crazy with orders. But people on the inside, if you point out these events, might counter that “normal people didn’t lose money” at these events. The weirdness was mostly fixed after the fact, and anyway pension funds, which is where most normal people’s money lives, don’t ever trade in the thin opening bell market.

But there’s another, less well known example from September 30th, 2008, when the House rejected the bailout, shorting stocks were illegal, and the Dow dropped 778 points. The prices as such common big-ticket stocks such as Google plummeted and, in this case, pension funds lost big money. It’s true that some transactions were later nulled, but not all of them.

This happened because the market makers of the time had largely pulled their models out of the market after shorting became illegal – there was no “do this algorithm except make sure you’re never short” button on the algorithm, so once the rule was called, the traders could only turn it all of completely. As a result, the liquidity wasn’t there and the pension funds, thinking they were being smart to do their big trades at close, instead got completely walloped.

Keep this in mind, before you go blaming the politicians on this one because the immediate cause was the short-sighted short-selling ban: the HFT firms regularly pull out of the market in times of stress, or when they’re updating their algorithms, or just whenever they want. In other words, it’s liquidity when you need it least.

Moreover, just because two out of three times were relatively benign for the 99%, we should not conclude that there’s nothing potentially disastrous going on. The flash crash and Knight Capital have had impact, namely they serve as events which erode our trust in the system as a whole. The 2008 episode on top of that proved that yes, we can be the victims of the out-of-control machines fighting against each other.

Quite aside from the instability of the system, and how regular people get screwed by insiders (because after all, that’s not a new story at all, it’s just a new technology for an old story), let’s talk about resources. How much money and resources are being put into the HFT arena and how could those resources otherwise be used?

Putting aside the actual energy consumed by the industry, which is certainly non-trivial, let’s focus for a moment on money. It has been estimated that overall, HFT firms post about $80 billion in profits yearly, and that they make on the order of 10% profit on their technology investments. That would mean that there’s in the order of $800 billion being invested in HFT each year. Even if we highball the return at 25%, we still have more than $300 billion invested in this stuff.

And to what end?

Is that how much it’s really worth the small investor to have decreased bid-ask spreads when they go long Apple because they think the new iPhone will sell? What else could we be doing with $800 billion dollars? A couple of years of this could sell off all of the student debt in this country.

What should be done

Germany has recently announced a half-second minimum for posting an share order. This is eons in current time frames, and would drastically change how trading is done. They also want HFT algorithms to be registered with them. You know, so people can keep tabs on the algorithms and understand what they’re doing and how they might interact with each other.

Um, what? As a former quant, let me just say: this will not work. Not a chance in hell. If I want to obfuscate the actual goals of a model I’ve written, that’s easier than actually explaining it. Moreover, the half-second rule may sound good but it just means it’s a harder system to game, not that it won’t be gameable.

Other ideas have been brought forth as to how to slow down trading, but in the end it’s really hard to do: if you put in delays, there’s always going to be an algorithm employed which decides whose trade actually happens first, and so there will always be some advantage to speed, or to gaming the algorithm. It would be interesting but academically challenging to come up with a simple enough rule that would actually discourage people from engaging in technological warfare.

The only sure-fire way to make people think harder about trading so quickly and so often is a simple tax on transactions, often referred to as a Tobin Tax. This would make people have sufficient amount of faith in their trade to pay the tax on top of the expected value of the trade.

And we can’t just implement such a tax on one market, like they do for equities in London. It has to be on all exhange-traded markets, and moreover all reasonable markets should be exchange-traded.

Oh, and while I’m smoking crack, let me also say that when exchanges are found to have given certain of their customers better access to prices, the punishments for such illegal insider information should be more than $5 million dollars.

Categories: #OWS, finance, hedge funds, rant

Newer Entries

mathbabe

Archive