mathbabe

What’s wrong with Wall Street and what should be done about it?

October 9, 2011 Cathy O'Neil, mathbabe 33 comments

I am trying to figure out the top five (or so) most important corrupt and actionable issues related to the financial system. I’m going to compile this list in order to conduct a “teach-in” at the Occupy Wall Street protest next week. The tentative date is Wednesday, October 12, at 5:30pm.

I’d love to hear your thoughts: please tell me if I’m missing something or got something wrong or left something out.

The list I have so far:

Investment bankers trading their books and taking outrageous risks which lead to government-backed bailouts because they are “too big to fail”. The related action in the U.S. might be the “Volcker rule” (i.e. reinstating something like Glass-Steagall); unfortunately it’s being watered down as you read this.
Ratings agencies in collusion with their clients. The actions here would be changing the pay structure of the ratings agencies and opening up the methods, as well as having better regulatory oversight. We also need to change the structure of ratings agencies, and either make it easier to form an agency or make the agencies that already exist and have government protection actually accountable for their “opinions”.
SEC and other regulators in collusion with the industry. The action here would be to nurture and maintain an adversarial relationship between regulators and bankers. We’ve seen too many people skip from the SEC to the banks they were regulating and then back. There should be rules against this (how about a minimum time requirement of 5 years between jobs on the opposite sides?). There should also be much better funding for the SEC and the other regulators, so they can actually meet their expanded mandate.
Conflict of interest issues from economists and business school professors. If you’ve seen “Inside Job” then you’ll know all about how professors at various universities use their credentials to back up questionable practices. Moreover, they are often not even required to expose their industry connections when they do expert witnessing or write “academic” papers. The action here would be, at the very least, to force full disclosure for all such appearances and all publications. I’ve heard some good news in this direction but there obviously should be a standard.
Rampant buying of politicians and influence of lobbyists from the financial industry. This is maybe more of a political problem than a financial one so I’m willing to chuck this off the list. Please tell me if you have something else in mind. Someone has suggested the opaque and elevated pension fund management system. Although I consider that pretty corrupt, I’m not sure it’s as important as other issues to the average person. I’m on the fence.

Categories: #OWS, finance, news, rant

Saturday afternoon quickie

October 8, 2011 Cathy O'Neil, mathbabe 13 comments

Two things.

If I see another fucking article about how the world is going to miss Steve Jobs I’m going to puke. He made and sold overpriced gadgets for fucks sake! It’s hero worship plain and simple, maybe even a sick cult.
I am happy that I’ve been invited to give a “teach-in” at Occupy Wall Street next Wednesday at 5:30 (tentative date and time). I’ve promised an overview of the 5 top corrupt things in the financial system. I’d really appreciate your thoughts: what is your top 5 list? I want them to be both important and relatively actionable. So far I’ve got:

Volcker rule (i.e. reinstating something like Glass-Steagall); it’s being watered down as you read this.
Ratings agencies in collusion with their clients
SEC and other regulators in collusion with the industry
Rampant buying of politicians and influence of lobbyists from the financial industry
Incredibly poor incentives for the individuals in the industry, both in terms of salary and whistleblowing

Categories: #OWS, finance, news, rant

Habits

October 8, 2011 Cathy O'Neil, mathbabe 2 comments

This is a guest post by my friend Tara Mathur:

I don’t need to read Tiger Mother to know that I don’t have one. I don’t remember either of my parents putting a lot of pressure on me to do things – even to study, although I developed that habit on my own.

As kids we develop some habits on our own, but we pick up a lot of habits from our parents.

We learn habits from our parents in a few ways. One is by mirroring them. For example, my parents have always read in bed before going to sleep and so have I; it’s so natural to me that until I got married I thought this was something everyone did.

Another is by having our parents make us do something repeatedly. For example, when we first brushed our teeth it probably seemed like a pain to do, but our parents kept making us do it, and it became automatic.

How can we cultivate new habits as adults?

(And am I the only one who associates the word “will-power” with pain and failure? People use that word when they’re talking about doing something really hard, against their natural tendencies. I hear that word and think, how is this gonna last?)

In the last few years I’ve become a big fan of a blog called Zen Habits written by Leo Babauta. He’s made big positive changes in his life – getting out of debt, quitting smoking, running marathons, starting a successful writing career – by focusing on habits rather than goals. Even though big goals are sexy and easy to get excited about, it’s the daily habits, built up baby step by baby step, which last and which comprise most of our life. By definition, when something is a habit we don’t have to rely on willl-power to stick with it. It’s effortless, automatic behavior. Leo emphasizes starting small and focusing on one habit at a time.

This could apply to any positive change we’d like to make in our life. BJ Fogg, a human behavior expert who runs the Persuasive Technology Lab at Stanford, sums up the three steps to cultivate a new habit as follows:

Make it tiny. To create a new habit, you must first simplify the behavior. Make it tiny, even ridiculous. (examples: floss one tooth, walk for three minutes, do two push-ups)
Find a spot. Find a spot in your existing routine where this tiny new behavior could fit. Put it after some act that is a solid habit for you, like brushing teeth or eating lunch. One key to a new habit is this simple: you need to find what it comes after.
Train the cycle. Now focus on doing the tiny behavior as part of your routine – every day, on cycle. At first you’ll need reminders. But soon the tiny behavior will get more automatic. Keep the behavior simple until it becomes a solid habit. That’s the secret to success.

That’s it! He says. Just keep your tiny habit going. Believe in baby steps. Eventually it will naturally expand to the bigger behavior, without much effort.

(There are other tricks too. I’ve also read that you’ll pick up a habit more quickly if you surround yourself with people who already have the habit you want — though I’m not sure if it will last when you’re no longer around those people. Try it and see what works.)

Categories: guest post, Uncategorized

Financial Terms Dictionary

October 7, 2011 Cathy O'Neil, mathbabe 5 comments

I’ve got a bunch of things to mention today. First, I’ll be at M.I.T. in less than two weeks to give a talk to women in math about working in business. Feel free to come if you are around and interested!

Next, last night I signed up for this free online machine learning course being offered out of Stanford. I love this idea and I really think it’s going to catch on. There are groups here in New York that are getting together to talk about the class and do homework. Very cool!

Next, I’m going back to the protests after work. The media coverage has gotten better and Matt Stoller really wrote a great piece and called on people to stop criticizing and start helping, which is always my motto. For my part, I’m planning to set up some kind of Finance Q&A booth at the demonstration with some other friends of mine in finance. It’s going to be hard since I don’t have lots of time but we’ll try it and see. One of my artistic friends came up with this:

Finally, one last idea. I wanted to find a funny way to help people understand financial and economic stuff, so I thought of starting a “Financial Terms Dictionary”, which would start with an obscure phrase that economists and bankers use and translate it into plain English. For example, under “injection of liquidity” you might see “the act of printing money and giving it to the banks”.

I’d love comments and suggestions for the Financial Terms Dictionary! I’ll start a separate page for it if it catches on.

Categories: #OWS, data science, news, rant, women in math

Bayesian regressions (part 1)

October 6, 2011 Cathy O'Neil, mathbabe 9 comments

I’ve decided to talk about how to set up a linear regression with Bayesian priors because it’s super effective and not as hard as it sounds. Since I’m not a trained statistician, and certainly not a trained Bayesian, I’ll be coming at it from a completely unorthodox point of view. For a more typical “correct” way to look at it see for example this book (which has its own webpage).

The goal of today’s post is to abstractly discuss “bayesian priors” and illustrate their use with an example. In later posts, though, I promise to actually write and share python code illustrating bayesian regression.

The way I plan to be unorthodox is that I’m completely ignoring distributional discussions. My perspective is, I have some time series (the $x_i$ ‘s) and I want to predict some other time series (the $y$ ) with them, and let’s see if using a regression will help me- if it doesn’t then I’ll look for some other tool. But what I don’t want to do is spend all day deciding whether things are in fact student-t distributed or normal or something else. I’d like to just think of this as a machine that will be judged on its outputs. Feel free to comment if this is palpably the wrong approach or dangerous in any way.

A “bayesian prior” can be thought of as equivalent to data you’ve already seen before starting on your dataset. Since we think of the signals (the $x_i$ ‘s) and response ( $y$ ) as already known, we are looking for the most likely coefficients $\beta_i$ that would explain it all. So the form a bayesian prior takes is: some information on what those $\beta_i$ ‘s look like.

The information you need to know about the $\beta_i$ ‘s is two-fold. First you need to know their values and second you need to have a covariance matrix to describe their statistical relationship to each other. When I was working as a quant, we almost always had strong convictions about the latter but not the former, although in the literature I’ve been reading lately I see more examples where the values (really the mean values) for the $\beta_i$ ‘s are chosen but with an “uninformative covariance assumption”.

Let me illustrate with an example. Suppose you are working on the simplest possible model: you are taking a single time series and seeing how earlier values of $x$ predict the next value of $x$ . So in a given update of your regression, $y= x_t$ and each $x_i$ is of the form $x_{t-a}$ for some $a>0.$

What is your prior for this? Turns out you already have one (two actually) if you work in finance. Namely, you expect the signal of the most recent data to be stronger than whatever signal is coming from older data (after you decide how many past signals to use by first looking at a lagged correlation plot). This is just a way of saying that the sizes of the coefficients should go down as you go further back in time. You can make a prior for that by working on the diagonal of the covariance matrix.

Moreover, you expect the signals to vary continuously- you (probably) don’t expect the third-from recent variable $x_{t-3}$ to have a positive signal but the second-from recent variable $x_{t-2}$ to have a negative signal (especially if your lagged autocorrelation plot looks like this). This prior is expressed as a dampening of the (symmetrical) covariance matrix along the subdiagonal and superdiagonal.

In my next post I’ll talk about how to combine exponential down-weighting of old data, which is sacrosanct in finance, with bayesian priors. Turns out it’s pretty interesting and you do it differently depending on circumstances. By the way, I haven’t found any references for this particular topic so please comment if you know of any.

Categories: data science, finance, hedge funds, statistics

My friend the coffee douche

October 5, 2011 Cathy O'Neil, mathbabe 8 comments

About a year ago or so, I went with my friend to a new coffee store in lower Manhattan that he was super excited about. He knew the name of their espresso machine (the Slayer) and kept going on about how amazing the espresso made from this machine must be, if done right. I was happy to go, first because I needed coffee and second because I just like my friend and like it when people get really into things. On the way there I told him that the way he was waxing poetic about the Slayer really defined him as an all-out “coffee douche”. He took it well- in fact I think he actually loved the title. Coffee douches rarely get rewarded with titles, I realized.

I used to be a coffee douche myself. Or at least a potential coffee douche. I worked at Coffee Connection in my youth, which was eventually bought out by Starbucks but in its time gave lots of people in the Boston area pretty good coffee. I hung with the owner, especially once I decided to go to Berkeley, because that’s where he went for undergrad and where he learned to love good coffee (he told me he fell in love at Istanbul Express, I wonder if that place still exists). At some point I knew how many seconds of roasting produced each style (I never liked Italian Roast myself- too burnt) and the characteristics of the different coffees from all over the world (mmm… Sumatra).

Over time, though, I lost it. Something about having kids. I’m now at the level of carrying around Nodoz in my purse just in case I’m traveling and there’s no coffee machine in the hotel room (or in case those tiny little packages of grounds are insufficient). I still enjoy a good cup of Sumatra but I’m almost equally happy going to 7 Eleven. So you can see that coffee douchery is at best a fond memory for me.

When we got to the store, we were immediately asked at the door if we were “press”. Umm, no, what’s going on? It turned out that Sylvia was the guest barista! She was 3 time Brazilian pull champion!! I inferred that this meant there are actually competitions for making espresso. My friend was getting more and more excited and agitated. We got our pictures taken before and after the coffee drinks arrived. Or rather, our cups and saucers were- I think we may have only accidentally entered a frame or two. Sylvia was very gracious and hard-working at the same time. I think I managed to shake her hand, just for the celebrity moment of it all.

As an aside, I noticed something about the whole coffee movement thing when I was checking out Sylvia and her methods. Everything there has a fetishized whiff to it. The coffee machine was the Slayer, the various implements were wooden of some kind of hardwood that they were happy to explain in detail, and although I can’t remember all the names of the implements, I got the distinct impression that there may be a sex shop in the back room with leather and wooden tools very similar to the coffee tools. Maybe just me.

Here’s a close-up sexy shot of the Slayer (if you look carefully at the reflections you will note at least 3 people there admiring its shiny round parts), taken from the website of RBC coffee:

I don’t think I’ve ever been under such pressure to enjoy my espresso, but it was pretty good (I think). Near the end of drinking it, we seemed to be peppered with technical questions from the people there, including the owner of the store, the owner of the coffee plantation that supplied the store, and the guy who roasted the coffee beans. It was a triumvirate of coffee! I was glad I had my coffee douche with me!! He impressed them with his idiosyncratic knowledge (I remember his sympathy combined with pride when he mentioned that he was aware that there were laws against roasting in Manhattan but not in Brooklyn, so did they roast in Brooklyn? They did).

When I left, I was invigorated. Here are these people, completely obsessed and fascinated with coffee and everything pertaining to coffee. In some sense it struck me as a waste of time, but in a larger sense it was very very cool. That’s what’s interesting and fun about humans, after all, that they get totally nerdy and into things that other people can’t relate to, and they really improve our knowledge as a community about the best way to do that thing. There are probably people somewhere who are as into park benches as these guys are into coffee, and thanks to them the park benches are getting more and more comfy and beautifully designed and long-lasting, at least if you know where to go for really excellent park benches.

Categories: rant

Data science: tools vs. craft

October 4, 2011 Cathy O'Neil, mathbabe 11 comments

I’ve enjoyed how many people are reading the post I wrote about hiring a data scientist for a business. It’s been interesting to see how people react to it. One consistent reaction is that I’m just saying that a data scientist needs to know undergraduate level statistics.

On some level this is true: undergrad statistics majors can learn everything they need to know to become data scientists, especially if they also take some computer science classes. But I would add that it’s really not about familiarity with a specific set of tools that defines a data scientist. Rather, it’s about being a craftsperson (and a salesman) with those tools.

To set up an analogy: I’m not a chef because I know about casserole dishes.

By the way, I’m not trying to make it sound super hard and impenetrable. First of all I hate it when people do that and second of all it’s not at all impenetrable as a field. In fact I’d say it the other way: I’d prefer smart nerdy people to think they could become data scientists even without a degree in statistics, because after all basic statistics is pretty easy to pick up. In fact I’ve never studied statistics in school.

To get to the heart of the matter, it’s more about what a data scientist does with their sometimes basic tools than what the tools are. In my experience the real challenges are things like

Defining the question in the first place: are we asking the question right? Is an answer to this question going to help our business? Or should we be asking another question?
Once we have defined the question, we are dealing with issues like dirty data, too little data, too much data, data that’s not at all normally distributed, or that is only a proxy to our actual problem.
Once we manhandle the data into a workable form, we encounter questions like, is that signal or noise? Are the errorbars bigger than the signal? How many more weeks or months of data collection will we need to go through before we trust this signal enough to bet the business on it?
Then of course we go back to: should we have asked a different question that would have not been as perfect an answer but would have definitely given us an answer?

In other words, once we boil something down to a question in statistics it’s kind of a breeze. Even so, nothing is ever as standard as you would actually find in a stats class – the chances of being asked a question similar to a stats class is zero. You always need to dig deeply enough into your data and the relevant statistics to understand what the basic goal of that t-test or statistic was and modify the standard methodology so that it’s appropriate to your problem.

My advice to the business people is to get someone who is really freaking smart and who has also demonstrated the ability to work independently and creatively, and who is very good at communicating. And now that I’ve written the above issues down, I realize that another crucial aspect to the job of the data scientist is the ability to create methodology on the spot and argue persuasively that it is kosher.

A useful thing for this last part is to have broad knowledge of the standard methods and to be able to hack together a bit of the relevant part of each; this requires lots of reading of textbooks and research papers. Next, the data scientist has to actually understand it sufficiently to implement it in code. In fact the data scientist should try a bunch of things, to see what is more convincing and what is easier to explain. Finally, the data scientist has to sell it to everyone else.

Come to think of it the same can be said about being a quant at a hedge fund. Since there’s money on the line, you can be sure that management wants you to be able to defend your methodology down to the tiniest detail (yes, I do think that being a quant at a hedge fund is a form of a data science job, and this ~~guy~~ woman agrees with me).

I would argue that an undergrad education probably doesn’t give enough perspective to do all of this, even though the basic mathematical tools are there. You need to be comfortable building things from scratch and dealing with people in intense situations. I’m not sure how to train someone for the latter, but for the former a Ph.D. can be a good sign, or any person that’s taken on a creative project and really made something is good too. They should also be super quantitative, but not necessarily a statistician.

Categories: data science, hedge funds, statistics

“Our organization does not reward failure” – Koch

October 3, 2011 Cathy O'Neil, mathbabe 9 comments

You have to check out this Bloomberg article about Koch Industries. Although it rambles a bit at times, it’s absolutely mesmerizing and horrible. Here’s the main premise, which bizarrely comes near the end of the article:

For six decades around the world, Koch Industries has blazed a path to riches — in part, by making illicit payments to win contracts, trading with a terrorist state, fixing prices, neglecting safety and ignoring environmental regulations. At the same time, Charles and David Koch have promoted a form of government that interferes less with company actions.

The phrase “our organization does not reward failure” comes from a book in 2007 written by one of the Koch brothers where he somehow fails to discuss a pipeline explosion that had recently killed two teenagers in Oklahoma:

The 570-mile-long pipeline carrying liquid butane from Medford, Oklahoma, to Mont Belvieu, Texas had corroded so badly that one expert, Edward Ziegler, likened it to Swiss cheese. The company didn’t give 40 of the 45 families near the explosion site — including the Smalley and Stone families — any information about what to do in case of an emergency, the NTSB wrote.

The article is complete, in that it even has a spiteful twin brother of one of the Koch brothers appearing to give away his brothers for stealing.

The Senate held hearings in May 1989 after Bill Koch, David Koch’s twin brother, told a U.S. Senate special committee on investigations that Koch Industries was stealing oil on American Indian reservations, cheating the federal government of royalties.

The investigators caught Koch Oil’s employees falsifying records so that the company would get more crude than it paid for, shortchanging Indian families, Elroy said. Koch’s records showed that the company took 1.95 million barrels of oil it didn’t pay for from 1986 to 1988, according to data compiled by the Senate.

One thing that fascinating to me is that there are two whistle-blowers in the story, both women who were essentially fired for having ethics (one reported on bribes and the other on toxic gas dumping, both sued the company after leaving). Doesn’t it seem like women are more often whistle-blowers? Especially if you consider the fact that high ranking people in these kinds of companies with access to the kind of information that whistle-blowers need to uncover fraud are typically men.

These Koch brothers are seriously despicable, and really all they seem to care about is the ability to make money without having to worry about rules, even basic rules of morality. They currently largely bankroll the Tea Party. It’s a scary thought that I could someday live in a country whose president owes a favor to these guys.

Categories: news, rant

First day of calculus class

October 2, 2011 Cathy O'Neil, mathbabe 12 comments

Last night I had dinner with a friend who is a post-doc in math, and she was mentioning that her students, especially in the lower-level calculus classes, generally don’t refer to her as “professor.” This would be fine since she’s not yet a professor, but she also mentioned they do refer to graduate student men in the same department as professor. She’s a young looking woman, and my guess is they simply don’t know better. Here’s what my advice to her was (and as usual, I’d give this advice to both men and women).

On the first day of class, introduce yourself and put your name on the board, explain when and where you got a Ph.D., what your field of research is, what your current job is, as well as office hours and homework policies. In addition, wear a button-down shirt that first day of class. It’s kind of ridiculous but it works, in the sense that the students will be more impressed with you, which translates into them behaving more respectfully.

Moreover, it’s totally appropriate and not manipulative to explain your credentials. It’s probably most important for calculus, because generally those students don’t really want to be there, at least not all of them. Upper level classes contain students who are more psyched about math and eager to like their professors. I say this partly from experience, partly from talking to other people about their experiences, and partly via information I glean from the student evaluations I’ve read.

Speaking of evaluations, at some point I want to write about the noise that come from calculus evaluations, because that may as well be an entire subfield of statistics in itself. For example, I think there may be more variation depending on semester than depending on professor, due to the way kids take calculus in high school. In general it’s really hard to infer how good a job you did teaching based on calculus evaluations.

However, there is some signal. I remember reading about a study that said when some guy who was teaching two sections was introduced the first day in one of the sections by a distinguished-looking professor who went on about the instructor’s credentials, that class had much better end-of-semester evaluations, even though the content of the two sections was identical. Even more evidence that you should formally introduce yourself, if not bring in a friend for the job.

Categories: math education, women in math

Is the Onion actually America’s finest news source?

October 1, 2011 Cathy O'Neil, mathbabe 2 comments

Have you noticed that some of the best reporting nowadays is satire? I feel like I learn most of the news I know from reading newspapers online, but I’m unusual: most people, especially young people, seem to get their news from the Daily Show and Colbert, as well as the Onion.

And it’s not just the writing, which is generally excellent and intelligent, as well as hilariously entertaining. It’s the topics themselves that are incisive and that get to the heart of what’s ridiculous or dysfunctional about our financial, cultural, and political systems.

What if we started a newspaper that took its cues directly from the Onion, and rewrote every article in a straight, anti-satire way? Would that newspaper be better or worse than the New York Times? I claim it would be more bizarre but also more relevant to our lives. It may miss entire swaths of typical news coverage but then again it would cover certain things in a more holistic light.

For example, what would a anti-satirist do with this article? Or this one? Just having someone seriously articulate why these things are so funny would be a good start, and an article I’d love to read.

Categories: news, rant

Mortar Hawk: hadoop made easy

September 30, 2011 Cathy O'Neil, mathbabe 6 comments

Yesterday a couple of guys from Mortar came to explain their hadoop platform. You can see a short demo here. I wanted to explain it at a really high level because it’s cool and a big deal for someone like me. I’m not a computer scientist by training, and Mortar allows me to work with huge amounts of data relatively easily. In other words, I’m not sure what ultimately will be the interface for analytics people like me to get access to massive data, but it will be something like this, if not this.

To back up one second, for people who are nodding off, here’s the thing. If you have terabytes of data to crunch, you can’t put it on your computer to take a look at it, and then crunch, because your computer is too small. So you need to pre-crunch. That’s pretty much the problem we need to solve, and people have solved it either one of two ways.

The first is to put your data onto a big relational database, on the cloud or something, and use SQL or some such language to do the crunching (and aggregating and what have you) until it’s small enough to deal with, and then download it and finish it off on your computer. The second solution, called MapReduce (the idea started at Google), or hadoop (the open-source implementation started at Yahoo) allows you to work on the raw data directly where it lies (e.g. on the Amazon cloud (where it’s actually Elastic MapReduce, which I believe is a fork of hadoop)), in iterative steps called mappings and reduction steps.

Actually there’s an argument to be made, apparently, because I heard it at the Strata conference, that data scientists should never use hadoop at all, that we should always just use relational databases. However, that doesn’t seem economical, the way it’s set up at my work anyway. Please comment if you have an opinion about this because it’s interesting to me how split the data science community seems to be about this issue.

On the other hand, if you can make using hadoop as easy as using SQL, then who cares? That’s kind of what’s happened with Mortar. Let me explain.

Mortar has a web-based interface with two windows. On top we have the pig window and on the bottom a python editor. The pig window is in charge and you can call python functions in the pig script if you have defined them below. Pig is something like SQL but is procedural, so you tell it when to join and when to aggregate and what functions to use in what order. Then pig figures out how to turn your code into map-reduce steps, including how many iterations. They say pig is good at this but my guess is that if you really don’t know anything about how map-reduce works then it’s possible to write pig code that’s super inefficient.

One cool feature, which I think comes from pig itself but in any case is nicely viewable through the Mortar interface, is that you can ask it to “illustrate” the resulting map-reduce code and it takes a small sample of your data and shows example data (of “every type” in a certain sense) at every step of the process. This is super useful as a bug-watching feature to see that it’s looking good with small data sets.

The interface is well designed and easy to use. Overall it reduces a pretty scary and giant data job to something that would probably take me about a week to feel comfortable. And new hires who know python can get up to speed really quickly.

There are some issues right now, but the Mortar guys seem eager to improve the product quickly. To name a few:

it’s not yet connected to git (although you can save pig and python code you’ve already run),
you can’t import most python modules except super basic ones like math (including ones you’ve written; right now you have to copy and paste into their editor),
they won’t be able to ever let you import numpy because they are actually using jython and numpy is c-based,
it doesn’t automatically shut down the cluster after your job is finished, and
it doesn’t yet allow people to share a cluster

These last two mean that you have to be pretty on top of your stuff, which is too bad if you want to leave for the night and start a job and then bike home and feed your kids and put them to bed. Which is kind of my style.

Please tell me if any of you know other approaches that allow python-savvy (but not java savvy) analytics nerds access to hadoop in an easy way!

Categories: data science, open source tools

Occupy Wall Street: Day 13

September 29, 2011 Cathy O'Neil, mathbabe 14 comments

So I went to see the Occupy Wall Street protests this morning before work and this evening after work again. Here are some of my comments and observations.

First, if you are interested in checking it out, know that there are small marches at opening and closing bell for the market.

However, the police have made it basically impossible to walk on Wall Street, due to some incredibly annoying barricades.

So for our march this morning we seemed to just circle the city block where the protest is based, although I didn’t stay til the end so it’s possible they decided to very very slowly march on Wall Street proper.

Second, they have “assemblies” twice a day, with guest speakers sometimes (Michael Moore, Susan Sarandon and Cornel West have visited), and this is where general announcements are made. The crowd was quite large tonight and it was difficult to hear what the speaker and the repeaters were saying, which is frustrating. But maybe it’s easier at the 1pm assembly. Also, it seems to be easier to actually discuss issues in the morning- at night it gets loud and kind of crazy and hard to focus in my opinion.

Next, I’d like to address the issue of the message of the protesters being dismissed as incoherent. For the record, I went to a conference at the end of 2009 at Columbia Business School on the financial crisis and what we should do about it, where the speakers were fancy economists from central banks and CEOs of international banks, and they were about as incoherent as these protesters. There was absolutely no getting them to say anything that was an actual plan or even an attempt at a plan for changing the system so this mess wouldn’t happen again. I should know, because there was a question and answer period and I asked.

Having said that, there have been some pretty unconvincing statements reported from some of the protesters in terms of what they would like to see. For example, some of them seem to think that short selling should be banned. As some of you know, I disagree. In fact there are lots of seriously corrupt and ridiculous things going on in the financial system which they should know about and they should protest, and I’d like to invite them to educate themselves.

In particular, if you are someone interested in knowing stuff about how the financial system works, then please ask! A major part of why I blog is to try to inform people about these things who are interested. Please comment below and ask whatever you want, and if I don’t know the answer I will find someone who does, or I will blog about the question.

Having said that, I’d like to add that it’s on the one hand perfectly reasonable that people don’t understand the financial system, because it has essentially been set up to be too complicated to understand, and on the other hand it’s also reasonable to think of the entire financial system as a black box which can be judged by its outputs.

Finally, if we are going to judge the system by looking at its outputs, then these protesters, who are in general young, with educations, huge students debts, and hopeless outlooks, have a pretty dismal view. In other words they have every right to complain that the system is fucking them, even though they don’t know how the system works. I for one am super proud that they’re out there doing something, even if it’s not obviously organized and polished, rather than passively sitting by.

Categories: #OWS, finance, rant

Go Rays!

September 29, 2011 Cathy O'Neil, mathbabe 1 comment

As a long-time (yes since they sucked) Red Sox fan, let me just say, the Tampa Bay Rays totally deserve to be in the play-offs. They made me a fan last night with an absolutely amazing game.

Categories: news, rant

Never apologize

September 28, 2011 Cathy O'Neil, mathbabe 11 comments

Last night I was talking to a friend of mine about my teaching experiences, and what’s it’s like to be a woman in math and to be taken seriously. We were going over the standard stuff, that women are too self-effacing compared to men and tend not to strut their stuff enough. But then I remembered this story from my early teaching experiences that kind of put a different spin on that.

I was in grad school, and over the summer I went to Berkeley to teach at a women in math program, which was still called the “Mill’s program” even though it was being held at Berkeley. It was a really fun experience, something like 30 days of lecture and problem session, and I led the problem sessions.

It was some time in the second week when, one day because of something or other, I hadn’t prepared completely and I apologized to the class for being slightly unprepared. I said something like, “sorry I’m not completely prepared today”. I remember thinking that, in spite of that, the class went very well and there was no “damage” from my being unprepared. Every other day I was completely, perhaps overly prepared, and that was the only day I ever mentioned something about my preparedness.

At the end of the summer we got back teaching evaluations, and I remember that a full half of the evaluations described me as unprepared.

I made a promise to myself never ever to apologize for anything again. And I never have, and I’ve never been accused like that since. Which isn’t to say I pretend to be a perfect teacher, but there are subtle ways of dealing with imperfections (my favorite: turn a self-criticism into a flattery. Instead of saying, oh how stupid I am for not thinking of that, say oh how smart you are for thinking of that. Generosity is not a negative in my experience!).

Going back to last night, though, it’ a two-way street. Women may be too self-effacing, but other people (including women!) are absolutely too dismissive. It’s a very important thing to keep in mind when you are teaching or presenting.

One other thing, in a one-on-one, professional setting, I believe you can apologize and not be executed for it (sometimes and depending on the person), but in a teacher-students setting, or when you’re presenting to clients in business, or even when you’re presenting to colleagues, you’re giving a performance and need to be flawlessly confident.

In an ideal world, we would use this information to learn to become better audiences, to not be dismissive and overly harsh of self-effacing people, and I do try to keep this in mind when I’m in the audience. But it’s going to take lots of effort for this to happen on a large scale, especially among strangers. It’s a cultural axiom in a certain sense.

My advice to young people, especially women: never apologize.

Categories: math education, women in math

Occupy Wall Street—Report

September 27, 2011 Cathy O'Neil, mathbabe 10 comments

This is a guest post by FogOfWar.

I was originally going to lead with a tongue-in-cheek comment (later in the post now), but then the NYPD did something colossally stupid. If you haven’t seen it, here’s the video from this last weekend. It pretty much speaks for itself.

There’s a lot to be said about freedom of expression and police overreaction. I’ve been to see the protests a number of times, and they’ve never been violent and in fact seem pretty well trained in the confines of freedom of assembly in the US legal system. Using mace against an imminent threat of violence is OK for the police, but the video seems to show no threatening moves made at all (and it runs for a good period before the police attack so it wasn’t edited out).

I’d suggest the NYPD be shown the following video (taken from the protests in Greece) to demonstrate when things reach a level where force might be an appropriate response. Note that the crowd is attacking with sticks, Molotov cocktails and a fucking bowling ball. In contrast, the NYPD appears to be pepper spraying people for just holding signs and walking down the street. What the fuck?

There are maybe a few hundred people consistently protesting at “Occupy Wall Street” for about 10 days now. It’s got a definite crunchy vibe to the center. Drumming and Mohawks are mandatory:

But also a (growing?) contingent of more mainstream participants like this one:

Here’s a crowd shot for scale:

And some people painting signs:

And then of course, there’s the dreaded “consensus circle”:

It’s hard to tell what they really want to happen—this was up at one of the information booths (but then down the next time I went):

Misspelled “derivatives”, and there are some things on that list that are spot on and then others that are just weird and irrelevant (DTC? Really?). I don’t think you can hold that against them though. I work in the industry, and I’ve been spending the last three years thinking about this stuff and I still find it confusing and hard to come up with a cohesive plan of what I think should be done. At least these people are doing something, even if it’s a bit incoherent at times.

I have to end with my all time favorite sign from the protest. Someone was looking for good cardboard and inadvertently came up with the following:

“Delicious pizza to pay off the taxpayers”. Now that’s a slogan I think we can all rally behind!

-FoW

Categories: #OWS, finance, FogOfWar, news, rant

The flat screen TV phenomenon

September 26, 2011 Cathy O'Neil, mathbabe 12 comments

Do you remember, back in 2005 or 2006 or even up to early 2008, how absolutely everyone seemed to be buying flat screen TVs? And not only one, they’d actually buy new ones when new models came out, or ones with different high definition properties. And not just people who could afford it, either. The marketers did an excellent job in somehow convincing people that they needed these flat screen TVs so bad that they should just put it on their credit cards, all 3 thousand dollars of it, or whatever those things cost.

I don’t know exactly how much they cost because I never bought one. The last TV we bought was in 1997 and it still works, for the most part, although it’s really hard to turn it on and off. When it finally kicks the bucket I’m thinking we go without a TV, since TV pretty much sucks anyway. When we do watch it, it’s for live sports (local, or nationally televised, since we don’t pay for cable). Baseball we watch or listen to on the computer.

I was reminded of the the “flat screen TV era” by my friend Ian Langmore the other day when we were discussing household debt amnesty. His argument against debt amnesty for consumers was that they might spend it on crappy things. His example was luxury dog poo, but I’ve been obsessed with the flat screen TV phenomenon ever since a friend of mine, who was $120,000 in debt and didn’t have a salary, somehow managed to buy a flat screen TV in 2007. It blew me away in terms of wasteful consumerism. Ian found this unbelievable blog which kind of sums up my concerns.

In Ian’s opinion, the danger of amnesty, or any system where money is put willy-nilly into the hands of consumers, is twofold:

1) We waste time on unproductive activities. E.g. people spent time buying/building cars that are unneeded.
2) If a miscalculation is made, then the over-leveraged money-go-round stops with a huge mis-balance. E.g. home mortgage crisis.

These are very good points, and put together form a lesson we somehow can’t learn, although perhaps that can be partially explained by this article.

I have two thoughts. First, I’m also uncomfortable putting money in the hands of irresponsible consumers. But the truth is, the way I see it is currently working, we are already putting money in the hands of irresponsible bankers (that’s what the term “injection of liquidity” really means), and they are not doing anything with it, so let’s try something else. In other words, an alternative unpleasant idea.

Second, I don’t think we are going to see a new wave of flat screen TV buying any time soon. If we put money into the hands of consumers right now, I think we’d see them pay down their debts, go to the doctor, and buy jeans for their kids. Of course, there is always someone whose pockets burn with cash, and they would waste money in any situation. Let’s face it, though, credit is tight right now compared to the mid-2000’s. In fact, since economists seem to have a tough time spotting bubbles until afterwards, maybe we can take “a huge part of the population starts buying useless gadgets on credit” as almost a definition, or at least a leading indicator. Then at least there would be some point to all of that wasteful spending.

Categories: finance, news, rant

Why and how to hire a data scientist for your business

September 25, 2011 Cathy O'Neil, mathbabe 22 comments

Here are the annotated slides from my Strata talk. The audience consisted of business people interested in big data. Many of them were coming from startups that are newly formed or are currently being formed, and are wondering who to hire.

When do you need a data scientist?

When you have too much data for Excel to handle: data scientists know how to deal with large data sets.

When your data visualization skills are being stretched: as we will see, data scientists are skilled (or should be) at data visualization and should be able to figure out a way to visualize most quantitative things that you can describe with words.

When you aren’t sure if something is noise or information: this is a big one, and we will come back to it.

When you don’t know what a confidence interval is: this is related to the above; it refers to the fact that almost every number you see coming out of your business is actually an estimate of something, and the question you constantly face is, how trustworthy is that estimate?

Let’s take a step back: Should you need a data scientist?

Are you asking the right questions? Is there a business that you’re not in that you could be in if you were thinking more quantitatively? Big data is making things possible that weren’t just a few years ago.

Are you getting the most out of your data? In other words, are you sitting on a bunch of delicious data and not even trying to mine it for your business?

Are you anticipating shocks to your business? As we will see, data scientists can help you do this in ways you may be surprised at.

Are you running your business sufficiently quantitatively? Are you not collecting the data (or not collecting it in a centralized way) that would lead to opportunities for data mining?

So, you’ve decided to hire a Data Scientist (nice move!)

What do you need to get started?

Data storage. You gotta keep all your data in one place and in some unified format.

Data access — usually through a database (payoffs for different types). Specifically, you can pay for someone else to run a convenient SQL database that people know how to use walking in the door without much training, or you could set something up that’s open source and “free” but then it will probably take more time to set up and make take the data scientists longer to figure out how to use. The investment here is to create tools to make it convenient to use.

Larger-scale or less uniform data may require Hadoop access (and someone with real tech expertise to set it up). The larger your data is the more complicated and developed your skills need to be to access it. But it’s getting easier (and other people here at the conference can tell you all you need to know about services like this).

Who and how should you hire? It’s not obvious how to hire a data scientist, especially if your business so far consists of less mathematical people.

A math major? Perhaps a Masters in statistics? Or a Ph.D. in machine learning? If you’re looking for someone to implement a specific thing, then you just need proof that they’re smart and know some relevant stuff. But typically you’re asking more than that: you’re asking for them to design models to answer hard questions and even to figure out what the right questions are. For that reason you need to see that the candidate has the ability to think independently and creatively. A Ph.D. is evidence of this but not the only evidence- some people could get into grad school or even go for a while but decide they are not academically-minded, and that’s okay (but you should be looking for someone who could have gotten a Ph.D. if they’d wanted to). As long as they went somewhere and challenged themselves and did new stuff and created something, that’s what you want to see. I’ll talk about specific skills you’d like in a later section, but keep in mind that these are people who are freaking smart and can learn new skills, so you shouldn’t obsess over something small like whether they already know SQL.

What should the job description include? Things like, super quantitative, can work independently, know machine learning or time series analysis, data visualization, statistics, knows how to program, loves data.

Who even interviews someone like this? Consider getting a data scientist as a consultant just to interview a candidate to see if they are as smart as they claim to be. But at the same time you want to make sure they are good communicators, so ask them to explain their stuff to you (and ask them to explain stuff that has been on your mind lately too) and make sure they can.

Also: don’t confuse a data scientist with a software engineer! Just as software engineers focus on their craft and aren’t expected to be experts at the craft of modeling, data scientists know how to program in the sense that they typically know how to use a scripting language like python to manipulate the data into a form where they can do analytics on it. They sometimes even know a bit of java or C, but they aren’t software engineers, and asking them to be is missing the point of their value to your business.

What do you want from them?

Here are some basic skills you should be looking for when you’re hiring a data scientist. They are general enough that they should have some form of all of them (but again don’t be too choosy about exactly how they can address the below needs, because if they’re super smart they can learn more):

Data grappling skills: they should know how to move data around and manipulate data with some programming language or languages.
Data viz experience: they should know how to draw informative pictures of data. That should in fact be the very first thing they do when they encounter new data
Knowledge of stats, errorbars, confidence intervals: ask them to explain this stuff to you. They should be able to.
Experience with forecasting and prediction, both general and specific (ex): lots of variety here, and if you have more than one data scientist position open, I’d try to get people from different backgrounds (finance and machine learning for example) because you’ll get great cross-pollination that way
Great communication skills: data scientists will be a big part of your business and will contribute to communications with big clients.

What does a Data Scientist want from you? This is an important question because data scientists are in high demand and are highly educated and can get poached easily.

Interesting, challenging work. We’re talking about nerds here, and they love puzzles, and they get bored easily. Make sure they have opportunities to work on good stuff or they’ll get other jobs. Make sure they are encouraged to think of their own projects when it’s possible.

Lots of great data (data is sexy!): data scientists love data, they play with it and become intimate with it. Make sure you have lots of data, or at least really high-quality data, or soon will, before asking a data scientist to work for you. Data science is an experimental science and cannot be done without data!

To be needed, and to have central importance to the business. Hopefully it’s obvious that you will want your data scientists to play a central role in your business.

To be part of a team that is building something: this should be true of anyone working in business, especially startups. If your candidate wants to write academic papers and sit around while they get published, then hire someone else.

A good and ethically sound work atmosphere.

Cash money. Most data scientists aren’t totally focused on money though or they would go into finance.

Further business reasons for hiring a Data Scientist

Reporting help: automatically generated daily reports can be a pain to set up and can require lots of tech work and may even require a dedicated person to generate charts. Data scientists can pull together certain kinds of reports in a matter of days or weeks and generate them every day with cronjobs. Here’s a sample picture of something I did at my job:

Having a data scientist enables you to see into data without taxing your tech team (beyond setup) via visualizations and reports like the above.

A/B testing: data scientists help you set up A/B testing rigorously.

Beyond A/B testing: adaptability and customization. What you really want to do is get beyond A/B testing. Instead of having the paradigm where customers come to the ad and respond in a certain way, we want to have the (right) ad come to the customer.

Knowing whether numbers are random (seasonality) or require action. If revenue goes down in a certain week, is that because of noise? Or is it because it always goes down the week after Labor Day? Data scientists can answer questions like this.

What-if analysis: you can ask data scientists to estimate what would happen to revenue (or some other stat) if a client drops you, or if you gain a new client, or if someone doubles their bid at an auction (more on this later).

Help with business planning: Will there be enough data to answer a given question? Will there be enough data to optimize on the answer? These are some of the most difficult and most important questions, and the fact that a data scientist can help you answer them means they will be central to the business.

Education for senior management: senior people who talk to and recruit new clients will need to be able to explain how to think about the data, the signals, the stats, and the errorbars in a rigorous and credible way. Data scientists can and should take on the role of an educator for situations like this.

Mathematically sound communication to clients: you may have situations where you need the data scientists to talk directly to clients or to their data scientists. This is yet another reason to make sure you hire someone with excellent communication skills, because they will be representing your business to really smart people who can see through bullshit.

Case Study: Stress Tests

We can learn from finance: the idea of a stress test is stolen directly from finance, where we look at how replays of things like the credit crisis would affect portfolios. I wanted to do something like that but for general environmental effects that a business like mine, which hosts an advertising platform, encounters.

You know how big changes will affect your business directionally and specifically. But do you know how combinations will play out? Stress tests allow you to combine changes and estimate their overall effect quantitatively. For example, say we want to know how lowering or raising their bids (by some scalar amount) will effect advertisers impression share (the number of times their ads get displayed to users). Then we can run that as a scenario (for each advertiser separately) using the last two weeks (say) of auction data with everything else kept the same, and compare it to what actually happened in the last two weeks. This gives an estimate of how such a change would affect impression change in the future. Here’s a heat map of possible results of such a “stress test”:

This shows a client-facing person that Advertiser 13 would benefit a lot from raising their bid 50% but that Advertiser 12 would suffer from lowering their bid.

We could also:

run scenarios which combine things like the above
run scenarios which ask different questions: how would advertisers be affected if a new advertiser entered the auction? If we change the minimum bid? If one of the servers fails? If we grow into new markets?
run scenarios from the perspective of the business: how would revenue change if the bids change?

In the end stress tests can benefit any client-facing person or anyone who wants to anticipate revenue, so across many of the verticals of the business.

Categories: data science

I never sit on the subway

September 24, 2011 Cathy O'Neil, mathbabe 6 comments

I remember when I moved to New York in 2005. I found it intimidating and shocking how aggressively people vied for seats on the subway. I live near Columbia so the 1 train is my line, and of course everyone thinks their subway line is the most overused and crazy line, but in this case I’m right. I came from Boston, where we have subways too, four little itty bitty ones, and we are extremely polite to each other and, in particular, we never touch. By contrast here were these New Yorkers not only touching but literally squeezing into these tiny seats and sweating all over each other in the summer.

After about 3 months of living here I got really into it. I was in love with this city, and every gritty thing about it, and I considered the shared experience of the subway a sign of a larger public communistic love. Here they were, people from all walks of life, sharing their sweat! Isn’t it beautiful?

That kind of admiration only grew in the two years I stayed a professor at Barnard, which meant I almost never left the cozy neighborhood of Morningside Heights, so subway rides were rather rare, amusing events. I loved the subway and I developed theories about when people start talking on the subway (in three situations: 1) someone who is incredibly smelly gets off the train and everyone needs to talk about how smelly they were, 2) someone who is incredibly sick and coughing up a lung gets off the train and everyone has to talk about how sick and nasty they were, and 3) the train stops in the tunnel and the announcer tells us we have no idea when we will be able to move, and everyone has to talk about their stuck-in-a-tunnel-during-9/11 experiences.)

As soon as I started working at D.E. Shaw in midtown, and commuted during rush hour, I got real. I figured out exactly where to stand, and I mean exactly where on each platform, to maximize my chances of getting a seat once the train came. I figured out, depending on how many people were on which platform in Times Square, and the subsequent stations as we passed them, what the recent train traffic pattern had been in terms of the express 2/3 train and my local 1 train, and sometimes I’d do crazy things like get off the express train early to get on the 1 train because I’d anticipate that if I waited til 96th street like everyone else, there would be no chance I could get on the 1 train. Actually looking back, I almost never sat down at all during these commutes, even when I was pregnant.

Which comes to the turn in my story. When I was heavily pregnant, commuting on the subway was actually hellish. I had no balance, and felt vulnerable, and being squished up against people with no place to hold on was really scary. For the most part commuters are a selfish bunch, and people sitting would pretend not to notice me, so they wouldn’t have to give up their seat. I promised myself I’d never be that jerk.

For the last two weeks of my pregnancy I took a cab to work every day, but even so coming home was another story, since it’s hard to get a cab in Times Square at 5pm. I remember one time some asshole in a suit actually ran to grab a cab that had stopped for me, and he beat me because… I was 9 months pregnant and couldn’t keep up with him. I started crying, on the street, until this nice pedicab guy pulled over and asked me if he could help. I told him I lived all the way uptown and he biked me around until he found me a cab; he refused to let me pay. I still love that guy.

Once I started down the road of getting up for pregnant people, though, it was a short logical step to never sitting down again. After all, there are all kinds of hidden reasons people may need to sit down more than I do. What if their feet are killing them after standing all day at work? What if they have balance problems?

For a while I decided it’s okay to sit if everyone else had an available seat. That seemed safe. But then I’d be sitting there, spaced out or reading, with a sea of empty seats around me, and all of a sudden a huge group of people would converge and somehow I’d be face to face with someone with a murderous look which said, you motherfucker you’re sitting in my seat. In the end, it’s become my policy to just never sit down.

I do of course still think about the question of where’s the best place to stand in the subway. This is a whole different optimization play, which for intellectual property reasons I won’t share with you all, since I don’t want more competition than I already have. Just one hint: don’t get on in the middle of the car. Always get on at one of the ends.

Categories: rant

In German beard circles, tensions are high.

September 23, 2011 Cathy O'Neil, mathbabe 2 comments

Best article ever about beards.

Categories: news

Are SAT scores going down?

September 23, 2011 Cathy O'Neil, mathbabe 1 comment

I wrote here about standardizing tests like the SAT. Today I wanted to spend a bit more time on them since they’ve been in the news and it’s pretty confusing what to think.

First, it needs to be said that, as I have learned in this book I’m reading, it’s probably a bad idea to make statements about learning when you make “cohort-to-cohort comparisons” instead of following actual students along in time. In other words, if you compare how well the 3rd grade did in a test one year to the next, then for the most part the difference could be explained by the fact that they are different populations or demographics. Indeed the College Board, which administers the SAT, explains that the scores went down this year because more and more diverse kids are taking the test. So that’s encouraging, and it makes you think that the statement “SAT scores went down” is in this case pretty meaningless.

But is it meaningless for that reason?

Keep in mind that these are small differences we’re talking about, but with a pretty huge sample size overall. Even so, it would be nice to see some errorbars and see the methodology for computing errorbars.

What I’m really worried about though is the “equating” part of the process. That’s the process by which they decide how to compare tests from year to year, mostly by having questions in common that are ungraded. At least that’s what I’m guessing, it’s actually not clear from their website.

My first question is, are they keeping in mind the errors for the equating process? (I find it annoying how often people, when they calculate errors, only calculate based on the very last step they take in a very sketchy overall process with many steps.) For example, is their equating process so good that they can really tell us with statistical significance that American Indians as a group did 2 points worse on the writing test (see this article for numbers like this)? I am pretty sure that’s a best guess with significant error bars.

Additional note: found this quote in a survey paper on equating methodologies (top of page 519):

Almost all test-equating studies ignore the issue of the standard error of the equating
function.

Second, I’m really worried about the equating process and its errorbars for the following reason: the number of repeat testers varies widely depending on the demographic, and also from year to year. How then can we assess performance on the “linking questions” (the questions that are repeated on different tests) if some kids (in fact the kids more likely to be practicing for the test) are seeing them repeatedly? Is that controlled for, and how? Are they removing repeat testers?

This brings me to my main complaint about all of this. Why is the SAT equating methodology not open source? Isn’t the proprietary “intellectual property” in the test itself? Am I missing a link? I’d really like to take a look. Even better of course if the methodology is open source (as in there’s an available script which actually computes the scores starting with raw data) and the data is also available with anonymization of course.

Categories: data science, math education, news

Newer Entries Older Entries

mathbabe

What’s wrong with Wall Street and what should be done about it?

Saturday afternoon quickie

Habits

Financial Terms Dictionary

Bayesian regressions (part 1)

My friend the coffee douche

Data science: tools vs. craft

“Our organization does not reward failure” – Koch

First day of calculus class

Is the Onion actually America’s finest news source?

Mortar Hawk: hadoop made easy

Occupy Wall Street: Day 13

Go Rays!

Never apologize

Occupy Wall Street—Report

The flat screen TV phenomenon

Why and how to hire a data scientist for your business

I never sit on the subway

In German beard circles, tensions are high.

Are SAT scores going down?

Top Posts & Pages

Follow Blog via Email

Recent Posts

Meta