mathbabe

What is an earnings surprise?

July 17, 2011 Cathy O'Neil, mathbabe 1 comment

One of my goals for this blog is to provide a minimally watered-down resource for technical but common financial terms. It annoys me when I see technical jargon thrown around in articles without any references.

My audience for a post like this is someone who is somewhat mathematically trained, but not necessarily mathematically sophisticated, and certainly not knowledgeable about finance. I already wrote a similar post about what it means for a statistic to be seasonally adjusted here.

By way of very basic background, publicly traded companies (i.e. companies you can buy stock on) announce their earnings once a quarter. They each have a different schedule for this, and their stock price often has drastic movements after the announcement, depending on if it’s good news or bad news. They usually make their announcement before or after trading hours so that it’s more difficult for news to leak and affect the price in weird ways minutes before and after the announcement, but even so most insider trading is centered around knowing and trading on earnings announcements before the official announcement. (Don’t do this. It’s really easy to trace. There are plenty of other ways to illegally make money on Wall Street that are harder to trace.)

In fact, there’s so much money at stake that there’s a whole squad of “analysts” whose job it is to anticipate earnings announcements. They are supposed to learn lots of qualitative information about the industry and the company and how it’s managed etc. Even so most analysts are pretty bad at forecasting earnings. For that reason, instead of listening to a specific analyst, people sometimes take an average of a bunch of analysts’ opinions in an effort to harness the wisdom of crowds. Unfortunately the opinions of analysts are probably not independent, so it’s not clear how much averaging is really going on.

The bottomline of the above discussion is that the concept of an earnings surprise is really only borderline technical, because it’s possible to define it in a super naive, model-free way, namely as the difference between the “consensus among experts” and the actual earnings announcement. However, there’s also a way to quantitatively model it, and the model will probably be as good or better than most analysts’ predictions. I will discuss this model now.

[As an aside, if this model works as well or better as most analysts’ opinions, why don’t analysts just use this model? One possible answer is that, as an analyst, you only get big payoffs if you make a big, unexpected prediction which turns out to be true; you don’t get much credit for being pretty close to right most of the time. In other words you have an incentive to make brash forecasts. One example of this is Meredith Whitney, who got famous for saying in October 2007 that Citigroup would get hosed. Of course it could also be that she’s really pretty good at learning about companies.]

An earnings surprise is the difference between the actual earnings, known on day t, and a forecast of the earnings, known on day t-1. So how do we forecast earnings? A simple and reasonable way to start is to use an autoregressive model, which is a fancy way of saying do a regression to tell you how past earnings announcements can be used as signals to predict future earnings announcements. For example, at first blush we may use last earning’s announcement as a best guess of this coming one. But then we may realize that companies tend to drift in the same direction for some number of quarters (we would find this kind of thing out by pooling data over lots of companies over lots of time), so we would actually care not just about what the last earnings announcement was but also the previous one or two or three. [By the way, this is essentially the same first step I want to use in the diabetes glucose level model, when I use past log levels to predict future log levels.]

The difference between two quarters ago and last quarter gives you a sense of the derivative of the earnings curve, and if you take an alternating sum over the past three you get a sense of the curvature or acceleration of the earnings curve.

It’s even possible you’d want to use more than three past data points, but in that case, since the number of coefficients you are regressing is getting big, you’d probably want to place a strong prior on those coefficients in order to reduce the degrees of freedom; otherwise we would be be fitting the coefficients to the data too much and we’d expect it to lose predictive power. I will devote another post to describing how to put a prior on this kind of thing.

Once we have as good a forecast of the earnings knowing past earnings as we can get, we can try adding macroeconomic or industry-specific signals to the model and see if we get better forecasts – such signals would bring up or bring down the earnings for the whole industry. For example, there may be some manufacturing index we could use as a proxy to the economic environment, or we could use the NASDAQ index for the tech environment.

Since there is never enough data for this kind of model, we would pool all the data we had, for all the quarters and all the companies, and run a causal regression to estimate our coefficients. Then we would calculate a earnings forecast for a specific company by plugging in the past few quarterly results of earnings for that company.

Categories: data science, finance, hedge funds, news

I love math nerd kids

July 15, 2011 Cathy O'Neil, mathbabe 9 comments

So I’m almost at the end of my second week here at HCSSiM, and the pathetic truth is I already miss these kids. They are so freaking adorable, and of course I miss my own kids so much, that the emotional turmoil of the situation combines to create the reality that I am actually nostalgic for each moment with them before that moment happens. Pathetic!! It’s something about identifying with their nerdy selves finding each other and figuring out that they have a community of nerds that accepts them… whatever, now I’m tearing up. Pitiful.

As for what I’m teaching them, the first week it was number theory, number theory, and more number theory. Can you tell I like number theory? At the end of the first week I looked around and I saw a bunch of earnest faces wondering if I was going to prove yet another thing about relatively prime numbers and solving polynomials modulo n and I thought to myself, these kids are going to think there’s no other examples of proof by induction! How shameless! So this week I talked about graph theory. Next week: I’m going back to number theory. Yes I know, but it’s AWESOME. I’m going to talk about Farey numbers and continued fractions and maybe the Pell equation. They will know all about the golden ratio and maybe we’ll even measure each other’s faces. I can’t wait.

Last night we went to the director’s house and ate corn on the cob (we made the kids husk the corn- did you know teenagers today have mostly never husked corn before in their lives?) and pizza and we played “Mafia,” which was hilarious and sweetly innocent.

This weekend is “Yellow Pig day” at the ~~camp~~ program, which is a day where we celebrate yellow pigs and the number 17. We take this incredibly seriously, including making t-shirts with yellow pigs, having a 4-hour (feels like 17) talk about interesting properties of the number 17, and finally, singing yellow pig carols and eating a yellow pig cake at the end. It’s a wild time for math nerd kids. They will remember this and each other for the rest of their lives. Woohoo!!

Did I mention that I was a minor celebrity last night because I solved a 7x7x7 Rubik’s cube in front of them? This is status at its best. I even showed them my trick, and one of the kids came back to me at breakfast this morning proudly displaying his cube with a 3-cycle. Update: he has solved his entire cube using 3-cycles. Now he’s moving on to a dodecahedron puzzle.

LOVE these kids.

Categories: math education, rant, women in math

Motivating transparency: what we could do about too big to fail

July 13, 2011 Cathy O'Neil, mathbabe 5 comments

In this previous post, I promised a follow-up post about how we can devise a system in which large banks are actually motivated to be transparent about what is inside their portfolios. We have also discussed why the current system doesn’t work this way and that the banks have every reason to obfuscate their holdings, and in fact make loads of money by doing so. This makes appropriate external risk management difficult or impossible.

I have actually thought about this problem quite a bit since that post, and I (and a friend in finance) have come up with two quasi ideas, which hopefully together add up to be as good as one complete idea. The first comes under the category, “add stuff to what we have now”, whereas the second comes under the category, “initiate a new system which will over time replace the one we have”. Both of these systems rely on a good understanding of the underlying problem of the current system, namely the concept of “too big to fail.”

If you’re reading this and you have comments about either idea, please do comment. We are hoping for lots of feedback so we can improve the details.

Too Big to Fail

Recall that the way it works when hedge funds want to trade stuff: they have prime brokers, i.e. banks like Deutche and Goldman Sachs and Bank of America (see list of the biggies here). When the brokers don’t like the trade, or think it’s not sufficiently liquid, or think that the hedge fund may fail for any reason, they demand that the hedge funds post margin. That way if the bet goes sour there is a limited amount of risk that the brokerage could lose. As soon as a position starts to look riskier, which could happen because of recent volatility or lack of price transparency, the amount of margin that needs to be posted normally increases, putting pressure on the hedge fund to liquidate suspicious assets.

In other words, there is a real cost to hedge funds for trading in illiquid or complex securities, namely their cash is tied up in bank accounts with their brokers. This is not to say that they don’t take large risks, but there is a limit of how much risk they can take because of the “posting margin” system.

By contrast, big banks don’t post margins. They trade with hedge funds, of course, since hedge funds trade with them, but it’s the banks who demand margin, not the hedge funds (actually there’s a historical exception to this rule, namely Paulson’s hedge fund demanded margin from its brokers during the 2008 financial crisis).

This asymmetrical situation begs the question, why do hedge funds have to post margin but the big banks don’t? Two reasons: first, banks have access to Federal funds, and second, they are deemed to big to fail. [I admit I don’t know exactly why the access to Federal funds is granted to banks, nor do I understand exactly what the effect is. But I do think it’s a pertinent fact which is why I’ve included it here. Please do comment if you know more! Also note it may be a red herring since Goldman Sachs didn’t have access to Fed funds until the crisis.]

This “too big to fail” guarantee is a huge problem, which has only gotten more precise (since we’ve seen the bailout and now everyone knows the guarantee is there) and larger (because, in the end, the net result of all the 2008 crisis is fewer, larger banks) and about which absolutely nothing seems to be getting done. The disingenuous whining of greedy bankers like Jamie Dimon serves as a smokescreen for the fact that, if anything, banks are presumably waltzing into the next phase of their life with more power and fewer checks than they could have dreamed about in August 2008.

Idea #1: make banks post margins

“Too big to fail” means that it is assumed that the bank will be rescued by the government if it makes huge bad bets that threaten to bring them down. Two of the reasons the government can be counted on to bail out banks are first, that the deposits of normal Americans are at risk, which is discussed below in Idea #2, and second, that a bankruptcy would be catastrophically complicated, which we discuss here. One result of the guarantee is that hedge funds don’t bother demanding margins, which makes the banks riskier, which makes the “too big to fail” guarantee even worse.

What if the lawmakers enforced a symmetry of posted margins? We have to be precise, because actually there are different kinds of margins that traders are forced to post.

First, there’s the margin you post in the sense of “keep $x as a deposit for the position”, the thinking being that even if things go south, the broker could liquidate at something better than $x below current marked price in a hurry. This is the initial margin.
Next there’s the “your position lost $10 today, so you need to give me $10” (this is called variation margin). This is the most likely way to get margin called.

The idea here is to require brokers to post initial margin just as hedge funds do now. More precisely, the idea would be to let the two parties negotiate on the initial margin, which could be more for hedge funds since they may well be riskier, but then once it’s set to have complete symmetry of variation margin.

Occasionally, in risky environments, the initial margin of $x is increased, which causes a lot of unraveling, and possibly cascading waves of problems which set off a panic. We’d need to have rules about how often this can happen to avoid the “symmetric of variation margin” rule from being bypassed with lots of initial margin modifications. The symmetry aspect should keep the margin contracts from allowing this to happen too often.

The overall goal would be to devise a system that would:

Encourage the posting and calling of (variation) margins,
Encourage sufficient sizing of initial margin,
Encourage early calls and liquidating if there is doubt that a variation margin call could be met, and
Simplify the bankruptcy rules on ownership of assets, especially for illiquid or complex assets.

The initial margin can be thought of as the dollar amount a price could move by between a margin call and it being paid. It should not be thought of as an asset for either party (and therefore the accounting of the various margins should be carefully considered, but I’m no accounting expert), and certainly should not be able to be recycled to buy more stuff, i.e. add to ones leverage, or offered towards capital requirements. Moreover, if it is indeed symmetric, that would mean if a bank claims to only need to post n dollars in initial margin, then the hedge fund can turn around and use that same number for that same trade, at least up to an understood discount.

As for bankruptcy, we should start with the following. When a margin call is made by one side and it isn’t met, the person making the call:

keeps ALL the margin,
gets the security, and
is a (super-senior level of seniority) claimaint to the variation margin they posted with the counterparty.

Moreover, rules 1 and 2 above do not go into a bankruptcy filing if one occurs (in particular, if the security is a swap, it’s just torn up). This is a key point since that means the bankruptcy is simplified and at the same time the security is back in liquid hands. All over, this setup, or one like it, encourage hedge funds to margin call frequently (banks already do that), which is a good thing, and as described above is a further incentive to invest in liquid, non-complex securities, which in the end creates transparency.

The above idea doesn’t deal directly with desired property 2, and may well cause margins to be lower. One possibility to encourage margins to be of sufficient size would be to allow either party to “put” the security in question on to the other party at a cost of giving up the initial margin posted.

Idea #2: grow a separate system of utility deposit banks

Besides incredibly complicated bankruptcy filings with infinitely many counterparties, one of the major reasons those banks really are too big to fail is that they hold deposits, and the government doesn’t want people to worry that their life savings are at risk, causing a run on the banks and chaos. Another way to get around this, at least eventually, is to create new “utility banks” at the state level which do not trade securities (beyond very basic one like interest rate swaps and treasuries), don’t take large risks, and have FDIC guarantees on savings.

In order to get consumers to switch to banks like this, the government should intentionally create incentives for people to transfer their deposits from “too big to fail” banks to these utility banks. A list of incentives could start with reasonable, transparent fees, and the eventual loss of FDIC insurance guarantee at non-utility banks. Then people who want to stay with risk-taking banks can do so knowing that, as long as bankruptcy laws eventually get simplified, the “too big to fail” guaranteed will in fact be gone.

Moreover, another layer of separation between depositors and utility banks should be the requirement that, even with the restricted kinds of trades allowed for utility banks, they should be done in separate corporate entities (since banks are always a mishmash of many companies anyway).

This idea is not new, and can be seen for example in this article. In fact it is incredibly obvious: admit that what we have now is a guarantee for a get-out-of-jail card for greedy bankers, and transfer that guarantee to a banking system that we’ve created to be boring, along the lines of the post office.

Categories: finance, hedge funds

Bank accounting link

July 12, 2011 Cathy O'Neil, mathbabe Comments off

I wanted to share this link with you; it is both interesting and relevant to another post I’m working on (a follow up to this one) that will describe two ideas I’m contemplating regarding how to systematically change the way big banks are motivated to behave in the presence of the “too big to fail” guarantee.

Its goal is to describe how banks will behave in a given situation with a mortgage, but the thought process generalizes quite well to how banks behave in general, and in particular how accounting considerations trump utility to the depositors and even the long-term shareholders. It also explains, to those of us who were wondering, why Obama’s mortgage modification plan was never going to work.

Categories: finance, news, rant

Short Post!

July 11, 2011 Cathy O'Neil, mathbabe 8 comments

I’ve been told my posts are intimidatingly long, what with the twitter generation’s sound byte attention span. Normally I’d say, screw that! It’s because my ideas are so freaking nuanced they can’t be condensed to under a paragraph without losing their essence!

But today I acquiesce; here’s a short post containing at most one idea.

Namely, I’ve been getting pretty strong reactions online and offline regarding my post about whether an academic math job is a crappy job. I just want to set the record straight: I’m not even saying it’s a crappy job, I’m simply talking about someone else’s essay which describes it that way. But moreover, even if I were saying that, I would only be saying it’s crappy (which I’m not) compared to other jobs that very very smart mathy people could get. Obviously in the grand scheme of things it’s a very good job- safe working conditions, regular hours, well-respected, etc., and many people in this world have far crappier jobs and would love a job with those conditions. But relative to other jobs that math people could be getting, it may not be the best.

Many professors of math (you know who you are) have this weird narrow world view, that they feed their students, which goes something like, “if you want to be a success, you should be exactly like me (which is to say, an academic)”. So anyone who gets educated in a math department is apt to run into all these people who define success as getting tenure in an academic math department, and they just don’t know about or consider other kinds of gigs. It would be nice if there was a way to get a more balanced view of the pros and cons of all of the options.

Categories: finance, internet startup, math education, rant, women in math

Weekend Reading

July 8, 2011 Cathy O'Neil, mathbabe 1 comment

FogOfWar and I have compiled a short list of weekend reading for you that you may enjoy:

What’s the right way to think about China’s economy?
Is Japan’s “lost decades” a media myth?
Can I hear a FUCK YEAH for Elizabeth Warren? I feel a follow-up post coming on how much she rocks.
Get ready to be depressed by how few natural resources there really are.
This essay really pins Robert Rubin to the wall in a totally awesome way. I will add more in another post.
The Republicans are holding the entire nation for ransom over the possibility of default. Is it all political posturing? Or is it for the sake of the insanely shitty idea of a tax repatriation holiday? Here’s another article about this crappy idea; when Bloomberg makes you out as a selfish bastard then you know you’re a truly selfish bastard. I’m convinced that the politicians (and union leaders) arguing for this are just counting on the average person not understanding the actual issues well enough to know how evil it is (and how much kickback they must be getting). Another example of asymmetric information that really gets my goat.
I think it’s fair to say we all need a little more of this in our lives.

Categories: finance, FogOfWar, hedge funds, news, rant

Adding-up rules and Hockey Sticks

July 7, 2011 Cathy O'Neil, mathbabe 2 comments

So I’m at the math program HCSSiM, teaching for three weeks in a “workshop,” which means I am responsible for teaching 12 teenagers the basic language and techniques of math- things like induction, proof by contradiction, the pigeon-hole principle, and how to correctly use phrases like “without loss of generality we can assume…” and “the following is a well-defined function…”, as well as familiarity with basic group theory, graph theory, number theory, cardinality, and fun things like Pascal’s triangle.

It’s really beautiful, classical math, and the students are eager and fantastically bright. They are my temporary brood, and I adore them and feed them chocolate at evening problem sets.

It’s also a fine opportunity to do some silly math doodling just for fun, the only rules being you can’t use a computer to look anything up until you’re done, and you can only use the stuff your kids at the program already learned. I’m going to describe what my mom and I, and then a junior (Amber Verser) and senior (Benji Fisher) staff member at the math program, figured out in the last couple of days. It’s super cool and turns out is at least 400 years old.

One of the most common examples of proof by induction is the formula for the sum of the counting numbers up to n:

1 + 2 + 3 + … + n = n(n+1)/2

And then, once you figure that out, you move on to the next case:

1^2 + 2^2 + 3^2 + … + n^2 = n(n+1)(2n+1)/6.

If you’re really into it, you can put the next case on the problem set:

1^3 + 2^3 + 3^3 + … + n^3 = (n(n+1)/2)^2.

Two obvious patterns are emerging when you add up successive dth powers up to n.

It’s a polynomial of degree d+1, and
The roots of the polynomial are symmetric about -1/2 (mom noticed this!).

How do you prove those two facts?

If you think it’s totally easy, stop reading now and give it a shot. There are about a million things you could try and none of them seem to work. I’ll wait.

…okay, let’s say you gave up, or already know, or don’t care. (Why are you reading still if you don’t care?!)

First let’s generalize the question to, if we add up values of some degree d polynomial for values i=0, 1, 2, …, n, then we want to prove the result is a degree d+1 polynomial in n. That this is equivalent to the first statement above is pretty easy to see by just re-arranging the terms of the double sum over i and over the terms of the polynomial in question. But it still seems like you need to know at least the answer to the question of what is a formula for 0^d + 1^d + 2^d + … + n^d, which is of course where we started.

But that’s where Pascal’s triangle comes in! We can generate Pascal’s triangle by the familiar “add up two consecutive numbers and put the answer below,” but we also can think of the element on the nth row and kth (tilted) column of Pascal’s triangle as the number of ways to choose k things from n things, which is referred to as “n choose k”, and where we start both the row and column counts at 0, not at 1. That definition satisfies the addition law because, if we have n things, we can label one as “special,” and then the choice of size k subsets of the n things divide into two categories: the size k subsets that contain the special guy and the ones that don’t. If they do, then we need only find k-1 other things in the remaining n-1 size set, and the number of ways to do that is given by the element on row n-1 and column k-1. If they don’t contain the special guy, we need to find k things in the remaining n-1 size set, and the number of ways to do that is given by the element on row n-1 and column k.

On the other hand, we also know a formula for the numbers in Pascal’s triangle: the guy on the nth row and kth column is given by a degree k polynomial in n, namely n!/k!(n-k)!. (This is because we can label all of the guys 1 through n, and just take the first k guys, and there are n! ways to label n things, but we don’t actually care about the order among the first k or among the last n-k.)

For example, in the second column, where we are looking at “n choose 2” for various n, we have the equation n(n-1)/2. This is a LOT like n^2 but has extra terms sticking on the end of lower order. When you’re looking at the third column, you’re working with the formula n(n-1)(n-2)/6, which is like the basic polynomial n^3 with extra stuff. In other words, the formula for “n choose k” is a degree k polynomial in n which we can think of as being a stand-in for n^k. Awesome.

The last ingredient is something called the “Hockey Stick Theorem,” which you gotta love just because of the name. It states that if we add up the values along a column, from the top of the rows down to the nth row, then the sum will be the number just below and to the right, and the entire picture will resemble a hockey stick.

The proof of the Hockey Stick Theorem is trivial- the answer is of course the sum of the two above it, and we have one in the sum already, but the other isn’t… but that other is the sum of the two above it, one of which is again already in the sum but the other isn’t… and you keep going until you get to the top edge of Pascal’s triangle, where the missing number is just 0.

Why does the Hockey Stick Theorem give us what we want? Going back to our generalized statement, we want to show the sum of values on a (any) degree d polynomial for i = 0, 1, 2, …, n is a degree d+1 polynomial. Well, use the dth column and make a hockey stick from the top to row n. Then the sum is on the (n+1)st row, in the (d+1)st column, which we know is a degree d+1 polynomial in n. Woohoo!

One way of looking at this is that we were actually asking the wrong question: instead of asking what the sum of the dth powers is we should have perhaps been asking what the sum of the dth column of Pascal’s triangle is; in other words, there is a better basis for the vector space of polynomials than x^d, namely “x choose d”. In fact, if there were an agreement in the world that actually the “x choose d” polynomials should be the standard basis, (by the way, these basis polynomials would be called “Pascalinomials”!) then the hockey stick theorem would be the last word on how do those things add up. As it stands, to figure out the actual formula for the sum of the dth powers for i=0, 1, 2, …, n, we need to write the first row of the change-of-basis matrix from one basis to the other.

As for the second question, we simply need to extend the definition of the sum F(n) of dth powers from 0 to n to the case where n is negative, by iteratively using the relation:

F(n) = F(n-1) + n^d, or

F(n-1) = F(n) – n^d.

Then we have F(0) = 0, F(-1) = 0, F(-2) = (-1)^(d+1), F(-3) = (-1)^(d+1) – (-2)^d = (-1)^(d+1)(1^d + 2^d) …, and it’s easy to prove that, for any n,

F(n) = (-1)^(d+1)F(-n-1).

This means that if we have a root at -1/2 + a, we also have a root at -1/2 – a = -(-1/2 +a) -1.

Categories: math education

Does an academic job in math really suck?

July 6, 2011 Cathy O'Neil, mathbabe 10 comments

My cousin recently sent me a link to this article about women in science. Actually it’s really about jobs in science, and how much they suck, and how women are too practical to want them. It’s definitely interesting- and pretty widely read, as well, although I’d never seen it. It makes a few excellent points, especially about the crappy amount of money and feedback one gets as an academic, two issues which were definitely part of my personal decision to leave my academic career.

I think his overall argument, though, is simultaneously too practical-minded and not practical-minded enough. And although his essay is about science, I’ll concentrate on how it relates to math.

It’s too practical in that it doesn’t really understand the attraction- the nearly carnal desire- people have to math. It essentially assumes that after some amount of time, maybe 20 years, people will lose interest in their subject, perhaps because they are getting poorly paid.

Is this really true? Maybe for some people this is true, but the nerds I know are nerds for life – they don’t wake up one day thinking math isn’t cool after all. And from what I know about people, they acclimate pretty thoroughly to their standard of living by the time they are 40.

It’s not practical enough, though, because it doesn’t get at one of the most important reasons women leave math, namely because they are married and maybe have kids and they simply can’t be that person who moves across the country for a visiting semester in Berkeley because their husband has a job already and it’s not in Berkeley.

[As a side note, if someone wants to actually encourage women in math, and they are loaded, I would encourage them to set up a fund that would pay costs for quality childcare and airplane tickets for kids when woman go to math conferences. You don’t even need to help organize the babysitting, just pay for it. It would help out a lot of young women and free them up to go to way more conferences, evening the playing field with young men.]

In fact there are plenty of women who are super nerdy and would love to go do math across the country, but when it comes to choosing between that lifestyle and having a family life, they will choose the family life more times than not. Really it’s the “nomadic monk” system itself that is crappy for women at that moment, even if they are theoretically happy to be a poor nerd for the rest of their lives.

I have another complaint (which will make it sound like I don’t like the essay but actually I do). It says that people in science don’t have the ability to switch careers, essentially because they don’t have the money. But that’s really not true, at least in math, and I’m a testament to the possibility of switching careers. One thing a nerd is really good at is learning new things quickly.

I also thought that there was something missing about the alternative jobs he mentions, in industry or otherwise, which is that, yes you do get paid better outside of academics, but on the other hand pretty much any nonacademic job requires you to have a boss, which can be really fine or really horrible, and restricts your vacation time to 3 or 4 weeks. By contrast the quality of life as an academic is, if not luxurious, at least much more under one’s control.

Categories: math education, women in math

Glucose Prediction Model: absorption curves and dirty data

July 5, 2011 Cathy O'Neil, mathbabe 7 comments

In this post I started visualizing some blood glucose data using python, and in this post my friend Daniel Krasner kindly rewrote my initial plots in R.

I am attempting to show how to follow the modeling techniques I discussed here in order to try to predict blood glucose levels. Although I listed a bunch of steps, I’m not going to be following them in exactly the order I wrote there, even though I tried to make them in more or less the order we should at least consider them.

For example, it says first to clean the data. However, until you decide a bit about what your model will be attempting to do, you don’t even know what dirty data really means or how to clean it. On the other hand, you don’t want to wait too long to figure something out about cleaning data. It’s kind of a craft rather than a science. I’m hoping that by explaining the steps the craft will become apparent. I’ll talk more about cleaning the data below.

Next, I suggested you choose in-sample and out-of-sample data sets. In this case I will use all of my data for my in-sample data since I happen to know it’s from last year (actually last spring) so I can always ask my friend to send me more recent data when my model is ready for testing. In general it’s a good idea to use at most two thirds of your data as in-sample; otherwise your out-of-sample test is not sufficiently meaningful (assuming you don’t have that much data, which always seems to be the case).

Next, I want to choose my predictive variables. First, we should try to see how much mileage we can get out of predicting future blood glucose levels with past glucose levels. Keeping in mind that the previous post had us using log levels instead of actual glucose levels, since then the distribution of levels is more normal, we will actually be trying to predict log glucose levels (log levels) knowing past log glucose levels.

One good stare at the data will tell us there’s probably more than one past data point that will be needed, since we see that there is pretty consistent moves upwards and downwards. In other words, there is autocorrelation in the log levels, which is to be expected, but we will want to look at the derivative of the log levels in the near past to predict the future log levels. The derivative can be computed by taking the difference of the most recent log level and the previous one to that.

Once we have the best model we can with just knowing past log levels, we will want to add reasonable other signals. The most obvious candidates are the insulin intakes and the carb intakes. These are presented as integer values with certain timestamps. Focusing on the insulin for now, if we know when the insulin is taken and how much, we should be able to model how much insulin has been absorbed into the blood stream at any given time, if we know what the insulin absorption curve looks like.

This leads to the question of, what does the insulin (rate of) absorption curve look like? I’ve heard that it’s pretty much bell-shaped, with a maximum at 1.5 hours from the time of intake; so it looks more or less like a normal distribution’s probability density function. It remains to guess what the maximum height should be, but it very likely depends linearly on the amount of insulin that was taken. We also need to guess at the standard deviation, although we have a pretty good head start knowing the 1.5 hours clue.

Next, the carb intakes will be similar to the insulin intake but trickier, since there is more than one type of carb and different types get absorbed at different rates, but are all absorbed by the bloodstream in a vaguely similar way, which is to say like a bell curve. We will have to be pretty careful to add the carb intake model, since probably the overall model will depend dramatically on our choices.

I’m getting ahead of myself, which is actually kind of good, because we want to make sure our hopeful path is somewhat clear and not too congested with unknowns. But let’s get back to the first step of modeling, which is just using past log glucose levels to predict the next glucose level (we will later try to expand the horizon of the model to predict glucose levels an hour from now).

Looking back at the data, we see gaps and we see crazy values sometimes. Moreover, we see crazy values more often near the gaps. This is probably due to the monitor crapping out near the end of its life and also near the beginning. Actually the weird values at the beginning are easy to take care of- since we are going to work causally, we will know there had been a gap and the data just restarted, so we we will know to ignore the values for a while (we will determine how long shortly) until we can trust the numbers. But it’s much trickier to deal with crazy values near the end of the monitor’s life, since, working causally, we won’t be able to look into the future and see that the monitor will die soon. This is a pretty serious dirty data problem, and the regression we plan to run may be overly affected by the crazy crapping-out monitor problems if we don’t figure out how to weed them out.

There are two things that may help. First, the monitor also has a data feed which is trying to measure the health of the monitor itself. If this monitor monitor is good, it may be exactly what we need to decide, “uh-oh the monitor is dying, stop trusting the data.” The second possible saving grace is that my friend also measured his blood glucose levels manually and inputted those numbers into the machine, which means we have a way to check the two sets of numbers against each other. Unfortunately he didn’t do this every five minutes (well actually that’s a good thing for him), and in particular during the night there were long gaps of time when we don’t have any manual measurements.

A final thought on modeling. We’ve mentioned three sources of signals, namely past blood glucose levels, insulin absorption forecasts, and carbohydrate absorption forecasts. There are a couple of other variables that are known to effect the blood glucose levels. Namely, the time of day and the amount of exercise that the person is doing. We won’t have access to exercise, but we do have access to timestamps. So it’s possible we can incorporate that data into the model as well, once we have some idea of how the glucose is effected by the time of day.

Categories: data science, open source tools

Cookies

July 4, 2011 Cathy O'Neil, mathbabe 9 comments

About three months ago I started working at an internet company which hosts advertising platforms. It’s a great place to work, with a bunch of fantastically optimistic, smart people who care about their quality of life. I’m on the tech team along with the team of developers which is led by this super smart, cool guy who looks like Keanu Reeves from the Matrix.

I’ve learned a few things about how the internet works and how information is collected about people who are surfing the web, and the bottom line is I clear my cookies now after every session of browsing. Now that I know the ways information travels the risks of retaining cookies seem to outweigh the benefits. First I’ll explain how the system works and then I’ll try to make a case for why it’s creepy, and finally, why you may not care at all.

Basically you should think of yourself, when you surf the web, as analogous to someone on the subway coming home from Macy’s with those enormous red and white shopping bags. You are a walking advertisement for your past, your consumer tastes, and your style, not to mention your willingness to purchase. Moreover, beyond that, you are also carrying around information about your political beliefs, religious beliefs, and temperament. The longer you browse between cookie cleanings, the more precise a picture you’ve painted of yourself for the sites you visit and for third parties (explained below) who get their hands on your information.

Just to give you a flavor of what I’m talking about, you probably are already aware that when you go to a site like, say, Amazon, the site assigns you a cookie to recognize you as a guest; when you return a week later it knows you and says, “Hi, Catherine!”. That’s on the low end of creepy since you have an account with Amazon and it’s convenient for the site to not ask you who you are every time you visit.

However, you may not be aware that Amazon can also see and parce the cookies that other sites, like Google (correction: a reader has pointed out to me that Google doesn’t let this happen, sorry. I was getting confused between the cookie and the “referring url”, which tells a site where the user has come from when they first get to the site. That does contain Google search terms), places on your web signature. In other words Amazon, or any other site that knows how to look, can figure out what other sites’ label of you says. Some cookies are encrypted but not all of them, and I think the general rule is to not encrypt- after all, the people who have the tools to read the cookies all benefit from that information being easy to read. From the perspective of Google, moreover, this information is helping improve your user experience. It should be added that Google and many other companies give you the option of opting out of receiving cookies, but to do so you have to figure out it’s happening and then how to opt out (which isn’t hard).

One last layer of cookie collection is this: there are other companies which lurk on websites (like Amazon, although I’m not an expert on exactly when and where this happens) which can also see your cookies and tag you with additional cookies, or even change your existing cookies (this is considered rude but not prevented). This is where, for me, the creep factor gets going. Those third parties certainly have less riding on their brand, since of course you don’t even see them, so they have less motivation to act honorably with the information they collect about you. For the most part, though, they are just looking to see what kind of advertisement you may be weak for and, once they figure it out, they show you exactly that model of showerhead that you searched for three weeks ago but decided was too expensive to buy. If you want to stop seeing that freaking showerhead popping up everywhere, clear thy cookies.

Here’s why I don’t like this; it’s not about the ubiquitous showerhead, which is just annoying. Think about rich people and how they experience their lives. I touched on this in a previous post about working at D.E. Shaw, but to summarize, rich people think they are always right, and that’s a pretty universal rule, which is to say anyone who becomes rich will probably succumb to that pretty quickly. Why, though? My guess is that everyone around them is aware of their money and is always trying to make them happy in the hope that they at some point could have some of that money. So they effectively live in a cocoon of rightness, which after a while seems perfectly logical and normal.

How that concept manifests itself in this conversation about cookies is that, in a small but meaningful way, that’s exactly what happens to the user when he or she is browsing the web with lots of cookies. Every time Joe encounters a site, the site and all third-party advertisers have the ability to see that Joe is a Republican gun-owner, and the ads shown to Joe will be absolutely in line with that part of the world. Similarly the cookies could expose Dan as a liberal vegetarian and he sees ads that never shake his foundations. It’s like we are funneled into a smaller and smaller world and we see less and less that could challenge our assumptions. This is an isolating thought, and it’s really happening.

At the same time, people sometimes want to be coddled, and I’m one of those people. Sometimes I enjoy it when my favorite yarn store advertises absolutely gorgeous silk-cashmere blends at me, or shows me to a rant against greedy bankers, and no I’d rather not replace them with Viagra ads. So it’s also a question of how much does this matter. For me it matters, but I also like New York City because it is dirty and gritty and all these people from all over the world live there and sweat on each other on the subway and it makes me feel like part of a larger community- I like to mix it up and have it mixed up.

I’d also like to mention another kind of reason you may want to clear your cookies: you get better deals. A general rule of internet advertising is that you don’t need to show good deals to loyalists. So if you don’t have cookies proving you have an account on Netflix, you may get an advertisement offering you three free months of membership. Or if you want to get more free articles on the New York Times website, clear your cookies and the site will have no idea who you are. There are many such examples like this.

Lastly, I’d like to point out that you probably don’t need to worry about this. After all, many browsers will clear your cookies but also clear your usernames and passwords, and you may never be able to get some of those back. And maybe you don’t mind being coddled while online. Maybe it’s the one place where you get to feel understood. Why question that?

Categories: internet startup, news

Fair Foods

July 3, 2011 Cathy O'Neil, mathbabe 3 comments

This post will only be indirectly quantitative, and not a rant, so I guess that means I will have to either apologize or change my mission statement. Sorry. Oh and by the way I do have lots of ideas for quantitative blogs coming up, topics to include:

clear your cookies! how internet companies track your every click
update on the diabetes model
is being a mathematician just a crappy job?
shout-outs to other nerd bloggers who are sending me readers

So yesterday I loaded up the (rental) car to the brim, with my mom, my two older sons, a guitar (for me) and an air conditioning unit (for my mom), and drove out to Amherst for the math program I’m teaching in for three weeks.

Before I left I visited my friend Nancy at Fair Foods in Dorchester.

I drove to her house early, getting there at maybe 8:30am. She wasn’t home- she had me meet her at a church near Codman Square, where she was making a drop. When I got there I helped her unload a van full of maybe 40 or so boxes of vegetables and fruit, with a few 50-pound bags of carrots and potatoes. She got on the van and handed me the boxes and I carried them over to a sidewalk, while the woman, Marie, who was accepting the drop, carried some smaller boxes into the basement. Nancy introduced me to Marie as her daughter, and introduced Marie to me as the beautiful, wise Haitian woman who was a professional cook and would turn all of these vegetables into a delicious feast for her congregation. Nancy and Marie talked about the church, and the fact that it was shared between two different congregations, one Haitian immigrant and one African-American, and how the church was run.

After a while it didn’t seem like Marie was going to get the help she was expecting to carry the larger boxes into the basement, so Nancy and I moved all of the boxes down there, temporarily rigging a window to be a de facto dumb waiter to avoid three corners and some stairs. There were tomatoes, white potatoes, red potatoes, carrots, ugli fruit, limes, lettuce, string beans, wax beans, and others I can’t remember. Almost all of these were in great condition, but some needed sorting before going into the feast. Marie asked for corn for the 4th of July- since the food that is collected is surplus, a given request may be hard to fill, especially around a holiday, which Nancy explained. But then she said that if we got corn we would call Marie right away.

After we finished unloading the van I was soaked in sweat; it reminded me of how incredibly strong I’d gotten working one summer for Nancy, unloading trucks all day (as well as loading them at the Chelsea Produce Market every morning at 7) and driving around the city in the big yellow truck making drops to churches, senior centers, and youth centers, and holding dollar-a-bag sites in vacant parking lots and sidestreets. That was in 1992; and Nancy, who was born in 1950, has been doing the program ever since, with various peoples’ help.

Nancy mentioned that before I’d gotten there she had gone into the church and listened to the singing and the praying of the Haitian congregation, and that it had been seriously beautiful. Marie insisted on us coming inside. We sat in the pews as the woman leading the small prayer group of about 8 people, mostly women, was talking to one woman who was clearly in distress. Perhaps she was in mourning. They were speaking in Creole, which I don’t understand (although I know some French so every now and then I can pick up a word or two), but it was viscerally moving how kindly the leader was speaking to the sad woman seated in front of her. After she allowed that woman to finish, she looked up and welcomed us in English and asked us our names. Marie explained in Creole something about us, probably that we had just brought in the food for the July 4th meal, and we were instantly welcomed by the entire group. After that they told us they were wrapping up their prayer session and would stand and have a group prayer.

Everyone stood, except for the mourning woman who was holding her head in her hands. And at once everyone started praying, but the interesting thing was they were all saying different prayers, and it was fascinating to watch and listen to how they could be both praying together and praying individually. I could make out a few words from Marie’s prayer, which near the beginning was quiet and included lots of words like “please” and “hope”, but which, like everyone else’s, became louder and more fervent and contained more words like “thank you” and “hallelujah”. It ended by everyone holding their hands up to the front of them and giving thanks. Everyone ended at exactly the same time.

After the prayer group ended, there were lots of hugs and hand shaking. Many of the women wanted to talk to Nancy and she probably ended up hugging and being hugged by everyone there. There was a deep human connection inside that little church, which is pretty different from my normal assumptions about piousness and rules-based religions. Connection and empathy.

After we left the church we went to a playground and sat and had coffee together, and Nancy laid something down that was pretty thick. She talked about her disillusionment with her generation- the hippy generation- how they made all these promises but then didn’t follow through- the words she uses is didn’t apply themselves. She talked about having faith in her generation up to the “We Are the World” moment, and then waiting, and seeing nothing come out of it, and how bitter that had made her feel, how disappointed. She said it took her years to get over that, and now she feels like those years of her life, until recently in fact, are in some sense unaccounted for, both because she’s been sick and because she was somewhat paralyzed with anger.

She went on to say that she’s in a new phase now, she’s accepted the lazy fact of life that the people she was counting on, if anything, have made the world a worse place, not a better one, but that she’s decided to love them and love the world anyway, and to continue to make human connections with individuals, because it makes her have faith in a different way, a more diffuse but a stronger faith that won’t be disappointed.

It’s interesting to me that Nancy would ever describe her life as unaccounted for or her feelings as bitter. When I met her in 1989, she had been diagnosed with MS and lived in a huge old house with very little working anything (and what was working she’d installed herself- wired the electricity and installed plumbing). She had a great Dane and a broken-down donated truck, and when I came to her we spent the whole night cleaning out and reorganizing the truck. Whenever the truck’s insurance was due, or the phone was about to be cut off, we’d get a check for $50 and it would be a miracle, and I always felt like if I was ever going to believe in something it would be because of her.

I fell in love with her and with her approach to problem solving- namely, do the right thing, and go figure how to with bare knuckles and sweat. Over the years she’s been better or worse off with her health, but she’s never given up and, to be honest, I never sensed bitterness from her. Maybe these are relative notions, that bitterness from her is like frustration from someone else. Unaccountability from the woman who moves tons of food a week, that will otherwise be thrown away, into the homes of impoverished, mostly immigrant households, who know her and appreciate her act of kindness and take part in that act, would mean… what? to other people. Hard to say.

Categories: news, rant

Did you have a happy childhood?

July 2, 2011 Cathy O'Neil, mathbabe 6 comments

For whatever reason, I’ve been thinking about my childhood recently. Partly it’s the post I wrote about why I chose to call myself “mathbabe”, partly it’s an old essay of Jonathan Franzen’s that got me all riled up (in a good way). Plus I’m traveling to the math camp of my youth to teach, and stopping on the way in Harvard Square at my parents’ house; that’s enough to make you reconsider your memories in short order.

I have never understood what people mean when they talk about carefree, happy childhoods. I think I’ve always assumed this to be some kind of ironic joke, or maybe a plastered-over memory, a convenient approach to pain management. While it’s true that children have fewer responsibilities than grown-ups, it’s really not the responsibilities of adulthood that weigh me down (says the woman with three kids), or ever did. For me it was the constant awareness of my helplessness and impotence, my inability to decide my own fate, my feeling of having to wait forever for freedom, that got to me. I was also teased, but not relentlessly, and I did have friends, and moreover I wasn’t thought of (I don’t think) as a worrying child. From the outside people may have imagined me as a normal albeit nerdy kid. However, I always identified with the oppressed, and I had a keen sense of fairness which was constantly being challenged by reality. When we studied the “Manifest Destiny” in third grade, it killed me to think of the white man’s assumptions. When I saw a kid getting bullied at school, it tore me up that I didn’t know how to put an end to it and no teachers bothered. The list goes on, you get the idea. Also, I had an internal standard that was painfully high- I wanted to become a musician, a pianist, but never thought I’d be good enough, and I questioned my creativity, since what I really wanted to do was compose. When I decided to become a mathematician I started worrying about my thesis (I was 16). By the way, lest people get the wrong impression, my parents never put pressure on me to play music (in fact they openly discouraged me since it was expensive) and thought my worrying about my thesis was downright amusing. This was all internally generated. In short, I was a struggler, at best of times a striver, but never ever carefree and happy.

I have always been attracted to other people who struggle and strive; for the most part my closest friends are, like me, in constant flux with respect to their identities and their goals and even the interpretation of the most basic cultural assumptions like toenail polish and the role of the FDIC.

This brings me to the Franzen essay, where he talks about being isolated in childhood as a reader, and spending the rest of your life trying to find and form a community with other isolated readers. As an aside, Franzen makes a distinction in this essay between isolated readers and isolated math or technology nerds. He basically said that math nerds are isolated because they are autistic, incapable of social interaction, whereas readers are isolated because they feel more deeply and can’t relate to artificiality. I’m not sure whether to argue that math nerds aren’t all autistic or just count myself as both a reader and a math nerd and be proud of out-isolating Franzen, no easy task. Basically, I agree with Franzen. From my perspective upon meeting someone I am always looking for that inner torture, the hallmark of an examined life. It doesn’t make you happy, perhaps, but it makes you real, and moreover interesting.

But here’s the thing, I was blindsided this week by the discovery that my husband, of all people, had a happy childhood. He insists on this, even when I ask him if perhaps he’s misremembering his inner turmoil– he claims no. He moreover avers that, at the age of 12, he decided to become a mathematician and has never looked back, never once questioned that decision. Is this possible? That I’m married to a man who had a happy childhood? For all I know, it is true and moreover it may be exactly why I have a happy marriage. Maybe strugglers need to be married to non-strugglers to maintain some kind of balance. I don’t know, I’m still thinking about it. It does explain something that I’ve always been confused by, though- when my husband comes across an ethical or moral decision, he does so painlessly and makes a decision instantaneously. I now think this is because he just doesn’t think about things like that in between those moments, and so he’s got a clarity of consciousness which allows him to make snap decisions. When I come across such dilemmas, I am much more confused and ambivalent. I usually decide it’s a matter of opinion. I’m wondering if it’s this element of our differences that makes our marriage work.

Categories: rant

Asymmetrical Information

July 2, 2011 Cathy O'Neil, mathbabe 2 comments

From my experience, there are only a few basic kinds of trading models encountered on Wall Street. These are:

chasing dumb money, which I’ve described already,
asymmetrical information, which I want to talk about today,
market-making,
providing “insurance”,
seasonality, which I’ve touched on, and
taking advantage of macroeconomic misalignment (think Soros’s pound trade)

In other posts I intend to go into more detail in the above categories, as well as devote a post to the question of how trading models fail (there also seem to be only a few basic categories for that).

Finance nerd readers: please tell me if I’m missing something!

The concept of asymmetrical information is incredibly simple: I know more than you so I can make a more informed assessment of the value of some underlying contract. This could mean I know inside information about a company and trade before the announcement (illegal but common), or that I know the likelihood of bankruptcy for a company is higher than the market seems to think, or that the underlying mortgages of a packaged security are likely to default.

I could go on, and probably will in another post, but I’d like to make a very basic point, which is this: a lot of money is made every day via asymmetrical information, and in particular there’s a major motivation to obfuscate data in order to create asymmetry. One of the missions of this blog is to uncover and expose major, unreasonable examples of obfuscated information that I know about.

At this point it’s critical to differentiate between two things which typically get confused by non-nerds. Namely, the difference between a technical but thorough explanation and true information obfuscation. A technical explanation, if thorough, can be worked through eventually by someone with enough expertise, or someone who is motivated enough to get that expertise, whereas true information obfuscation just doesn’t provide enough details to really know anything.

The worst is when you are given pretty specific technical information, but which only explains half of the story. This leads to an imprecise false sense of security, which I suspect underlies most of the very large mistakes we’ve seen in finance in the last few years.

For example, let’s talk about the bank stress tests in the United States in 2009. They were conducted in two distinct phases. In the first, a bunch of economists were asked to write down two scenarios. The first was kind of a prediction of how 2009 and 2010 would play out, and the second was a more negative scenario. Okay so far, even though economists aren’t all that pessimistic as people (more on this on another post). The scenarios were averaged in some way and then publicly posted. The good news is, if you thought the scenarios were unrealistic, you’d at least know how to complain about them. The bad news is that they are pretty vague, only really specifying the GDP growth and the unemployment rate.

In the second phase, the banks were allowed to predict the impact of those two scenarios on their portfolios using their own internal models, which were not made public. Here’s the white paper if you don’t believe me. So, in the name of asymmetrical information, why is this a problem? Here are a few reasons:

Banks had bad internal risk models
Banks had clear motivation to mark their portfolios to their advantage
The fact that their methods weren’t made public gives them ample cover to do whatever they wanted

There are two reasons I say that banks had bad internal risk models. The first reason is the one you know about already- they evidently bought a whole bunch of toxic securities leading up to 2008 and seemed to have no idea about the risks. But moreover, my personal experience working in the risk field is that banks used external risk modeling companies as a rubber stamp, essentially to placate those worrywarts who insisted on obsessing about risks. To be more precise without getting anyone into trouble, it was commonplace for banks to not notice when a model at a risk software company had very basic problems and would spit out nonsensical numbers. It was almost as if you couldn’t trust the banks to look at their risk numbers at all. This isn’t true of every bank at all times, but as a general rule when models had major problems it was hedge funds, not banks, who would bring attention to those problems. Moreover, the banks did not seem to have internal risk modeling across their desks. In other words, a trading desk which trades a certain kind of instrument may have some risk monitoring in place (mostly to bound the amount of trading of that type), but when it comes to understanding systemic risk across instrument types, the external risk companies were the source.

It is obvious that banks were motivated to mark their portfolios to their advantage. The ultimate result of bank stress tests were possible additional capital requirements, which they clearly wanted to avoid. This temptation meant it would benefit them to make every assumption of their risk model liberal to their cause.

Finally, they didn’t expose their methods- not even to explain in general terms how they dealt with, say, interest rate risks across instrument types. This meant that only the Fed people involved got to decide how honest the banks were. This is the opposite of what is needed in this situation. There is no reasonable need to keep these methodologies secret from the general public, since it is we who are on the hook if their methods are flawed, as we have seen.

Here’s where I admit that it’s actually really hard to come up with good methodologies to measure impact of vague GDP growth and unemployment estimates. But that admission is only going to add to my rant, because my overall point is that the instruments themselves have been created to make that hard. They are examples, especially tranched mortgage-backed securities but others as well, of intentional obfuscation for the sake of creating asymmetrical information. Instead of living in a world where banks who own things like this are allowed to measure them at their whim, and benefit from that obfuscation, we need to create a system where they are penalized for having illiquid or complex instruments.

And here’s where I admit that I’m not an expert on all of these instruments – some would say I don’t have the right to talk about how they should be assessed. Yet again, I choose to use that fact to add to my rant: if, after working for four years in finance as a quant at a hedge fund and then a researcher and account manager at a risk company, I can’t have an opinion about how to assess risk, then the system is too freaking complicated.

Categories: finance, hedge funds

Women on a board of directors: let’s use Bayesian inference

June 30, 2011 Cathy O'Neil, mathbabe 6 comments

I wanted to show how to perform a “women on the board of directors” analysis using Bayesian inference. What this means is that we need to form a “prior” on what we think the distribution of the answer could be, and then we update our prior with the data available. In this case we simplify the question we are trying to answer: given that we see a board with 3 women and 7 men (so 10 total), what is the fraction of women available for the board of directors in the general population? The reason we may want to answer this question is that then we can compare the answer to other available answers, derived other ways (say by looking at the makeup of upper level management) and see if there’s a bias.

In order to illustrate Bayesian techniques, I’ve simplified it further to be a discrete question. So I’ve pretended that there are only 11 answers you could possible have, namely that the fraction of available women (in the population of people qualified to be put on the board of directors) is 0%, 10%, 20%, …, 90%, or 100%.

Moreover, I’ve put the least judgmental prior on the situation, namely that there is an equal chance for any of these 11 possibilities. Thus the prior distribution is uniform:

We have absolutely no idea what the fraction of qualified women is.

The next step is to update our prior with the available data. In this case we have the data point that there a board with 3 women and 7 men. In this case we are sure that there are some women and some men available, so the updated probability of there being 0% women or 100% women should both be zero (and we will see that this is true). Moreover, we would expect to see that the most likely fraction will be 30%, and we will see that too. What Bayesian inference gives to us, though, is the relative probabilities of the other possibilities, based on the likelihood that one of them is true given the data. So for example if we are assuming for the moment that 70% of the qualified people are women, what is the likelihood that the board ends up being 3 women and 7 men? We can compute that as (0.70)^3*(0.30)^7. We multiply that by 1/11, the probability that 70% is the right answer (according to our prior) to get the “unscaled posterior distribution”, or the likelihoods of each possibility. Here’s a graph of these numbers when I do it for all 11 possibilities:

We learn the relative likelihoods of the outcome "3 out of 10" given the various ratios of women

In order to make this a probability distribution we need to make sure the total adds up to 1, so we scale to get the actual posterior distribution:

We scale these to add up to 1

What we observe is, for example, that it’s about twice as likely for 50% of women to be qualified as it is for 10% of women to be qualified, even though those answers are equally distant from the best guess of 30%. This kind of “confidence of error” is what Bayesian inference is good for. Also, keep in mind that if we had had a more informed prior the above graph would look different; for example we could use the above graph as a prior for the next time we come across a board of directors. In fact that’s exactly how this kind of inference is used: iteratively, as we travel forward through time collecting data. We typically want to start out with a prior that is pretty mild (like the uniform distribution above) so that we aren’t skewing the end results too much, and let the data speak for itself. In fact priors are typically of the form, “things should vary smoothly”; more on what that could possibly mean in a later post.

Here’s the python code I wrote to make these graphs:

#!/usr/bin/env python

from matplotlib.pylab import *

from numpy import *

# plot prior distribution:

figure()

bar(arange(0,1.1,0.1), array([1.0/11]*11), width = 0.1, label = “prior probability distribution”)

xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )

xlim(0, 1.1)

legend()

show()

# compute likelihoods for each of the 11 possible ratios of women:

likelihoods = []

for x in arange(0, 1.1, 0.1):

likelihoods.append(x**3*(1-x)**7)

# plot unscaled posterior distribution:

figure()

bar(arange(0,1.1,0.1), array([1.0/11]*11)*array(likelihoods), width = 0.1, label = “unscaled posterior probability distribution”)

xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )

xlim(0, 1.1)

legend()

show()

# plot scaled posterior distribution:

figure()

bar(arange(0,1.1,0.1), array([1.0/11]*11)*array(likelihoods)/sum(array([1.0/11]*11)*array(likelihoods)), width = 0.1, label = “scaled posterior probability distribution”)

xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )

xlim(0, 1.1)

legend()

show()

Here’s the R code that Daniel Krasner wrote for these graphs:

barplot( rep((1/11), 11), width = .1, col=”blue”, main = “prior probability distribution”)

likelihoods = c()

for (x in seq(0, 1.0, by = .1))

likelihoods = c(likelihoods, (x^3)*((1-x)^7));

barplot(likelihoods, width = .1, col=”blue”, main = “unscaled posterior probability distribution”)

barplot(likelihoods/sum(seq((1/11), 11)*likelihoods), width = .1, col=”blue”, main = “scaled posterior probability distribution”)

Categories: data science, open source tools

Cora Sadosky

June 29, 2011 Cathy O'Neil, mathbabe 5 comments

I was looking through an old photo album (the kind where there are sticky pages and actual physical photos- it looks like an ancient technology now) and I came across one of my favorites of all time- a picture of me being embraced and supported by Cora Sadosky on one side and Barry Mazur on the other. This picture was taken in 1993 in Vancouver, where I received the Alice T. Schafer prize. It was a critical moment for me, and both of those people have influenced me profoundly. Barry became my thesis advisor; part of the reason I went into number theory was to become his student (the other part was this book).

Cora became my mathematical role model and spiritual mother. I already wrote earlier about how going to math camp when I was 14 changed my life and made me realize there is a whole community of math nerds out there and that I belonged to that nerd community. Well, Cora, whom I met when I was 21, was the person that made me realize there is a community of women mathematicians, and that I was also welcome to that world.

Actually it was something I didn’t even really want to know at the time. After all, I was happy to be a successful math undergraduate at UC Berkeley, frolicking in the graduate student lounge and partaking in tea every day at 3:00. Who cares that I was a woman? It seemed antiquated to me, almost crude, to mention my gender. When I got word that I’d won the prize, my reaction was essentially, “is there money?” (there was a bit).

And when I meet young women in math nowadays with that attitude, I am happy for them, really very happy for them. To live in that state of not caring what your gender is in mathematics is a kind of bliss, that lasts until the very moment it stops. My greatest wish for future generations of women in math is for that bliss to never stop.

And yet. I went to Vancouver and met Cora and learned about Alice Shafer and her struggles and successes as a trailblazer for women in math, and I felt really honored to be collecting an award in her name. And I felt honored to have met Cora, whose obvious passion for mathematics was absolutely awe-inspiring. She was the person who first explained to me that, as women mathematicians, we will keep growing, keep writing, and keep getting better at math as we grow older (unlike men who typically do their best work when they’re 29), and we absolutely have to maintain a purpose and a drive and fortitude for that highest call, the struggle of creation.

I kept up with Cora over the years. Every now and then she’d write to me and send me pushy little maternal notes reminding me to work hard and stay strong and productive. And I’d write to her with news of my life and my growing family and sometimes when I visited D.C. I’d meet her and we’d have lunch or dinner and talk about ideas and great books we’d read and how much we loved each other.

When I googled her this morning, I found out she’d died about 6 months ago. You can read about her difficult and inspiring mathematical career in this biography. It made me cry and made me think about how much the world needs role models like Cora.

Categories: women in math

Woohoo!

June 27, 2011 Cathy O'Neil, mathbabe 3 comments

First of all, I changed the theme of the blog, because I am getting really excellent comments from people but I thought it was too difficult to read the comments and to leave comments with the old theme. This way you can just click on the word “Go to comments” or “Leave a comment” which is a bit more self-evident to design-ignorant people like me. Hope you like it.

Next, I had a bad day today, but I’m very happy to report that something has raised my spirits. Namely, Jake Porway from Data Without Borders and I have been corresponding, and I’ve offered to talk to prospective NGO’s about data, what they should be collecting depending on what kind of studies they want to be able to perform, and how to store and revise data. It looks like it’s really going to happen!

In fact his exact words were: I will definitely reach out to you when we’re talking to NPOs / NGOs.

Oh, and by the way, he also says I can blog about our conversations together as well as my future conversations with those NGO’s (as long as they’re cool with it), which will be super interesting.

Oh, yeah. Can I get a WOOHOO?!?

Categories: data science, open source tools

Better risk modeling: motivating transparency

June 27, 2011 Cathy O'Neil, mathbabe Comments off

In a previous post, I wrote about what I see as the cowardice and small-mindedness of the U.S. government and in particular the regulators for not demanding daily portfolios of all large investors. Of course this goes for the governments in Europe as well, and especially right now. The Economist had a good article this past Friday which attempted to quantify the results of a Greek default, but there were major holes, especially in the realm of “who owns the CDS contracts on Greek bonds, and how many are there?”. This fear of the unknown is a root cause of the current political wrangling which will probably end in a postponement of resolving the Greek situation; the question is whether the borrowed time will be used properly or squandered.

It’s ridiculous that nobody knows where the risk lies, but as a friend of mine pointed out to me last week at lunch, it probably won’t be enough to demand the portfolios daily, even if you had the perfect quantitative risk model available to you to plug them into. Why? Because if “transparency” is what the regulators demand, then “transparency” is what they would get – in the form of obfuscated lawyered-up holding lists.

In other words, let’s say a bank has a huge pile of mortgage-backed securities of dubious value on their books, but doesn’t want to accept losses on them. If they knew they’d have to start giving their portfolio to the SEC daily instead of quarterly, it would change the rules of the game. They’d have to hide these holdings by pure obfuscation rather than short-term month- or quarter-end legal finagling. So for example, they could invest in company A, which invests in company B, which happens to have a bunch of mortgage-backed securities of dubious value, but which is too small to fall under the “daily reporting” rules. This is just an example but is probably an accurate portrayal of the kind of thing that would happen with enough lead time and enough lawyers.

What we actually want is to set up a system whereby banks and hedge funds are motivated to be transparent. Read this as: will lose money if they aren’t transparent, because that’s the only motivation that they respond to.

In some sense, as my friend reminded me, we don’t need to worry about hedge funds as much as about banks. This is because hedge funds do their trades through brokerages, which force margin calls on trades that they deem risky. In other words, they pay for their risk through margins on a trade-by-trade, daily basis. If you are thinking, “wait, what about LCTM? Isn’t that a hedge fund that got away with murder and almost blew up the system and didn’t seem to have large margins in place?” then the answer is, “yeah but brokers don’t get fooled (as much) by hedge funds anymore”. In other words, brokers, who are major players in the financial game, are the policemen of hedge funds.

There are two major limits to the above argument. Firstly, hedge funds purposefully use multiple brokers simultaneously so that nobody knows their entire book, so to the extent that risk of portfolio isn’t additive (it isn’t), this policing method isn’t complete. Secondly, it is only a local kind of risk issue- it doesn’t clarify risk given a catastrophic event (like a Greek default), but rather a more work-a-day “normal circumstances” market risk.

Even so, what about the banks? Are there any brokers measuring the risk of their activities and investments? Since the banks are the brokers, we have to look elsewhere… I guess that would have to be at the government, and the regulators themselves, maybe the FDIC… in any case, people decidedly not players in the financial game, not motivated by pay-off, and therefore not prone to delving into the asperger-inspiring details of complicated structured products to search out lies or liberal estimates.

The goal then is to create a new kind of market which allows insiders to bet on the validity of banks’ portfolios. You may be saying, “hey isn’t that just the stock price of the bank itself?”, and to answer that I’d refer you to this article which does a good job explaining how little information and power is actually being exercised by stockholders.

I will follow up this post with another more technical one where I will attempt to describe the new market and how it could (possibly, hopefully) function to motivate transparency of banks. But in the meantime, feel free to make suggestions!

Categories: finance, hedge funds, news

Women on S&P500 boards of directors

June 26, 2011 Cathy O'Neil, mathbabe 6 comments

This is a co-post with FogOfWar.

Here’s an interesting article about how many board of directors for S&P500 companies consist entirely of men. Turns out it’s 47. Well, but we’d expect there to be some number of boards (out of 500) which consist entirely of men even if half of the overall set of board members are women. So the natural question arises, what is the most likely actual proportion of women given this number 47 out of 500?

In fact we know that many people are on multiple boards but for the sake of this discussion let’s assume that there’s a line of board seekers standing outside waiting to get in, and that we will randomly distribute them to boards as they walk inside, and we are wondering how many of them are women given that we end up with 47 all-men boards out of 500. Also, let’s assume there are 8 slots per board, which is of course a guess but we can see how robust that guess is by changing it at the end.

By the way, I can think of two arguments as to why the simplification that nobody is on multiple boards argument might skew the results. On the one hand, we all know it’s an old boys network so there are a bunch of connections that a few select men enjoy which puts them on a bunch of boards, which probably means the average number of boards that a man is on, who is on at least one board, is pretty large. On the other hand, it’s also well-known that, in order to seem like you’re diverse and modern, companies are trying to get at least a token woman on their board, and for some reason consider the task of finding a qualified woman really difficult. Thus I imagine it’s quite likely that once a woman has been invited to be on a board, and she’s magically dubbed “qualified,” then approximately 200 other boards will immediately invite that same woman to be on their board (“Oh my god, they’ve actually found a qualified woman!”). In other words I imagine that the average number of boards a given woman is on, assuming she’s on one, is probably even higher than for men, so our simplifying assumptions will in the end be overestimating the number of women on boards. But this is just a guess.

Now that I’ve written that argument down, I realize another reason our calculation below will be overestimating women is this concept of tokenism- once a board has one woman they may think their job is done, so to speak, in the diversity department. I’m wishing I could really get my hands on the sizes and composition of each board and see how many of them have exactly one woman (and compare that to what you’d expect with random placement). This could potentially prove (in the sense of providing statistically significant evidence for) a culture of tokenism. If anyone reading this knows how to get their hands on that data, please write!

Now to the calculation. Assuming, once more, that each board member is on exactly one board and that there are 8 people (randomly distributed) per board, what is the most likely percentage of overall women given that we are seeing 47 all-male boards out of 500? This boils down to a biased coin problem (with the two sides labeld “F” and “M” for female and male) where we are looking for the bias. For each board we flip the coin 8 times and see how many “F”s we get and how many “M”s we get and that gives us our board.

First, what would the expected number of all-male boards be if the coin is unbiased? Since expectation is additive and we are modeling the boards as independent, we just need to figure out the probability that one board is all-male and multiply by 500. But for an unbiased coin that boils down to (1/2)^8 = 0.39%, so after multiplying by 500 we get 1.95, in other words we’d expect 2 all-male boards. So the numbers are definitely telling us that we should not be expecting 50% women. What is the most likely number of women then? In this case we work backwards: we know the answer is 47, so divide that by 500 to get 0.094, and now find the probability p of the biased coin landing on F so that all-maleness has probability 0.094. This is another way of saying that (1-p)^8 = 0.094, or that 1-p is 0.744, the eighth root of 0.094. So our best guess is p = 25.6%. Here’s a table with other numbers depending on the assumed size of the boards:

If anyone reading this has a good sense of the distribution of the size of boards for the S&P500, please write or comment, so I can improve our estimates.

Categories: data science, finance, FogOfWar

Step 0 Revisited: Doing it in R

June 25, 2011 Cathy O'Neil, mathbabe Comments off

A nerd friend of mine kindly rewrote my python scripts in R and produced similar looking graphs. I downloaded R from here and one thing that’s cool is that once it’s installed, if you open an R source code (ending with “.R”), an R console pops up automatically and you can just start working. Here’s the code:

gdata <- read.csv('large_data_glucose.csv', header=TRUE)

#We can open a spreadsheet type editor to check out and edit the data:
edit(gdata)
#Since we are interested in the glucose sensor data, column 31, but the name is a bit awkward to deal with, a good thing to do is to change it:
colnames(gdata)[31] <- "GSensor"

#Lets plot the glucose sensor data:
plot(gdata$GSensor, col="darkblue")

#Here's a histogram plot:
hist(gdata$GSensor, breaks=100, col="darkblue")

#and now lets plot the logarithm of the data:
hist(log(gdata$GSensor), breaks=100, col="darkblue")

And here are the plots:

Sensor_Glucose_plot

Sensor_Glucose_histogram

Log_Sensor_Glucose_histogram

One thing my friend mentions is that R automatically skips missing values (whereas we had to deal with them directly in python). He also mentions that other things can be done in this situation, and to learn more we should check out this site.

R seems to be really good at this kind of thing, that is to say doing the first thing you can think about with data. I am wondering how it compares to python when you have to really start cleaning and processing the data before plotting. We shall see!

Categories: data science, open source tools

Working with Larry Summers (part 2)

June 24, 2011 Cathy O'Neil, mathbabe 19 comments

This is the second part of a description of my experiences working at D.E. Shaw, which was started here and continues here.

I want to describe the culture of working at D.E. Shaw during the credit crisis, so from June 2007 to June 2009, because I think it’s emblematic of something that most news articles and books written about hedge funds really miss out on when they fixate on the average I.Q. of the people working there, which is in the end a distraction and nothing more, or the bizarre or quirky personalities that exist there, which is only idiosyncratic and doesn’t explain anything deeply.

I promised myself I’d put focus on the following phrase, which struck me down when I first heard it used and still makes me shake my head, namely the concept of “dumb money.” The phrase was tossed around constantly and cleverly, and really, to understand what it means inside the context of the hedge fund culture, is to understand the culture. So I’ll try to explain it. First a bit of context.

Most of the quants at D.E. Shaw were immigrant men. In fact I was the only woman quant when I joined, and there were quite a few quants, maybe 50, and I was also one of the only Americans. What nearly all these men had in common was a kind of constant, nervous hunger, almost like a daily fear that they wouldn’t have enough to eat. At first I thought of them as having a serious chip on their shoulder, like they were the kind of guy that didn’t make the football team in high school and were still trying to get over that. And I still think there’s an element of something as simple as that, but it goes deeper. One of my colleagues from Eastern Europe said to me once, “Cathy, my grandparents were coal miners. I don’t want my kids to be coal miners. I don’t want my grandchildren to be coal miners. I don’t want anybody in my family to ever be a coal miner again.” So, what, you’re going to amass enough money so that no descendent of yours ever needs to get a job? Something like that.

But here’s the thing, that fear was real to him. It was that earnest, heartfelt anxiety that convinced me that I was really different from these guys. The difference was that, firstly, they were acting as if a famine was imminent, and they’d need to scrounge up food or starve to death, and secondly, that only their nuclear family was worth saving. This is where I really lost them. I mean, I get the idea of acts of desperation to survive, but I don’t get how you choose who to save and who to let die. However, it was this kind of us-against-them mentality that prevailed and informed the approach to making money.

Once you understand the mentality, it’s easier to understand the “dumb money” phrase. It simply means, we are smarter than those idiots, let’s use our intelligence to anticipate dumb peoples’ trades and take their money. It is our right as intelligent, imminently starving people to do this. Chasing dumb money can take various forms, but is generally aimed at anticipating lazy fund managers: if you know that they always wait until Friday afternoon to balance their books, or that they wait until the end of the month, or that they are required to buy certain kinds of things, you can anticipate their trades, make them yourself a bit before they do, thereby forcing them to pay more, and getting a nice little profit for yourself. In short this works in general, since statistically speaking the anticipated trade wasn’t driving up the intrinsic value of the underlying, but rather was being affected by trade impact for a short amount of time. If we can anticipate big trades by lots of dumb money, then the short-term market impact will be large enough and last long enough to buy in beforehand and sell at the top, while it still lasts, assuming there’s sufficient liquidity. The subtext of taking dumb money, going back to the football team issue, is: if we don’t somebody else will, and then we will feel like fools for not doing it ourselves.

To tell you the truth, I was completely naive when I went to work there. I had kind of accepted the job because I wanted to be a business woman, wanted a brisk pace after the agonizing slowness of academics, and I had really no moral judgment on the concept of a hedge fund; I thought it was morally neutral, at worst a scavenger on the financial system, like a market maker or someone who provides insurance for something. Well I’ve decided it’s more like a leech.

Getting to the part about actually working with Larry Summers. I did work on a couple of his ideas, although in order not to get sued I can’t be detailed about what his ideas were. And I had various meetings with him and a bunch of managing directors. One thing I remember about these meetings was the eery way the managing directors seemed intimidated by him, even though behind his back they kind of scoffed at the possibility that he could actually offer good modeling ideas. It was basically a publicity stunt, or at least rumored to be, to have him work there. It was after he had gotten pushed out of the Presidency at Harvard for talking out of his ass about women in math, and yes it was a bit surreal to be the only woman quant in the place, and to be working on his project considering that. Since I am pretty much never intimidated for some reason, I had no problem. He kept on grilling me about various things to try and I kept explaining what I’d done and how I’d already thought of that. It was fine, pretty combative and pushy, but actually kind of fun. I really have nothing to say about him treating me differently because I was a woman.

But when I think about that last project I was working on, I still get kind of sick to my stomach. It was essentially, and I need to be vague here, a way of collecting dumb money from pension funds. There’s no real way to make that moral, or even morally neutral. There’s no way to see that as scavenging on the marketplace. Nope, that’s just plain chasing after dumb money, and I needed to quit. I still don’t know if that model went into production.

Categories: finance, hedge funds

Newer Entries Older Entries

mathbabe

What is an earnings surprise?

I love math nerd kids

Motivating transparency: what we could do about too big to fail

Bank accounting link

Short Post!

Weekend Reading

Adding-up rules and Hockey Sticks

Does an academic job in math really suck?

Glucose Prediction Model: absorption curves and dirty data

Cookies

Fair Foods

Did you have a happy childhood?

Asymmetrical Information

Women on a board of directors: let’s use Bayesian inference

Here’s the python code I wrote to make these graphs:

Here’s the R code that Daniel Krasner wrote for these graphs:

Cora Sadosky

Woohoo!

Better risk modeling: motivating transparency

Women on S&P500 boards of directors

Step 0 Revisited: Doing it in R

Working with Larry Summers (part 2)

Top Posts & Pages

Follow Blog via Email

Recent Posts

Meta