Cathy O'Neil, mathbabe

Update on organic food

August 19, 2012 Cathy O'Neil, mathbabe 5 comments

So I’m back from some town in North Ontario (please watch this video to get an idea). I spent four days on a tiny little island on Lake Huron with my family and some wonderful friends, swimming, boating, picnicking, and reading the Omnivore’s Dilemma by Michael Pollan whenever I could.

It was a really beautiful place but really far away, especially since my husband jumped gleefully into the water from a high rock with his glasses on so I had to drive all the way back without help. But what I wanted to mention to you is that, happily, I managed to finish the whole book – a victory considering the distractions.

I was told to read the book by a bunch of people who read my previous post on organic food and why I don’t totally get it: see the post here and be sure to read the comments.

One thing I have to give Pollan, he has written a book that lots of people read. I took notes on his approach and style because I want to write a book myself. And it’s not that I read statistics on the book sales – I know people read the book because, even though I hadn’t, lots of facts and passages were eerily familiar to me, which means people I know have quoted the book to me. That’s serious!

In other words, there’s been feedback from this book to the culture and how we think about organic food vs. industrial farming. I can’t very well argue that I already knew most of the stuff in the book, even though I did, because I probably only know it because he wrote the book on it and it’s become part of our cultural understanding.

I terms of the content, first, I’ll complain, then I’ll compliment.

Complaint #1: the guy is a major food snob (one might even say douche). He spends like four months putting together a single “hunting and gathering” meal with the help of his friends the Chez Panisse chefs. It’s kind of like a “lives of the rich and famous” episode in that section of the book, which is to say voyeuristic, painfully smug, and self-absorbed. It’s hard to find this guy wise when he’s being so precious.

Complaint #2: a related issue, which is that he never does the math on whether a given lifestyle is actually accessible for the average person. He mentions that the locally grown food is more expensive, but he also suggests that poor people now spend less of their income on food than they used to, implying that maybe they have extra cash on hand to buy local free-range chickens, not to mention that they’d need the time and a car and gas to drive to the local farms to buy this stuff (which somehow doesn’t seem to figure into his carbon footprint calculation of that lifestyle). I don’t think there’s all that much extra time and money on people’s hands these days, considering how many people are now living on food stamps (I will grant that he wrote this book before the credit crisis so he didn’t anticipate that).

Complaint #3: he doesn’t actually give a suggestion for what to do about this to the average person. In the end this book creates a way for well-to-do people to feel smug about their food choices but doesn’t forge a path otherwise, besides a vague idea that not eating processed food would be good. I know I’m asking a lot, but specific and achievable suggestions would have been nice. Here’s where my readers can say I missed something – please comment!

Compliment #1: he really educates the reader on how much the government farm subsidies distort the market, especially for corn, and how the real winners are the huge businesses like ConAgra and Monsanto, not the farmers themselves.

Compliment #2: he also explains the nastiness of processed food and large-scale cow, pig, and chicken farms. Yuck.

Compliment #3: My favorite part is that he describes the underlying model of the food industry as overly simplistic. He points out that, by just focusing on the chemicals like nitrogen and carbon in the soil, we have ignored all sorts of other important things that are also important to a thriving ecosystem. So, he explains, simply adding nitrogen to the soil in the form of fertilizer doesn’t actually solve the problem of growing things quickly. Well, it does do that, but it introduces other problems like pollution.

This is a general problem with models: they almost by definition simplify the world, but if they are successful, they get hugely scaled, and then the things they ignore, and the problems that arise from that ignorance, are amplified. There’s a feedback loop filled with increasingly devastating externalities. In the case of farming, the externalities take the form of pollution, unsustainable use of petrochemicals, sick cows and chickens, and nasty food-like items made from corn by-products.

Another example is teacher value-added models: the model is bad, it is becoming massively scaled, and the externalities are potentially disastrous (teaching to the test, the best teachers leaving the system, enormous amount of time and money spent on the test industry, etc.).

But that begs the question, what should we do about it? Should we well-to-do people object to the existence of the model and send our kids to the private schools where the teachers aren’t subject to that model? Or should we acknowledge it exists, it isn’t going away, and it needs to be improved?

It’s a similar question for the food system and the farming model: do we save ourselves and our family, because we can, or do we confront the industry and force them to improve their models?

I say we do both! Let’s not ignore our obligation to agitate for better farming practices for the enormous industry that already exists and isn’t going away. I don’t think the appropriate way to behave is to hole up with your immediate family and make sure your kids are eating wholesome food. That’s too small and insular! It’s important to think of ways to fight back against the system itself if we believe it’s corrupt and is ruining our environment.

For me that means being part of Occupy, joining movements and organization fighting against lobbyist power (here’s one that fights against BigFood lobbyists), and broadly educating people about statistics and mathematical modeling so that modeling flaws and externalities are understood, discussed, and minimized.

Categories: data science, math education, musing

Away for a week – will miss you

August 12, 2012 Cathy O'Neil, mathbabe 2 comments

Like all good New Yorkers, I’m going away for a week’s vacation in August. I’ll be on a tiny island on Lake Ontario with no internet connection. I’ll miss you guys! See you in a week!

Categories: musing

Subway etiquette: applying makeup on the 1 train

August 11, 2012 Cathy O'Neil, mathbabe 29 comments

I’m a huge fan of public transportation, mostly subways. I used the New York City subway system on average three times a day, especially now that I’m not working. And I like to observe people on the subway, and the sometimes strange etiquette that you see there.

Specifically, I am interested in how people break what I call the two cardinal rules of public transportation:

No eye contact or conversations with people you didn’t get on the subway with. Exceptions when, as described here, somebody incredibly smelly or incredibly sick leaves, or the subway gets irretrievably stuck in the tunnel.
No doing anything weird, even by yourself, to attract undue attention. Things like reading, playing games on your phone, and pretending to sleep are OK, things like eating smelly food or humming or whistling: not okay.

Most people who break these rules I get – they are trying to get you to give them money, or they’re slightly to totally insane, or both. Fair enough, that’s part of the fabric of life in a big city.

But there’s one category of people I just don’t get, namely the women who put outlandish amounts of makeup on while sitting on the subway.

I’m not talking about a dab of lipstick, which seems fine and comparable to chapstick or something. I’m talking about the women who come with a complete set of foundation, eyeliners, mascara, the works. They sit there peering intensely into their tiny mirrors, creating a new persona, utterly absorbed in their transformation, and completely oblivious to the mesmerizing effect that it has on everyone.

Or maybe not, maybe it’s performance art – sometimes I think so. Or perhaps they are actually insane in a small way.

Because otherwise it seems like a contradiction in terms to me. From my perspective, wearing that much makeup usually indicates a willingness to conform at the highest level (these are usually young women, so the idea that they are actually in need of makeup to cover sun spots or wrinkles does not apply), but then the willingness to break the second cardinal rule of subway riding seems to be in direct conflict with that religion of conformism.

For example, whenever I see one of these 25-year-old foundation appliers, I’m wondering, who are you becoming? From whom are you hiding your real face? If it’s your coworkers, what if one of them is on this train right now? Then they’d see the real you in the before shot, at the beginning of your ride. Wouldn’t that defeat the purpose of the makeup? Isn’t that too large a risk to take for you?

Since I don’t wear makeup myself, I’m also wondering if I’m just not understanding the goal of that much makeup. Maybe if I understood more deeply why women wear these masks, I’d also understand why they’re willing to apply them in front of a crowd of strangers.

Categories: musing

Datadive weekend with DataKind September 7-9

August 10, 2012 Cathy O'Neil, mathbabe 1 comment

I’ll be a data ambassador at an upcoming DataKind weekend, working with a team on New York City open government data.

DataKind, formerly known as Data Without Borders, is a very cool, not at all creepy organization that brings together data nerds with typically underfunded NGO’s in various ways, including datadive weekends, which are like hack-a-thons for data nerds.

I have blogged a few times about working with them, because I’ve done this before working with the NYCLU on stop-and-frisk data (check out my update here as well). By the way, stop-and-frisk events have gone down 34% in recent months. I feel pretty good about being even tangentially involved in that fact.

This time we’re working with mostly New York City parks data, so stuff like trees and storm safety and 311 calls.

The event starts on Friday, September 7th, with an introduction to the data and the questions and some drinks, then it’s pretty much all day Saturday, til midnight, and then there are presentations Sunday morning (September 9th). It’s always great to meet fellow nerds, exchange technical information and gossip, and build something cool together.

Registration is here, sign up quick!

Categories: data science, open source tools

Looterism

August 9, 2012 Cathy O'Neil, mathbabe 16 comments

My friend Nik recently sent me a PandoDaily article written by Francisco Dao entitled Looterism: The Cancerous Ethos That Is Gutting America.

He defines looterism as the “deification of pure greed” and says:

The danger of looterism, of focusing only on maximizing self interest above the importance of creating value, is that it incentivizes the extraction of wealth without regard to the creation or replenishment of the value building mechanism.

I like the term, I think I’ll use it. And it made me think of this recent Bloomberg article about private equity and hedge funds getting into the public schools space. From the article:

Indeed, investors of all stripes are beginning to sense big profit potential in public education.

The K-12 market is tantalizingly huge: The U.S. spends more than $500 billion a year to educate kids from ages five through 18. The entire education sector, including college and mid-career training, represents nearly 9 percent of U.S. gross domestic product, more than the energy or technology sectors.

Traditionally, public education has been a tough market for private firms to break into — fraught with politics, tangled in bureaucracy and fragmented into tens of thousands of individual schools and school districts from coast to coast.

Now investors are signaling optimism that a golden moment has arrived. They’re pouring private equity and venture capital into scores of companies that aim to profit by taking over broad swaths of public education.

The conference last week at the University Club, billed as a how-to on “private equity investing in for-profit education companies,” drew a full house of about 100.

[I think I know why that golden moment arrived, by the way. The obsession with test scores, a direct result of No Child Left Behind, is both pseudo-quantitative (by which I mean it is quantitative but is only measuring certain critical things and entirely misses other critical things) and has broken the backs of unions. Hedge funds and PE firms love quantitative things, and they don’t really care if they numbers are meaningful if they can meaningfully profit.]

Their immediate goal is out-sourcing: they want to create the Blackwater (now Academi) of education, but with cute names like Schoology and DreamBox.

Lest you worry that their focus will be on the wrong things, they point out that if you make kids drill math through DreamBox “heavily” for 16 weeks, they score 2.3 points higher in a standardized test, although they didn’t say if that was out of 800 or 20. Never mind that “heavily” also isn’t defined, but it seems safe to say from context that it’s at least 2 hours a day. So if you do that for 16 weeks, those 2.3 points better be pretty meaningful.

So either the private equity guys and hedge funders have the whole child in mind here, or it’s maybe looterism. I’m thinking looterism.

Categories: finance, hedge funds, math education, rant

High frequency trading: does it hurt the little guy?

August 8, 2012 Cathy O'Neil, mathbabe 11 comments

I’ve already written about high frequency trading here, and I came out in favor of a transaction tax to slow that shit down a little bit. After all, the argument that liquidity is good so more liquidity is better only holds to a point – we don’t need infinite liquidity. It makes sense to actually have a small barrier to trade – you actually have to think it’s a good idea one way or another, otherwise you have no incentive not to do something dumb.

And as we’ve seen recently with Knight Capital, dumb things definitely are likely to happen.

It’s been interesting to see the media reaction. On the one hand, the Room for Debate over at the New York Times has a bunch of people discussing high frequency trading (HFT), and the most pro-HFT guy essentially says that the SEC should keep up technology-wise with these guys, and everything will be ok. That’s called living in a fantasy world.

More interesting to me was Felix Salmon’s post yesterday, where he rightly complained that, all too often, journalists dumb down and simplify reporting on these things, and then he proceeds to dumb down and simplify reporting on this thing.

Specifically, he complains that no “little guys” were hurt in Knight’s crash, even though the press is always looking for the little guy that gets hurt. [Side note: he also complains about the LIBOR manipulation not hurting municipalities, which is false, it did hurt them. He needs to understand that better before he dismisses it.]

But, if I’m not dreaming, Fidelity was one of the large customers of Knight that’s pulled out, and if I’m not unconscious, Fidelity manages quite a few of my many 401K accounts, as well as a huge proportion of the 401K accounts in this country. So it’s quite possible that my retirement money was part of that massive screw-up which is now owned by Goldman Sachs, not that I’ve been notified by Fidelity of any harm (but that’s another post).

As for small investors vs. little guys, there’s a difference. If you have enough money that you’re investing it through brokers, I personally don’t count you as small, even if you appear small to Goldman Sachs. So I’m not interested in whether the small investor was all that harmed by Knight’s meltdown, but I’m pretty sure the small investor was scared away by it.

But looking at the larger picture, I’d definitely say this is an indication of the outrageous complexity of the financial system, which most definitely is hurting the little guy, i.e. the taxpayer. This complexity is why we have the government guarantee in place, the Too-Big-and-Too-Complex-To-Fail banks and markets, and the little guy on the hook when things melt down. Moreover, there’s a direct line from that whole mess to the destruction of unions and pension programs, even if people don’t want to draw it.

So if you want to be myopic you can say that this was one firm, making one major blunder, and it’s self-contained and that firm is failing just like it should. But if you take a step back you see they were doing this as part of a larger culture of competition for speed and technology that they are so focused on, they threw risk to the wind in order to achieve a tiny edge over NYSE.

That laser focus on having a tiny edge really is the underlying story, and will continue to be, at the expense of risk, at the expense of our retirement funds trading for us, without regard to unnecessary complexity or, yes, the little guy, until our politicians and regulators grow some balls and put an end to it.

Categories: finance, news, rant

I love whistleblowers

August 7, 2012 Cathy O'Neil, mathbabe 11 comments

There’s something people don’t like about whistleblowers. I really don’t get it, but I know it’s true (I’m looking at you, Obama).

In particular, I hear all the time that you’re giving up on your career if you’re a whistleblower, that nobody would ever want to hire you again. But if I’m running a company, which I presumably want have run well, without corruption, and be successful, then I’m totally fine with whistleblowers! They will tell me truth and expose fraud. To say out loud that I don’t want to hire someone like that is basically admitting I’m okay with fraud, no?

I’m really missing something, and if you have an explanation I’d love to hear it.

In the meantime, though, I’ll say this: the web is great for anonymous whistleblowing (if anyone pays attention and follows up). Science Fraud is a great one that tells about scientific publishing fraud in the life sciences – see the “About” page of Science Fraud for more color. See also Retraction Watch for a broader look.

But then there’s another issue, which is that some people won’t seriously consider whistleblowers unless they identify themselves! What up? Facts are facts – if someone has given good evidence that can be checked independently, why should also submit themselves to being blacklisted for their efforts?

Here’s a good response to this crappy line of reasoning against anonymous whistleblowing by the Retraction Watch guys Ivan Oransky and Adam Marcus.

Categories: news, open source tools, rant

What is a proof?

August 6, 2012 Cathy O'Neil, mathbabe 35 comments

I recently described (here) a proof to be a convincing argument of why you think something is true. I’ll stick to that definition in spite of a few commenters who want there to be axioms or postulates, because I really don’t think that’s what happens in real life (which is a good thing! It would be an incredibly boring life!). Since I’m a utilitarian, I only care about and only want to discuss what actually happens.

The above definition immediately begs the question, convincing to whom? Can a proof to someone be a non-proof to someone else? Absolutely, proofs are entirely context-driven. If I’m trying to prove something to you and you remain unconvinced, then it is no proof, even if I’ve used the same argument before successfully.

This brings me to my first main point, which is that it the responsibility of the person proving something to convince his or her audience that it’s true. Likewise, it is the responsibility of the audience to remain skeptical (but attentive) and be open to being convinced or to finding a flaw in the argument.

Things get trickier when it’s not a live interaction, but when things are written down, like in published articles. On the one hand, written proofs give the audience more time to understand the reasoning and to come up with problems, but on the other hand there’s no opportunity to say “I just don’t get what you’re talking about,” which is the feeling one typically has at least 85% of the time.

In an ideal world, those who write proofs understand the goal to be that the reader should be able to understand the argument, and thus make the arguments coherent and understandable to their “typical reader.” Who is this typical reader? Someone who is probably relatively fluent in the basic objects of the field, say, but hasn’t recently thought about this problem.

Now that I’ve described the ideal situation, I’ll rant for a bit about how people game this system. There are two things that creep into the system that give rise to its gaming, and those two things are status and credit. People like to be high status (and like to signal high status even more), and of course people like to take credit.

First, status. It turns out that people often really want to explain their reasoning no to the typical audience, but to the expert audience. So they don’t give sufficient context, and they are lazy reasoners, because the experts can be expected to understand how to fill in the details.

It’s not only insecure young mathematicians that are guilty of this – there are plenty of experts who themselves fall prey to this habit (thus the signaling). I think it’s driven by a combination of feeling kind of smug and smart when people who are trying to follow your conversation leave because they’re exhausted and confused (and possibly ashamed), and the echo chamber that remains after people who don’t get it (or who admit to not getting it) leave. Whatever the reason, there are plenty of experts who get less and less understandable over time, in person and in print.

The other side of this status play is those experts get away with it. The papers written by these people are often accepted in spite of the fact that they are nearly unreadable to all but the 5 people in their field for whom they have been written, since after all, these guys are experts.

But does this approach constitute a proof? I claim it doesn’t, not if I have to be one of 5 people to read and understand it. The writer has choked, bigtime, on his or her responsibility to convince the reader.

Second, the credit thing. People want to get credit for proving things, because that’s how they get high status. But they don’t always want to prove everything they claim, because it’s hard work. So sometimes you see people proving something and then claiming an even more general thing is true, and giving a “sketch of a proof” for that more general thing (this is one example where “sketches” come up, but actually there are plenty of them).

Let’s examine that concept for a moment, the “sketch of a proof.” Usually this implies that the basic outline is there, but many details of how to rely on so-and-so’s theorem or what’s-his-name’s method are left out. It’s a proof lying in the shadows, and we’ve only seen it highlighted every few feet or so to wend our way through it.

Is a sketch a proof? No, it’s not. Best case scenario, it would take a typical reader a few minutes, maybe up to two hours, say, to turn that sketch into a proof.

But what if the typical reader can’t do it in two hours?

The problem with the concept of a sketch of a proof is that it’s too difficult to refute. If I am a reader and I say, “this is a false sketch” then I could just be opening myself up to people who tell me I didn’t spend my two hours wisely, or that I’m not good enough to complain about it. They may even expect me to prove that that method cannot be used to prove that result.

But that’s bullshit! As far as I’m concerned, if you claim to have sketched a proof, and if I’ve tried to prove it using your notes and I’ve failed, then that’s your fault, not mine. It’s your responsibility to prove it to me, and you haven’t.

Conclusion: let’s all remember when you claim a result, you are claiming credit, and it’s your responsibility to convince the audience it’s true – not just 5 experts. And second, if you aren’t willing to actually prove something, don’t claim it as a result. Instead, say something like, “this may generalize using so-and-so’s theorem or what’s-his-name’s method….”. Consider it a gift to the next person who reads your paper and wants to prove something new.

Categories: math, rant

Bailout, the book

August 5, 2012 Cathy O'Neil, mathbabe 10 comments

You know that feeling, where you feel like a conspiracy theorist because, even though you don’t have cold hard evidence for it, you have a distinct feeling that someone is trying to thwart you even though they claim to be your friend, or thwart an idea they claim to believe in, or even worse, thwart a principle they claim to stand by?

That’s how I was feeling about Tim Geithner, and frankly the entire Obama administration, until I read “Bailout,” the recently published tell-all book by Neil Barofsky, who was put in charge of detecting and preventing fraud related to TARP.

I recently blogged about how I consider this book a call to Occupy, but I had only read the excerpt from Bloomberg at that point. Now that I’ve read the book, it’s most definitely a call to Occupy, as well as to any group or individual who still has principles and enough energy up to summon outrage.

Going back to the feeling of being a conspiracy theorist.

Nothing in this book was really new to me or really surprised me, except the fact that Barofsky was willing to write it down in black and white. Thank goodness there are still a few people who still have principles, even inside Washington.

Everything there was something I’d pieced together either working in finance, where I lost faith in the Obama administration right away when it introduced HAMP, which was clearly set up to fail homeowners, or by meeting people in the Alternative Banking group of #OWS, specifically Yves Smith, who explained the technical details of the more recent mortgage settlement, and how it is a backdoor bailout to the banks. Yet another one!

Where was the corresponding bailout for the people? Why this doublespeak, where we’d talk about moral hazard for people who have been screwed by the predatory loan industry, but the moral hazard for AIG executives getting multi-million dollar bonuses after an $85 billion bailout is just something we have to swallow, out of deference to the sanctity of contract?

And if we care so much about contracts, why do we allow companies to enter bankruptcy just to jettison pension promises but we don’t allow individuals (who are not too-big-to-fail) to renegotiate crippling student debt loads?

I’m confused no longer. It was never Geithner’s intention, or Obama’s intention, to help out the people. It has always been their intention solely to prop up a failed banking system. What they’ve been doing, rather than saying, is much more consistent with this theory anyway. Lots of roundabout efforts to explain why they’d set up a mortgage modification system to help homeowners was completely ineffective; it’s because it was actually set up to slow down foreclosures in order to “foam the runway” for banks to get back into the black. That makes much more sense!

It actually restores my faith in the Obama administration a bit. Before this I was sometimes torn between thinking they were bought by the banks or they were utterly incompetent. But now I know they aren’t entirely incompetent in the follow-through with their goals: they actually did succeed in slavishly working for the banks in the name of helping out homeowners.

Thank you, Neil Barofsky, for a great book. Thank you for maintaining your justified anger and for being courageous enough, and enough of a dick, to write it.

Categories: #OWS, finance

Le Monde article (#OWS)

August 4, 2012 Cathy O'Neil, mathbabe 1 comment

Categories: #OWS, finance, news

Why the internet is creepy

August 3, 2012 Cathy O'Neil, mathbabe 11 comments

Recently I’ve been seeing various articles and opinion pieces that say that Facebook should pay its users to use it, or give a cut of the proceeds when they sell personal data, or something along those lines.

This strikes me a naive to a surprising degree; it means people really don’t understand how web businesses work. How can people simultaneously complain that Facebook isn’t a viable business and that they don’t pay their users for their data?

People have gotten used to getting free services, and they assume that infrastructure somehow just exists, and they want to have that infrastructure, and use it, and never see ads and never have their data used, or get paid whenever someone uses their data.

But you can’t have all of that at the same time!

These companies need to monetize somehow, and instead of asking users for money directly, which isn’t the current culture, they get creepy with data. The fact that there are basically no rules about personal information (aside from some medical information) means that the creepiness limit is extreme, and possibly hasn’t been reached yet.

What are the alternatives? I can think of a few, none of them particularly wonderful:

Legislate privacy laws to make personal data sharing or storing illegal without explicit consent for each use (right now you just sign away all your rights at once when you sign up for the service, but that could and probably should change). This would kill the internet as we know it. In the short term the consequences would be extreme. Besides the fact that some people would save and use data illegally, which would be very hard to track and to stop, places like Twitter, Facebook, and Google would have no revenue model. An interesting thought experiment on what would happen after this.
Make people pay for services, either through micro-payments or subscription services like Netflix. This would maybe work, but only for people with credit cards and money to spare. So it would also change access to the internet, and not in a good way.
Wikipedia-style donation-based services. This is clearly a tough model, and they always seem to be on the edge of solvency.
Get the government to provide these services as meaningful infrastructure for society, like highways. Imagine what Google Government would be like.
Some combination of the above.

Am I missing something?

Categories: data science, internet startup, rant

VAM shouldn’t be used for tenure

August 2, 2012 Cathy O'Neil, mathbabe 2 comments

I recently read a New York Times “Room for Debate” discussion on the teacher Value-added model (VAM) and whether it’s fair.

I’ve blogged a few times about this model and I think it’s crap (see this prior post which is entitled “The Value Added Model Sucks” for example).

One thing I noticed about the room for debate is that the two most pro-VAM talking heads (this guy and this guy) both quoted the same paper, written by Dan Goldhaber and Michael Hansen, called “Assessing the Potential of Using Value-Added Estimates of Teacher Job Performance for Making Tenure Decisions,” which you can download here.

Looking at the paper, I don’t really think it’s a very good resource if you want to argue for tenure-decisions based on VAM, but I guess it’s one of those things, where they don’t expect you actually do the homework.

For example, they admit that year-to-year scores are only correlated between 20% and 50% for the same teacher (page 4). But then they go on to say that, if you average two or more years in a row, these correlations go up (page 4). I’m wondering if that’s just because they calculate the correlations that come from the same underlying data, in which case of course the correlations go up. They aren’t precise enough at that point to make me convinced they did this carefully.

But it doesn’t matter, because when teachers are up for tenure, they have one or two scores, that’s it. So the fact that 17 years of scores, on average, has actual information, even if true, is irrelevant. The point is that we are asking whether one or two scores, in a test that has 20-50% correlation year-to-year, is sufficiently accurate and precise to decide on someone’s job. And by the way, in my post the correlation of teachers’ scores for the same year in the same subject was 24%, so I’m guess we should lean more towards the bottom of this scale for accuracy.

This is ludicrous. Can you imagine being told you can’t keep your job because of a number that imprecise? I’m grasping for an analogy, but it’s something like getting tenure as a professor based on what an acquaintance you’ve never met head about your reputation while he was drunk at a party. Maddening. And I can’t imagine it’s attracting more good people to the trade. I’d walk the other way if I heard about this.

The reason the paper is quoted so much is that it looks at a longer-term test to see whether early-career VAM scores have predictive power for the students more than 11 years later. However, it’s for one data set in North Carolina, and the testing actually happened in 1995 (page 6), so before the testing culture really took over (an important factor), and they clearly exclude any teacher whose paperwork is unavailable or unclear, as well as small classes (page 7), which presumably means any special-ed kids. Moreover, they admit they don’t really know if the kids are actual students of the teacher who proctored the tests (page 6).

Altogether a different set-up than the idiosyncratic, real-world situation faced by actual teachers, whose tenure decision is actually being made based on one or two hugely noisy numbers.

I’m not a huge fan of tenure, and I want educators to be accountable to being good teachers just like everyone else who cares about this stuff, but this is pseudo-science.

I’m still obsessed with the idea that people would know how crappy this stuff is if we could get our hands on the VAM itself and set something up where people could test robustness directly, by putting in their information and seeing how their score would change based on how many kids they had in their class etc..

Categories: data science, open source tools, rant

Gangnam Style

August 1, 2012 Cathy O'Neil, mathbabe 8 comments

Best, most absurd video ever, and impossible not to feel cheered up after you watch it. From Gawker, hat tip Johan.

Categories: musing

Statisticians aren’t the problem for data science. The real problem is too many posers

July 31, 2012 Cathy O'Neil, mathbabe 29 comments

Crossposted on Naked Capitalism

Cosma Shalizi

I recently was hugely flattered by my friend Cosma Shalizi’s articulate argument against my position that data science distinguishes itself from statistics in various ways.

Cosma is a well-read broadly educated guy, and a role model for what a statistician can be, not that every statistician lives up to hist standard. I’ve enjoyed talking to him about data, big data, and working in industry, and I’ve blogged about his blogposts as well.

That’s not to say I agree with absolutely everything Cosma says in his post: in particular, there’s a difference between being a master at visualizations for the statistics audience and being able to put together a power point presentation for a board meeting, which some data scientists in the internet start-up scene definitely need to do (mostly this is a study in how to dumb stuff down without letting it become vapid, and in reading other people’s minds in advance to see what they find sexy).

And communications skills are a funny thing; my experience is communicating with an academic or a quant is a different kettle of fish than communicating with the Head of Product. Each audience has its own dialect.

But I totally believe that any statistician who willingly gets a job entitled “Data Scientist” would be able to do these things, it’s a self-selection process after all.

Statistics and Data Science are on the same team

I think that casting statistics as the enemy of data science is a straw man play. The truth is, an earnest, well-trained and careful statistician in a data scientist role would adapt very quickly to it and flourish as well, if he or she could learn to stomach the business-speak and hype (which changes depending on the role, and for certain data science jobs is really not a big part of it, but for others may be).

It would be a petty argument indeed to try to make this into a real fight. As long as academic statisticians are willing to admit they don’t typically spend just as much time (which isn’t to say they never spend as much time) worrying about how long it will take to train a model as they do wondering about the exact conditions under which a paper will be published, and as long as data scientists admit that they mostly just redo linear regression in weirder and weirder ways, then there’s no need for a heated debate at all.

Let’s once and for all shake hands and agree that we’re here together, and it’s cool, and we each have something to learn from the other.

Posers

What I really want to rant about today though is something else, namely posers. There are far too many posers out there in the land of data scientists, and it’s getting to the point where I’m starting to regret throwing my hat into that ring.

Without naming names, I’d like to characterize problematic pseudo-mathematical behavior that I witness often enough that I’m consistently riled up. I’ll put aside hyped-up, bullshit publicity stunts and generalized political maneuvering because I believe that stuff speaks for itself.

My basic mathematical complaint is that it’s not enough to just know how to run a black box algorithm. You actually need to know how and why it works, so that when it doesn’t work, you can adjust. Let me explain this a bit by analogy with respect to the Rubik’s cube, which I taught my beloved math nerd high school students to solve using group theory just last week.

Rubiks

First we solved the “position problem” for the 3-by-3-by-3 cube using 3-cycles, and proved it worked, by exhibiting the group acting on the cube, understanding it as a subgroup of $S_8 \times S_{12},$ and thinking hard about things like the sign of basic actions to prove we’d thought of and resolved everything that could happen. We solved the “orientation problem” similarly, with 3-cycles.

I did this three times, with the three classes, and each time a student would ask me if the algorithm is efficient. No, it’s not efficient, it takes about 4 minutes, and other people can solve it way faster, I’d explain. But the great thing about this algorithm is that it seamlessly generalizes to other problems. Using similar sign arguments and basic 3-cycle moves, you can solve the 7-by-7-by-7 (or any of them actually) and many other shaped Rubik’s-like puzzles as well, which none of the “efficient” algorithms can do.

Something I could have mentioned but didn’t is that the efficient algorithms are memorized by their users, are basically black-box algorithms. I don’t think people understand to any degree why they work. And when they are confronted with a new puzzle, some of those tricks generalize but not all of them, and they need new tricks to deal with centers that get scrambled with “invisible orientations”. And it’s not at all clear they can solve a tetrahedron puzzle, for example, with any success.

Democratizing algorithms: good and bad

Back to data science. It’s a good thing that data algorithms are getting democratized, and I’m all for there being packages in R or Octave that let people run clustering algorithms or steepest descent.

But, contrary to the message sent by much of Andrew Ng’s class on machine learning, you actually do need to understand how to invert a matrix at some point in your life if you want to be a data scientist. And, I’d add, if you’re not smart enough to understand the underlying math, then you’re not smart enough to be a data scientist.

I’m not being a snob. I’m not saying this because I want people to work hard. It’s not a laziness thing, it’s a matter of knowing your shit and being for real. If your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is. That’s your job.

As I see it, there are three problems with the democratization of algorithms:

As described already, it lets people who can load data and press a button describe themselves as data scientists.
It tempts companies to never hire anyone who actually knows how these things work, because they don’t see the point. This is a mistake, and could have dire consequences, both for the company and for the world, depending on how widely their crappy models get used.
Businesses might think they have awesome data scientists when they don’t. That’s not an easy problem to fix from the business side: posers can be fantastically successful exactly because non-data scientists who hire data scientists in business, i.e. business people, don’t know how to test for real understanding.

How do we purge the posers?

We need to come up with a plan to purge the posers, they are annoying and making a bad name for data science.

One thing that will be helpful in this direction is Rachel Schutt’s Data Science class at Columbia next semester, which is going to be a much-needed bullshit free zone. Note there’s been a time change that hasn’t been reflected on the announcement yet, namely it’s going to be once a week, Wednesdays for three hours starting at 6:15pm. I’m looking forward to blogging on the contents of these lectures.

Categories: data science, rant

Columbia Data Science Institute: it’s gonna happen

July 30, 2012 Cathy O'Neil, mathbabe 1 comment

So Bloomberg finally got around to announcing the Columbia Data Science Institute is really going to happen. The details as we know them now:

It’ll be at the main campus, not Manhattanville.
It’ll hire 75 faculty over the next decade (specifically, 30 new faculty by launch in August 2016 and 75 by 2030, so actually more than a decade but who’s counting?).
It will contain a New Media Center, a Smart Cities Center, a Health Analytics Center, a Cybersecurity Center, and a Financial Analytics Center.
The city is pitching in $15 million whereas Columbia is ponying up $80 million.
Columbia Computer Science professor Kathy McKeown will be the Director and Civil Engineering professor Patricia Culligan will be the Institute’s Deputy Director.

Categories: data science, math education, news

The douche burger, and putting a ruler to the dick.

July 30, 2012 Cathy O'Neil, mathbabe 16 comments

I have been pretty hardcore and serious for a few weeks, and today I want to lighten it up for a change.

Douchery

First, I want everyone to read this article about a New York City food truck that sells douche burgers. From the article:

For just $666 you can purchase a foie gras-stuffed Kobe patty covered in Gruyere cheese that’s been melted with champagne steam and topped with lobster, truffles, caviar, and a BBQ sauce made with Kopi Luwak coffee beans that have been pooped out by some sort of animal called the Asian palm civet. The whole thing is then served in a gold-leaf wrapper.

Two things I like about this article, first that it’s hilarious and over the top satire, which is always excellent, and second that the world is picking up on my idea of calling people douches when they get really into esoteric stuff.

If you don’t believe me, read my previous post My friend the coffee douche. It’s one of my favorites.

Putting a ruler to the dick

Next, speaking of using language in a funny but pointed way, are you with me that “opening the kimono” is an offensive and sexist phrase? Well, how about we replace it with a better, more offensive, and more sexist phrase that’s even more fun to say, namely “putting a ruler to the dick”??

This was my friend Laura Strausfeld’s idea, and I love it. It’s gonna be the buzzword (buzzphrase) of the year, we just know it.

Here’s how it works in context:

guy A: “So do you think you’ll invest in those guys? They seemed really excited about that new technique they’ve developed!”

guy B: “I don’t know. They talked a big game, but until I can put a ruler to the dick I’m not putting my money there.”

Categories: musing

Does mathematics have a place in higher education?

July 29, 2012 Cathy O'Neil, mathbabe 66 comments

A recent New York Times Opinion piece (hat tip Wei Ho), Is Algebra Necessary?, argues for the abolishment of algebra as a requirement for college. It was written by Andrew Hacker, an emeritus professor of political science at Queens College, City University of New York. His concluding argument:

I’ve observed a host of high school and college classes, from Michigan to Mississippi, and have been impressed by conscientious teaching and dutiful students. I’ll grant that with an outpouring of resources, we could reclaim many dropouts and help them get through quadratic equations. But that would misuse teaching talent and student effort. It would be far better to reduce, not expand, the mathematics we ask young people to imbibe. (That said, I do not advocate vocational tracks for students considered, almost always unfairly, as less studious.)

Yes, young people should learn to read and write and do long division, whether they want to or not. But there is no reason to force them to grasp vectorial angles and discontinuous functions. Think of math as a huge boulder we make everyone pull, without assessing what all this pain achieves. So why require it, without alternatives or exceptions? Thus far I haven’t found a compelling answer.

For an interesting contrast, there’s a recent Bloomberg View Piece, How Recession Will Change University Financing, by Gary Shilling (not to be confused with Robert Shiller). From Shilling’s piece:

Most thought that a bachelor’s degree was the ticket to a well-paid job, and that the heavy student loans were worth it and manageable. And many thought that majors such as social science, education, criminal justice or humanities would still get them jobs. They didn’t realize that the jobs that could be obtained with such credentials were the nice-to-have but nonessential positions of the boom years that would disappear when times got tough and businesses slashed costs.

Some of those recent graduates probably didn’t want to do, or were intellectually incapable of doing, the hard work required to major in science and engineering. After all, afternoon labs cut into athletic pursuits and social time. Yet that’s where the jobs are now. Many U.S.-based companies are moving their research-and-development operations offshore because of the lack of scientists and engineers in this country, either native or foreign-born.

For 34- to 49-year-olds, student debt has leaped 40 percent in the past three years, more than for any other age group. Many of those debtors were unemployed and succumbed to for-profit school ads that promised high-paying jobs for graduates. But those jobs seldom materialized, while the student debt remained.

Moreover, many college graduates are ill-prepared for almost any job. A study by the Pew Charitable Trusts examined the abilities of U.S. college graduates in three areas: analyzing news stories, understanding documents and possessing the math proficiency to handle tasks such as balancing a checkbook or tipping in a restaurant.

The first article is written by a professor, so it might not be surprising that, as he sees more and more students coming through, he feels their pain and wants their experience to not be excruciating. The easiest way to do that is to remove the stumbling block requirement of math. He also seems to think of higher education as something everyone is entitled to, which I infer based on how he dismisses vocational training.

The second article is written by a financial analyst, an economist, so we might not be surprised that he strictly sees college as a purely commoditized investment in future income, and wants it to be a good one. The easiest way to do that is to have way fewer students go through college to begin with, since having dumb or bad students get into debt but not learn anything and then not get a job afterwards doesn’t actually make sense.

And where the first author acts like math is only needed for a tiny minority of college students, the second author basically dismisses non-math oriented subjects as frivolous and leading to a life of joblessness and debt. These are vastly different viewpoints. I’m thinking of inviting them both to dinner to discuss.

By the way, I think that last line, where Hacker wonders what the pain of math-as-huge-boulder achieves, is more or less answered by Shilling. The goal of having math requirements is to have students be mathematically literate, which is to say know how to do everyday things like balancing checkbooks and reading credit card interest rate agreements. The fact that we aren’t achieving this goal is important, but the goal is pretty clear. In other words, I think my dinner party would be fruitful as well as entertaining.

If there’s one thing these two agree on, it’s that students are having an awful lot of trouble doing basic math. This makes me wonder a few things.

First, why is algebra such a stumbling block? Is it that the students are really that bad, or is the approach to teaching it bad? I suspect what’s really going on is that the students taking it have mostly not been adequately taught the pre-requisites. That means we need more remedial college math.

I honestly feel like this is the perfect place for online learning. Instead of charging students enormous fees while they get taught high-school courses they should already know, and instead of removing basic literacy requirements altogether, ask them to complete some free online math courses at home or in their public library, to get them ready for college. The great thing about computers is that they can figure out the level of the user, and they never get impatient.

Next, should algebra be replaced by a Reckoning 101 course? Where, instead of manipulating formulas, we teach students to figure out tips and analyze news stories and understand basic statistical statements? I’m sure this has been tried, and I’m sure it’s easy to do badly or to water down entirely. Please tell me what you know. Specifically, are students better at snarky polling questions if they’ve taken these classes than if they’ve taken algebra?

Finally, I’d say this (and I’m stealing this from my friend Kiri, a principal of a high school for girls in math and science): nobody ever brags about not knowing how to read, but people brag all the time about not knowing how to do math. There’s nothing to be proud of in that, and it’s happening to a large degree because of our culture, not intelligence.

So no, let’s not remove mathematical literacy as a requirement for college graduates, but let’s think about what we can do to make the path reasonable and relevant while staying rigorous. And yes, there are probably too many students going to college because it’s now a cultural assumption rather than a thought-out decision, and this lands young people in debt up to their eyeballs and jobless, which sucks (here’s something that may help: forcing for-profit institutions to be honest in advertising future jobs promises and high interest debt).

Something just occurred to me. Namely, it’s especially ironic that the most mathematically illiterate and vulnerable students are being asked to sign loan contracts that they, almost by construction, don’t understand. How do we address this? Food for thought and for another post.

Categories: math, math education, news, statistics

Income distributions and misleading poll questions (#OWS)

July 28, 2012 Cathy O'Neil, mathbabe 9 comments

Disingenuous, pseudo-quantitative arguments piss me off.

In this recent Bloomberg View article entitled “Making the rich poorer doesn’t enrich the middle class,” Caroline Baum argues that middle class people would rather get more money than take away money from rich people. From the article:

Polling by the Pew Research Center shows that people aren’t interested in taking money from the wealthy. They just want a chance to get rich themselves.

But that’s a misleading question. It seems like a zero sum game when you put it that way, equivalent to something like, “Would you rather gain $100 or have a rich person somewhere lose $100?”.

But if you pose the question differently, and more in line with actual numbers, not to mention contextualized to reality in other ways, then you’d probably get the opposite.

Let’s take a look at wealth distribution from 2007, which I got here:

Let’s just say we’re being extreme and we take away all the wealth of the top 1% and give it to everybody equally (say we even give back some of it to those top 1%). That would mean that 34.6% get flattened out to 100 pots instead of one, which means that each of those percentiles gets about 0.35% more than they used to have. The middle 20% would grow from 4% of the overall wealth to (4 + 20*0.35)% = 11%. That’s still a lot less than 20%, but the wealth of the middle 20% is still nearly tripled by just this one percent re-distributing.

Said another way, it’s not tit-for-tat at all.

If we asked someone in the middle class which they want more, a 1% increase in their wealth or a top 1%’er to lose 1% of their wealth, then that might be very different. Consider the political influence that 1% represents, at the very least. Consider the fact that 1% of that person in the middle 20% is 173 times smaller than for the top 1%.

It’s still not fair, though, because the middle class is so squeezed on necessities like food, housing, education, medical expenses, and child care, that they can’t afford even a 1% loss. What if you took those out?

If you go even further and ask someone in the middle class which they want more, a 1% increase in their discretionary income or a top 1%’er to lose 1% of their discretionary income, then that might be very different still. I haven’t been able to find a similar graphic to work with to see the discretionary income distribution, but rest assured it’s even more unbalanced.

Caroline Baum, would you care to cover those questions on your next poll to the middle class?

Categories: #OWS, news, rant

Why is LIBOR such a big deal? (#OWS)

July 27, 2012 Cathy O'Neil, mathbabe 7 comments

The manipulation of LIBOR interest rates by the big, mostly-European banks (but not entirely, see a full list here) was an open secret inside finance in 2008. As in so open that I didn’t think of it as a secret at all.

The fact that that manipulation is now consistently creating huge headlines is interesting to me – it brings up a few issues.

People seem surprised this out-and-out manipulation was happening. That says to me that they clearly still don’t understand what the culture of finance is really like. The fact that Bob Diamond of Barclays claims to have felt “physically ill” when he saw the emails of the traders manipulating LIBOR is either an out-and-out lie or they guy is simple-minded, as in stupid. And word on the street is he’s not stupid.
People still buy the line that most of the problems from the credit crisis arose from legal but wrong-headed efforts to make money, plus corrupt ratings on mortgage-backed securities. This is incredible to me. Let’s get it clear: the culture of finance is to take advantage of every opportunity to juice your bottom line, even if it’s wrong, even if it’s fraudulent, even if it affects the terms of loans on millions of houses and towns in other countries, and even if only your trading desk is benefiting.
The LIBOR manipulation in 2008 was about more than that, namely trying not to look as bad as other banks, to avoid being the next Lehman. It was done in the name of not looking weak and requiring a government bailout. Bob Diamond still doesn’t think they did anything wrong by lying there. It was almost like they were doing something noble.
Speaking of towns in other countries, read this article about how LIBOR manipulation has screwed U.S. cities to the ground. I’ve got a lot more to say about municipal debt and how that sleazy system works but it’s waiting for another post.
Finally, why did it take so long for the media to pick up on LIBOR manipulation? It tempts me to make a list of the illegal stuff that we all knew about back then and send it around just to make sure.

Categories: #OWS, finance, news

Is open data a good thing?

July 26, 2012 Cathy O'Neil, mathbabe 4 comments

As much as I like the idea of data being open and free, it’s not an open and shut case. As it were.

I’m first going to argue against open data with three examples.

The first is a pretty commonly discussed concern of privacy. Simply put, there is no such thing as anonymized data, and people who say there is are either lying or being naive. The amount of information you’d need to remove to really anonymize data is not known to be different from the amount of data you have in the first place. So if you did a good job to anonymize a data set, you’d probably remove all interesting information anyway. Of course, you could think this is only important with respect to individual data.

But my next example comes from land data, specifically Tamil Nadu in Southern India. There’s an interesting Crooked Timber blogpost here (hat tip Suresh Naidu) explaining how “open data” has screwed a local population, the Dalits. Although you could (and I would) argue that the way the data is collected and disseminated, and the fact that the courts go along with this process, is itself politically motivated and disenfrachising, there are some important point made in this post:

Open data undermines the power of those who benefit from “the idiosyncracies and complexities of communities… Local residents [who] understand the complexity of their community due to prolonged exposure.” The Bhoomi land records program is an example of this: it explicitly devalues informal knowledge of particular places and histories, making it legally irrelevant; in the brave new world of open data such knowledge is trumped by the ability to make effective queries of the “open” land records.¹⁵ The valuing of technological facility over idiosyncratic and informal knowledge is baked right in to open data efforts.

The Crooked Timber blog post specifically called out Tim O’Reilly and his “Government as Platform” project as troublesome:

The faith in markets sometimes goes further among open data advocates. It’s not just that open data can create new markets, there is a substantial portion of the push for open data that is explicitly seeking to create new markets as an alternative to providing government services.

It’s interesting to see O’Reilly’s Mike Loukides’s reaction (hat tip Chris Wiggins), entitled the Dark Side of Data, here. From Loukides:

The issue is how data is used. If the wealthy can manipulate legislators to wipe out generations of records and folk knowledge as “inaccurate,” then there’s a problem. A group like DataKind could go in and figure out a way to codify that older generation of knowledge. Then at least, if that isn’t acceptable to the government, it would be clear that the problem lies in political manipulation, not in the data itself. And note that a government could wipe out generations of “inaccurate records” without any requirement that the new records be open. In years past the monied classes would have just taken what they wanted, with the government’s support. The availability of open data gives a plausible pretext, but it’s certainly not a prerequisite (nor should it be blamed) for manipulation by the 0.1%.

[Speaking of DataKind (formerly Data Without Borders), it’s also a problem, as I discovered as a data ambassador working with the NYCLU on Stop, Question and Frisk data, when the government claims to be open but withholds essential data such as crime reports.]

My final example comes from finance. On the one hand I want total transparency of the markets, because it sickens me to think about how nobody knows the actual price of bonds, or the correct interest rate, or the current default assumption of the market, how all of that stuff is being kept secret by Wall Street insiders so they can each skim off their little cut and the dumb money players get constantly screwed.

But on the other hand, if I imagine a world where everything really is transparent, then even in the best of all database situations, that’s just asstons of data which only the very very richest and most technologically savvy high finance types could ever munge through.

So who would benefit? I’d say, for some time, the average dumb money customer would benefit very slightly, by not paying extra fees, but that the edgy techno finance firms would benefit fantastically. Then, I imagine, new ways would be invented for the dumb money customers to lost that small amount of benefit altogether, probably by just inundating them with so much data they can’t absorb it.

In other words, open data is great for the people who have the tools to use it for their benefit, usually to exploit other people and opportunities. It’s not clearly great for people who don’t have those tools.

But before I conclude that data shouldn’t be open, let me strike an optimistic (for me) tone.

The tools for the rest of us are being built right now. I’m not saying that the non-exploiters will ever catch up with the Goldman Sachs and credit card companies, because probably not.

But there will be real tools (already are things like python and R, and they’re getting better every day), built out of the open software movement, that will help specific people analyze and understand specific things, and there are platforms like wordpress and twitter that will allow those things to be broadcast, which will have real impact when the truth gets out. An example is the Crooked Timber blog post above.

So yes, open data is not an unalloyed good. It needs to be a war waged by people with common sense and decency against those who would only use it for profit and exploitation. I can’t think of a better thing to do with my free time.

Categories: finance, open source tools, rant

Newer Entries Older Entries

mathbabe

Archive

Update on organic food

Away for a week – will miss you

Subway etiquette: applying makeup on the 1 train

Datadive weekend with DataKind September 7-9

Looterism

High frequency trading: does it hurt the little guy?

I love whistleblowers

What is a proof?

Bailout, the book

Le Monde article (#OWS)

Why the internet is creepy

VAM shouldn’t be used for tenure

Gangnam Style

Statisticians aren’t the problem for data science. The real problem is too many posers

Columbia Data Science Institute: it’s gonna happen

The douche burger, and putting a ruler to the dick.

Does mathematics have a place in higher education?

Income distributions and misleading poll questions (#OWS)

Why is LIBOR such a big deal? (#OWS)

Is open data a good thing?

Top Posts & Pages

Follow Blog via Email

Recent Posts

Meta