December | 2012 | mathbabe

I totally trust experts, actually

December 31, 2012 Cathy O'Neil, mathbabe 15 comments

I lied yesterday, as a friend at my Occupy meeting pointed out to me last night.

I made it seem like I look into every model before trusting it, and of course that’s not true. I eat food grown and prepared by other people daily. I go on airplanes and buses all the time, trusting that they will work and that they will be driven safely. I still have my money in a bank, and I also hire an accountant and sign my tax forms without reading them. So I’m a hypocrite, big-time.

There’s another thing I should clear up: I’m not claiming I understand everything about climate research just because I talked to an expert for 2 or 3 hours. I am certainly not an expert, nor am I planning to become one. Even so, I did learn a lot, and the research I undertook was incredibly useful to me.

So, for example, my father is a climate change denier, and I have heard him give a list of scientific facts to argue against climate change. I asked my expert to counter-argue these points, and he did so. I also asked him to explain the underlying model at a high level, which he did.

My conclusion wasn’t that I’ve looked carefully into the model and it’s right, because that’s not possible in such a short time. My conclusion was that this guy is trustworthy and uses logical argument, which he’s happy to share with interested people, and moreover he manages to defend against deniers without being intellectually defensive. In the end, I’m trusting him, an expert.

On the other hand, if I met another person with a totally different conclusion, who also impressed me as intellectually honest and curious, then I’d definitely listen to that guy too, and I’d be willing to change my mind.

So I do imbue models and theories with a limited amount of trust depending on how much sense they makes to me. I think that’s reasonable, and it’s in line with my advocacy of scientific interpreters. Obviously not all scientific interpreters would be telling the same story, but that’s not important – in fact it’s vital that they don’t, because it is a privilege to be allowed to listen to the different sides and be engaged in the debate.

If I sat down with an expert for a whole day, like my friend Jordan suggests, to determine if they were “right” on an issue where there’s argument among experts, then I’d fail, but even understanding what they were arguing about would be worthwhile and educational.

Let me say this another way: experts argue about what they don’t agree on, of course, since it would be silly for them to talk about what they do agree on. But it’s their commonality that we, the laypeople, are missing. And that commonality is often so well understood that we could understand it rather quickly if it was willingly explained to us. That would be a huge step.

So I wasn’t lying after all, if I am allowed to define the “it” that I did get at in the two hours with an expert. When I say I understood it, I didn’t mean everything, I meant a much larger chunk of the approach and method than I’d had before, and enough to evoke (limited) trust.

Something I haven’t addressed, which I need to think about more (please help!), is the question of what subjects require active skepticism. On of my commenters, Paul Stevens, brought this up:

… For me, lay people means John Q Public – public opinion because public opinion can shape policy. In practice, this only matters for a select few issues, such as climate change or science education. There is no impact to a lay person not understanding / believing in the Higgs particle for example.

Categories: data science, math education, modeling, statistics

On trusting experts, climate change research, and scientific translators

December 30, 2012 Cathy O'Neil, mathbabe 29 comments

Stephanie Tai has written a thoughtful response on Jordan Ellenberg’s blog to my discussion with Jordan regarding trusting experts (see my Nate Silver post and the follow-up post for more context).

Trusting experts

Stephanie asks three important questions about trusting experts, which I paraphrase here:

What does it take to look into a model yourself? How deeply must you probe?
How do you avoid being manipulated when you do so?
Why should we bother since stuff is so hard and we each have a limited amount of time?

I must confess I find the first two questions really interesting and I want to think about them, but I have a very little patience with the last question.

Here’s why:

I’ve seen too many people (individual modelers) intentionally deflect investigations into models by setting them up as so hard that it’s not worth it (or at least it seems not worth it). They use buzz words and make it seem like there’s a magical layer of their model which makes it too difficult for mere mortals. But my experience (as an arrogant, provocative, and relentless questioner) is that I can always understand a given model if I’m talking to someone who really understands it and actually wants to communicate it.
It smacks of an excuse rather than a reason. If it’s our responsibility to understand something, then by golly we should do it, even if it’s hard.
Too many things are left up to people whose intentions are not reasonable using this “too hard” argument, and it gives those people reason to make entire systems seem too difficult to penetrate. For a great example, see the financial system, which is consistently too complicated for regulators to properly regulate.

I’m sure I seem unbelievably cynical here, but that’s where I got by working in finance, where I saw first-hand how manipulative and manipulated mathematical modeling can become. And there’s no reason at all such machinations wouldn’t translate to the world of big data or climate modeling.

Climate research

Speaking of climate modeling: first, it annoys me that people are using my “distrust the experts” line to be cast doubt on climate modelers.

People: I’m not asking you to simply be skeptical, I’m saying you should look into the models yourself! It’s the difference between sitting on a couch and pointing at a football game on TV and complaining about a missed play and getting on the football field yourself and trying to figure out how to throw the ball. The first is entertainment but not valuable to anyone but yourself. You are only adding to the discussion if you invest actual thoughtful work into the matter.

To that end, I invited an expert climate researcher to my house and asked him to explain the climate models to me and my husband, and although I’m not particularly skeptical of climate change research (more on that below when I compare incentives of the two sides), I asked obnoxious, relentless questions about the model until I was satisfied. And now I am satisfied. I am considering writing it up as a post.

As an aside, if climate researchers are annoyed by the skepticism, I can understand that, since football fans are an obnoxious group, but they should not get annoyed by people who want to actually do the work to understand the underlying models.

Another thing about climate research. People keep talking about incentives, and yes I agree wholeheartedly that we should follow the incentives to understand where manipulation might be taking place. But when I followed the incentives with respect to climate modeling, they bring me straight to climate change deniers, not to researchers.

Do we really think these scientists working with their research grants have more at stake than multi-billion dollar international companies who are trying to ignore the effect of their polluting factories on the environment? People, please. The bulk of the incentives are definitely with the business owners. Which is not to say there are no incentives on the other side, since everyone always wants to feel like their research is meaningful, but let’s get real.

Scientific translators

I like this idea Stephanie comes up with:

Some sociologists of science suggest that translational “experts”–that is, “experts” who aren’t necessarily producing new information and research, but instead are “expert” enough to communicate stuff to those not trained in the area–can help bridge this divide without requiring everyone to become “experts” themselves. But that can also raise the question of whether these translational experts have hidden agendas in some way. Moreover, one can also raise questions of whether a partial understanding of the model might in some instances be more misleading than not looking into the model at all–examples of that could be the various challenges to evolution based on fairly minor examples that when fully contextualized seem minor but may pop out to someone who is doing a less systematic inquiry.

First, I attempt to make my blog something like a platform for this, and I also do my best to make my agenda not at all hidden so people don’t have to worry about that.

This raises a few issues for me:

Right now we depend mostly on press to do our translations, but they aren’t typically trained as scientists. Does that make them more prone to being manipulated? I think it does.
How do we encourage more translational expertise to emerge from actual experts? Currently, in academia, the translation to the general public of one’s research is not at all encouraged or rewarded, and outside academia even less so.
Like Stephanie, I worry about hidden agendas and partial understandings, but I honestly think they are secondary to getting a robust system of translation started to begin with, which would hopefully in turn engage the general public with the scientific method and current scientific knowledge. In other words, the good outweighs the bad here.

Categories: data science, finance, math education, modeling, statistics

Open data is not a panacea

December 29, 2012 Cathy O'Neil, mathbabe 30 comments

I’ve talked a lot recently about how there’s an information war currently being waged on consumers by companies that troll the internet and collect personal data, search histories, and other “attributes” in data warehouses which then gets sold to the highest bidders.

It’s natural to want to balance out this information asymmetry somehow. One such approach is open data, defined in Wikipedia as the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

I’m going to need more than one blog post to think this through, but I wanted to make two points this morning.

The first is my issue with the phrase “freely available to everyone to use”. What does that mean? Having worked in futures trading, where we put trading machines and algorithms in close proximity with exchanges for large fees so we can get to the market data a few nanoseconds before anyone else, it’s clear to me that availability and access to data is an incredibly complicated issue.

And it’s not just about speed. You can have hugely important, rich, and large data sets sitting in a lump on a publicly available website like wikipedia, and if you don’t have fancy parsing tools and algorithms you’re not going to be able to make use of it.

When important data goes public, the edge goes to the most sophisticated data engineer, not the general public. The Goldman Sachs’s of the world will always know how to make use of “freely available to everyone” data before the average guy.

Which brings me to my second point about open data. It’s general wisdom that we should hope for the best but prepare for the worst. My feeling is that as we move towards open data we are doing plenty of the hoping part but not enough of the preparing part.

If there’s one thing I learned working in finance, it’s not to be naive about how information will be used. You’ve got to learn to think like an asshole to really see what to worry about. It’s a skill which I don’t regret having.

So, if you’re giving me information on where public schools need help, I’m going to imagine using that information to cut off credit for people who live nearby. If you tell me where environmental complaints are being served, I’m going to draw a map and see where they aren’t being served so I can take my questionable business practices there.

I’m not saying proponents of open data aren’t well-meaning, they often seem to be. And I’m not saying that the bad outweighs the good, because I’m not sure. But it’s something we should figure out how to measure, and in this information war it’s something we should keep a careful eye on.

Categories: data science, finance, open source tools

Corporations don’t act like people

December 24, 2012 Cathy O'Neil, mathbabe 6 comments

Corporations may be legally protected like people, but they don’t act selfishly like people do.

I’ve written about this before here, when I was excitedly reading Liquidated by Karen Ho, but recent overheard conversations have made me realize that there’s still a feeling out there that “the banks” must not have understood how flawed the models were because otherwise they would have avoided them out of a sense of self-preservation.

Important: “the banks” don’t think or do things, people inside the banks think and do things. In fact, the people inside the banks think about themselves and their own chances of getting big bonuses/ getting fired, and they don’t think about the bank’s future at all. The exception may be the very tip top brass of management, who may or may not care about the future of their institutions just as a legacy reputation issue. But in any case their nascent reputation fears, if they existed at all, did not seem to overwhelm their near-term desire for lots of money.

Example: I saw Robert Rubin on stage well before the major problems at Citi in a discussion about how badly the mortgage-backed securities market was apt to perform in the very near future. He did not seem to be too stupid to understand what the conversation was about, but that didn’t stop him from ignoring the problem at Citigroup whilst taking in $126 million dollars. The U.S. government, in the meantime, bailed out Citigroup to the tune of $45 billion with another guarantee of $300 billion.

Here’s a Bloomberg BusinessWeek article excerpt about how he saw his role:

Rubin has said that Citigroup’s losses were the result of a financial force majeure. “I don’t feel responsible, in light of the facts as I knew them in my role,” he told the New York Times in April 2008. “Clearly, there were things wrong. But I don’t know of anyone who foresaw a perfect storm, and that’s what we’ve had here.”

In March 2010, Rubin elaborated in testimony before the Financial Crisis Inquiry Commission. “In the world of trading, the world I have lived in my whole adult life, there is always a very important distinction between what you could have reasonably known in light of the facts at the time and what you know with the benefit of hindsight,” he said. Pressed by FCIC Executive Director Thomas Greene about warnings he had received regarding the risk in Citigroup’s mortgage portfolio, Rubin was opaque: “There is always a tendency to overstate—or over-extrapolate—what you should have extrapolated from or inferred from various events that have yielded warnings.”

Bottomline: there’s no such thing as a bank’s desire for self-preservation. Let’s stop thinking about things that way.

Categories: finance, rant

Consumer segmentation taken to the extreme

December 23, 2012 Cathy O'Neil, mathbabe 5 comments

I’m up in Western Massachusetts with the family, hidden off in a hotel with a pool and a nearby yarn superstore. My blogging may be spotty for the next few days but rest assured I haven’t forgotten about mathbabe (or Aunt Pythia).

I have just enough time this morning to pose a thought experiment. It’s in three steps. First, read this Reuters article which ends with:

Imagine if Starbucks knew my order as I was pulling into the parking lot, and it was ready the second I walked in. Or better yet, if a barista could automatically run it out to my car the exact second I pulled up. I may not pay more for that everyday, but I sure as hell would if I were late to a meeting with a screaming baby in the car. A lot more. Imagine if my neighborhood restaurants knew my local, big-tipping self was the one who wanted a reservation at 8 pm, not just an anonymous user on OpenTable. They might find some room. And odds are, I’d tip much bigger to make sure I got the preferential treatment the next time. This is why Uber’s surge pricing is genius when it’s not gouging victims of a natural disaster. There are select times when I’ll pay double for a cab. Simply allowing me to do so makes everyone happy.

In a world where the computer knows where we are and who we are and can seamlessly charge us, the world might get more expensive. But it could also get a whole lot less annoying. ”This is what big data means to me,” Rosensweig says.

Second, think about just how not “everyone” is happy. It’s a pet peeve of mine that people who like their personal business plan consistently insist that everybody wins, when clearly there are often people (usually invisible) who are definitely losing. In this case the losers are people whose online personas don’t correlate (in a given model) with big tips. Should those people not be able to reserve a table at a restaurant now? How is that model going to work?

And now I’ve gotten into the third step. It used to be true that if you went to a restaurant enough, the chef and the waitstaff would get to know you and might even keep a table open for you. It was old-school personalization.

What if that really did start to happen at every restaurant and store automatically, based on your online persona? On the one hand, how weird would that be, and on the other hand how quickly would we all get used to it? And what would that mean for understanding each other’s perspectives?

Categories: data science, modeling, musing

Whom can you trust?

December 21, 2012 Cathy O'Neil, mathbabe 20 comments

My friend Jordan has written a response to yesterday’s post about Nate Silver. He is a major fan of Silver and contends that I’m not fair to him:

I think Cathy’s distrust is warranted, but I think Silver shares it. The central concern of his chapter on weather prediction is the vast difference in accuracy between federal hurricane forecasters, whose only job is to get the hurricane track right, and TV meteorologists, whose very different incentive structure leads them to get the weather wrong on purpose. He’s just as hard on political pundits and their terrible, terrible predictions, which are designed to be interesting, not correct.

To this I’d say, Silver mocks TV meteorologists and political pundits in a dismissive way, as not being scientific enough. That’s not the same as taking them seriously and understanding their incentives, and it doesn’t translate to the much more complicated world of finance.

In any case, he could have understood incentives in every field except finance and I’d still be mad, because my direct experience with finance made me understand it, and the outsized effect it has on our economy makes it hugely important.

But Jordan brings up an important question about trust:

But what do you do with cases like finance, where the only people with deep domain knowledge are the ones whose incentive structure is socially suboptimal? (Cathy would use saltier language here.) I guess you have to count on mavericks like Cathy, who’ve developed the domain knowledge by working in the financial industry, but who are now separated from the incentives that bind the insiders.

But why do I trust what Cathy says about finance?

Because she’s an expert.

Is Cathy OK with this?

No, Cathy isn’t okay with this. The trust problem is huge, and I address it directly in my post:

This raises a larger question: how can the public possibly sort through all the noise that celebrity-minded data people like Nate Silver hand to them on a silver platter? Whose job is it to push back against rubbish disguised as authoritative scientific theory?

It’s not a new question, since PR men disguising themselves as scientists have been around for decades. But I’d argue it’s a question that is increasingly urgent considering how much of our lives are becoming modeled. It would be great if substantive data scientists had a way of getting together to defend the subject against sensationalist celebrity-fueled noise.

One hope I nurture is that, with the opening of the various data science institutes such as the one at Columbia which was a announced a few months ago, there will be a way to form exactly such a committee. Can we get a little peer review here, people?

I do think domain-expertise-based peer review will help, but not when the entire field is captured, like in some subfields of medical research and in some subfields of economics and finance (for a great example see Glen Hubbard get destroyed in Matt Taibbi’s recent blogpost for selling his economic research).

The truth is, some fields are so yucky that people who want to do serious research just leave because they are disgusted. Then the people who remain are the “experts”, and you can’t trust them.

The toughest part is that you don’t know which fields are like this until you try to work inside them.

Bottomline: I’m telling you not to trust Nate Silver, and I would also urge you not to trust any one person, including me. For that matter don’t necessarily trust crowds of people either. Instead, carry a healthy dose of skepticism and ask hard questions.

This is asking a lot, and will get harder as time goes on and as the world becomes more complicated. On the one hand, we need increased transparency for scientific claims like projects such as runmycode provide. On the other, we need to understand the incentive structure inside a field like finance to make sure it is aligned with its stated mission.

Categories: data science, finance, modeling

Nate Silver confuses cause and effect, ends up defending corruption

December 20, 2012 Cathy O'Neil, mathbabe 96 comments

Crossposted on Naked Capitalism

I just finished reading Nate Silver’s newish book, The Signal and the Noise: Why so many predictions fail – but some don’t.

The good news

First off, let me say this: I’m very happy that people are reading a book on modeling in such huge numbers – it’s currently eighth on the New York Times best seller list and it’s been on the list for nine weeks. This means people are starting to really care about modeling, both how it can help us remove biases to clarify reality and how it can institutionalize those same biases and go bad.

As a modeler myself, I am extremely concerned about how models affect the public, so the book’s success is wonderful news. The first step to get people to think critically about something is to get them to think about it at all.

Moreover, the book serves as a soft introduction to some of the issues surrounding modeling. Silver has a knack for explaining things in plain English. While he only goes so far, this is reasonable considering his audience. And he doesn’t dumb the math down.

In particular, Silver does a nice job of explaining Bayes’ Theorem. (If you don’t know what Bayes’ Theorem is, just focus on how Silver uses it in his version of Bayesian modeling: namely, as a way of adjusting your estimate of the probability of an event as you collect more information. You might think infidelity is rare, for example, but after a quick poll of your friends and a quick Google search you might have collected enough information to reexamine and revise your estimates.)

The bad news

Having said all that, I have major problems with this book and what it claims to explain. In fact, I’m angry.

It would be reasonable for Silver to tell us about his baseball models, which he does. It would be reasonable for him to tell us about political polling and how he uses weights on different polls to combine them to get a better overall poll. He does this as well. He also interviews a bunch of people who model in other fields, like meteorology and earthquake prediction, which is fine, albeit superficial.

What is not reasonable, however, is for Silver to claim to understand how the financial crisis was a result of a few inaccurate models, and how medical research need only switch from being frequentist to being Bayesian to become more accurate.

Let me give you some concrete examples from his book.

Easy first example: credit rating agencies

The ratings agencies, which famously put AAA ratings on terrible loans, and spoke among themselves as being willing to rate things that were structured by cows, did not accidentally have bad underlying models. The bankers packaging and selling these deals, which amongst themselves they called sacks of shit, did not blithely believe in their safety because of those ratings.

Rather, the entire industry crucially depended on the false models. Indeed they changed the data to conform with the models, which is to say it was an intentional combination of using flawed models and using irrelevant historical data (see points 64-69 here for more (Update: that link is now behind the paywall)).

In baseball, a team can’t create bad or misleading data to game the models of other teams in order to get an edge. But in the financial markets, parties to a model can and do.

In fact, every failed model is actually a success

Silver gives four examples what he considers to be failed models at the end of his first chapter, all related to economics and finance. But each example is actually a success (for the insiders) if you look at a slightly larger picture and understand the incentives inside the system. Here are the models:

The housing bubble.
The credit rating agencies selling AAA ratings on mortgage securities.
The financial melt-down caused by high leverage in the banking sector.
The economists’ predictions after the financial crisis of a fast recovery.

Here’s how each of these models worked out rather well for those inside the system:

Everyone involved in the mortgage industry made a killing. Who’s going to stop the music and tell people to worry about home values? Homeowners and taxpayers made money (on paper at least) in the short term but lost in the long term, but the bankers took home bonuses that they still have.
As we discussed, this was a system-wide tool for building a money machine.
The financial melt-down was incidental, but the leverage was intentional. It bumped up the risk and thus, in good times, the bonuses. This is a great example of the modeling feedback loop: nobody cares about the wider consequences if they’re getting bonuses in the meantime.
Economists are only putatively trying to predict the recovery. Actually they’re trying to affect the recovery. They get paid the big bucks, and they are granted authority and power in part to give consumers confidence, which they presumably hope will lead to a robust economy.

Cause and effect get confused

Silver confuses cause and effect. We didn’t have a financial crisis because of a bad model or a few bad models. We had bad models because of a corrupt and criminally fraudulent financial system.

That’s an important distinction, because we could fix a few bad models with a few good mathematicians, but we can’t fix the entire system so easily. There’s no math band-aid that will cure these boo-boos.

I can’t emphasize this too strongly: this is not just wrong, it’s maliciously wrong. If people believe in the math band-aid, then we won’t fix the problems in the system that so desperately need fixing.

Why does he make this mistake?

Silver has an unswerving assumption, which he repeats several times, that the only goal of a modeler is to produce an accurate model. (Actually, he made an exception for stock analysts.)

This assumption generally holds in his experience: poker, baseball, and polling are all arenas in which one’s incentive is to be as accurate as possible. But he falls prey to some of the very mistakes he warns about in his book, namely over-confidence and over-generalization. He assumes that, since he’s an expert in those arenas, he can generalize to the field of finance, where he is not an expert.

The logical result of this assumption is his definition of failure as something where the underlying mathematical model is inaccurate. But that’s not how most people would define failure, and it is dangerously naive.

Medical Research

Silver discusses both in the Introduction and in Chapter 8 to John Ioannadis’s work which reveals that most medical research is wrong. Silver explains his point of view in the following way:

I’m glad he mentions incentives here, but again he confuses cause and effect.

As I learned when I attended David Madigan’s lecture on Merck’s representation of Vioxx research to the FDA as well as his recent research on the methods in epidemiology research, the flaws in these medical models will be hard to combat, because they advance the interests of the insiders: competition among academic researchers to publish and get tenure is fierce, and there are enormous financial incentives for pharmaceutical companies.

Everyone in this system benefits from methods that allow one to claim statistically significant results, whether or not that’s valid science, and even though there are lives on the line.

In other words, it’s not that there are bad statistical approaches which lead to vastly over-reported statistically significant results and published papers (which could just as easily happen if the researchers were employing Bayesian techniques, by the way). It’s that there’s massive incentive to claim statistically significant findings, and not much push-back when that’s done erroneously, so the field never self-examines and improves their methodology. The bad models are a consequence of misaligned incentives.

I’m not accusing people in these fields of intentionally putting people’s lives on the line for the sake of their publication records. Most of the people in the field are honestly trying their best. But their intentions are kind of irrelevant.

Silver ignores politics and loves experts

Silver chooses to focus on individuals working in a tight competition and their motives and individual biases, which he understands and explains well. For him, modeling is a man versus wild type thing, working with your wits in a finite universe to win the chess game.

He spends very little time on the question of how people act inside larger systems, where a given modeler might be more interested in keeping their job or getting a big bonus than in making their model as accurate as possible.

In other words, Silver crafts an argument which ignores politics. This is Silver’s blind spot: in the real world politics often trump accuracy, and accurate mathematical models don’t matter as much as he hopes they would.

As an example of politics getting in the way, let’s go back to the culture of the credit rating agency Moody’s. William Harrington, an ex-Moody’s analyst, describes the politics of his work as follows:

In 2004 you could still talk back and stop a deal. That was gone by 2006. It became: work your tail off, and at some point management would say, ‘Time’s up, let’s convene in a committee and we’ll all vote “yes”‘.

To be fair, there have been moments in his past when Silver delves into politics directly, like this post from the beginning of Obama’s first administration, where he starts with this (emphasis mine):

To suggest that Obama or Geithner are tools of Wall Street and are looking out for something other than the country’s best interest is freaking asinine.

and he ends with:

This is neither the time nor the place for mass movements — this is the time for expert opinion. Once the experts (and I’m not one of them) have reached some kind of a consensus about what the best course of action is (and they haven’t yet), then figure out who is impeding that action for political or other disingenuous reasons and tackle them — do whatever you can to remove them from the playing field. But we’re not at that stage yet.

My conclusion: Nate Silver is a man who deeply believes in experts, even when the evidence is not good that they have aligned incentives with the public.

Distrust the experts

Call me “asinine,” but I have less faith in the experts than Nate Silver: I don’t want to trust the very people who got us into this mess, while benefitting from it, to also be in charge of cleaning it up. And, being part of the Occupy movement, I obviously think that this is the time for mass movements.

From my experience working first in finance at the hedge fund D.E. Shaw during the credit crisis and afterwards at the risk firm Riskmetrics, and my subsequent experience working in the internet advertising space (a wild west of unregulated personal information warehousing and sales) my conclusion is simple: Distrust the experts.

Why? Because you don’t know their incentives, and they can make the models (including Bayesian models) say whatever is politically useful to them. This is a manipulation of the public’s trust of mathematics, but it is the norm rather than the exception. And modelers rarely if ever consider the feedback loop and the ramifications of their predatory models on our culture.

Why do people like Nate Silver so much?

To be crystal clear: my big complaint about Silver is naivete, and to a lesser extent, authority-worship.

I’m not criticizing Silver for not understanding the financial system. Indeed one of the most crucial problems with the current system is its complexity, and as I’ve said before, most people inside finance don’t really understand it. But at the very least he should know that he is not an authority and should not act like one.

I’m also not accusing him of knowingly helping cover up the financial industry. But covering for the financial industry is an unfortunate side-effect of his naivete and presumed authority, and a very unwelcome source of noise at this moment when so much needs to be done.

I’m writing a book myself on modeling. When I began reading Silver’s book I was a bit worried that he’d already said everything I’d wanted to say. Instead, I feel like he’s written a book which has the potential to dangerously mislead people – if it hasn’t already – because of its lack of consideration of the surrounding political landscape.

Silver has gone to great lengths to make his message simple, and positive, and to make people feel smart and smug, especially Obama’s supporters.

He gets well-paid for his political consulting work and speaker appearances at hedge funds like D.E. Shaw and Jane Street, and, in order to maintain this income, it’s critical that he perfects a patina of modeling genius combined with an easily digested message for his financial and political clients.

Silver is selling a story we all want to hear, and a story we all want to be true. Unfortunately for us and for the world, it’s not.

How to push back against the celebrity-ization of data science

The truth is somewhat harder to understand, a lot less palatable, and much more important than Silver’s gloss. But when independent people like myself step up to denounce a given statement or theory, it’s not clear to the public who is the expert and who isn’t. From this vantage point, the happier, shorter message will win every time.

Conclusion

There’s an easy test here to determine whether to be worried. If you see someone using a model to make predictions that directly benefit them or lose them money – like a day trader, or a chess player, or someone who literally places a bet on an outcome (unless they place another hidden bet on the opposite outcome) – then you can be sure they are optimizing their model for accuracy as best they can. And in this case Silver’s advice on how to avoid one’s own biases are excellent and useful.

But if you are witnessing someone creating a model which predicts outcomes that are irrelevant to their immediate bottom-line, then you might want to look into the model yourself.

Categories: finance, modeling, rant, statistics

Empathy, murder, and the NRA

December 19, 2012 Cathy O'Neil, mathbabe 23 comments

I’ve been having lots of dinnertime discussions with my kids about the following three news stories:

the guy who was pushed into the subway and nobody helped him
the Sandy Hook murders
the Syrian uprising

When my son asked why people care so much about the kids murdered in Connecticut but not nearly as much in a random day when as many rebels are murdered by their government in Syria, I talk about how for whatever reason people have more empathy for individuals closer to them, and Connecticut is closer than Syria. It doesn’t feel good but it kind of makes sense.

But of course this doesn’t apply to the guy who was pushed off the subway.

And, speaking of the subway incident, let me be the person who stands up and says that yes, if I’d been there I would have tried to help that man get out of the subway tracks. There were 22 seconds to help him after the crazy guy fled.

For me the ethical obligations are obvious and the empathy I feel for strangers in danger is visceral. I’ve been in situations not entirely unlike this in the subway, and I saw firsthand how other people ran away and start talking about themselves rather than trying to help someone suffering, and it amazes and disgusts me.

It makes me wonder how we develop what I’ll term “working empathy”, to distinguish between someone who actually tries to help in real time and in a meaningful way when someone else is in pain versus someone who is gawking at arm’s length.

This New York Times article touches on it but doesn’t go very deep; it basically suggests we model it for children and talk about how other people feel. It also talks about how monetary rewards stifle empathy (which I knew already from working in finance).

I’m not wondering this abstractly or philosophically. I’m wondering it because if I had a good theory about creating and spreading working empathy, I’d try to join the NRA and apply the technique to see if it works on tough cases. As in, they actually try to prevent unreasonable guns in unreasonable places, not that they issue press releases.

Categories: news, rant

If Barofsky heads the SEC I’ll work for it

December 18, 2012 Cathy O'Neil, mathbabe 9 comments

Neil Barofsky visited my Occupy group, Alternative Banking, this past Sunday. He was awesome.

We discussed the credit crisis, the recent outrageous HSBC ruling which quantified the cost banks near for money laundering for terrorists and drug lords at below cost, and the hopelessness, or on a good day the hope, of having a financial and regulatory system that will eventually work.

We discussed the incentives in the HAMP set-up, which explain why very few homeowners have actually received lasting relief from unaffordable mortgages. We discussed the incentives for fraud and other criminal behavior in the absence of real punishment, that too much money is being spent pursuing insider training because that’s what people understand how to do, and we discussed the reluctance of the regulators to litigate tough cases. We talked about how change has to come from the top, because all of these organizations are super hierarchical and require political will to get things done.

In the past year I was offered a job at the SEC, working as a quant in the enforcement division. Although I want to help sort out this mess, I haven’t felt that this job, which is relatively junior, would allow me to do that meaningfully.

But I came away from the meeting with Barofsky with this feeling: if we had someone in charge at the SEC like him who could speak truth to power and who is smart enough to see through economic jargon and bullshit well enough to understand incentives for fraud and lying, then I’d work there in a heartbeat.

Let’s just hope it doesn’t take another world-wide financial crisis before we get someone like that.

Categories: #OWS, finance

Making math beautiful with XyJax

December 17, 2012 Cathy O'Neil, mathbabe 1 comment

My husband A. Johan de Jong has an open source algebraic geometry project called the stacks project. It’s hosted at Columbia, just like his blog which is aptly named the stacks project blog.

The stacks project is awesome: it explains the theory of stacks thoroughly, assuming only that you have a basic knowledge of algebra and a shitload of time to read. It’s ~~about three thousand~~ update: it’s exactly 3,452 pages, give or take, and it has a bunch of contributors besides Johan. I’m on the list most likely because of the fact that I helped him develop the tag system which allows permanent references to theorems and lemmas even within an evolving latex manuscript.

He even has pictures of tags, and hands out t-shirts with pictures of tags when people find mistakes in the stacks project.

Speaking of latex, that’s what I wanted to mention today.

Recently a guy named Pieter Belmans has been helping Johan out with development for the site: spiffing it up and making it look more professional. The most recent thing he did was to render the latex into human readable form using XyJax package, which is an “almost xy-pic compatible package for MathJax“. I think they are understating the case; it looks great to me:

: Before

: After

: Before

: After

Categories: math, open source tools

Silicon Valley: VC versus startup culture

December 16, 2012 Cathy O'Neil, mathbabe 3 comments

This is a guest post by David Carlton, who first met Cathy at the Hampshire College Summer Studies in Mathematics when they were high school students. He was trained as a mathematician, but left academia in 2003 and has been working as a programmer and manager in the San Francisco Bay Area since then. This is crossposted from his blog malvasia bianca.

One thing I’ve been wondering recently: to what extent do I like the influence of Silicon Valley venture capital firms on the local startup culture?

There are certain ways in which their influence is good, no question: it’s great that there’s money available for people to try new things, it’s great that it means that there are exciting small companies around, and I’m fairly sure that VCs have valuable specialized knowledge that I don’t have and could benefit from. So that’s all to the good.

However, it is not the case that VCs’ interests and my interests are aligned.

Don’t get me wrong: if I’m working at a VC-funded company, then those VCs and I both want the company to succeed, and that’s great. But beyond that, our interests diverge significantly.

Their goal is to make money in a five-yearish horizon through a portfolio approach, starting from a significant pool of cash. The portfolio is a particularly important factor here: no matter what, most startups are going to fail; so, rather than try to get as many as possible to be a moderate success, it’s a perfectly reasonable thing to do to do what you can to get a few companies in your portfolio to be a major success.

And, while I’d be perfectly happy to be working at a company that’s a major success, it’s much less clear to me that I want to do that at the cost of reducing the chances that the company is a moderate success. Because while the company crashing and burning is potentially a problem at a financial level, it’s also potentially much more of a problem at a personal level.

For example, I believe in the concept of a “sustainable pace”, that on average, for most people, working too hard eventually produces less output. But if the unsustainable pace on average masks nine disasters and one remarkable success, then that may be just fine for a portfolio approach, despite what it does to the people who go through the nine disasters. (This is probably where some of the VC-funded startup youth fetishism comes in, too.)

Time horizons also play into that issue, as well: if you can keep up an unsustainable pace long enough to look good at a payoff threshold, then that could be good enough. (Possibly burning out many people along the way while hiring enough new faces to replace them and keep the company looking healthy from the outside.)

I think I saw a version of this at Playdom: the company spent the year before it got bought going on a hiring spree, buying companies that, even at the time, seemed like they made no sense. (Don’t get me wrong, some of the purchase made a lot of sense, but there were certainly many specific purchases that I raised my eyebrows at.) As far as I can tell, this was a ploy to make Playdom look good to potential purchasers by increasing our headcount, our number of games and players, and our geographic reach; but we shut down a bunch of those games soon after Disney bought us, and I didn’t see anything concrete come out of many of those studios.

The amount of money VCs are investing also plays into this. Even when funding small companies, they don’t want those companies to stay small: they want those companies to grow and grow, to justify larger and larger investments and still larger payouts.

So, if you want to work at a company that is small and focused, VC funded companies probably aren’t the best place to go (though there are exceptions: if your small and focused company is producing something that appeals to tens or hundreds of millions of people, then you can be the next Instagram).

That’s how VCs are looking for aspects of companies that I’m not; but I’m also looking for aspects of companies that VCs don’t have as strong a reason to be attracted to.

I’m always trying to learn something, and typically have specific goals along those lines that I’m looking for at companies; VCs have no reason to care about my personal development.

More broadly, I’ve been participating in industry discussions about how to develop software, and trying to figure out which of those ideas seem to work well for me; I’m sure noises about some of that filters up to the VC level, but I’m also sure that most VCs don’t have any real idea what the word ‘agile’ means. Not that they should; this is a difference, not a judgment.

I also want to work at a company that I feel is doing the right thing: e.g. on a basic level it should treat people of different genders, ethnicities, ages, class backgrounds, sexualities, relationship status, etc. fairly.

Silicon Valley actually strikes me as astonishingly open to different nationalities (most of the founders of most of the companies that I’ve worked at haven’t been American, along with a noticeable fraction of the employees); on many of the other dimensions, though, Silicon Valley isn’t nearly as open.

Here’s a nice takedown of some of the bullshit around the idea of a “meritocracy”, and VC firms themselves apparently don’t do so well themselves in this regard. I hear rumors about VC “pattern matching”; if this means that VCs are happy to insert ignorant sexist assholes into the management ranks of their portfolio companies because those execs fit some sort of pattern that the VCs have seen, that is not good.

What’s scary, too, is how hard it can be to tell this sort of thing in advance: when joining a company, you never know how it is going to change over the next months or years.

For example, when I did my last job search, I talked to a few Facebook game companies; some of them were steeped in testosterone, but one, Casual Collective, seemed like a pleasant enough place. They’d produced one game I respected, they woman I interviewed with seemed sharp, and their name seemed to signal that they weren’t going to go too far down the “core gamer” path.

I didn’t interview further with them because of the technologies they were using and because of their location, but if that job search had gone slightly differently I can easily imagine myself having been interested in them.

A year later, they’d changed their name to Kixeye, turned themselves into a maker of “hardcore” games, and released this recruiting video that positioned them squarely within the brogrammer manchild tradition:

And I saw more news coverage (and for that matter people in person) speaking favorably of that video than not: it’s not just one company, that’s a lamentably strong aspect of the culture around here.

There’s way too much adolescent male status jockeying going on, way too little quiet listening; and I will be perfectly happy never to see another foam bat or nerf gun in my life.

Though Kixeye does seem to be particularly bad: rather than quietly shunning people who don’t fit into that culture, they seem to have had an actively discriminatory culture.

I’d like to think that I would have picked up on that culture if I’d interviewed in person, but I’m not at all confident that that’s the case; and, for that matter, for all I know the culture of the company really may have changed significantly since I interviewed. Which could be fine for somebody on the outside who is trying to get the company to pivot in search of greater profits; not necessarily so great for people in the middle of it.

I dunno; I’ve been in a pretty negative mood recently. Because the truth is, I could find just as many bad things to say about lots of other corporate subcultures around here. I certainly wouldn’t actively want to work in large companies, either, though I’m getting a more nuanced view of their strengths and weaknesses. And I’ve worked with great people at a lot of startups around here: great technically, but also great human beings, people that it’s been an honor to work with. I also certainly have nothing against making money, and I think that it’s great that money is available for people with ideas.

I just wish I had better leads on companies that were concentrating a bit more on their culture and their effects, companies that want to build the right things in the right ways. I’m sure there are a fair number if I knew where to look, I’m just not plugged into networks that enable me to see them.

And, seriously: the sexism in the valley has to stop.

Categories: guest post

Aunt Pythia’s advice

December 15, 2012 Cathy O'Neil, mathbabe 7 comments

Aunt Pythia has two wee bits of bad news.

First, nobody helped out NYC and Wondering from last week looking to get educated via internships past college age. Maybe if the question were worded differently it would have gotten more responses.

In the meantime, NYC and Wondering, I’d suggest you look into MOOCs on Coursera, Udacity, and the like. There are Meetup groups you can join once you’re in a course like this one.

Second, Aunt Pythia has been informed that some people are getting error messages when they try to submit questions. That’s no good! If that’s happening to you, please comment below using the phrase, “question for Aunt Pythia” and it will automatically go into my mailbox instead of getting posted.

On to this week’s questions:

——

Dear Aunt Pythia,

I am suffering from severe Facebook phobia. Is the entire phenomenon as repulsive as I think it is?

Curmudgeon Lee Luddite

Dear Curmudgeon LL,

Here’s the thing. I am totally grossed out by Facebook on so many levels. As a concerned data scientist, the shit they pull with respect to personal information, letting other people post private information about you, and selling your data makes me really uneasy. Read this recent article about the Facebook Doctrine (“What’s good for Facebook is good for you”) if you want to hear more. Mind you, that Doctrine seems to be pretty clear-cut if you modify it just a bit: “What’s good for Facebook is good for the stock price of Facebook”: the market loves the trend of information selling because it’s magnificently profitable.

Another thing that pisses me off, which I learned about in the student presentations last week at the Columbia Data Science class I was blogging: people are posting various legalese-sounding letters to Facebook on their timeline which tells Facebook to keep their hands off their personal data. Guess what, kids, it’s too late, you signed away your rights when you entered, and such crap only serves as yet another illusion of control (along with the Facebook privacy settings).

Having said all that, I use Facebook myself – but of course I never post anything remotely private on it. But for that matter I also use Google+, and I’m ready and willing to use another platform when one comes along that’s less creepy.

Curmudgeon, to answer your question, yes it’s just as repulsive as you think. I fully defend your disgust, and if anyone questions it just send that person to me, I’ll set them straight.

Best,

Aunt Pythia

——

Dear Aunt Pythia,

Should I go to the Joint Math Meetings if I’m thinking of leaving academia realllly soon?

Katydid

Dear Katydid,

It depends. Is it tough for you to go? Would you miss job opportunities by doing so? I’m assuming you are planning to leave academia but you haven’t actually gotten another job.

If the answer is that it’s relatively easy to go and that you don’t have any other plans that weekend, then by all means you should go. And you should make a plan beforehand on what information you can gather about jobs that math people do.

For example, make sure you have a good idea of what kind of jobs in academia there really are, by interviewing a bunch of people about what they do on a daily basis. But keep in mind that most of them will be drunk because it’s the Joint Math Meetings and that’s kind of the point. And also keep in mind you’re hearing much more about academia than about industry since you’re at this meeting.

But that doesn’t mean you’ll be stuck only hearing about academia! Because I’ll be there, talking about the world outside academic math, and so will a few other people. I am particularly psyched that I’ll be speaking on the first day so I can meet people after my talk and hang out with them for the next few days, getting drunk and playing bridge. It’s very serious business, of course.

See you soon I hope!

Aunt Pythia

——

Dear Aunt Pythia,

Is health insurance a sound financial investment?

Uninsured

Dear Uninsured,

Great question. It’s not really an investment, and it’s only a good idea once in a while; the problem is knowing in advance when it’s a good idea.

I say it’s not an investment because usually with an investment you can expect to make money, whereas insurance is never like that. Once you pay insurance premiums that money is gone.

Insurance can, however, be seen as a financial bet: you’re betting that losing a predictable and smallish amount of money is less painful than the overall risk of losing an unpredictable and large amount of money if you get horribly sick and need major treatment.

There are plenty of problems with this explanation though, including:

it’s not so smallish if you don’t have work or if you have crappy work through a place like Walmart,
you personally might be very healthy and the risk of getting super sick might not be high, say if you’re 24 and fit; this means that your money may be better spent buying high quality food than paying for health insurance, and
the large amount of money you may get billed with if you do end up horribly sick can be discharged through bankruptcy, and in fact most of the bankruptcy proceedings happen because of medical debt of uninsured people. Keep in mind you will lose your house (if you have one) if you go this route, so only consider it if you’re willing to take that risk.

In the end it depends on your situation whether it’s worth it to buy health insurance.

I hope that helps!

Aunt Pythia

——

Aunt Pythia,

Why do some foods burn when you stir them? It doesn’t make sense that my rice or pasta should burn when there is still a lot of water in the pot just because I stirred it.

Physics-Inclined Wannabe Chef

This is a great question! Is it even true? Does it also happen with orzo? People, get out your pots and do some mythbuster-type experiments! And then comment below with your ideas and results.

I’m counting on you nerdy folks to get to the bottom of this, so to speak.

——

In the meantime, if you have a moral, personal, or emotional dilemma or somesuch, please share avec moi below on my gorgeous new form:

Categories: Aunt Pythia

MOOCs and calculus

December 14, 2012 Cathy O'Neil, mathbabe 24 comments

I’ve really enjoyed the discussion on my post from yesterday about MOOCs and how I predict they are going to affect the education world. I could be wrong, of course, but I think this stuff is super interesting to think about.

One thing I thought about since writing the post yesterday, in terms of math departments, is that I used to urge people involved in math departments to be attentive to their calculus teaching.

The threat, as I saw it then, was this: if math departments are passive and boring and non-reactive about how they teach calculus, then other departments which need calculus for their majors would pick up the slack and we’d see calculus taught in economics, physics, and engineering departments.

The reason math departments should care about this is that calculus is the bread and butter of math departments – math departments in other countries who have lost calculus to other departments are very small. If you only need to teach math majors, it doesn’t require that many people to do that.

But now I don’t even bother saying this, because the threat from MOOCs is much bigger and is going to have a more profound effect, and moreover there’s nothing math departments can do to stop it. Well, they can bury their head in the sand but I don’t recommend it.

Once there’s a really good calculus sequence out there, why would departments continue to teach the old fashioned way? Once there’s a fantastic calculus-for-physics MOOC, or calculus-for-economics MOOC available, one would hope that math departments would admit they can’t do better.

Instead of the old-fashioned calculus approach they’d figure out a way to incorporate the MOOC and supplement it by forming study groups and leading sections on the material. This would require a totally different set-up, and probably fewer mathematicians.

Another thing. I think I’ve identified a few separate issues in the discussion that it makes sense to highlight. There are four things (at least) that are all rolled together in our current college and university experience:

learning itself,
credentialing,
research, and
socializing

So, MOOCs directly address learning but clearly want to control something about credentialing too, which I think won’t necessarily work. They also affect research because the role of professor as learning instructor will change. They give us nothing in terms of socializing.

But as commenters have pointed out, socializing students is a huge part of the college experience, and may be even more important than credentialing. Or another way of saying that is people look at your resume not so much to know what you know but to know how you’ve been socialized.

It makes me wonder how we will address the “socializing” part of education in the future. And it also makes me wonder where research will be in 100 years.

Categories: math education, musing

MOOC is here to stay, professors will have to find another job

December 13, 2012 Cathy O'Neil, mathbabe 30 comments

I find myself every other day in a conversation with people about the massive online open course (MOOC) movement.

People often want to complain about the quality of this education substitute. They say that students won’t get the one-on-one interaction between the professor and student that is required to really learn. They complain that we won’t know if someone really knows something if they only took a MOOC or two.

First of all, this isn’t going away, nor should it: it’s many people’s only opportunity to learn this stuff. It’s not like MIT has plans to open 4,000 campuses across the world. It’s really awesome that rural villagers (with internet access) all over the world can now take MIT classes anyway through edX.

Second, if we’re going to put this new kind of education under the microscope, let’s put the current system under the microscope too. Many of the people fretting about the quality of MOOC education are themselves products of super elite universities, and probably don’t know what the average student’s experience actually is. Turns out not everyone gets a whole lot of attention from their professors.

Even at elite institutions, there are plenty of masters programs which are treated as money machines for the university and where the quality and attention of the teaching is a secondary concern. If certain students decide to forgo the thousands of dollars and learn the stuff just as well online, then that would be a good thing (for them at least).

Some things I think are inevitable:

Educational institutions will increasingly need to show they add value beyond free MOOC experiences. This will be an enormous market force for all but the most elite universities.
Instead of seeing where you went to school, potential employers will directly test knowledge of candidates. This will mean weird things like you never actually have to learn a foreign language or study Shakespeare to get a job, but it will be good for the democratization of education in general.
Professors will become increasingly scarce as the role of the professor is decreased.
One-on-one time with masters of a subject will become increasingly rare and expensive. Only truly elite students will have the mythological education experience.

Categories: musing, open source tools

When accurate modeling is not good

December 12, 2012 Cathy O'Neil, mathbabe 11 comments

I liked Andrew Gelman’s recent post (hat tip Suresh Naidu) about predatory modeling going on in casinos, specifically Caesars in Iowa. The title of the post is already good, and is a riff on Caesars Entertainment CEO Gary Loveman said:

There are four ways to get fired from Caesars: (1) theft, (2) sexual harassment, (3) running an experiment without a control group, and (4) keeping a gambling addict away from the casino

He tells a story about a woman who loses lots of money at the casino, but who moreover gets manipulated to come back and lose more and more based on the data the people collected at Caesars and based on the models built by the quants there. You should read the whole thing, which as usual with Gelman is quirky and fun. His main point comes here (emphasis mine):

The Caesars case (I keep wanting to write Caesar’s but apparently no, it’s Caesars, just like Starbucks) interested me because of the role of statistics. I’m used to thinking of probability and statistics as a positive social force (helping medical research or, in earlier days, helping the allies in World War 2), or mildly positive (for example, helping design measures to better evaluate employees), or maybe neutral (exotic financial instruments which serve no redeeming social value but presumably don’t do much harm) or moderately negative (“Moneyball”-style strategies such as going for slow sluggers who foul off endless pitches and walk a lot; it may win games but it makes for boring baseball). And then there are statisticians who do fishy analyses, for example trying to hide that some drug causes damage so it can stay on the market. But that’s a bit different because such a statistical analysis, no matter how crafty, is inherently a bad analysis, trying to obscure rather than learn.

The Caesars case seems different, in that there is a very direct tradeoff: the better the statistics and the better the science, the worse the human outcomes. These guys are directly optimizing their ability to ruin some people’s lives.

It’s not the only one, but they are not usually this clear-cut.

It’s time we started scoring models on various dimensions. Accuracy is one, predatoriness is another. They’re distinct.

Categories: data science

Fighting the information war (but only on behalf of rich people)

December 11, 2012 Cathy O'Neil, mathbabe 26 comments

There’s an information war out there which we have to be prepared for. Actually there a few of them.

And according to this New York Times piece, there’s now a way to fight against the machine, for a fee. Companies like Reputation.com will try to scour the web and remove data you don’t want floating around about you, and when that’s impossible they’ll flood the web with other good data to balance out the bad stuff.

At least that’s what I’m assuming they do, because they of course don’t really explain their techniques. And that’s the other information war, where they scare rich people with technical sounding jargon and tell them unlikely stories to get their money.

I’m not claiming predatory information-gatherers aren’t out there. But this is the wrong way to deal with it.

First of all, most of the data out there systematically being used for nefarious purposes, at least in this country, is used against the poor, denying them reasonable terms on their loans and other services. So the idea that people will need to pay for a service to protect their information is weird. It’s like saying the air quality is bad for poor people, so let’s charge rich people for better air.

So what kind of help is Reputation.com actually providing? Here’s my best guess.

First it targets people to get overly scared in the spirit of this recent BusinessWeek article, which explains that cosmetic companies have gone to China and started a campaign to convince Chinese women they are too hairy so they’ll start buying products to remove hair. From that article, which is guaranteed to make you understand something about American beauty culture too:

Despite such plays on women’s fears of embarrassment, Reckitt Benckiser’s Sehgal says that Chinese women are too “independent-minded” to be coaxed into using a product they don’t really need. Others aren’t so sure. Veet’s Chinese marketing “plays a role that is very similar to that of the apple in the Bible,” says Benjamin Voyer, a social psychologist and assistant professor of marketing at ESCP Europe business school. “It creates an awareness, which subsequently creates a feeling of shame and need.”

Second, Reputation.com gets their clients off nuisance lists, like the modern version of a do-not-call program (which, importantly, is run by the government). This is probably equivalent to setting up a bunch of email filters and clearing their cookies every now and then, but they can’t tell their clients that.

Finally, for those rich people who are also super vain, they will try to do things like replace the unflattering photos of them that come up in a google image search with better-looking ones they choose. Things like that, image issues.

I just want to point out one more salient fact about Reputation.com. It’s not just in their interest to scare-monger, it’s actually in their interest to make the data warehouses more complete (they have themselves amassed an enormous database on people), and to have people who don’t pay for their services actually need their services more. They could well create a problem to produce a market for their product.

What drives me nuts about this is how elitist it is.

There are very real problems in the information-gathering space, and we need to address them, but one of the most important issues is that the very people who can’t afford to pay for their reputation to be kept clean are the real victims of the system.

There is literally nobody who will make good money off of actually solving this problem: I challenge any libertarian to explain how the free market will address this. It has to be addressed through policy, and specifically through legislating what can and cannot be done with personal data.

Probably the worst part is that, through using the services from companies Reputation.com and because of the nature of the personalization of internet usage, the very legislators who need to act on behalf of their most vulnerable citizens won’t even see the problem since they don’t share it.

Categories: data science, rant

Columbia Data Science course, week 14: Presentations

December 10, 2012 Cathy O'Neil, mathbabe 3 comments

In the final week of Rachel Schutt’s Columbia Data Science course we heard from two groups of students as well as from Rachel herself.

Data Science; class consciousness

The first team of presenters consisted of Yegor, Eurry, and Adam. Many others whose names I didn’t write down contributed to the research, visualization, and writing.

First they showed us the very cool graphic explaining how self-reported skills vary by discipline. The data they used came from the class itself, which did this exercise on the first day:

so the star in the middle is the average for the whole class, and each star along the side corresponds to the average (self-reported) skills of people within a specific discipline. The dotted lines on the outside stars shows the “average” star, so it’s easier to see how things vary per discipline compared to the average.

Surprises: Business people seem to think they’re really great at everything except communication. Journalists are better at data wrangling than engineers.

We will get back to the accuracy of self-reported skills later.

We were asked, do you see your reflection in your star?

Also, take a look at the different stars. How would you use them to build a data science team? Would you want people who are good at different skills? Is it enough to have all the skills covered? Are there complementary skills? Are the skills additive, or do you need overlapping skills among team members?

Thought Experiment

If all data which had ever been collected were freely available to everyone, would we be better off?

Some ideas were offered:

all nude photos are included. [Mathbabe interjects: it’s possible to not let people take nude pics of you. Just sayin’.]
so are passwords, credit scores, etc.
how do we make secure transactions between a person and her bank considering this?
what does it mean to be “freely available” anyway?

The data of power; the power of data

You see a lot of people posting crap like this on Facebook:

But here’s the thing: the Berner Convention doesn’t exist. People are posting this to their walls because they care about their privacy. People think they can exercise control over their data but they can’t. Stuff like this give one a false sense of security.

In Europe the privacy laws are stricter, and you can request data from Irish Facebook and they’re supposed to do it, but it’s still not easy to successfully do.

And it’s not just data that’s being collected about you – it’s data you’re collecting. As scientists we have to be careful about what we create, and take responsibility for our creations.

As Francois Rabelais said,

Wisdom entereth not into a malicious mind, and science without conscience is but the ruin of the soul.

Or as Emily Bell from Columbia said,

Every algorithm is editorial.

We can’t be evil during the day and take it back at hackathons at night. Just as journalists need to be aware that the way they report stories has consequences, so do data scientists. As a data scientist one has impact on people’s lives and how they think.

Here are some takeaways from the course:

We’ve gained significant powers in this course.
In the future we may have the opportunity to do more.
With data power comes data responsibility.

Who does data science empower?

The second presentation was given by Jed and Mike. Again, they had a bunch of people on their team helping out.

Thought experiment

Let’s start with a quote:

“Anything which uses science as part of its name isn’t political science, creation science, computer science.”

– Hal Abelson, MIT CS prof

Keeping this in mind, if you could re-label data science, would you? What would you call it?

Some comments from the audience:

Let’s call it “modellurgy,” the craft of beating mathematical models into shape instead of metal
Let’s call it “statistics”

Does it really matter what data science is? What should it end up being?

Chris Wiggins from Columbia contends there are two main views of what data science should end up being. The first stems from John Tukey, inventor of the fast fourier transform and the box plot, and father of exploratory data analysis. Tukey advocated for a style of research he called “data analysis”, emphasizing the primacy of data and therefore computation, which he saw as part of statistics. His descriptions of data analysis, which he saw as part of doing statistics, are very similar to what people call data science today.

The other prespective comes from Jim Gray, Computer Scientist from Microsoft. He saw the scientific ideals of the enlightenment age as expanding and evolving. We’ve gone from the theories of Darwin and Newton to experimental and computational approaches of Turing. Now we have a new science, a data-driven paradigm. It’s actually the fourth paradigm of all the sciences, the first three being experimental, theoretical, and computational. See more about this here.

Wait, can data science be both?

Note it’s difficult to stick Computer Science and Data Science on this line.

Statistics is a tool that everyone uses. Data science also could be seen that way, as a tool rather than a science.

Who does data science?

Here’s a graphic showing the make-up of Kaggle competitors. Teams of students collaborated to collect, wrangle, analyze and visualize this data:

The size of the blocks correspond to how many people in active competitions have an education background in a given field. We see that almost a quarter of competitors are computer scientists. The shading corresponds to how often they compete. So we see the business finance people do more competitions on average than the computer science people.

Consider this: the only people doing math competitions are math people. If you think about it, it’s kind of amazing how many different backgrounds are represented above.

We got some cool graphics created by the students who collaborated to get the data, process it, visualize it and so on.

Which universities offer courses on Data Science?

There will be 26 universities in total by 2013 that offer data science courses. The balls are centered at the center of gravity of a given state, and the balls are bigger if there are more in that state.

Where are data science jobs available?

Observations:

We see more professional schools offering data science courses on the west coast.
It would also would be interesting to see this corrected for population size.
Only two states had no jobs.
Massachusetts #1 per capita, then Maryland

Crossroads

McKinsey says there will be hundreds of thousands of data science jobs in the next few years. There’s a massive demand in any case. Some of us will be part of that. It’s up to us to make sure what we’re doing is really data science, rather than validating previously held beliefs.

We need to advance human knowledge if we want to take the word “scientist” seriously.

How did this class empower you?

You are one of the first people to take a data science class. There’s something powerful there.

Thank you Rachel!

Last Day of Columbia Data Science Class, What just happened? from Rachel’s perspective

Recall the stated goals of this class were:

learn about what it’s like to be a data scientists
be able to do some of what a data scientist does

Hey we did this! Think of all the guest lectures; they taught you a lot of what it’s like to be a data scientist, which was goal 1. Here’s what I wanted you guys to learn before the class started based on what a data scientist does, and you’ve learned a lot of that, which was goal 2:

Mission accomplished! Mission accomplished?

Thought experiment that I gave to myself last Spring

How would you design a data science class?

Comments I made to myself:

It’s not a well-defined body of knowledge, subject, no textbook!
It’s popularized and celebrated in the press and media, but there’s no “authority” to push back
I’m intellectually disturbed by idea of teaching a course when the body of knowledge is ill-defined
I didn’t know who would show up, and what their backgrounds and motivations would be
Could it become redundant with a machine learning class?

My process

I asked questions of myself and from other people. I gathered information, and endured existential angst about data science not being a “real thing.” I needed to give it structure.

Then I started to think about it this way: while I recognize that data science has the potential to be a deep research area, it’s not there yet, and in order to actually design a class, let’s take a pragmatic approach: Recognize that data science exists. After all, there are jobs out there. I want to help students to be qualified for them. So let me teach them what it takes to get those jobs. That’s how I decided to approach it.

In other words, from this perspective, data science is what data scientists do. So it’s back to the list of what data scientists do. I needed to find structure on top of that, so the structure I used as a starting point were the data scientist profiles.

Data scientist profiles

This was a way to think about your strengths and weaknesses, as well as a link between speakers. Note it’s easy to focus on “technical skills,” but it can also be problematic in being too skills-based, as well as being problematic because it has no scale, and no notion of expertise. On the other hand it’s good in that it allows for and captures variability among data scientists.

I assigned weekly guest speakers topics related to their strengths. We held lectures, labs, and (optional) problem sessions. From this you got mad skillz:

programming in R
some python
you learned some best practices about coding

From the perspective of machine learning,

you know a bunch of algorithms like linear regression, logistic regression, k-nearest neighbors, k-mean, naive Bayes, random forests,
you know what they are, what they’re used for, and how to implement them
you learned machine learning concepts like training sets, test sets, over-fitting, bias-variance tradeoff, evaluation metrics, feature selection, supervised vs. unsupervised learning
you learned about recommendation systems
you’ve entered a Kaggle competition

Importantly, you now know that if there is an algorithm and model that you don’t know, you can (and will) look it up and figure it out. I’m pretty sure you’ve all improved relative to how you started.

You’ve learned some data viz by taking flowing data tutorials.

You’ve learned statistical inference, because we discussed

observational studies,
causal inference, and
experimental design.
We also learned some maximum likelihood topics, but I’d urge you to take more stats classes.

In the realm of data engineering,

we showed you map reduce and hadoop
we worked with 30 separate shards
we used an api to get data
we spent time cleaning data
we’ve processed different kinds of data

As for communication,

you wrote thoughts in response to blog posts
you observed how different data scientists communicate or present themselves, and have different styles
your final project required communicating among each other

As for domain knowledge,

lots of examples were shown to you: social networks, advertising, finance, pharma, recommender systems, dallas art museum

I heard people have been asking the following: why didn’t we see more data science coming from non-profits, governments, and universities? Note that data science, the term, was born in for-profits. But the truth is I’d also like to see more of that. It’s up to you guys to go get that done!

How do I measure the impact of this class I’ve created? Is it possible to incubate awesome data science teams in the classroom? I might have taken you from point A to point B but you might have gone there anyway without me. There’s no counterfactual!

Can we set this up as a data science problem? Can we use a causal modeling approach? This would require finding students who were more or less like you but didn’t take this class and use propensity score matching. It’s not a very well-defined experiment.

But the goal is important: in industry they say you can’t learn data science in a university, that it has to be on the job. But maybe that’s wrong, and maybe this class has proved that.

What has been the impact on you or to the outside world? I feel we have been contributing to the broader discourse.

Does it matter if there was impact? and does it matter if it can be measured or not? Let me switch gears.

What is data science again?

Data science could be defined as:

A set of best practices used in tech companies, which is how I chose to design the course
A space of problems that could be solved with data
A science of data where you can think of the data itself as units

The bottom two have the potential to be the basis of a rich and deep research discipline, but in many cases, the way the term is currently used is:

Pure hype

But it doesn’t matter how we define it, as much as that I want for you:

to be problem solvers
to be question askers
to think about your process
to use data responsibly and make the world better, not worse.

More on being problem solvers: cultivate certain habits of mind

Here’s a possible list of things to strive for, taken from here:

Here’s the thing. Tons of people can implement k-nearest neighbors, and many do it badly. What matters is that you cultivate the above habits, remain open to continuous learning.

In education in traditional settings, we focus on answers. But what we probably should focus on is how a student behaves when they don’t know the answer. We need to have qualities that help us find the answer.

Thought experiment

How would you design a data science class around habits of mind rather than technical skills? How would you quantify it? How would you evaluate? What would students be able to write on their resumes?

Comments from the students:

You’d need to keep making people doing stuff they don’t know how to do while keeping them excited about it.
have people do stuff in their own domains so we keep up wonderment and awe.
You’d use case studies across industries to see how things work in different contexts

More on being question-askers

Some suggestions on asking questions of others:

start with assumption that you’re smart
don’t assume the person you’re talking to knows more or less. You’re not trying to prove anything.
be curious like a child, not worried about appearing stupid
ask for clarification around notation or terminology
ask for clarification around process: where did this data come from? how will it be used? why is this the right data to use? who is going to do what? how will we work together?

Some questions to ask yourself

does it have to be this way?
what is the problem?
how can I measure this?
what is the appropriate algorithm?
how will I evaluate this?
do I have the skills to do this?
how can I learn to do this?
who can I work with? Who can I ask?
how will it impact the real world?

Data Science Processes

In addition to being problem-solvers and question-askers, I mentioned that I want you to think about process. Here are a couple processes we discussed in this course:

(1) Real World –> Generates Data –>
–> Collect Data –> Clean, Munge (90% of your time)
–> Exploratory Data Analysis –>
–> Feature Selection –>
–> Build Model, Build Algorithm, Visualize
–> Evaluate –>Iterate–>
–> Impact Real World

(2) Asking questions of yourselves and others –>
Identifying problems that need to be solved –>
Gathering information, Measuring –>
Learning to find structure in unstructured situations–>
Framing Problem –>
Creating Solutions –> Evaluating

Thought experiment

Come up with a business that improves the world and makes money and uses data

Comments from the students:

autonomous self-driving cars you order with a smart phone
find all the info on people and then show them how to make it private
social network with no logs and no data retention

10 Important Data Science Ideas

Of all the blog posts I wrote this semester, here’s one I think is important:

10 Important Data Science Ideas

Confidence and Uncertainty

Let’s talk about confidence and uncertainty from a couple perspectives.

First, remember that statistical inference is extracting information from data, estimating, modeling, explaining but also quantifying uncertainty. Data Scientists could benefit from understanding this more. Learn more statistics and read Ben’s blog post on the subject.

Second, we have the Dunning-Kruger Effect.
Have you ever wondered why don’t people say “I don’t know” when they don’t know something? This is partly explained through an unconscious bias called the Dunning-Kruger effect.

Basically, people who are bad at something have no idea that they are bad at it and overestimate their confidence. People who are super good at something underestimate their mastery of it. Actual competence may weaken self-confidence.

Thought experiment

Design an app to combat the dunning-kruger effect.

Optimizing your life, Career Advice

What are you optimizing for? What do you value?

money, need some minimum to live at the standard of living you want to, might even want a lot.
time with loved ones and friends
doing good in the world
personal fulfillment, intellectual fulfillment
goals you want to reach or achieve
being famous, respected, acknowledged
?
some weighted function of all of the above. what are the weights?

What constraints are you under?

external factors (factors outside of your control)
your resources: money, time, obligations
who you are, your education, strengths & weaknesses
things you can or cannot change about yourself

There are many possible solutions that optimize what you value and take into account the constraints you’re under.

So what should you do with your life?

Remember that whatever you decide to do is not permanent so don’t feel too anxious about it, you can always do something else later –people change jobs all the time

But on the other hand, life is short, so always try to be moving in the right direction (optimizing for what you care about).

If you feel your way of thinking or perspective is somehow different than what those around you are thinking, then embrace and explore that, you might be onto something.

I’m always happy to talk to you about your individual case.

Next Gen Data Scientists

The second blog post I think is important is this “manifesto” that I wrote:

Next-Gen Data Scientists. That’s you! Go out and do awesome things, use data to solve problems, have integrity and humility.

Here’s our class photo!

Categories: data science, math education, modeling, open source tools, statistics

Costco visit

December 9, 2012 Cathy O'Neil, mathbabe 12 comments

Yesterday I was in a bit of a funk.

I decided to try to get out and do something new, and since my neighbors were going to Costco, which I’d never been to (or maybe I had once but I couldn’t remember and it would have been at least 13 years ago), I invited myself to go with them. The change would do me good, I thought.

Here’s the thing about Costco which you probably already know if you shop there: it’s a rush, like taking a narcotic. I am sure that their data scientists have spent many computing hours munging through many terabytes of shopping behavior to perfect this effect [Update: an article has just appeared in the New York Times explaining this very thing].

For example, I’m on a budget, and I was planning to go for the sociological experience of it, not to buy anything. After all, I’ve already gotten my groceries for the week, and bought reasonable presents for the kids for Christmas. Nothing outrageous. So I was feeling pretty safe.

But then I got there, and I immediately came across things I didn’t know I needed until I saw them. The most ridiculous and startling version of this was my tupperware experience.

Tupperware is an essential tool in a house with three kids, because if you do the math you’ve got 15 school lunches to make per week. That’s not something you can scrounge up carelessly. So what you do is save plastic Chinese takeout containers and fill them up after dinners with things like pasta and chicken bits. Ready to go in the morning.

But when I came across the gorgeous 30-, no 42-, no 75-piece rainbow-colored tupperware kits that actually stack together beautifully, I instantaneously realized my hitherto scheme was laughably ridiculous and that I must own gorgeous tupperware right now.

This happened to me with a rainbow-colored knife set too, but I managed to hold myself back because, I argued, I already owned knives. I am not sure why that reasoning failed with the tupperware.

In fact I immediately filled up a cart with various (rainbow-colored) things I had no idea I needed until I got there ($50 for a cashmere sweater!?), and then I started wandering around the food area.

This is when my narcotic experience started to wear off. I think it was around the 6-pound block of mozzarella that I started to say to myself, wait, do I really want 6 pounds of mozzarella? I mean, it’s a great price, but won’t that take up half my fridge?

Then I started putting stuff back. It was amazingly easy to part with that stuff once the narcotic wore off. I was in deep withdrawal by the time we left the store. Paying in cash for the few items I did buy also helped.

My conclusions:

I totally get why people own lots of stuff they don’t need.
I’m sure people can get addicted to that narcotic shopping rush.
We should consider treating Costco as a dealer in controlled substances.
I have ugly tupperware and that’s okay.

Categories: musing

Aunt Pythia’s advice

December 8, 2012 Cathy O'Neil, mathbabe 7 comments

Last week I asked you guys to answer this question for me:

Dear Aunt Pythia,
How should I organize my bookshelf? I have 1000+ books.

Booknerd

As usual, you guys impressed me with your various uber-nerdy answers, some by email. Here are some excellent book-organizing suggestions:

By color
By publisher
Via the tool librarything.com
By your personal history reading the book – how old were you? This might be hard for me and Little House on the Prairie, which I read numerous times as a kid, then in Dutch to learn the language when I got married to a Dutch man, and now again because I just want to
My personal favorite:

Sort books by copyright date of first editions. This would give an insight into the development of ideas. One problem with this is that it would make locating a book more difficult so an electronic index would be a valuable supplement. My initial thought was to do this only within categories, but as I think of it, it would be interesting to see fiction interspersed with history, philosophy or science.

——

Dear Aunt Pythia,

My kid went to California last year to study number theory, computer science and physics at a highly regarded university. After getting his dream summer job with a fast growing startup known for hiring only the brightest and spoiling them with great food, good money and lots of benefits he’s starting to adopt some of the libertarian founders viewpoints on life. For next summer, he’s interviewing with some hedge funds known to recruit and “develop” talented math kids. It is looking like he’s going to the Dark Side. What can be done to turn this around and avoid the pain Mathbabe’s father must have gone through?

Failed Parent

Dear Failed Parent,

First of all, my dad was disappointed when I went into math instead of economics, psyched when I left academics for a hedge fund, and disappointed when I quit finance. So much for role modeling.

Second of all, I find your story consistent with my experience of super smart people finding it shockingly fun to be super smart and developing a viewpoint which allows them (or even encourages them) to take advantage of less smart people because they can and because it’s legal, independent of whether it’s moral.

There’s this whole machine, of which hedge funds are a large part, and some math and science camps too, which feed into this rhetoric and profit from it. I call it the fetishization of intelligence. People get so in love with their talent they think it transcends them above mere morality.

So the story isn’t good, and your son is another cog in the intelligence-as-fetish machine. And moreover, you’re not going to be able to talk him out of it, not now when most of the world is telling him he’s kind of an outsider and weak and unattractive (if he’s like most young men this is how they feel they’re treated, even if they aren’t actually unattractive or weak or outsiderish) whereas there’s this tiny contingent telling him he has super powers. Why would he listen to you?

My advice: wait it out. He’ll probably at some point realize he could be getting more out of life if he had stuff he’d done that he could be proud of. Then it’s time for you to have a list of things he might want to do that don’t involve taking advantage of “dumb money” and do involve applying his brains to discovery and solving important problems. Or at least have a pen and paper ready to help him make such a list.

One more piece of advice: if he claims that what he does is morally neutral because if he didn’t do it someone else would, please don’t pass up that opportunity to point out all the evil that has been done by otherwise decent people with that very line. I can’t tell you how many times I hear that. I refuse to help people feel comfortable with their uncomfortable choices.

Good luck, and please get back to me in 10 years to tell me how things are.

Aunt Pythia

——

Dear Aunt Pythia,

I’m a PhD candidate in Math and I’m going to get into quantitative finance. I’m wondering how hard it is these days for a quant to maintain his integrity without undermining success? Any tips in doing so? Have you seen many happy yet righteous quant?

Fish in the Forest

Dear FitF,

After having a few jobs outside academia, I’m pretty sure that you get paid extra in finance partly because you’re smart and the skills you have are rare, but partly because you’re doing something pretty iffy on the righteous scale.

I’m not saying that there are no righteous quants, but I’m saying that if everyone had a righteous score and a salary score, then there would be a negative correlation between those two lists.

So I’m on the job market, and I am not totally against working in finance, because there are all kinds of jobs, but the ones I’m interested in are doing things like making accounting reports machine-readable and consistently formatted, to promote transparency. This doesn’t pay the big bucks as you might imagine.

Here are some snarky tips for you to continue to feel righteous if you work in finance:

Figure out what the information is that you know but other people don’t. See if that matters to you. Example: insider trading, does it make you feel dirty? Maybe not, in which case you’re good.
Start reading Ayn Rand and develop a sense that you are entitled to be rich because you’re so smart, and that anyone who doesn’t agree must be super dumb (see previous post for how to go about this, although you may be too old to go to summer camps).
Do you mind lying? How about omitting the truth? How about pretending to regulators that you have a firm grasp on the underlying risk of a instrument class because there’s a magical mathematical formula someone with a Ph.D. came up with who can’t explain it to anyone? Good?

Good luck, and tell me how it goes!

Aunt Pythia

——

Readers, it’s time for you to answer a question! This one is particularly outside my realm, but I have faith in you good people to help:

Aunt Pythia,

I’ve been finding it really hard as an adult who has stepped out of the higher education track because of a lack of funding to find resources for adults in the same vein as what i found as a young adult. Are there any resources out there that you would recommend for a minority woman aged 29+ looking for training opportunities that are more experiential than internship based?

NYC and Wondering

——

Please ask Aunt Pythia a question! She loves her job and can’t wait to give you unreasonable, useless, and possibly damaging counsel!

Categories: Aunt Pythia

Older Entries

mathbabe

Archive

I totally trust experts, actually

On trusting experts, climate change research, and scientific translators

Open data is not a panacea

Suggested New Year’s resolution: start a blog

Corporations don’t act like people

Consumer segmentation taken to the extreme

Whom can you trust?

Nate Silver confuses cause and effect, ends up defending corruption

Empathy, murder, and the NRA

If Barofsky heads the SEC I’ll work for it

Making math beautiful with XyJax

Silicon Valley: VC versus startup culture

Aunt Pythia’s advice

MOOCs and calculus

MOOC is here to stay, professors will have to find another job

When accurate modeling is not good

Fighting the information war (but only on behalf of rich people)

Columbia Data Science course, week 14: Presentations

Costco visit

Aunt Pythia’s advice

Top Posts & Pages

Follow Blog via Email

Recent Posts

Meta