Author Archive

Let’s not replace the SAT with a big data approach

The big news about the SAT is that the College Boards, which makes the SAT, has admitted there is a problem, which is widespread test-prep and gaming. As I talked about in this post, the SAT mainly serves to sort people by income.

It shouldn’t be a surprise to anyone when a weak proxy gets gamed. Yesterday I discussed this very thing in the context of Google’s PageRank algorithm, and today it’s student learning aptitude. The question is, what do we do next?

Rick Bookstaber wrote an interesting post yesterday (hat tip Marcos Carreira) with an idea to address the SAT problem with the same approach that I’m guessing Google is addressing the PageRank problem, namely by abandoning the poor proxy and getting a deeper, more involved one. Here’s Bookstaber’s suggestion:

You would think that in the emerging world of big data, where Amazon has gone from recommending books to predicting what your next purchase will be, we should be able to find ways to predict how well a student will do in college, and more than that, predict the colleges where he will thrive and reach his potential.  Colleges have a rich database at their disposal: high school transcripts, socio-economic data such as household income and family educational background, recommendations and the extra-curricular activities of every applicant, and data on performance ex post for those who have attended. For many universities, this is a database that encompasses hundreds of thousands of students.

There are differences from one high school to the next, and the sample a college has from any one high school might be sparse, but high schools and school districts can augment the data with further detail, so that the database can extend beyond those who have applied. And the data available to the colleges can be expanded by orders of magnitude if students agree to share their admission data and their college performance on an anonymized basis. There already are common applications forms used by many schools, so as far as admission data goes, this requires little more than adding an agreement in the college applications to share data; the sort of agreement we already make with Facebook or Google.

The end result, achievable in a few years, is a vast database of high school performance, drilling down to the specific high school, coupled with the colleges where each student applied, was accepted and attended, along with subsequent college performance. Of course, the nature of big data is that it is data, so students are still converted into numerical representations.  But these will cover many dimensions, and those dimensions will better reflect what the students actually do. Each college can approach and analyze the data differently to focus on what they care about.  It is the end of the SAT version of standardization. Colleges can still follow up with interviews, campus tours, and reviews of musical performances, articles, videos of sports, and the like.  But they will have a much better filter in place as they do so.

Two things about this. First, I believe this is largely already happening. I’m not an expert on the usage of student data at colleges and universities, but the peek I’ve had into this industry tells me that the analytics are highly advanced (please add related comments and links if you have them!). And they have more to do with admissions and college aid – and possibly future alumni giving – than any definition of academic success. So I think Bookstaber is being a bit naive and idealistic if he thinks colleges will use this information for good. They already have it and they’re not.

Secondly, I want to think a little bit harder about when the “big, deeper data” approach makes sense. I think it does for teachers to some extent, as I talked about yesterday, because after all it’s part of a job to get evaluated. For that matter I expect this kind of thing to be part of most jobs soon (but it will be interesting to see when and where it stops – I’m pretty sure Bloomberg will never evaluate himself quantitatively).

I don’t think it makes sense to evaluate children in the same way, though. After all, we’re basically talking about pre-consensual surveillance, not to mention the collection and mining of information far beyond the control of the individual child. And we’re proposing to mine demographic and behavioral data to predict future success. This is potentially much more invasive than just one crappy SAT test. Childhood is a time which we should try to do our best to protect, not quantify.

Also, the suggestion that this is less threatening because “the data is anonymized” is misleading. Stripping out names in historical data doesn’t change or obscure the difference between coming from a rich high school or a poor one. In the end you will be judged by how “others like you” performed, and in this regime the system gets off the hook but individuals are held accountable. If you think about it, it’s exactly the opposite of the American dream.

I don’t want to be naive. I know colleges will do what they can to learn about their students and to choose students to make themselves look good, at least as long as the US News & World Reports exists. I’d like to make it a bit harder for them to do so.

The endgame for PageRank

First there was Google Search, and then pretty quickly SEOs came into existence.

SEOs are marketing people hired by businesses to bump up the organic rankings for that business in Google Search results. That means they pay people to make their website more attractive and central to Google Search so they don’t have to pay for ads but will get visitors anyway. And since lots of customers come from search results, this is a big deal for those businesses.

Since Google Search was based on a pretty well-known, pretty open algorithm called PageRank which relies on ranking the interestingness of pages by their links, SEOs’ main jobs were to add links and otherwise fiddle with links to and from the websites of their clients. This worked pretty well at the beginning and the businesses got higher rank and they didn’t have to pay for it, except they did have to pay for the SEOs.

But after a while Google caught on to the gaming and adjusted its search algorithm, and SEOs responded by working harder at gaming the system (see more history here). It got more expensive but still kind of worked, and nowadays SEOs are a big business. And the algorithm war is at full throttle, with some claiming that Google Search results are nowadays all a bunch of crappy, low-quality ads.

This is to be expected, of course, when you use a proxy like “link” to indicate something much deeper and more complex like “quality of website”. Since it’s so high stakes, the gaming acts to decouple the proxy entirely from its original meaning. You end up with something that is in fact the complete opposite of what you’d intended. It’s hard to address except by giving up the proxy altogether and going for something much closer to what you care about.

Recently my friend Jordan Ellenberg sent me an article entitled The Future of PageRank: 13 Experts on the Dwindling Value of the LinkIt’s an insider article, interviewing 13 SEO experts on how they expect Google to respond to the ongoing gaming of the Google Search algorithm.

The experts don’t all agree on the speed at which this will happen, but there seems to be some kind of consensus that Google will stop relying on links as such and will go to user behavior, online and offline, to rank websites.

If correct, this means that we can expect Google to pump all of our email, browsing, and even GPS data to understand our behaviors in a minute fashion in order to get at a deeper understanding of how we perceive “quality” and how to monetize that. Because, let’s face it, it’s all about money. Google wants good organic searches so that people won’t abandon its search engine altogether so it can sell ads.

So we’re talking GPS on your android, or sensor data, and everything else it can get its hands on through linking up various data sources (which as I read somewhere is why Google+ still exists at all, but I can’t seem to find that article on Google).

It’s kind of creepy all told, and yet I do see something good coming out of it. Namely, it’s what I’ve been saying we should be doing to evaluate teachers, instead of using crappy and gameable standardized tests. We should go deeper and try to define what we actually think makes a good teacher, which will require sensors in the classroom to see if kids are paying attention and are participating and such.

Maybe Google and other creepy tech companies can show us the way on this one, although I don’t expect them to explain their techniques in detail, since they want to stay a step ahead of SEO’s.

Categories: data science, modeling

Working at the Columbia Journalism School

I’m psyched to say that, as of today, I’m helping start a data journalism program at the Columbia J-School. It’s a one or two semester post-bacc program to get people into data, coding, and visualizations who are starting from non-technical fields. It starts this summer and runs through the end of the year.

And although it’s being held in the J-School, it’s not only meant for journalists. The idea is that people from other humanities who see value in working with data can enroll in the program and emerge competent with data.

There’s no time to waste, as the program starts soon (May 27th) and we don’t even quite have a name for it (suggestions welcome!). We’re also looking for students and teachers. What we do have is plenty of great plans of what to teach, lots of institutional support, and some scholarship money.


Categories: data journalism

Julia Angwin’s Dragnet Nation

I recently devoured Julia Angwin‘s new book Dragnet Nation: A Quest for Privacy, Security, and Freedom in a World of Relentless Surveillance. I actually met Julia a few months ago and talked to her briefly about her upcoming book when I visited the ProPublica office downtown, so it was an extra treat to finally get my hands on the book.

First off, let me just say this is an important book, and a provides a crucial and well-described view into the private data behind the models that I get so worried about. After reading this book you have a good idea of the data landscape as well as many of the things that can currently go wrong for you personally with the associated loss of privacy. So for that reason alone I think this book should be widely read. It’s informational.

Julia takes us along her journey of trying to stay off the grid, and for me the most fascinating parts are her “data audit” (Chapter 6), where she tries to figure out what data about her is out there and who has it, and the attempts she makes to clean the web of her data and generally speaking “opt out”, which starts in Chapter 7 but extends beyond that when she makes the decision to get off of gmail and LinkedIn. Spoiler alert: her attempts do not succeed.

From the get go Julia is not a perfectionist, which is a relief. She’s a working mother with a web presence, and she doesn’t want to live in paranoid fear of being tracked. Rather, she wants to make the trackers work harder. She doesn’t want to hand herself over to them on a silver platter. That is already very very hard.

In fact, she goes pretty far, and pays for quite a few different esoteric privacy services; along the way she explores questions like how you decide to trust the weird people who offer those services. At some point she finds herself with two phones – including a “burner”, which made me think she was a character in House of Cards – and one of them was wrapped up in tin foil to avoid the GPS tracking. That was a bit far for me.

Early on in the book she compares the tracking of a U.S. citizen with what happened under Nazi Germany, and she makes the point that the Stasi would have been amazed by all this technology.

Very true, but here’s the thing. The culture of fear was very different then, and although there’s all this data out there, important distinctions need to be made: both what the data is used for and the extent to which people feel threatened by that usage are very different now.

Julia brought these up as well, and quoted sci-fi writer David Brin: The key question is, who has access? and what do they do with it?

Probably the most interesting moment in the book was when she described the so-called “Wiretapper’s Ball”, a private conference of private companies selling surveillance hardware and software to governments to track their citizens. Like maybe the Ukrainian government used such stuff when they texted warning messages to to protesters.

She quoted the Wiretapper’s Ball organizer Jerry Lucas as saying “We don’t really get into asking, ‘Is in the public’s interest?’”.

That’s the closest the book got to what I consider the critical question: to what extent is the public’s interest being pursued, if at all, by all of these data trackers and data miners?

And if the answer is “to no extent, by anyone,” what does that mean in the longer term? Julia doesn’t go much into this from an aggregate viewpoint, since her perspective is both individual and current.

At the end of the book, she makes a few interesting remarks. First, it’s just too much work to stay off the grid, and moreover it’s become entirely commoditized. In other words, you have to either be incredibly sophisticated or incredibly rich to get this done, at least right now. My guess is that, in the future, it will be more about the latter category: privacy will be enjoyed only by those people who can afford it.

Julia also mentions near the end that, even though she didn’t want to get super paranoid, she found herself increasingly inside a world based on fear and well on her way to becoming a “data survivalist,” which didn’t sound pleasant. It is not a lot of fun to be the only person caring about the tracking in a world of blithe acceptance.

Julia had some ways of measuring a tracking system, which she refers to as a “dragnet”, which seems to me a good place to start:

julia_angwinIt’s a good start.

The sun goes around the earth

Periodically you have people conducting surveys to prove how dumb people are. Questions are of the form: Is Germany in Africa? Is the earth less than 1000 years old?

I hate these surveys, and I’m usually able to ignore these obnoxious and unscientific nature of them, except when they also ask the following question: Does the sun go around the earth?

Here’s my reproduction of the imaginary conversation if I encounter such a pollster:

Pollster: Does the sun go around the earth?

Me: It depends on your frame of reference, but yes, if I’m standing on the earth, and I look up in the sky, I will observe the sun going around the earth in a wobbly path, although before I let you go I need to make the point that it would be quite a bit simpler to understand the model of the solar system whereby the earth and other planets revolve around the sun and spin while they do so.

Pollster: Yes or no question, ma’am, what’s it gonna be?

Me: Yes, I guess.

Pollster: You are so ignorant!

Categories: Uncategorized

SAT overhaul

There’s a good New York Times article by Todd Balf entitled The Story Behind the SAT Overhaul (hat tip Chris Wiggins).

In it is described the story of the new College Board President David Coleman, and how he decided to deal with the biggest problem with the SAT: namely, that it was pretty easy to prepare for the test, and the result was that richer kids did better, having more resources – both time and money – to prepare.

Here’s a visual from another NY Times blog on the issue:


Here’s my summary of the story.

At this point the SAT serves mainly to sort people by income. It’s no longer an appropriate way to gauge “IQ” as it was supposed to be when it was invented. Not to mention that colleges themselves have been playing a crazy game with respect to gaming the US News & World Reports college ranking model via their SAT scores. So it’s one feedback loop feeding into another.

How can we deal with this? One way is to stop using it. The article describes some colleges that have made SAT scores optional. They have not suffered, and they have more diversity.

But since the College Board makes their livelihood by testing people, they were never going to just shut down. Instead they’ve decided to explicitly make the SAT about content knowledge that they think high school students should know to signal college readiness.

And that’s good, but of course one can still prepare for that test. And since they’re acknowledging that now, they’re trying to set up the prep to make it more accessible, possibly even “free”.

But here’s the thing, it’s still online, and it still involves lots of time and attention, which still saps resources. I predict we will still see incredible efforts towards gaming this new model, and it will still break down by income, although possibly not quite as much, and possibly we will be training our kids to get good at slightly more relevant stuff.

I would love to see more colleges step outside the standardized testing field altogether.

Categories: modeling, statistics

Could we use eminent domain to help suffering homeowners? (#OWS)

Here are two things you might have some trouble believing if you read the papers regularly and find yourself convinced we are in a housing recovery. First, there are still huge numbers of homeowners on the brink of, or just starting to enter, foreclosure. Second, many of the banks foreclosing on those properties do not have clear legal ownership over the mortgages in question.

Obama should have addressed the first problem through TARP way back in 2008. In fact mortgage modification was an intention of TARP that was promised Congress when it passed the second half of the money but it never happened. Instead Obama came up with the garbage called HAMP, which has been dreadfully implemented and possibly a net harmful program.

Even without Obama, we should have seen a willingness to renegotiate debt. After all, we can negotiate credit card debt, and businesses routinely renegotiate their mortgages. Why are private home mortgages kept airtight? I guess the banks see it as in their interest not to allow negotiations, and whatever the banks want, the banks seem to get.

The second problem, which is essentially one of botched paperwork (explained here), is probably technically the job of some regulator to deal with, but nobody wants to “blow up the system” so nobody is dealing with it. This is especially ironic considering how often we hear about the so-called sanctity of the contract.

The result of these huge looming problems is that banks got bailed out and the system never got cleared of its actual debt and paperwork problems,.

Enter the concept of using eminent domain to force these two issues. Strike Debt, an offshoot of Occupy Wall Street, is pushing this in a few nationwide court cases, for example in Richmond, California.

More recently, and what inspired this post this morning, is a plan cooked up by Strike Debt using eminent domain to force courts to clear up broken chains of title, written by Hannah Appel and JP Massar.

This idea is on its face unappealing, given the history of that crude tool eminent domain. Everyone I meet has their own stories, but start here for a short list of eminent domain abuses.

And it might not work, either. A district judge might not want to deal with the complexity of the issue and might just let the bad paperwork through.

For that matter, many concerns have been voiced about the practicality of this approach, and one that deeply resonates with me is the idea of using it against current mortgages – i.e. mortgages where the homeowner is up-to-date with payment. Using eminent domain in such a case could set a precedent whereby, even though someone has been taking care of their property, the city uses eminent domain to condemn it based on historical data which implies the owner is likely to neglect their property. That would not be good enough. As far as I know the current plan only uses mortgages where there have been missed payments, though.

The bottomline is this: we’re in a situation where all these homeowners are being crushed with unreasonable monthly payments, and hugely inflated principals, where the legal ownership of the mortgage itself is under question, and nobody seems to want to do squat about it. Maybe it’s time a crude tool is used against a cruel enemy.

Categories: #OWS, finance, musing

Aunt Pythia’s advice

Aunt Pythia missed you very much last week and is ever so grateful to return today. And although she usually takes on four questions from readers, today she feels like switching it up and taking on three but making them extra delicious. She hopes you agree that this was the correct choice. Plus she’s running out of questions again, so she’s conserving.

In other words, after you enjoy Aunt Pythia’s wisdom, please don’t forget to:

think of something to ask Aunt Pythia at the bottom of the page!

By the way, if you don’t know what the hell Aunt Pythia is talking about, go here for past advice columns and here for an explanation of the name Pythia.


Dear Aunt Pythia,

So about that Valentine’s Day article which you asked us to ask about… so many questions!

1. In consecutive paragraphs, she says that educated men want “younger, less challenging women” and then that educated women will be frustrated with someone who “just can’t keep up with you or your friends.” Question: is this more insulting to women or to men?

2. She says that “College is the best place to look for your mate. It is an environment teeming with like-minded, age-appropriate single men with whom you already share many things.” Is she talking about STDs here?

3. Did she actually write the sentence “Men won’t buy the cow if the milk is free.”?

4. She writes, “And if you fail to identify ‘the one’ while you’re in college, don’t worry—there’s always graduate school.” So she’s encouraging the old MRS degree. Question: what year was this article written?

That’s all I’ve got for now… I can’t bear to read any more of it!

Woman Turning Forty

Dear WTF,

First, may I express deep satisfaction and pleasure at both your willingness to hate on this article with me and your gorgeous and appropriate acronym. Nicely done, we should hang out. Plus we are age-appropriate, so I’m sure Susan Patton would approve. In fact, here’s a picture of Susan Patton approving or not:

She actually looks like she's reserving judgment in a baffled way.

She actually looks like she’s reserving judgment in a baffled way.

On to the questions:

1. Great point, but I’d have to go with “equally insulting to all human persons” here. The basic assumption she makes is that people can be meaningfully measured by external attributes such as age and education level. Some of the stupidest people I’ve ever met were at Harvard and MIT, and some of the wisest – and in some sense, most threatening – people I’ve met are young children, who can really say it like it is. As to the assumption that men are only interested in young, less challenging women, I’m going to assume that’s the way she raises her sons to be, and I pity them.

2. I mean, look. I’m not saying you shouldn’t take lovers in college, and experiment with STDs for that matter, when it suits you and you have the time and interest. In fact you should fool around as much as you care to, and it’s a natural thing to do considering how many hormones are knocking about. But the idea that you should feel like you’re already late to the critical party if you graduate from college without a fiancee is just putrid advice. People make desperate and bad choices when they are insecure, boxed in, and panicking for time. The way I see it, getting people to marry young is a kind of social control that old people exert on the young, before they really know how to say “fuck this particular model of conformity”.

3. OMG yes she did, and guess what? That’s sexual objectification, pure and simple, and it’s not empowering. If she doesn’t see that, she should watch this video with Caroline Heldman, the chair of the Politics department at Occidental College. In fact everyone should, it blew me away.

4. I’m eyeballing the answer as before 1920, the year women were given the right to vote.

Thanks again for the opportunity to vent!

Aunt Pythia


Dear Aunt Pythia,

You asked for questions on the Susan Patton column. This is barely a question, but here you go.

I have a lot of “alpha” traits that may be stereotypically associated with males. Your posts on being an alpha female have definitely helped me understand some aspects of myself and why it can be confusing for me when I interact with other women, so thanks for that.

For example, my ego likes it when I’m the smartest one in a group, or earn the most money in a relationship or something. But that isn’t always actually what will make me happiest/best off. I am an amateur musician, and I have learned to enjoy being in a musical group where I am the weakest link. I don’t like being a burden to the other people in the group, but if I’m the worst, that means I’m making music with a bunch of people who are even better than I am, so I am making really great music. (And of course I work hard to improve and play as well as I possibly can.) I don’t like playing music with people who are so much better that they will hate the experience, but if I’m the worst by a little bit, it’s perfect for me. Sure, it would give me a little ego boost to be the best and look down on the other people, but that ego boost isn’t as good as the feeling of making better music.

Likewise, if my family’s earnings were limited to 2x, where x is my salary, I would be worse off than if I had a partner who made more money than I did (assuming that money can buy happiness, which it basically can). But in the Patton piece, she talks about the old trope that men don’t want to be out-earned by their partners. My question is, what’s the deal with that? Why are people (stereotypically males, I guess) so threatened by having a partner who earns more than they do, or who is smarter than they are?

Another Alpha Female

Dear AAF,

I just want to make a couple of remarks before getting to your question. First of all, everyone likes feeling like a smart person in a group, and second of all, not everyone is willing to be the worst player in a band. So good for you for being willing to put yourself out there, and alpha female or not, people need to challenge themselves. Plus keep in mind many people – maybe even all – will think they’re the worst person in a band, because they notice their own mistakes more than they notice other people’s.

As for the money thing, I think there are two effects going on here. First, there’s a very temporary “attributes seem important” effect when you first meet someone. This was illustrated recently by various reports (e.g. this) on how people create artificial filters in their online dating profiles – things like height, weight, and education requirements. As it turns out, people are much more restrictive online than in real life, partly because of the nature of the information that is available to online daters.

So just as you think you want a tall guy when you fill out a form, if you meet someone in real life who is two inches shorter than you but makes you laugh yourself silly, you will not even notice his height. And just as men might abstractly be seeking a woman who earns just a little bit less than he does – although I’m not sure men think about it explicitly like this – there’s a good chance he will fall in love based on how she smiles when she plays guitar rather than her paycheck.

There may be a longer term intimidation problem as well, where men and women are accustomed to the idea that the man should be in some way dominant. For example, I still think that men are less likely to leave bad jobs because they have more of a sense of duty towards their images as workers. I’m not sure how to address this in a relationship except to advise women to find a man who loves his job.

Finally, I don’t think anyone ever thinks they’re “not as smart” as their partner. It’s a combination of the multidimensionality of intelligence and human nature that we all find ways in which we’re plenty smart with respect to our long-term friends and partners. I guess the exception might be if both people work in the same exact field and so one dimension of smarts is overemphasized. In that case I’d suggest working in different jobs or at least focusing on other kinds of talents whenever possible.

Aunt Pythia


Dear Aunt Pythia,

Isn’t fairness at least as quantifiable as happiness? Why have no fairness rankings of nations been published? If psychologists can study happiness, then surely sociologists can study fairness.

Elvis Von Essende Nicholas Friedrich Lester Otto Widener IV


Well, depending on what you mean by fairness, there have been a few attempts. For just plain income inequality, we have what’s called the Gini coefficient with an associated map:

In 2009, USA had a terribly high Gini coefficient. Most recently it is 0.486.

In 2009, USA had a terribly high Gini coefficient. Most recently it was measured at 0.486, the very top of that bin.

For other concepts of fairness like “given your situation at birth, what’s your situation later on?” you have the concept of mobility, and here’s a graph of that by city from the New York Times:

inequality map 630

Did you have something else in mind?

Aunt Pythia


Please submit your well-specified, fun-loving, cleverly-abbreviated question to Aunt Pythia!

Categories: Aunt Pythia

An attempt to FOIL request the source code of the Value-added model

Last November I wrote to the Department of Education to make a FOIL request for the source code for the teacher value-added model (VAM).


To explain why I’d want something like this, I think the VAM model sucks and I’d like to explore the actual source code directly. The white paper I got my hands on is cryptically written (take a look!) and doesn’t explain what the actual sensitivity to inputs are, for example. The best way to get at that is the source code.

Plus, since the New York Times and other news outlets published teacher’s VAM scores after a long battle and a FOIA request (see details about this here), I figured it’s only fair to also publicly release the actual black box which determines those scores.

Indeed without knowledge of what the model consists of, the VAM scoring regime is little more than a secret set of rules, with tremendous power over teachers and the teacher union, and also incorporates outrageous public shaming as described above.

I think teachers deserve better, and I want to illustrate the weaknesses of the model directly on an open models platform.

The FOIL request

Here’s the email I sent to on 11/22/13:

Dear Records Access Officer for the NYC DOE,

I’m looking to get a copy of the source code for the most recent value-added teacher model through a FOIA request. There are various publicly available descriptions of such models, for example here, but I’d like the actual underlying code.

Please tell me if I’ve written to the correct person for this FOIA request, thank you very much.

Cathy O’Neil

Since my FOIL request

In response to my request, on 12/3/13, 1/6/14, and 2/4/14 I got letters saying stuff was taking a long time since my request was so complicated. Then yesterday I got the following response:
Screen Shot 2014-03-07 at 8.49.57 AM

If you follow the link you’ll get another white paper, this time from 2012-2013, which is exactly what I said I didn’t want in my original request.

I wrote back, not that it’s likely to work, and after reminding them of the text of my original request I added the following:

What you sent me is the newer version of the publicly available description of the model, very much like my link above. I specifically asked for the underlying code. That would be in a programming language like python or C++ or java.

Can you to come back to me with the actual code? Or who should I ask?

Thanks very much,

It strikes me as strange that it took them more than 3 months to send me a link to a white paper instead of the source code as I requested. Plus I’m not sure what they mean by “SED” but I’m guessing it means these guys, but I’m not sure of exactly who to send a new FOIL request.

Am I getting the runaround? Any suggestions?

Categories: modeling, statistics

Speaking tonight at NYC Open Data

March 6, 2014 Comments off

Tonight I’ll be giving a talk at the NYC Open Data Meetup, organized by Vivian Zhang. I’ll be discussing my essay from last year entitled On Being a Data Skeptic, as well as my Doing Data Science book. I believe there are still spots left if you’d like to attend. The details are as follows:

When: Thursday, March 6, 2014, 7:00 PM to 9:00 PM

Where: Enigma HQ, 520 Broadway, 11th Floor, New York, NY (map)


  • 6:15pm: Doors Open for pizza and casual networking
  • 7:00pm: Workshop begins
  • 8:30pm: Audience Q&A
Categories: data science

Gaming the (risk/legal) system

A while back I was talking to some math people about how credit default swaps (CDSs), by their very nature, contain risk that is generally speaking undetectable with standard risk models like Value-at-Risk (VaR).

It occurred to me then that I could put it another way: that perhaps credit default swaps might have been deliberately created by someone who knew all about the standard risk models to game the system. VaR was commercialized in the mid 1990′s and CDSs existed around the same time, but didn’t take off for a decade or so until after VaR became super widespread, which makes it hard to prove without knowing the actors.

For that matter it is reasonable to assume something less deliberate occurred: that a bunch of weird instruments were created and those which hid risk the most thrived, kind of an evolutionary approach to the same theory.

I was reminded recently of this conspiracy theory when Joe Burns talked to my Occupy group last Sunday about his recent book, Reviving the Strike. He talked about the history of strikes as a tool of leverage, and how much less frequently we’ve seen large-scale strikes and industry-wide strikes. He made the point that the legality of strikes has historically been uncorrelated to the existence of strikes – that strikers cannot necessarily wait for the legal system to catch up with the needs of the worker. Sometimes strikers need to exert pressure on legislation.

Anyhoo, one question that came up in Q&A was how, in this world of subsidiaries and franchises, can workers strike against the upper management with control over the actual big money? After all, McDonalds workers work for franchisees who are often not well-off. The real money lives in the mother company but is legally isolated from the franchises.

Similarly, with Walmart, there are massive numbers of workers that don’t work directly for Walmart but do work in the massive supply chain network set up and run by Walmart. They would like to hold Walmart responsible for their working conditions. How does that work?

It seems like the same VaR/CDS story as above. Namely, the legal structure of McDonalds and Walmart almost seems deliberately set up to avoid legal responsibility from disgruntled workers. So maybe first you had the legal system, then lawyers set up the legal construction of the supply chain and workers such that striking workers could only strike against powerless figures, especially in the McDonalds case (since Walmart has plenty of workers working for the mother company as well).

Last couple of points. First, only long-term, powerful enterprises can go to the trouble of gaming such large systems. It’s an artifact of the age of the corporation.

And finally, I feel like it’s hard to combat. We could try to improve our risk or legal system but that makes them – probably – even more complicated, which in turn gives massive corporations more ways to game them. Not to be a cynic, but I don’t see a solution besides somehow separately sidestepping our personal risk exposure to these problems.

Categories: finance

How much is your data worth?

I heard an NPR report yesterday with Emily Steel, reporter from the Financial Times, about what kind of attributes make you worth more to advertisers. She has developed an ingenious online calculator here, which you should go play with.

As you can see it cares about things like whether you’re about to have a kid or are a new parent, as well as if you’ve got some disease where the industry for that disease is well-developed in terms of predatory marketing.

For example, you can bump up your worth to $0.27 from the standard $0.0007 if you’re obese, and another $0.10 if you admit to being the type to buy weight-loss products. And of course data warehouses can only get that much money for your data if they know about your weight, which they may or may not since if you don’t buy weight-loss products.

The calculator doesn’t know everything, and you can experiment with how much it does know, but some of the default assumptions are that it knows my age, gender, education level, and ethnicity. Plenty of assumed information to, say, build an unregulated version of a credit score to bypass the Equal Credit Opportunities Act.

Here’s a price list with more information from the biggest data warehouser of all, Acxiom.

Categories: data science, modeling

Report from an MSRI MOOC conversation

I am back from Berkeley where I attended a couple of hours of conversations about MOOCs last Friday up at MSRI.

It was a panel discussion given mostly by math and stats people who themselves run MOOCs, and I was wondering if the people who are involved have a better sense of the side effects and feedback loops involved in the process. After all, I’m claiming that the MOOC Revolution will lead to the end of math research, and I wanted to be proven wrong.

Unfortunately, I left feeling like I have even more evidence that my fears will be realized.

I think the critical moment came when Ani Adhikari spoke. Professor Adhikari is in the second semester of giving her basic stats MOOC, and from how she described it, she is incredibly good at it, and there’s a social network aspect of the class which seems like it’s going really well – she says she spends 30 minutes to an hour a day on it herself, interacting with students. I think she said 28,000 students took it her first semester in addition to her in-class students at Berkeley. I know and respect Professory Adhikari personally, as I taught for her at the Berkeley Mills summer program for women many years ago. I know how devoted she is to good teaching.

Even so, she lost me late in the discussion when she explained that EdX, the platform which hosts her stats MOOC, wanted to offer her class three times a year without her participation. She said something to the effect that MOOC professors had to be “extra vigilant” about this outrageous idea and guard against it at all costs.

After all, she said, at the end of the day the MOOC videos are something like a fancy textbook, and we don’t hand out textbooks and claim they are courses, so we by the same token cannot hand out MOOC videos (and presumably the social networks associated with them) and claim they are courses.

When I pressed her in the Q&A session as to how exactly she was going to remain vigilant against this threat, she said she has a legal contract with EdX that prevented them from offering the course without her approval.

And I’m happy for her and her great contract, but here are two questions for her and for the community.

First, how long until someone in math or stats makes a kick-ass MOOC and doesn’t remember to have that air-tight legal contract? Or has an actual legal battle with EdX and realized their lawyers are not as expensive? Or believes that “information should be free” and does it with the express intention of letting the MOOC be replayed forever?

Second, how much sense does it make to claim that you and your presence are super critical to the success of a MOOC if 28,000 people took this class and you interacted at most one hour a day? Can you possibly claim that the average student benefitted from your presence? It seems to me that the value proposition for the average MOOC student is very similar whether you are there or not.

Overall the impression I got from the speakers, who were mostly MOOC evangelists and involved with MOOCs themselves, was that they loved MOOCs because MOOCs were working for them. They weren’t looking much beyond that point to side effects.

There was one exception, namely Susan Holmes, who listed some side effects of MOOCs including a decreased need for math Ph.D.’s. Unfortunately the conversation didn’t dwell on this, though, and it happened at the very end of the day.

Here’s what I’d like to see: a conversation at MSRI about the future of math research funding in the context of MOOCs and a reduced NSF, where hopefully we come up with something besides “Jim Simons”. It’s extra ironic that the conversation, if it happens, would be held in the Simons Theater.

Categories: math education

Data journalism

I’m in Berkeley this week, where I gave two talks (here are my slides from Monday’s talk on recommendation engines, and here are my slides from Tuesday’s talk on modeling) and I’ve been hanging out with math nerds and college friends and enjoying the amazing food and cafe scene. This is the freaking life, people.

Here’s what’s been on my mind lately: the urgent need for good data journalism. If you read this Washington Post blog by Max Fisher you will get at one important angle of the problem. The article talks about the need for journalists to be competent in basic statistics and exploratory data analysis to do reasonable reporting on data, in this case the state of journalistic freedoms.

And you might think that, as long as journalists report on other stuff that’s not data heavy, they’re safe. But I’d argue that the proliferation of data is leaking into all corners of our culture, and basic data and computing literacy is becoming increasingly vital to the job of journalism.

Here’s what I’m not saying (a la Miss Disruption): learn to code, journalists, and everything will be cool. To be clear, having data skills is necessary but not sufficient.

So it’s more like, if you don’t learn to code, and even more importantly if you don’t learn to be skeptical of the models and the data, then you will have yet another obstacle between you and the truth.

Here’s one way to think about it. A few days ago I wrote a post about different ways to define and regulate discriminatory acts. On the one hand you have acts or processes that are “effectively discriminatory” and on the other you have acts or processes that are “intentionally discriminatory.”

In this day and age, we have complicated, opaque, and proprietary models: in other words, a perfect hiding place for bad intentions. It would be idiotic for someone with the intention of being discriminatory to do so outright. It’s much easier to embed such a thing in an opaque model where it will seem unintentional and will probably never be discovered at all.

But how is an investigative journalist going to even approach that? The first thing they need is to arm themselves with the right questions and the right attitude. And it wouldn’t help if they or their team can perform a test on the data and algorithm as well.

I’m not saying that we’re going to suddenly have do-everything super human journalists. Just as the list of job requirements for data scientists is outrageously long and nobody can be expert at everything, we will have to form teams of journalists which as a whole has lots of computing and investigative expertise.

The alternative is that the models go unchallenged, which is a really bad idea.

Here’s a perfect example of what I think needs to happen more: when ProPublica reverse-engineered Obama’s political messaging model.

Categories: data journalism

What privacy advocates get wrong

There’s a wicked irony when it comes to many privacy advocates.

They are often narrowly focused on the their own individual privacy issues, but when it comes down to it they are typically super educated well-off nerds with few revolutionary thoughts. In other words, the very people obsessing over their privacy are people who are not particularly vulnerable to the predatory attacks of either the NSA or the private companies that make use of private data.

Let me put it this way. If I’m a data scientist working at a predatory credit card firm, seeking to build a segmentation model to target the most likely highly profitable customers – those that ring up balances and pay off minimums every month, sometimes paying late to accrue extra fees – then if I am profiling a user and notice an ad blocker or some other signal of privacy concerns, chances are that becomes a wealth indicator and I leave them alone. The mere presence of privacy concerns signals that this person isn’t worth pursuing with my manipulative scheme.

If you don’t believe me, take a look at a recent Slate article written by  and entitled Take My Data Please: How I learned to stop worrying and love a less private internet.

In it he describes how he used to be privacy obsessed, for no better reason than that he like to stick up a middle finger to those who would collect his data. I think that article should have been called something like, Well-educated white guy was a privacy freak until he realized he didn’t have to be because he’s a well-educated white guy.

He concludes that he really likes how well customized things are to his particular personality, and that shucks, we should all just appreciate the web and stop fretting.

But here’s the thing, the problem isn’t that companies are using his information to screw Cyrus Nemati. The problem is that the most vulnerable people – the very people that should be concerned with privacy but aren’t – are the ones getting tracked, mined, and screwed.

In other words, it’s silly for certain people to be scrupulously careful about their private data if they are the types of people who get great credit card offers and have a stable well-paid job and are generally healthy. I include myself in this group. I do not prevent myself from being tracked, because I’m not at serious risk.

And I’m not saying nothing can go wrong for those people, including me. Things can, especially if they suddenly lose their jobs or they have kids with health problems or something else happens which puts them into a special category. But generally speaking those people with enough time on their hands and education to worry about these things are not the most vulnerable people.

I hereby challenge Cyrus Nemati to seriously consider who should be concerned about their data being collected, and how we as a society are going to address their concerns. Recent legislation in California is a good start for kids, and I’m glad to see the New York Times editors asking for more.

Categories: data science, rant

Ya’ make your own luck, n’est-ce pas?

This is a guest post by Leopold Dilg.

There’s little chance we can underestimate our American virtues, since our overlords so seldom miss an opportunity to point them out.  A case in point – in fact, le plus grand du genre, though my fingers tremble as I type that French expression, for reasons I’ll explain soon enough – is the Cadillac commercial that interrupted the broadcast of the Olympics every few minutes.

A masterpiece of casting and directing and location scouting, the ad follows a middle-aged man, muscular enough but not too proud to show a little paunch – manifestly a Master of the Universe – strutting around his chillingly modernist $10 million vacation house (or is it his first or fifth home? no matter), every pore oozing the manly, smirky bearing that sent Republican country-club women swooning over W.

It starts with Our Hero, viewed from the back, staring down his infinity pool.   He pivots and stares down the viewer.  He shows himself to be one of the more philosophical species of the MotU genus.  “Why do we work so hard?” he puzzles. “For this?  For stuff?….”  We’re thrown off balance:  Will this son of Goldman Sachs go all Walden Pond on us?  Fat chance.

Now, still barefooted in his shorts and polo shirt, he’s prowling his sleak living room (his two daughters and stay-at-home wife passively reading their magazines and ignoring the camera, props in his world no less than his unused pool and The Car yet to be seen) spitting bile at those foreign pansies who “stop by the café” after work and “take August off!….OFF!”  Those French will stop at nothing.

“Why aren’t YOU like that,” he says, again staring us down and we yield to the intimidation.  (Well gee, sir, of course I’m not.  Who wants a month off?  Not me, absolutely, no way.)  “Why aren’t WE like that” he continues – an irresistible demand for totalizing merger.   He’s got us now, we’re goose-stepping around the TV, chanting “USA! USA! No Augusts off! No Augusts off!”

No, he sneers, we’re “crazy, hardworking believers.”  But those Frogs – the weaklings who called for a double-check about the WMDs before we Americans blasted Iraqi children to smithereens (woops, someone forgot to tell McDonalds, the official restaurant of the U.S. Olympic team, about the Freedom Fries thing; the offensive French Fries are THERE, right in our faces in the very next commercial, when the athletes bite gold medals and the awe-struck audience bites chicken nuggets, the Lunch of Champions) – might well think we’re “nuts.”

“Whatever,” he shrugs, end of discussion, who cares what they think.  “Were the Wright Brothers insane?  Bill Gates?  Les Paul?…  ALI?”  He’s got us off-balance again – gee, after all, we DO kinda like Les Paul’s guitar, and we REALLY like Ali.

Of course!  Never in a million years would the hip jazz guitarist insist on taking an August holiday.  And the imprisoned-for-draft-dodging boxer couldn’t possibly side with the café-loafers on the WMD thing.  Gee, or maybe…. But our MotU leaves us no time for stray dissenting thoughts.  Throwing lunar dust in our eyes, he discloses that WE were the ones who landed on the moon.  “And you know what we got?” Oh my god, that X-ray stare again, I can’t look away.  “BORED.   So we left.” YEAH, we’re chanting and goose-stepping again, “USA! USA!  We got bored!  We got bored!”

Gosh, I think maybe I DID see Buzz Aldrin drumming his fingers on the lunar module and looking at his watch.  “But…” – he’s now heading into his bedroom, but first another stare, and pointing to the ceiling – “…we got a car up there, and left the keys in it.  You know why? Because WE’re the only ones goin’ back up there, THAT’s why.” YES! YES! Of COURSE! HE’S going back to the moon, I’M going back to the moon, YOU’RE going back to the moon, WE’RE ALL going back to the moon. EVERYONE WITH A U.S. PASSPORT is going back to the moon!!

Damn, if only the NASA budget wasn’t cut after all that looting by the Wall Street boys to pay for their $10 million vacation homes, WE’D all be going to get the keys and turn the ignition on the rover that’s been sitting 45 years in the lunar garage waiting for us.   But again – he must be reading our mind – he’s leaving us no time for dissent, he pops immediately out of his bedroom in his $12,000 suit, gives us the evil eye again, yanks us from the edge of complaint with a sharp, “But I digress!” and besides he’s got us distracted with the best tailoring we’ve ever seen.

Finally, he’s out in the driveway, making his way to the shiny car that’ll carry him to lower Manhattan.  (But where’s the chauffer?  And don’t those MotUs drive Mazerattis and Bentleys?  Is this guy trying to pull one over on the suburban rubes who buy Cadillacs stupidly thinking they’ve made it to the big time?)

Now the climax:  “You work hard, you create your own luck, and you gotta believe anything is possible,” he declaims.

Yes, we believe that!  The 17 million unemployed and underemployed, the 47 million who need food stamps to keep from starving, the 8 million families thrown out of their homes – WE ALL BELIEVE.  From all the windows in the neighborhood, from all the apartments across Harlem, from Sandy-shattered homes in Brooklyn and Staten Island, from the barren blast furnaces of Bethlehem and Youngstown, from the foreclosed neighborhoods in Detroit and Phoenix, from the 70-year olds doing Wal-mart inventory because their retirement went bust, from all the kitchens of all the families carrying $1 trillion in college debt, I hear the national chant, “YOU MAKE YOUR OWN LUCK!  YOU MAKE YOUR OWN LUCK!”

And finally – the denouement – from the front seat of his car, our Master of the Universe answers the question we’d all but forgotten.  “As for all the stuff? That’s the upside of taking only two weeks off in August.”  Then the final cold-blooded stare and – too true to be true – a manly wink, the kind of wink that makes us all collaborators and comrades-in-arms, and he inserts the final dagger: “N’est-ce pas?”

N’est-ce pas?

Categories: guest post

How can we regulate around discrimination?

I am looking into the history of anti-discrimination laws like the Equal Credit Opportunity Act, (ECOA) and how it got passed, and hopefully find data to measure how well it’s worked since it got passed in 1974.

Putting aside the history of this legislation for now – although it is fascinating – I’d like to talk this morning about this paper from 1989 written by Gregory Elliehausen and Thomas Durkin from the Board of Governors of the Federal Reserve System, which discusses the abstract question of approaches to defining and regulation around discrimination.

This came up because when Congress passed ECOA, they left it to the regulators – in this case the Federal Reserve – to decide exactly how to write the rules, which pertain to credit decisions (think credit card offerings). From the article:

The term “discriminate against an applicant” was defined in Section 202. 2(n) as meaning “to treat an applicant less favorably than other applicants.” By itself, this rule does not offer an unquestionably unambiguous operational definition of socially unacceptable discrimination in a screening context where limited selections are constantly being made from a longer list of applicants.

The paper then goes on to list 3 separate regulatory approaches to anti-discrimination regulation. I have found these three definition really interesting and thought-provoking. I won’t even go into the rest of the paper on this post because I think just this list of three approaches is so interesting. Tell me if you agree.

1) The “effects-based” approach to regulation. This is the idea that, we don’t need to know how you actually make credit decisions, but if the effect is that no women or minorities ever get credit from you, then you’re doing something wrong. If you want to be really extreme in this category you get to things like quotas. if you want to be less extreme you think about studying applications that are similar except for one thing like race or gender, kind of like the the male vs. female science lab application test studied here. Needless to say, effects-based regulation is not in use, it’s considered too extreme.

2) The “intent-based” approach to regulation. This is where you have to prove intent to discriminate. It’s super rare that you can do that, because it’s super rare that people aiming to discriminate are dumb enough to make it obvious. Far easier to embed discrimination in a model where you can maintain plausible deniability. Although intent-based regulation is considered too extreme in the other direction, it seems to be what surfaces when there’s a legal case (although I’m not a legal expert).

3) The “practices-based” approach to regulation. This is where you make a list of acceptable or unacceptable practices in extending credit and hope you cover everything. So for example you aren’t allowed to explicitly use race or marital status or governmental assistance status in your credit models. This is what the Fed finally decided to use, and it makes sense in that it’s easy to implement, but of course the lists change over time, and that’s the key issue (for me anyway): we need to update those lists in the age of big data.

Tell me if you think there’s yet another approach not mentioned. And note these regulatory approaches correspond to different ways of thinking about or even defining discrimination, which is itself a great reason to list them comprehensively. I think my future discussions about what constitutes discrimination will be informed by which above approach will pick up on a given instance.

Categories: finance

Aunt Pythia’s advice

Aunt Pythia has some exciting news.

After spending about 5 days of the last 7 in bed with an awful flu, and finishing off both seasons of House of Cards (with the associated feeling of being simultaneously drowned in cynicism and phlegm), Aunt Pythia started in on Battlestar Galactica, which she honestly should have done years ago.

I just love me some Starbuck!

And do you know who stars in that series, at least in Season 2? None other than yours truly, Pythia the Oracle of Delphi! I am honored, and I hope you are honored by association. Go ahead, feel the honor.

After you enjoy my column (and the honor!) today please don’t forget to:

think of something to ask Aunt Pythia at the bottom of the page!

By the way, if you don’t know what the hell Aunt Pythia is talking about, go here for past advice columns and here for an explanation of the name Pythia.


Dear Aunt Pythia,

I gave a talk at this year’s JMM in Baltimore. It was one of those super rushed 10-minute talks. But giving any talk at all sufficed for my university to pay for my travel and lodging. That’s not to say that I didn’t take it seriously. I did. I even dressed nice for it, which I don’t normally do as a grad student and mother of a toddler. I bothered to care about a talk that has only enough time to explain its title because this year is an important year for me. It’s my last year of my PhD and I’m applying for postdocs and jobs. It’s why I attended the JMM.

My talk went well enough. I got a few questions at the end and I didn’t go over my time. And that should be the end of it. JMM is over. I can get back to stressing over my dissertation. But I got an email. An email from someone who was in the audience. He wrote to me that he enjoyed my talk and would like to meet me for dinner. He even added that this is “to be clear, a non-math invitation.”

My first thought was that I should send a reply correcting the many grammatical errors I found in his very short email. But that thought quickly changed into anger. I traveled a very long distance to work. I’m taking time away from my research, away from my 2-year-old so that I can present myself professionally to an audience of my peers and potential employers. I hope and expect to be treated like a real scientist. I remembered all the stories, all of the frustration of so many of my friends and colleagues, scientists who also happen to be women, who were treated with anything but respect just because they weren’t born with a penis. I was insulted, furious that some stupid little boy thought that this sort of behavior is appropriate.

But there was always the small chance that he is, in fact, stupid—in certain ways. After all, this is a math conference. There are mathematicians who, while brilliant, may not have (let’s just say) mature social skills. (Though this guy’s probably not too clever since a quick Google search would have revealed that I have a webpage containing a photo of me and my family and therefore not likely to be interested in dating.)

I replied with an invitation to meet for lunch. So that I can verify that he’s not developmentally challenged and confirm his implied intention. And then yell at him to his face. He didn’t end up showing, even though he sounded eager to meet in the multiple emails he sent following my response. He was probably scared away by the large crowd of my friends that had gathered around our meeting place to support me or, more likely, to witness the spectacle.

Most of the men I spoke to about this incident were sympathetic to the poor idiotic horny kid who clearly had no idea how to talk to girls. They recalled some embarrassing moments from their youth and said that I should have just mercifully sent him a gentle rejection.

I, on the other hand, find his action to be a stark example of how women are not taken seriously in science and feel he should be told that this sort of behavior is not excusable. Granted, a public shaming may not have been warranted. But I think that I am right to feel insulted in this situation.

I’m still thinking about emailing this guy and telling him off. My friend (who is usually a feminist) thinks that while the guy had absolutely no tact and needs some guidance on interacting with other humans, finding a speaker attractive and approaching her at a conference is not wrong. He thinks that had the guy joined me and my friends for drinks after my talk and then later admitted to his interest in me, I would not have been offended. I disagree.

What do you think? Am I overreacting?

Scientista (in training)

Dear Scientista,

Wow, that was a really long question, but I decided to publish it all anyway, because I can see you earnestly want my advice. Not so sure you’re going to like my advice though.

Because here’s the thing, you are absolutely overreacting. I mean, that’s ok, and no actual harm done, but what a huge amount of time wasted at JMM where you could have been doing math, drinking bourbon, or playing bridge.

That’s not to say I like what the guy did, it was definitely obtuse to the point of idiocy, but there you have it, he’s an idiot. Best thing to do in that situation is to delete the email and not give it another thought.

I mean, I guess there might have been a side benefit for the rest of the math community in this planned public shaming, if word had gotten out that this guy had written such an unsolicited and unwelcome email. It might have given pause to the 450 other such emails that happened that weekend. Or not.

Also, I think we should be careful to separate your efforts in preparing your talk and coming to the conference, which were real, from this guy’s sexual interest. I’m guessing that, had you gotten 5 emails talking about the math and how awesome it is, and this email to boot, you would have been able to shrug this one off. It’s the unfortunate nature of short talks that they take a lot to prepare for but there’s little chance of getting good feedback. But let’s not take out that frustration on him entirely.

In one way I’d like to defend this guy: at least he made his explicit desires known. It would have been worse, in my opinion, if he’d come up with some math pretext for meeting and then put his hand on your knee at lunch.

Plus, I’d like to take this opportunity to defend sex at math conferences in general. I mean, it’s one of the classic ways of blowing off some steam after a long day of whirlwind 10-minute talks, married or unmarried.

Finally, and I hope this doesn’t sound too harsh, I’d like to give you some general advice. You are a woman in math, which means you are a warrior, even if you didn’t want to sign up for that. And the best and easiest way to be a warrior is to have a thick skin, to remember the victories, and to ignore the defeats.

And I don’t mean stay quiet about awful, actionable sexism that threatens your job or your responsibilities at work, but I do mean deleting idiotic emails without a second thought, from now on.

Good luck!

Aunt Pythia


Dear Aunt Pythia,

Given that the entire financial industry seems to be loaded with unethical behavior, what do you think are ethical ways to invest your money? Certainly choosing credit unions over large banks seems to be a good way for your savings but I am curious about how you would invest for retirement. Do you think there are ethical ways to invest in stocks, bonds, etc?


Serious Pondering About Money

Dear SPAM,

I get asked this a lot, but I don’t have a good answer. And honestly I worry more about people who don’t have any money saved for retirement at all, and are stuck in student or medical debt.

If you really want my advice, I’d say there are three things you could or should worry about regarding savings: liquidity, risk, and ethics. You may have more things you worry about, but this is just a starting point. I’d suggest you divide your money up into those categories, depending on how you weight the associated concerns.

For the liquidity part, keep cash in a savings account (FDIC insured) or a money market account (not FDIC insured). For the risk part, invest in an ETF for the overall market, because we’ve seen that the government props up the market so you want to ride that buffered wave whilst minimizing fees. For the ethical part, track down a company – or even an individual – doing stuff you think is good for the world and invest in it. It’s highly illiquid and highly risky to do that, but you’ve already taken care of those concerns.

Aunt Pythia


Dear Aunt Pythia,

Is it OK to review NSA grant proposals?

You might have seen Beilinson’s letter to the AMS notices extolling mathematicians to break ties with the NSA. I kind of sympathize with it. The AMS helps the NSA administer its grants program and I recently got two proposals to referee. These were from young mathematicians that I hold in high regard and think deserve to be funded. As NSF funding is dwindling, if they don’t get the NSA grant they might be unfunded. Moreover, I am knowledgeable about their work and felt that if I turned down the request it would be bad for them, so I decided to review the proposals. Have I done the right thing?

Not Sure Actually

Dear NSA,

I feel your pain. The funding is drying up for these worthy researchers, but you’d rather not feel like a collaborator. Those are directly conflicting issues.

And it’s exactly what I fear when I think of the oncoming MOOC revolution and the end of math research. Who is going to fund math research when calculus is gone? The obvious answer is private companies, private individuals, and places like the NSA. Not a pretty picture.

My best advice for you is to review the proposals because you want those researchers funded – and feel slightly better that they’re doing research external to the NSA – and at the same time get involved with solving the larger funding problem for mathematics. This could mean going to talk to your congressperson about the need for mathematical funding or it could mean spreading the word more generally about the importance of math research.

Aunt Pythia


Dear Auntie Pythia,

The Facebook Data Science folks posted a series of blog posts about love (or at least relationships). As a data scientist and sex oracle, what do you make of the results and/or on the use of social network data for these kinds of studies?


Lots Of Valentine’s Extrapolations

Dear LOVE,

Wow, thanks for the link. I happen to know the author, Mike Develin, of those posts, first because he was a (brilliant) student of mine at math camp way back in like 1993, and second because we worked at D.E. Shaw together – although he worked in the California office.

So anyhoo, I like the posts. They’re smart. The one thing I’d say, for example about the age difference of couples in different countries, is that I have to assume there’s a bias away from older middle-aged couples and towards couples where the husband is old and the wife is young. Here’s a picture:

Who is actually on Facebook divulging their marital status?

Who is actually on Facebook divulging their marital status?

I say this because, even if both members of the couple are on Facebook (and that already skews somewhat young), I would guess older people are less likely to divulge their marital status. That kind of thing makes me think we should look at these charts with the caveat that they are true “in the context of Facebook data”.

In terms of the ethics of this kind of use of aggregated data, I’d say it’s great. The stuff I think is scary is the stuff that isn’t aggregated and is hidden from us.


Aunt Pythia


Please submit your well-specified, fun-loving, cleverly-abbreviated question to Aunt Pythia!

Categories: Aunt Pythia

The CARD Act works

Every now and then you see a published result that has exactly the right kind of data, in sufficient amounts, to make the required claim. It’s rare but it happens, and as a data lover, when it happens it is tremendously satisfying.

Today I want to share an example of that happening, namely with this paper entitled Regulating Consumer Financial Products: Evidence from Credit Cards (hat tip Suresh Naidu). Here’s the abstract:

We analyze the effectiveness of consumer financial regulation by considering the 2009 Credit Card Accountability Responsibility and Disclosure (CARD) Act in the United States. Using a difference-in-difference research design and a unique panel data set covering over 150 million credit card accounts, we find that regulatory limits on credit card fees reduced overall borrowing costs to consumers by an annualized 1.7% of average daily balances, with a decline of more than 5.5% for consumers with the lowest FICO scores. Consistent with a model of low fee salience and limited market competition, we find no evidence of an offsetting increase in interest charges or reduction in volume of credit. Taken together, we estimate that the CARD Act fee reductions have saved U.S. consumers $12.6 billion per year. We also analyze the CARD Act requirement to disclose the interest savings from paying off balances in 36 months rather than only making minimum payments. We find that this “nudge” increased the number of account holders making the 36-month payment value by 0.5 percentage points.

That’s a big savings for the poorest people. Read the whole paper, it’s great, but first let me show you some awesome data broken down by FICO score bins:

Rich people buy a lot, poor people pay lots of fees.

Rich people buy a lot, poor people pay lots of fees.

Interestingly, some people in the middle lose money for credit card companies. Poor people are great customers but there aren't so many of them.

Interestingly, some people in the middle lose money for credit card companies. Poor people are great customers but there aren’t so many of them.

The study compared consumer versus small business credit cards. After CARD Act implementation, fees took a nosedive.

The study compared consumer versus small business credit cards. After CARD Act implementation, fees took a nosedive.


This data, and the results in this paper, fly directly in the face of the myth that if you regulate away predatory fees in one way, they will pop up in another way. That myth is based on the assumption of a competitive market with informed participants. Unfortunately the consumer credit card industry, as well as the small business card industry, is not filled with informed participants. This is a great example of how asymmetric information causes predatory opportunities.

Categories: finance, modeling

JP Morgan suicides and the clustering illusion

Yesterday a couple of people sent me this article about mysterious deaths at JP Morgan. There’s no known connection between them, but maybe it speaks to some larger problem?

I don’t think so. A little back-of-the-envelope calculation tells me it’s not at all impressive, and this is nothing but media attention turned into conspiracy theory with the usual statistics errors.

Here are some numbers. We’re talking about 3 suicides over 3 weeks. According to wikipedia, JP Morgan has 255,000 employees, and also according to wikipedia, the U.S. suicide rate for men is 19.2 per 100,000 per year, and for women is 5.5. The suicide rates for Hong Kong and the UK, where two of the suicides took place, are much higher.

Let’s eyeball the overall rate at 19 since it’s male dominated and since may employees are overseas in higher-than-average suicide rate countries.

Since 3 weeks is about 1/17th of a year, we’d expect to see about 19/17 suicides per year per 100,000 employees, and seince we have 255,000 employees, that means about 19/17*2.55 = 2.85 suicides in that time. We had three.

This isn’t to say we’ve heard about all the suicides, just that we expect to see about one suicide a week considering how huge JP Morgan is. So let’s get over this, it’s normal. People commit suicide pretty regularly.

It’s very much like how we heard all about suicides at Foxconn, but then heard that the suicide rate at Foxconn is lower than the general Chinese population.

There is a common statistical problem called the clustering illusion, whereby actually random events look clustered sometimes. Here’s a 2-dimensional version of the clustering illusion:

There are little areas that look overly filled with (or strangely devoid of) dots.

There are little areas that look overly filled with (or strangely devoid of) dots.

Actually my calculation above points to something even dumber, which is that we expected 2.85 suicides and we saw 3, so it’s not even a proven cluster. Although it could be, because again we probably didn’t hear about all of them. Maybe it’s a cluster of “really obvious jump-from-a-building” suicides.

And I’m not saying JP Morgan is a nice place to work. I feel suicidal just thinking about working there myself. But I don’t want us to jump to any statistically unsupported conclusions.

Categories: statistics

Get every new post delivered to your Inbox.

Join 885 other followers