mathbabe

When big data goes bad in a totally predictable way

August 19, 2013 Cathy O'Neil, mathbabe 10 comments

Three quick examples this morning in the I-told-you-so category. I’d love to hear Kenneth Neil Cukier explain how “objective” data science is when confronted with this stuff.

1. When an unemployed black woman pretends to be white her job offers skyrocket (Urban Intellectuals, h/t Mike Loukides). Excerpt from the article: “Two years ago, I noticed that Monster.com had added a “diversity questionnaire” to the site. This gives an applicant the opportunity to identify their sex and race to potential employers. Monster.com guarantees that this “option” will not jeopardize your chances of gaining employment. You must answer this questionnaire in order to apply to a posted position—it cannot be skipped. At times, I would mark off that I was a Black female, but then I thought, this might be hurting my chances of getting employed, so I started selecting the “decline to identify” option instead. That still had no effect on my getting a job. So I decided to try an experiment: I created a fake job applicant and called her Bianca White.”

2. How big data could identify the next felon – or blame the wrong guy (Bloomberg). From the article: “The use of physical characteristics such as hair, eye and skin color to predict future crimes would raise ‘giant red privacy flags’ since they are a proxy for race and could reinforce discriminatory practices in hiring, lending or law enforcement, said Chi Chi Wu, staff attorney at the National Consumer Law Center.”

3. How algorithms magnify misbehavior (the Guardian, h/t Suresh Naidu). From the article: “For one British university, what began as a time-saving exercise ended in disgrace when a computer model set up to streamline its admissions process exposed – and then exacerbated – gender and racial discrimination.”

This is just the beginning, unfortunately.

Categories: data science, modeling

Ask Aunt Pythia

August 17, 2013 Cathy O'Neil, mathbabe 3 comments

You know how you sometimes wake up and just feel like the luckiest person in the world? With the awesomest friends and family? And you just wanna go hug everything and everyone?

That is where Aunt Pythia is today, psychically speaking. Aunt Pythia is feeling so good that her usual quarrelsome self is in hiding, and every single piece of her advice is therefore probably useless, but so be it, it feels damn good.

Oh, and one more thing before the worthless drivel revs up: Aunt Pythia has noticed that people close to her don’t enjoy her columns very much at all, possibly because “they get to hear Aunt Pythia’s advice all the time and are frankly sick of it.”

So if you’re someone who does like Aunt Pythia’s advice column, please sing it loud and clear! The best way to express your AP love, of course, is by posing your very own ethical dilemma at the bottom of this column, so Auntie P has something to do next Saturday (she’s running low).

By the way, if you don’t know what the hell I’m talking about, go here for past advice columns and here for an explanation of the name Pythia.

And please, Submit your question for Aunt Pythia at the bottom of this page!

——

Dear Aunt Pythia,

I’m 20 years old, very much a virgin, dating my boy for 2 and a half years and when it comes to the question of having sex, we have had oral sex but not intercourse as we decided it would be best to wait since no one knows about the future. Am i missing out on too much if i wait till i get married in seven years from now (which is a long time of course)?

Strong Headed

Dear Strong,

A few things. First of all, it disturbs me that you are planning so far ahead that you’ve already chosen 7 years as the amount of time before getting married. Where did that come from? That’s a lifetime of adulthood from the point of view of a 20-year-old. Who knows what country you’ll live in in 7 years, or what kind of job you’ll have.

Next, if you pair that with your alleged reason for not getting laid which is “since no one knows about the future”, it makes even less sense that you’re willing to wait for some arbitrary and enormous amount of time before getting down to the business of doing what you supposedly want with your life.

About that – do you actually want to get married? Well I’m not saying you should or shouldn’t, but I am saying you should figure out what you want and do it, and don’t ask other people, and don’t make plans based on random external rules.

Finally, the sex thing. I’m never going to understand why people come to me for sex advice since the one and only thing I ever ever say is “go for it!”.

Unless… unless they are somehow using me as a way of making an excuse to themselves for doing something they actually want to do already – I’m a proxy moral authority, perhaps? It’s happened.

So, if I’m playing that role, then by all means go do what you already want to do, but my real advice is to be your own moral authority next time. Your life, go live it.

Good luck!

Aunt Pythia

——

Dear Aunt Pythia,

When will we see the space elevator in operation?

Carbon diox

Dear Carbon,

Seriously! I am super impatient for that myself. And I appreciate how your question somehow implies that it’s all set to go but nobody’s turned it on yet.

——

Dear Aunt Pythia,

This isn’t a question for AP, but instead a suggestion for a MB post: what are your thoughts on the Colin McGinn case?

Academic Philosophy

Dear Academic,

Tough shit, it came through Aunt Pythia’s feed so that’s what you get.

So actually I had to google Colin McGinn, since I hadn’t heard about it, and I supplied the link I reached above, so if that’s not representative then I apologize.

In any case I’ll comment based on that article.

First, it’s not a huge surprise to me, to hear of an academic discipline and culture filled with bullies, which sometime extends to sexual predation, and that women are excluded from that field for both the bullying reason and the sexual predation reason. This is super consistent with having a crappy and overly aggressive culture.

I’ve never entered the academic discipline of philosophy myself, but something that scares me about the field is the idea that you rely on your intelligence to make your point, rather than any outside evidence, like you might in science, or outside logical fact, like you might in mathematics.

In other words, I like math because it’s filled with people who know how to admit they’re wrong (some subfields of math are better than others at this). I like experimental science because, when they claim something will happen and it doesn’t happen, they have to revise their theory. I don’t like philosophy because arguments are slippery, like this one that Colin McGinn gave as an explanation for sending aggressive sexual requests to his first year graduate student:

Mr. McGinn said that “the ‘3 times’ e-mail,” as he referred to it, was not an actual proposal. “There was no propositioning,” he said in the interview. Properly understanding another e-mail to the student that included the crude term for masturbation, he added later via e-mail, depended on a distinction between “logical implication and conversational implicature.”

“Remember that I am a philosopher trying to teach a budding philosopher important logical distinctions,” he said.

Yuck!

I’m not saying the field can’t recover, but until they work on it, I won’t feel sorry for the fact that women are under-represented.

Auntie P

——

Dear Aunt Pythia,

This question stems from your response from one of last week’s questions (the last one):

“The truth is, once you’ve been politicized and sensitized to the evil that organizations do or are involved in, you start to see it everywhere. Or if not everywhere, at least most places where you get paid.”

I have certainly found this to be true, as a physics student with a long career in retail to help support the student-ing.

Does it get better? Or easier to accept and harder to maintain some abstract idealism? Must this perspective in some way be balanced? I have dreams of grad school and research, but I wonder if even then it will be true that organizations are weird things that involve people behaving in unfortunate ways.

Reading Chomsky doesn’t seem to help.

-rage against the machine

Dear rage,

Great question! I think it does get better, and although it’s hard to maintain a long list of personal heroes when you keep looking behind the curtain and learning too much, I’ve found it’s not impossible to maintain idealism itself. It’s something you need to nurture, though, for sure, and it takes patience – you have to play the long game.

In other words, some people are aware of the hypocrisies and evils of the world and decide it’s too big to deal with so they figure they’ll just ignore it. Other people see that stuff and try to do everything, and they burn out. Other people just don’t see it at all.

I think a middle ground is good: try to do what you can, and make that a long term goal, and have standards you actually live by that help you make decisions. If, for example, you feel complicit in something you consider evil, then get the fuck out, even if it means quitting your job. You’ll get another job, I’m guessing, especially with a physics background and the ability to read Chomsky.

One thing I want to stress: don’t depend on a single person or a couple of persons to embody the ideals that you care about, because they’ll probably end up disappointing you at some point, and that’s not a great reason to throw in the towel. Instead, write up an internal list of your ideals, they’ll never let you down.

Good luck!

Love,

Aunt Pythia

——

Please submit your well-specified, fun-loving, cleverly-abbreviated question to Aunt Pythia!

Categories: Aunt Pythia

What’s the difference between big data and business analytics?

August 16, 2013 Cathy O'Neil, mathbabe 24 comments

I offend people daily. People tell me they do “big data” and that they’ve been doing big data for years. Their argument is that they’re doing business analytics on a larger and larger scale, so surely by now it must be “big data”.

No.

There’s an essential difference between true big data techniques, as actually performed at surprisingly few firms but exemplified by Google, and the human-intervention data-driven techniques referred to as business analytics.

No matter how big the data you use is, at the end of the day, if you’re doing business analytics, you have a person looking at spreadsheets or charts or numbers, making a decision after possibly a discussion with 150 other people, and then tweaking something about the way the business is run.

If you’re really doing big data, then those 150 people probably get ~~fired~~ laid off, or even more likely are never hired in the first place, and the computer is programmed to update itself via an optimization method.

That’s not to say it doesn’t also spit out monitoring charts and numbers, and it’s not to say no person takes a look every now and then to make sure the machine is humming along, but there’s no point at which the algorithm waits for human intervention.

In other words, in a true big data setup, the human has stepped outside the machine and lets the machine do its thing. That means, of course, that it takes way more to set up that machine in the first place, and probably people make huge mistakes all the time in doing this, but sometimes they don’t. Google search got pretty good at this early on.

So with a business analytics set up we might keep track of the number of site visitors and a few sales metrics so we can later try to (and fail to) figure out whether a specific email marketing campaign had the intended effect.

But in a big data set-up it’s typically much more microscopic and detail oriented, collecting everything it can, maybe 1,000 attributed of a single customer, and figuring out what that guy is likely to do next time, how much they’ll spend, and the magic question, whether there will even be a next time.

So the first thing I offend people about is that they’re not really part of the “big data revolution”. And the second thing is that, usually, their job is potentially up for grabs by an algorithm.

Categories: data science, modeling

Are small businesses less corrupt?

August 14, 2013 Cathy O'Neil, mathbabe 14 comments

I’ve had a bit of a bee in my bonnet for a while now about how we’re expected to assume that big is better when it comes to businesses. It started when I wrote this post about how women CEO’s are considered unambitious for wanting their businesses small enough to manage.

In other words, there might be some selection bias in my next few examples, so full disclosure. And yet I’ll give them to you anyhow.

First, an example of a extra corruption in a large business. I recently met HSBC whistleblower Everett Stern, profiled by Matt Taibbi here. He told me about the stuff he’d seen going on in HSBC, whereby there was rampant money laundering for terrorists (his region of interest was the Middle East). When asked why nobody’s gotten into trouble, his answer was simple: too big to jail.

Or if you’re not convinced too-big-to-jail is a real problem, just look at the state of the London Whale case: two low-level indictments and basically nothing else for lying to regulators and changing their books to pretend they had less losses.

Next, in the category of it’s-actually-good-to-be-small, you might have seen this tiny New York Times article about two email provider companies which folded rather than giving up their customers’ data. Can anyone imagine Facebook or Google doing that? The big business version of this is “hiring really fancy lawyers” I guess, but it doesn’t seem to work as well.

I’m wondering if this generalizes: in general, can we claim that small companies have less to lose and therefore have more ethics?

It’s certainly true that, at the very least, small companies live and die based on the relationship of trust that they have with their customers, so to the extent that their customers have ethics, then the companies need to consider them. Larger firms, on the other hand, can hire PR firms to fix their image after the fact if things go wrong.

What do you think? Is there research on this?

Update: First of all, sure there’s research on this, if you think accounting fraud is a good proxy for corruption. Second, now that I think about it, small companies having less to lose can also be a super bad thing, if you want to get away with bad shit. And for that matter, if you consider little subsidiaries of big companies as “small companies”, or for that matter McDonalds’ franchises, they already are.

Also, as a friend of mine pointed out over email, small companies are often inefficient (so: no unionization) and are used as a political baby seal to justify all sorts of crappy policies, as we’ve of course seen.

Categories: musing

How to be a pickup artist, Silicon Valley style

August 13, 2013 Cathy O'Neil, mathbabe 6 comments

You know that feeling you get when you’re reading an disembodied article on the web and it’s just so ridiculous, you get the creeping sensation that it’s either from The Onion or the Borowitz Report?

That is, I would suggest, how you’re going to feel when you read this article about a school for Silicon Valley style entrepreneurship (hat tip Peter Woit). Even just the name of the school – the Draper University of Heroes – feels like an Onion article, never mind the visuals:

Students in class at the Draper University of Heroes

So, what do these young people ~~learn~~ do to become douchebag heros? Here’s what:

They pledge allegiance every morning to their personal brands,
They submit to a full two days of coding and excel lessons,
Then they get down to the real work of sun tanning by the pool and go-kart racing,
They hang out with VC Tim Draper, an investor in Tesla (the new conspicuous consumption choice among pseudo-progressive capitalists, as I learned at FOO),
They read books, or at least they own books, including Donald Trump’s The Art of the Deal, The Wall Street MBA, and Ayn Rand’s The Fountainhead,
and all this for just $9,500 for an eight week program!

How does it end? From the article:

In lieu of diplomas, Draper U. students receive masks and capes printed with their superhero nicknames and are instructed to jump on each of a series of three small trampolines placed in a line in front of them. While bouncing from trampoline to trampoline, they’re told to shout, “Up, up, and away!” Then they assemble for a group photo.

“The world needs more heroes,” Draper says. “And it just got 40 more of them!”

Here’s the thing. It’s no accident that there are way more men than women here. This school is very similar in design and intent to the society built by Neil Strauss, who wrote The Game and taught a bunch of guys how to pick up “hot” women for sex – Aunt Pythia discussed it here.

Why do I say that? Because it’s fundamentally a confidence-boosting ritual, where a bunch of guys convince themselves that their prospects are good, their goals are attainable, their narcissistic world view is honorable, and it’s just a question of acquiring the right magic tricks to entrap their prey. It just happens to be about money instead of sex in this case.

There is a difference, of course. Whereas the pick up artists only needed to trick drunk women for a few hours in order to sleep with them, these “Silicon Valley Heroes” have to trick way more people for way longer that they should get investment. That doesn’t make it impossible for something like this to work, though, just harder.

Categories: musing, news

Larry Summers and the Lending Club

August 12, 2013 Cathy O'Neil, mathbabe 19 comments

So here’s something potential Fed Chair Larry Summers is involved with, a company called Lending Club, which creates a money lending system that cuts out the middle man banks.

Specifically, people looking for money come to the site and tell their stories, and try to get loans. The investors invest in whichever loans look good to them, for however much money they want. For a perspective on the risks and rewards of this kind of peer-to-peer lending operation, look at this Wall Street Journal article which explains things strictly from the investor’s point of view.

A few red flags go up for me as I learn more about Lending Club.

First, from this NYTimes article, “The company [Lending Club] itself is not regulated as a bank. But it has teamed up with a bank in Utah, one of the states that allows banks to charge high interest rates, and that bank is overseen by state regulators and the Federal Deposit Insurance Corporation.”

I’m not sure how the FDIC is involved exactly, but the Utah connection is good for something, namely allowing high interest rates. According to the same article, 37% of loans are for APR’s of between 19% and 29%.

Next, Summers is referred to in that article as being super concerned about the ability for the consumers to pay back the loans. But I wonder how someone is supposed to be both desperate enough to go for a 25% APR loan and also able to pay back the money. This sounds like loan sharking to me.

Probably what bothers me most though is that Lending Club, in addition to offering credit scores and income when they have that information, also scores people asking for loans with a proprietary model which is, as you guessed it, unregulated. Specifically, if it’s anything like ZestFinance, could use signals more correlated to being uneducated and/or poor than to the willingness or ability to pay back loans.

By the way, I’m not saying this concept is bad for everyone- there are probably winners on the side of the loanees, and it might be possible that they get a loan they otherwise couldn’t get or they get better terms than otherwise or a more bespoke contract than otherwise. I’m more worried about the idea of this becoming the new normal of how money changes hands and how that would affect people already squeezed out of the system.

I’d love your thoughts.

Categories: data science, finance, modeling

Finance and open source

August 11, 2013 Cathy O'Neil, mathbabe 10 comments

I want to bring up two quick topics this morning I’ve been mulling over lately which are both related to this recent post by Economist Rajiv Sethi from Barnard (h/t Suresh Naidu), who happened to be my assigned faculty mentor when I was an assistant prof there. I have mostly questions and few answers right now.

In his post, Sethi talks about former computer nerd for Goldman Sachs Sergey Aleynikov and his trial, which was chronicled by Michael Lewis recently. See also this related interview with Lewis, h/t Chris Wiggins.

I haven’t read Lewis’s piece yet, only his interview and Sethi’s reaction. But I can tell it’ll be juicy and fun, as Lewis usually is. He’s got a way with words and he’s bloodthirsty, always an entertaining combination.

So, the two topics.

First off, let’s talk a bit about high frequency trading, or HFT. My first two questions are, who does HFT benefit and what does HFT cost? For both of these, there’s the easy answer and then there’s the hard answer.

Easy answer for HFT benefitting someone: primarily the people who make loads of money off of it, including the hardware industry and the people who get paid to drill through mountains with cables to make connections between Chicago and New York faster.

Secondarily, market participants whose fees have been lowered because of the tight market-making brought about by HFT, although that savings may be partially undone by the way HFT’ers operate to pick off “dumb money” participants. After all, you say market making, I say arbing. Sorting out the winners, especially when you consider times of “extreme market conditions”, is where it gets hard.

Easy answer for the costs of HFT is for the companies that invest in IT and infrastructure and people to do the work, although to be sure they wouldn’t be willing to make that investment if they didn’t expect it to pay off.

A harder and more complete answer would involve how much risk we take on as a society when we build black boxes that we don’t understand and let them collide with each other with our money, as well as possibly a guess at what those people and resources now doing HFT might be doing otherwise.

And that brings me to my second topic, namely the interaction between the open source community and the finance community, but mostly the HFTers.

Sethi said it ~~well~~ (Cathy: see bottom of this for an update) this way in his post:

Aleynikov relied routinely on open-source code, which he modified and improved to meet the needs of the company. It is customary, if ~~not mandatory~~(Cathy: see bottom of this for an update) for these improvements to be released back into the public domain for use by others. But his attempts to do so were blocked:

Serge quickly discovered, to his surprise, that Goldman had a one-way relationship with open source. They took huge amounts of free software off the Web, but they did not return it after he had modified it, even when his modifications were very slight and of general rather than financial use. “Once I took some open-source components, repackaged them to come up with a component that was not even used at Goldman Sachs,” he says. “It was basically a way to make two computers look like one, so if one went down the other could jump in and perform the task.” He described the pleasure of his innovation this way: “It created something out of chaos. When you create something out of chaos, essentially, you reduce the entropy in the world.” He went to his boss, a fellow named Adam Schlesinger, and asked if he could release it back into open source, as was his inclination. “He said it was now Goldman’s property,” recalls Serge. “He was quite tense. When I mentioned it, it was very close to bonus time. And he didn’t want any disturbances.”

This resonates with my experience at D.E. Shaw. We used lots of python stuff, and as a community were working at the edges of its capabilities (not me, I didn’t do fancy HFT stuff, my models worked at a much longer time frame of at least a few hours between trades).

The urge to give back to the OS community was largely thwarted, when it came up at all, because there was a fear, or at least an argument, that somehow our competition would use it against us, to eliminate our edge, even if it was an invention or tool completely sanitized from the actual financial algorithm at hand.

A few caveats: First, I do think that stuff, i.e. python technology and the like eventually gets out to the open source domain even if people are consistently thwarting it. But it’s incredibly slow compared to what you might expect.

Second, It might be the case that python developers working outside of finance are actually much better at developing good tools for python, especially if they have some interaction with finance but don’t work inside. I’m guessing this because, as a modeler, you have a very selfish outlook and only want to develop tools for your particular situation. In other words, you might have some really weird looking tools if you did see a bunch coming from finance.

Finally, I think I should mention that quite a few people I knew at D.E. Shaw have now left and are actively contributing to the open source community now. So it’s a lagged contribution but a contribution nonetheless, which is nice to see.

Update: from my Facebook page, a discussion of the “mandatoriness” of giving back to the OS community from my brother Eugene O’Neil, super nerd, and friend William Stein, other super nerd:

Eugene O’Neil: the GPL says that if you give someone a binary executable compiled with GPL source code, you also have to provide them free access to all the source code used to generate that binary, under the terms of the GPL. This makes the commercial sale of GPL binaries without source code illegal. However, if you DON’T give anyone outside your organization a binary, you are not legally required to give them the modified source code for the binary you didn’t give them. That being said, any company policy that tries to explicitly PROHIBIT employees from redistributing modified GPL code is in a legal gray area: the loophole works best if you completely trust everyone who has the modified code to simply not want to distribute it.

William Stein: Eugene — You are absolutely right. The “mandatory” part of the quote: “It is customary, if not mandatory, for these improvements to be released back into the public domain for use by others.” from Cathy’s article is misleading. I frequently get asked about this sort of thing (because of people using Sage (http://sagemath.org) for web backends, trading, etc.). I’m not aware of any popular open source license that make it mandatory to give back changes if you use a project internally in an organization (let alone the GPL, which definitely doesn’t). The closest is AGPL, which involves external use for a website. Cathy — you might consider changing “Sethi said it well…”, since I think his quote is misleading at best. I’m personally aware of quite a few people that do use Sage right now who wouldn’t otherwise if Sethi’s statement were correct.

Categories: finance, open source tools

Ask Aunt Pythia

August 10, 2013 Cathy O'Neil, mathbabe 3 comments

Hey it’s Saturday and unlike last week, I know it! That means it’s time for Aunt Pythia to spew forth her ill-considered advice to thoroughly nice people such as yourself.

By the way, if you don’t know what the hell I’m talking about, go here for past advice columns and here for an explanation of the name Pythia.

And please, Submit your question for Aunt Pythia at the bottom of this page!

——

Dear Aunt Pythia,

I am a 48-year-old newly single mother of teenagers. I finally have time to date and am seeking the statistically most successful way to meet single, available men. I do not like to hang out in bars and my “sports” interests are ballet and yoga–not anywhere any heterosexual men usually hang out. I do love wine but joining a wine “club” would be prohibitively expensive, and a book club is also not where available men can be found. Do you suggest I take up new activities in my life to meet men? And if so, which ones would maximize my chances in my age group and my proclivity to be introverted? Please do not suggest match.com–it was a disaster.

Thanks,
Statistically Seeking Mr. Right

Dear SSMR,

I suggest you take up a nerd sport, like learning a programming language – python?. Join a python meetup group in your area and go to some meetings and wait for a super nice nerd to show up. Note: super nice nerds might not talk a lot, so you might need to be patient and/or draw them out.

Good luck!

Aunt Pythia

——

Dear Aunt Pythia,

I’m an incoming senior undergrad CS student at Columbia.

This summer, I’m very fortunate to be working on some very interesting problems in data science, learning a ton, and implementing and testing a lot of models of my own. It’s more research/science type stuff, rather than software engineering, and I really want to continue to do this (while being compensated) after graduation next year.

The problem is, I’ve never once considered grad school (I’m really not an academic type and I love working with real data in real companies). Is it possible for a new graduate to get a research-type data science job, or at least mostly research-type, without a further degree? More importantly, I’d like to work on interesting problems, that hopefully will benefit the greater good, at least in some way.

If so, where do these jobs lie, and how can I get there?

Fledgling Scientist

Dear Fledgling Scientist,

It’s interesting how, at least for you, there’s a disconnect between the desire to be doing abstract research and the desire to be at grad school. What does that say about the reputation of grad school? What does it symbolize to you if not doing abstract research? Would you reconsider that?

Here’s the thing. I’ma be honest with you, most research doesn’t pay for itself. Indeed it’s pretty rare for research to pay off. So companies, especially start-ups that don’t have extra money floating around, will not pay for people to be abstract researchers, even if they’re proven professionals (i.e. they have Ph.D.’s and lots of papers).

Even in my job, where I’m an experienced researcher in math, and to some extent it’s my job to be a researcher in data science, it’s not abstract at all – I’m trying to figure out how to start a business in data science that will create a revenue stream of real cash money.

I don’t want to be completely negative, so here’s an idea for you that doesn’t require grad school. Get a job that pays pretty well but isn’t full time and do research in your spare time. It might not pay off cash-wise but it could very well make you money. And after a while you might decide that getting a Ph.D. would suit you.

Good luck!

Aunt Pythia

——

Dear Aunt Pythia,

First of all: happy belated birthday!

In a couple of weeks I’m going to be taking part in a really awesome program at my university that brings low-income/first-generation college students to campus for a week to work on a research project with a lab of their choice before starting here in the fall. I get to be their Resident Assistant during this time and help them out with their lab projects/presentations. I’m feeling incredibly excited but also incredibly nervous about staring this! For example, I keep having imaginary conversations with hypothetical students in one or another life-situation with the aim of trying to figure out what’s the best possible advice/consolation I could offer them in that theoretical moment (this is just symptomatic of how math has drilled my brain to think about everything. I’m not actually crazy).

But whenever I overanalyze something to this extent I tend to become aloof and disconnected from the reality of it when it actually happens. It’s really important to me that I DO NOT DO THIS because I would love to be able to keep interacting regularly with these kids once the program finishes and I don’t want them to think of me as that weird guy who shakes his gravelly hands and mumbles whenever they bring up an academic/personal problem that I might *actually* be able to help them with (given on my own crazy and nonlinear experience). So how can I avoid doing this? How can I keep it real? Any other nuggets of wisdom you’d like to offer me going into this would also be greatly appreciated!

Derp dErp deRp derP

Dear Dddd,

Thanks for the birthday wishes! I had a great one.

Let’s see. Your job is to help kids with research projects, and you want to do a good job and keep it real, and you want to keep in touch with them. They’re also low-income/ first-generation college students.

My first piece of advice is to be nice and to articulate very loudly that you’re here to help and you want to make yourself available to them. That is always appreciated by people who don’t know what they’re doing. My next suggestion is to assume they are nerds, here to learn, and want to be challenged as well as to impress. So get ready to be impressed, and be sure to give positive feedback when it’s appropriate. People really love that stuff.

Third, you mention you have had a crazy and non-linear experience yourself. It might help them to know that, to relate to you, because chances are they might have moments of feeling out of place. But I’d wait on telling them until it’s one-on-one and you’ve already established a friendship and mutual respect. For example, it’d be a good time to mention this if they’ve come to you in a panic because they’ve been feeling over their head but know they can rely on you for advice. And also for example, it’d be a bad time to mention this on the first day when they’re all just meeting you, because it’d come across as you not expecting much from them.

Finally, it’s always fun to work with young people, so have a great time! Feed off their energy and they’ll feed off of your wisdom.

Aunt Pythia

——

Dear Aunt Pythia,

I recently finished up a masters in applied mathematics. I also recently left the Air Force to stop being a part of an organization that does awful awful things. I am now trying to find a job that hopefully uses my recent degree and avoids working for an organization that does awful things. Currently this means I am teaching small children to ice skate and play hockey which is great but doesn’t quite fill up the day or have much of any direct connection to math.

I am wondering what I could do and where I could look to avoid being chewed up by the military-industrial complex or other such entity? (see: financial sector) I’ve been looking at teaching jobs and been avoiding the thought of going for a PhD (so far, that bug will bite soon I’m sure), but I wondered if there might be other options I haven’t thought of. Any advice would be greatly appreciated.

Will Math to Feed Book Habit

Dear Will,

Yikes.

The truth is, once you’ve been politicized and sensitized to the evil that organizations do or are involved it, you start to see it everywhere. Or if not everywhere, at least most places where you get paid.

So if you’re dead-set on not being part of that stuff at all, your options are limited. For example, working at Google might not be a good idea for you since we don’t really know what they do. Facebook is pretty much a no fly zone, depending on what it is you have objections to. Start-ups often participate in weird shit in ways they don’t want to acknowledge (and sometimes don’t – you should be on the look-out for a good job at a small start-up in any case).

Here’s my suggestion: do math tutoring. I know people who get paid pretty darn well for math tutoring, especially for wealthy kids. And yes, there are issues around that too, of course, but on the other hand you know exactly what you’re getting yourself into, and you’re pretty much independent. Plus you’ve already shown you can work with kids, so it might be an easy transition. Over time you can start a math tutoring company and run it with no ties to anyone you don’t like.

Auntie P

——

Dear Aunt Pythia,

I was in Penn Station today around 6pm and a guy came up to me and asked me if he could ask me a question. I said okay, and well he first asked me if I spoke English, and then he said he needed money to take the train to Patchogue (which I later looked up costs $12.75). I wasn’t sure what to do, and I just reflexively I guess asked him how much it the fare was and he said 11.75 and well, then I took out my wallet and had 12 bucks so I gave it to him and he thanked me and walked away (I had to catch my own train elsewhere and so I don’t know whether he bought a ticket to patchogue or not).

After he walked away, I felt a bit silly for giving him so much – I could have just said no, but I often have a hard time saying no – and felt like I hadn’t stood up for myself, and had given him the money so that he would go away (I felt threatened/intimated by him because of reasons that aren’t PC to mention; but there were plenty of people around so I didn’t consider myself to be in imminent danger).

At the same time, I tried to make myself feel better by reminding myself that I can’t take any money with me when I die, and that I expect to die with more than 12 bucks in my name, so in the end it doesn’t matter, and maybe he was having a rough time so I perhaps I did my good deed for the day.

On the other hand, giving money away like that just encourages people/panhandlers to ask, maybe it is a scam (btw this would be the second time within the last year I was asked in Penn Station such a question (I said no the previous time but it was a lady that was asking so I didn’t feel threatened), and I’m only in there about once a week) and so sometimes I say no to such requests.

So my question is, what would AP do? (Oh, if it matters, I make $70,000 a year, and have no dependents). And what does MB do when asked by panhandlers for spare change?

Penn $tation

Dear Penn,

First of all, I like that you gave the dude $12 – I’ve been scammed before – plus, I like your “death bed” reasoning as well, it makes sense to me. I don’t think you need to feel weird or ashamed of what you’ve done.

On the other hand, it’s not what I do. I never give scary men money because they’re threatening me, whether they’re black or white, on principle, and I’ve never had a problem with saying no. In fact I almost never give money to strangers at all, except when they’re older women who seem like they’ve been thrown out of mental institutions. Then I often give them $20 bills, and they’re often not even asking for them because they’re so confused.

Since I live and work in New York and commute to work via subway most days, giving money to everyone who asks me for it could actually be bad for my family over time. But that’s not why I don’t do it. Mostly I don’t do it because, having worked in soup kitchens and having read enough about childhood poverty and hunger, I know that the people who need petty cash the most aren’t the ones asking for it in Penn Station. It’s way bigger than that, unfortunately.

I do buy my broke friends stuff, mostly food, and I give money to causes like Fair Foods in Boston that I have a personal connection to and which I think address the immediate needs of poor immigrants and children.

Yours,

Aunt Pythia

——

Please submit your well-specified, fun-loving, cleverly-abbreviated question to Aunt Pythia!

Categories: Aunt Pythia

New schwag for the Stacks Project

August 9, 2013 Cathy O'Neil, mathbabe 3 comments

This just in from zazzle.com: Stacks Project cups and shirts, to celebrate the recent upgrade on Stacks Project viz.

An unflattering yet adorable picture. You can see we're both instructing our 11-year-old on how to take a picture and our 4-year-old is squeezing in too. He's not shy.

An unflattering yet adorable picture of both me and Johan in Stacks Project t-shirts, new and old. Unfortunately the colors came out too light. You can see we’re both instructing our 11-year-old on how to take a picture and our 4-year-old is squeezing in too. He’s not shy.

Fuzzy yet awesome.

Categories: musing

Survivorship bias for women in men’s fields

August 8, 2013 Cathy O'Neil, mathbabe 7 comments

I like this essay written by Annie Gosfield, a self-described “composeress”, which is her word to mean a female composer. She finds it slightly absurd to be singled out for her femaleness. Her overall take on being a woman in a man’s world is refreshing, and resonates with me as a woman in math and technology.

From her essay:

I’ve never considered myself a “woman composer,” but I suspect that over the years being female has helped more than it’s hurt. Being a woman (and having high hair) has made me easier to recognize, easier to remember and has spared me from fitting into the generic description of a composer: “medium build, dark hair, glasses, beard.” I will admit to liking the invented honorific term “composeress.” (It sounds archaic, grand, and slightly ridiculous, just as a gender-specific title for a composer should.)

So, great for her, and wonderful that from her perspective she feels propelled rather than suffocated by her otherness status. To some extent I agree from my own experience.

But having said that, it doesn’t mean that other women, possibly many other women, haven’t been squeezed out, or have selected out, because of their female status. After all, we hear way more from the people who stay and “succeed”, which tends to give us massive survivorship bias.

Indeed, and to be nerdy and true to form, we can almost think about measuring the extent to which there is a weeding-out effect of women by asking the survivors the extent to which they identify as “women” versus the population at large. I think we’d find that the women who survive in nearly all-male environments have developed, or were born with, coping mechanisms which allow them to ignore their own otherness.

I know that was true of me – when I was in grad school at Harvard, I went through a distinct phase of wanting to wear men’s clothing, or at least gender neutral clothing – so jeans, t-shirts, leather shoes, never dresses – to be externally more consistent with how I felt inside. Not that I was sexually identified with men, but that I didn’t want to be seen as primarily feminine. Instead I wanted to be seen as primarily a mathematician.

Does it make me a freak, to wear men’s clothing and (sometimes) wish I could grow a beard? Possibly, although over time it’s changed, and nowadays I take pride in my femininity, and in fact I think much of my power emanates from it.

But it does give me pause when I hear successful women in men’s fields talking about how great it is to be a woman and how surprising all the attention is. We still seem to be contorting ourselves in an effort to not seem too womanly, and this makes me think it’s entirely un-coincidental, and possibly a crucial part of what allows us to succeed. Besides talent and hard work, of course. And I don’t think it’s undue attention at all – I think it’s just something we train ourselves not to consider because focusing on it too much could be paralyzing.

By the way, I’m not doing justice to Annie Gosfield’s essay, which you should read in its entirety and has nuanced things to say about otherness in the field of composing.

Categories: women in math

Minorities possible unfairly disqualified from opening bank accounts

August 7, 2013 Cathy O'Neil, mathbabe 9 comments

My friend Frank Pasquale sent me this article over twitter, about New York State attorney general Eric T. Schneiderman’s investigation into possibly unfair practices by big banks using opaque and sometimes erroneous databases to disqualify people from opening accounts.

Not much hard information is given in the article but we know that negative reports stemming from the databases have effectively banished more than a million lower-income Americans from the financial system, and we know that the number of “underbanked” people in this country has grown by 10% since 2009. Underbanked people are people who are shut out of the normal banking system and have to rely on the underbelly system including check cashing stores and payday lenders.

I can already hear the argument of my libertarian friends: if I’m a bank, and I have reason to suspect you have messed up with your finances in the past, I don’t offer you services. Done and done. Oh, and if I’m a smart bank that figures out some of these so-called “past mistakes” are actually erroneously reported, then I make extra money by serving those customers that are actually good when they look bad. And the free market works.

Two responses to this. First, at this point big banks are really not private companies, being on the taxpayer dole. In response they should reasonably be expected to provide banking services to all of not most people as part of a service. Of course this is a temporary argument, since nobody actually likes the fact that the banks aren’t truly private companies.

The second, more interesting point – at least to me – is this. We care about and defend ourselves from our constitutional rights being taken away but we have much less energy to defend ourselves against good things not happening to us.

In other words, it’s not written into the constitution that we all deserve a good checking account, nor a good college education, nor good terms on a mortgage, and so on. Even so, in a large society such as ours, such things are basic ingredients for a comfortable existence. Yet these services are rare if not nonexistent for a huge and swelling part of our society, resulting in a degradation of opportunity for the poor.

The overall effect is heinous, and at some point does seem to rise to the level of a constitutional right to opportunity, but I’m no lawyer.

In other words, instead of only worrying about the truly bad things that might happen to our vulnerable citizens, I personally spend just as much time worrying about the good things that might not happen to our vulnerable citizens, because from my perspective lots of good things not happening add up to bad things happening: they all narrow future options.

Categories: modeling, news, rant

Should lawmakers use algorithms?

August 5, 2013 Cathy O'Neil, mathbabe 18 comments

Here is an idea I’ve been hearing floating around the big data/ tech community: the idea of having algorithms embedded into law.

The argument for is pretty convincing on its face: Google has gotten its algorithms to work better and better over time by optimizing correctly and using tons of data. To some extent we can think of their business strategies and rules as a kind of “internal regulation”. So why don’t we take a page out of that book and improve our laws and specifically our regulations with constant feedback loops and big data?

No algos in law

There are some concerns I have right off the bat about this concept, putting aside the hugely self-serving dimension of it.

First of all, we would be adding opacity – of the mathematical modeling kind – to an already opaque system of law. It’s hard enough to read the legalese in a credit card contract without there also being a black box algorithm to make it impossible.

Second of all, whereas the incentives in Google are often aligned with the algorithm “working better”, whatever that means in any given case, the incentives of the people who write laws often aren’t.

So, for example, financial regulation is largely written by lobbyists. If you gave them a new tool, that of adding black box algorithms, then you could be sure they would use it to further obfuscate what is already a hopelessly complicated set of rules, and on top of it they’d be sure to measure the wrong thing and optimize to something random that would not interfere with their main goal of making big bets.

Right now lobbyists are used so heavily in part because they understand the complexity of their industries more than the lawmakers themselves. In other words, they actually add value in a certain way (besides in the monetary way). Adding black boxes would emphasize this asymmetric information problem, which is a terrible idea.

Third, I’m worried about the “black box” part of algorithms. There’s a strange assumption among modelers that you have to make algorithms secret or else people will game them. But as I’ve said before, if people can game your model, that just means your model sucks, and specifically that your proxies are not truly behavior-based.

So if it pertains to a law against shoplifting, say, you can’t have an embedded model which uses the proxy of “looking furtive and having bulges in your clothes.” You actually need to have proof that someone stole something.

If you think about that example for a moment, it’s absolutely not appropriate to use poor proxies in law, nor is it appropriate to have black boxes at all – we should all know what our laws are. This is true for regulation as well, since it’s after all still law which affects how people are expected to behave.

And by the way, what counts as a black box is to some extent in the eye of the beholder. It wouldn’t be enough to have the source code available, since that’s only accessible to a very small subset of the population.

Instead, anyone who is under the expectation of following a law should also be able to read and understand the law. That’s why the CFPB is trying to make credit card contracts be written in Plain English. Similarly, regulation law should be written in a way so that the employees of the regulator in question can understand it, and that means you shouldn’t have to have a Ph.D. in a quantitative field and know python.

Algos as tools

Here’s where algorithms may help, although it is still tricky: not in the law itself but in the implementation of the law. So it makes sense that the SEC has algorithms trying to catch insider trading – in fact it’s probably the only way for them to attempt to catch the bad guys. For that matter they should have many more algorithms to catch other kinds of bad guys, for example to catch people with suspicious accounting or consistently optimistic ratings.

In this case proxies are reasonable, but on the other hand it doesn’t translate into law but rather into a ranking of workflow for the people at the regulatory agency. In other words the SEC should use algorithms to decide which cases to pursue and on what timeframe.

Even so, there are plenty of reasons to worry. One could view the “Stop & Frisk” strategy in New York as following an algorithm as well, namely to stop young men in high-crime areas that have “furtive motions”. This algorithm happens to single out many innocent black and latino men.

Similarly, some of the highly touted New York City open data projects amount to figuring out that if you focus on looking for building code violations in high-crime areas, then you get a better hit rate. Again, the consequence of using the algorithm is that poor people are targeted at a higher rate for all sorts of crimes (key quote from the article: “causation is for other people”).

Think about this asymptotically: if you live in a nice neighborhood, the limited police force and inspection agencies never check you out since their algorithms have decided the probability of bad stuff happening is too low to bother. If, on the other hand, you are poor and live in a high-crime area, you get checked out daily by various inspectors, who bust you for whatever.

Said this way, it kind of makes sense that white kids smoke pot at the same rate as black kids but are almost never busted for it.

There are ways to partly combat this problem, as I’ve described before, by using randomization.

Conclusion

It seems to me that we can’t have algorithms directly embedded in laws, because of the highly opaque nature of them together with commonly misaligned incentives. They might be useful as tools for regulators, but the regulators who choose to use internal algorithms need to carefully check that their algorithms don’t have unreasonable and biased consequences, which is really hard.

Categories: data science, finance, modeling

Ask Aunt Pythia – special Sunday edition

August 4, 2013 Cathy O'Neil, mathbabe 1 comment

Guys, I messed up. I have been traveling two weeks in a row and I plumb forgot what day it was yesterday and thus, sadly, ignored my inner Aunt Pythia and her advice. I’m making up for it now, and I’m sending out major league apologies to people who were disappointed by the bullshit complaint about Indiana school politics yesterday instead of the sass you’ve grown to love from Auntie P.

By the way, if you don’t know what the hell I’m talking about, go here for past advice columns and here for an explanation of the name Pythia.

And please, Submit your question for Aunt Pythia at the bottom of this page!

——

Dear Aunt Pythia,

I completed a BA in economics a number of years ago (well before the economy went to heck-in-a-handbasket), but didn’t immediately pursue a graduate degree. Instead of focusing on my career, I dedicated myself to a charity project–building a community school in a very poor country–which took a lot of my time and financial resources. Now, the project is up and running on its own and I’m thinking again about career paths (in order to be able to fund bigger and better philanthropic works, if nothing else).

I’ve had the obvious thought of continuing my education with a MA or PhD program, but I’m not entirely convinced that doing so will actually improve my prospects for landing a plumb job. It will, on the other hand, be sure to cost me plenty of moola. What do you think: is going into debt in order to obtain an advanced degree a wise financial decision in this economic climate? If not, what other steps do you think would be helpful for an underemployed intellectual looking to move out of manual labor and into something more “white collar,” ideally without having to sell his/her soul?

Or maybe it’s just that are some of us just stuck down here on the lower rungs of the income distribution and had better just get used to it. That is what I tend to think, but I’ve been accused of pessimism before and thought maybe you might have something less depressing to suggest.

Feeling Out Obvious Limits

Dear FOOL,

I gotta say, I’m not sure. I’m not an expert on jobs in Econ. But I’ll tell you what, if it’s like math, it’s not kind to people who take time off. I think this is a huge mistake, and obviously one that affects women more than men. If math, as a community, were serious about attracting good women, they’d change this bias. But I don’t see that happening soon. Ditto with probability 90% for Econ.

Having said that, it sounds like what you’ve accomplished is real, and although it’s possibly invisible to certain academic communities, I’d bet it isn’t to others, like the business community. If you’re a quantitative person who’s build a working charity (amazing!), then you could probably convince someone to give you a good job.

How about you look into getting a masters degree in something you’re interested in that’s also quantitative, and then rebuild yourself as an experienced team-builder?

Good luck!

Aunt P

——

Dear Pythia,

A Platonic friend from undergrad analysis class and I were walking on the beach together one sunny day, several years ago. She suggested we take our shoes and socks off and wade in the water, which we did. When it came time to put our shoes back on, while deftly balancing on one foot like a flamingo, I dried off my free foot with a sock, put the sock on, then the shoe, then repeated the process for the other foot, all without a hitch. Whether real, or possibly feigned premeditatedly, my companion was exhibiting quite the struggle a few feet away. Perhaps because I am more attracted to skill and independence than incompetence and dependence, I just stood by and watched. Would you agree that this was the right thing to do, or am I in for a scolding instead?

Free Bird

Dear Free,

This is a great example of a question that says more about the questioner than anything about the question.

Putting that aside, and to answer your putative question: you have no obligation to help a grownup put on their socks. But you do have an obligation to forget about how a friend puts on their socks within at most 2 days, and you have a definite obligation to not judge them for their sock-putting-on-technique on a sandy beach. Plus, it wasn’t a way to get into your pants, if that’s what you mean.

Good luck,

Auntie P

——

Dear Aunt Pythia,

I have a question about a question you actually answered (see last question answered here) for your revival.

If ‘D’ stands for ‘Dry’ and ‘G’ stands for ‘Got laid’, don’t you actually think that there would be some sort of stickiness (or state-dependence) coming in? I mean, I have the impression – maybe fallacious – that there is some sort of cold feet effect with getting laid: once you’ve entered the ‘dry’ state, your probability to remain in that state is actually increasing.

In other words, don’t you think that $Pr(D_t | D_{t-1})$ is actually increasing with $t$ ? How would you test for that?

There are several mechanisms behind that I think (and I will speak for myself here): it’s becoming more and more obvious that you’re sex-starved, and this is a big put-off, because that may be interpreted as being a lousy lover. You may also have less and less patience for the required chitchat before the physical fun etc.

The above may hold for males but not for females.

I’m not so sure about the other conditional probability $Pr(G_t | G_{t-1})$ mainly because I’ve little experience in staying very long in the ‘G’ state; but would be curious to know more about it.

Cheers,

Canada Dry

Dear Canada,

Great points! And eminently modelable, which I appreciate, although the data collection would be a bitch, especially considering how much people lie about getting laid (see first answer here).

I don’t agree that the underlying effect doesn’t effect women though. The concept that “if I haven’t gotten laid in a long time my chances are actively going down” definitely seems true for many of my friends, male and female, and I don’t think it’s because they are perceived as lousy lovers.

After all, it’s not like there’s a ticker tape on their foreheads counting up the second since their last sexual encounter. Instead, I think it’s part pheromones and part self-regard. If you feel unattractive, you don’t act like a sexy thang and people are less likely to approach you.

Similarly, if you’ve gotten laid recently, you feel sexy, which makes you act like a sexy person, which is hot in itself, and also you have sex pheromones dripping off of you, which attracts the opposite sex like flies to a lightbulb.

By the way, if you’re a woman and you want a leg up on the process, may I suggest you buy synthetic female pheromones from the Athena Institute. Some of my friends swear by this, and claim it makes men desire them and/or be nice to them. Let’s say it this way: it either works or it works as a placebo.

One last thing: I think the community you live in makes a big difference for these dependent probabilities. If you have been dry for a long time but you have a good set of wingwomen or wingmen, then you’re way better off than if you’re isolated socially.

Good luck, Canada Dry! Go hang with your buddies and get them on board for your worthy cause!

Auntie P

——

Dear Aunt Pythia,

I spent my childhood as a lonely nerd with no friends. Over college and beyond I made friends and learned to have deep, meaningful relationships with people. Then I spent a few years working at a nonprofit, making the world a better place. I made a lot of money while helping to ease the pain associated with a number of types of cancer. And now I’m in my late 30s and rich.

I want to experience the shallow life that I see so many people around me enjoying but I have no idea how to do it. I’d try to buy my way in, but I don’t know where to begin. I’ve heard that girls go for guys with money, but don’t know where to find these girls.

Seriously, I need help being superficial for a while.

Want to be shallow

Dear WtbS,

Please let me be the first person to tell you that you’re already quite superficial. Congratulations!

Just the way you’re talking about “girls” makes me kind of gag, as if they’re lego parts that can be bought, traded, and sold. Plus you also sound crazy smug about your accomplishments, another strong signal for superficiality. So I honestly don’t think I need to give you any more advice on that front.

What I think you actually are wondering is how to be happy, or possibly happy in a hedonistic way. But the sneaky little thing about really enjoying a hedonistic lifestyle is, in my opinion, that you have real connections with the other people in your company. Otherwise you might just wake up feeling empty and crappy. It’s fun to do stupid sexy things with your friends if everyone’s into it, it’s not fun to do stupid sexy things with strangers whose motivations you don’t know, especially if you’re young and rich, because even if you don’t know, I will.

So my advice: go back to your college-aged talents and make deep connections with people who are also fun-loving and slightly crazy. It will take a few months but you might just be able to live like a fucking rock star.

Good luck!

Aunt Pythia

——

Please submit your well-specified, fun-loving, cleverly-abbreviated question to Aunt Pythia!

Categories: Aunt Pythia

Educational accountability scores get politically manipulated again

August 3, 2013 Cathy O'Neil, mathbabe 1 comment

My buddy Jordan Ellenberg just came out with a fantastic piece in Slate entitled “The Case of the Missing Zeroes: An astonishing act of statistical chutzpah in the Indiana schools’ grade-changing scandal.”

Here are the leading sentences of the piece:

Florida Education Commissioner Tony Bennett resigned Thursday amid claims that, in his former position as superintendent of public instruction in Indiana, he manipulated the state’s system for evaluating school performance. Bennett, a Republican who created an A-to-F grading protocol for Indiana schools as a way to promote educational accountability, is accused of raising the mark for a school operated by a major GOP donor.

Jordan goes on to explain exactly what happened and how that manipulation took place. Turns out it was a pretty outrageous and easy-to-understand lie about missing zeroes which didn’t make any sense. You should read the whole thing, Jordan is a great writer and his fantasy about how he would deal with a student trying the same scam in his calculus class is perfect.

A few comments to make about this story overall.

First of all, it’s another case of a mathematical model being manipulated for political reasons. It just happens to be a really simple mathematical model in this case, namely a weighted average of scores.
In other words, the lesson learned for corrupt politicians in the future may well to be sure the formulae are more complicated and thus easier to game.
Or in other words, let’s think about other examples of this kind of manipulation, where people in power manipulate scores after the fact for their buddies. Where might it be happening now? Look no further than the Value-Added Model for teachers and schools, which literally nobody understands or could prove is being manipulated in any given instance.
Taking a step further back, let’s remind ourselves that educational accountability models in general are extremely ripe for gaming and manipulation due to their high stakes nature. And the question of who gets the best opportunity to manipulate their scores is, as shown in this example of the GOP-donor-connected school, often a question of who has the best connections.
In other words, I wonder how much the system can be trusted to give us a good signal on how well schools actually teach (at least how well they teach to the test).
And if we want that signal to be clear, maybe we should take away the high stakes and literally measure it, with no consequences. Then, instead of punishing schools with bad scores, we could see how they need help.
The conversation doesn’t profit from our continued crazy high expectations and fundamental belief in the existence of a silver bullet, the latest one being the Kipp Charter Schools – read this reality check if you’re wondering what I’m talking about (hat tip Jordan Ellenberg).
As any statistician could tell you, any time you have an “educational experiment” involving highly motivated students, parents, and teachers, it will seem like a success. That’s called selection bias. The proof of the pudding lies in the scaling up of the method.
We need to think longer term and consider how we’re treating good teachers and school administration who have to live under arbitrary and unfair systems. They might just leave.

Categories: math education, modeling, statistics

How much is the Stacks Project graph like a random graph?

August 1, 2013 Cathy O'Neil, mathbabe 1 comment

This is a guest post from Jordan Ellenberg, a professor of mathematics at the University of Wisconsin. Jordan’s book, How Not To Be Wrong, comes out in May 2014. It is crossposted from his blog, Quomodocumque, and tweeted about at @JSEllenberg.

Cathy posted some cool data yesterday coming from the new visualization features of the magnificent Stacks Project. Summary: you can make a directed graph whose vertices are the 10,445 tagged assertions in the Stacks Project, and whose edges are logical dependency. So this graph (hopefully!) doesn’t have any directed cycles. (Actually, Cathy tells me that the Stacks Project autovomits out any contribution that would create a logical cycle! I wish LaTeX could do that.)

Given any assertion v, you can construct the subgraph G_v of vertices which are the terminus of a directed path starting at v. And Cathy finds that if you plot the number of vertices and number of edges of each of these graphs, you get something that looks really, really close to a line.

Why is this so? Does it suggest some underlying structure? I tend to say no, or at least not much — my guess is that in some sense it is “expected” for graphs like this to have this sort of property.

Because I am trying to get strong at sage I coded some of this up this morning. One way to make a random directed graph with no cycles is as follows: start with N edges, and a function f on natural numbers k that decays with k, and then connect vertex N to vertex N-k (if there is such a vertex) with probability f(k). The decaying function f is supposed to mimic the fact that an assertion is presumably more likely to refer to something just before it than something “far away” (though of course the stack project is not a strictly linear thing like a book.)

Here’s how Cathy’s plot looks for a graph generated by N= 1000 and f(k) = (2/3)^k, which makes the mean out-degree 2 as suggested in Cathy’s post.

Pretty linear — though if you look closely you can see that there are really (at least) a couple of close-to-linear “strands” superimposed! At first I thought this was because I forgot to clear the plot before running the program, but no, this is the kind of thing that happens.

Is this because the distribution decays so fast, so that there are very few long-range edges? Here’s how the plot looks with f(k) = 1/k^2, a nice fat tail yielding many more long edges:

My guess: a random graph aficionado could prove that the plot stays very close to a line with high probability under a broad range of random graph models. But I don’t really know!

Update: Although you know what must be happening here? It’s not hard to check that in the models I’ve presented here, there’s a huge amount of overlap between the descendant graphs; in fact, a vertex is very likely to be connected all but c of the vertices below it for a suitable constant c.

I would guess the Stacks Project graph doesn’t have this property (though it would be interesting to hear from Cathy to what extent this is the case) and that in her scatterplot we are not measuring the same graph again and again.

It might be fun to consider a model where vertices are pairs of natural numbers and (m,n) is connected to (m-k,n-l) with probability f(k,l) for some suitable decay. Under those circumstances, you’d have substantially less overlap between the descendant trees; do you still get the approximately linear relationship between edges and nodes?

Categories: guest post, math, statistics

Analyzing the complexity of the Stacks Project graphs

July 31, 2013 Cathy O'Neil, mathbabe 10 comments

So yesterday I told you about the cool new visualizations now available on Johan’s Stack Project.

But how do we use these visualizations to infer something about either mathematics or, at the very least, the way we think about mathematics? Here’s one way we thought of with Pieter.

So, there’s a bunch of results, and each of them has its own subgraph of the entire graph which positions that result as the “base node” and shows all the other results which it logically depends on.

And each of those graphs has structure and attributes, the stupidest two of which are the just counts of the nodes and edges. So for each result, we have an ordered pair (#nodes, #edges). What can we infer about mathematics from these pairs?

Here’s a scatter plot of the nodes-vs-edges for each of the 10,445 results (email me if you want to play with this data yourself):

I also put a best-fit line in, just to illustrate that the scatter plot is super linear but not perfectly linear.

So there are a bunch of comments I can make about this, but I’ll limit myself to the following:

There are a lot of points at (1,0), corresponding to remarks, axioms, beginning lemmas, definitions, and tags for sections.
As a data person, let me just say that data is never this clean. There’s something going on, some internal structure to these graphs that we should try to understand.
By “clean” I’m not exactly referring to the fact that things look pretty linear, although that’s weird and we should think about that. What I really mean is that things are so close to the curve that is being approximated. They’re all within a very tight border of this imaginary line. It’s super amazing.
Let’s pretend it’s just plain straight. Does that make sense, that as graphs get more complex the edges don’t get more dense than some multiple (1.86) of of the number of nodes?
Kind of: remember, we don’t depict all logical dependency edges, just the ones that are directly referred to in the proof of a result. So right off the bat you are less surprised that the edges aren’t growing quadratically in the number of nodes, even though the number of possible edges is of course quadratic in the number of nodes.
Think about it this way: assume that every result that requires proof (so, that’s not a (1,0) result) refers to exactly 2 other results in its proof. Then those two child results each correspond to some subgraph of the entire graph, and say their subgraphs each have something like twice as many edges as nodes. Then, ignoring overlap, we’d see two graphs with a 2:1 ratio, then we’d see that parent node, plus two edges leading to each result, which is also a 2:1 ratio, and the disjoint union of all those graphs gives us a large graph with a 2:1 ratio.
Then if you imagine now allowing the overlap, the ratio goes down a bit on average. In this toy model, the discrepancy between 2.0 and the slope we actually see, 1.86, is a measurement of the collapse of the two child graphs, which can be taken as a proxy for how much the two supporting results overlap as notions.
Of course, not every result has exactly two children.
Plus it doesn’t really explain how ridiculously consistent the plot above is. What would?
If you think about it, the only real explanation of the consistency above is my husband brain.
In other words, he’s humming along, thinking about stacks, and at some point, when he thinks things have gotten complicated enough, he says to himself “It’s time to wrap this stuff up and call it a result!” and then he does so. That moment, when he’s decided things are getting complicated enough, is very consistent internally to his brain.
In other words, if someone else created the stacks project, I’d expect to see another kind of plot, possibly also very consistent, but possibly with a different slope.
Also it’d be interesting to compare this plot to another kind of citation network graph, like the papers in the arXiv. Has anyone made that?

Categories: math, modeling

The Stacks Project gets ever awesomer with new viz

July 30, 2013 Cathy O'Neil, mathbabe 17 comments

Crossposted on Not Even Wrong.

Here’s a completely biased interview I did with my husband A. Johan de Jong, who has been working with Pieter Belmans on a very cool online math project using d3js. I even made up some of his answers (with his approval).

Q: What is the Stacks Project?

A: It’s an open source textbook and reference for my field, which is algebraic geometry. It builds foundations starting from elementary college algebra and going up to algebraic stacks. It’s a self-contained exposition of all the material there, which makes it different from a research textbook or the experience you’d have reading a bunch of papers.

We were quite neurotic setting it up – everything has a proof, other results are referenced explicitly, and it’s strictly linear, which is to say there’s a strict ordering of the text so that all references are always to earlier results.

Of course the field itself has different directions, some of which are represented in the stacks project, but we had to choose a way of presenting it which allowed for this idea of linearity (of course, any mathematician thinks we can do that for all of mathematics).

Q: How has the Stacks Project website changed?

A: It started out as just a place you could download the pdf and tex files, but then Pieter Belmans came on board and he added features such as full text search, tag look-up, and a commenting system. In this latest version, we’ve added a whole bunch of features, but the most interesting one is the dynamic generation of dependency graphs.

We’ve had some crude visualizations for a while, and we made t-shirts from those pictures. I even had this deal where, if people found mathematical mistakes in the Stacks Project, they’d get a free t-shirt, and I’m happy to report that I just last week gave away my last t-shirt. Here’s an old picture of me with my adorable son (who’s now huge).

Q: Talk a little bit about the new viz.

A: First a word about the tags, which we need to understand the viz.

Every mathematical result in the Stacks Project has a “tag”, which is a four letter code, and which is a permanent reference for that result, even as other results are added before or after that one (by the way, Cathy O’Neil figured this system out).

The graphs show the logical dependencies between these tags, represented by arrows between nodes. You can see this structure in the above picture already.

So for example, if tag ABCD refers to Zariski’s Main Theorem, and tag ADFG refers to Nakayama’s Lemma, then since Zariski depends on Nakayama, there’s a logical dependency, which means the node labeled ABCD points to the node labeled ADFG in the entire graph.

Of course, we don’t really look at the entire graph, we look at the subgraph of results which a given result depends on. And we don’t draw all the arrows either, we only draw the arrows corresponding to direct references in the proofs. Which is to say, in the subgraph for Zariski, there will be a path from node ABCD to node ADFG, but not necessarily a direct link.

Q: Can we see an example?

Let’s move to an example for result 01WC, which refers to the proof that “a locally projective morphism is proper”.

First, there are two kinds of heat maps. Here’s one that defines distance as the maximum (directed) distance from the root node. In other words, how far down in the proof is this result needed? In this case the main result 01WC is bright red with a black dotted border, and any result that 01WC depends on is represented as a node. The edges are directed, although the arrows aren’t drawn, but you can figure out the direction by how the color changes. The dark blue colors are the leaf nodes that are farthest away from the root.

Another way of saying this is that the redder results are the results that are closer to it in meaning and sophistication level.

Note if we had defined the distance as the minimum distance from the root node (to come soon hopefully), then we’d have a slightly different and also meaningful way of thinking about “redness” as “relevance” to the root node.

This is a screenshot but feel free to play with it directly here. For all of the graphs, hovering over a result will cause the statement of the result to appear, which is awesome.

Next, let’s look at another kind of heat map where the color is defined as maximum distance from some leaf note in the overall graph. So dark blue nodes are basic results in algebra, sheaves, sites, cohomology, simplicial methods, and other chapters. The link is the same, you can just toggle between the different metric.

Next we delved further into how results depend on those different topics. Here, again for the same result, we can see the extent to which that result depends on the different on results from the various chapters. If you scroll over the nodes you can see more details. This is just a screenshot but you can play with it yourself here and you can collapse it in various ways corresponding to the internal hierarchy of the project.

Finally, we have a way of looking at the logical dependency graph directly, where result node is labeled with a tag and colored by “type”: whether it’s a lemma, proposition, theorem, or something else, and it also annotates the results which have separate names. Again a screenshot but play with it here, it rotates!

Check out the whole project here, and feel free to leave comments using the comment feature!

Categories: math, modeling, open source tools

Larry Summers being set up to fail?

July 29, 2013 Cathy O'Neil, mathbabe 15 comments

I’m back from PyData, which was a lot of fun and filled with super nice nerdy people. My prezi slides are now available here.

I have time for one thought: a bunch of people have chatted me up recently with the theory that Larry Summers is being put in the running for the Fed Chair alongside Janet Yellen just so that, when Yellen gets the call, we can all breathe a sigh of relief it didn’t go to Summers.

In other words, it’s a wholly political ploy so the Obama can look like a hero for women everywhere when he chooses Yellen, and so that we can all conclude that at least Obama’s learned this one lesson with regards to dealing with the ongoing financial crisis: Summers isn’t the solution.

Depending on my mood I sometimes buy into this theory, but obviously I’m still worried.

Categories: finance, news

PyData talk today

July 28, 2013 Cathy O'Neil, mathbabe 5 comments

Not much time because I’m giving a keynote talk at the PyData 2013 conference in Cambridge today, which is being held at the Microsoft NERD conference center.

It’s gonna be videotaped so I’ll link to that when it’s ready.

My title is “Storytelling With Data” but for whatever reason on the schedule handed out yesterday the name had been changed to “Scalable Storytelling With Data”. I’m thinking of addressing this name change in my talk – one of the points of the talk, in fact, is that with great tools, we don’t need to worry too much about the scale.

Plus since it’s Sunday morning I’m going to make an effort to tie my talk into an old testament story, which is totally bizarre since I’m not at all religious but for some reason it feels right. Please wish me luck.

Categories: data science, modeling, open source tools

Aunt Pythia’s advice

July 27, 2013 Cathy O'Neil, mathbabe 16 comments

It’s a speed advice column today, folks, because I’m blogging whilst sitting at the PyData 2013 conference [Aside: I believe in Travis Oliphant, the nerd Santa Claus, do you?]. I’ll try to keep it to the point yet amusing slash provocative.

By the way, if you don’t know what the hell I’m talking about, go here for past advice columns and here for an explanation of the name Pythia.

And please, Submit your question for Aunt Pythia at the bottom of this page!

——

Dear Aunt Pythia,

I’m having a baby soon, and I’m planning to be the primary caregiver for a few months (from 3 months onward). I’m hoping that I’ll be able to get some research done at the same time, but I’m not sure how practical that is. What should I expect? Do you have any tips for juggling baby care and math research? (assuming no teaching and minimal responsibilities around the department.)

Baffled About Birth Year

Dear BABY,

Other people are gonna tell you encouraging things like, “oh you can do it!” or “If anybody can do it, it’s you!” but not me.

Don’t get me wrong, I’m not telling you you can’t do it, but by acting like it’s just a matter of proper planning, I’d be underselling how much work you’re signing up for, and how fucking hard it really is going to be.

So here’s the real deal: it’s the hardest thing you’ll ever do (hopefully). You know how grad school was hard? This is like having to write a thesis once a year while living 24/7 with someone who’s only goal is for you to not get that done.

Which is to say: be incredibly proud of yourself every day you survive this period, and don’t add an ounce of guilt to yourself that you can avoid. Guilt doesn’t help. And also, the system is set up badly for you, to be sure, but don’t dwell on it too much, that also doesn’t help while you’re in it.

In terms of very practical advice: pay through the nose for good babysitting and daycare, it’s worth the investment so that you don’t have to worry your kid is getting love and attention. Go into debt, borrow money, or whatever, but get it set up so that you actually feel jealous of your kid, and specifically so you know your kid is better off with that situation for the next few hours than being with you.

Finally, when you feel crazy and insane and underproductive, know that it’ll get better, for sure, by the time the kids can wipe their own asses, and that you won’t regret having those beautiful children nor trying to get something else done too. Never apologize for needing to cry and vent about how hard this period is, and if you’re around people who don’t get it, find new people.

Good luck!

Cathy

——

Aunt Pythia,

How do I dress to make people think I am an adult? I’m a 25-year-old woman, and I’m getting a bit tired of people asking me if I’m a student.

I think they ask me this because I only wear jeans and nerdy t-shirts. I basically only own jeans and nerdy t-shirts, plus some cardigans. I am not at all interested in skirts or girly things, but I’m open to wearing slightly nicer clothes. Like more cardigans? Messenger bags that aren’t falling apart? Urk.

People on the internet claim that I need to pluck my eyebrows to be taken seriously, but fuck that shit.

Shopping Is Hard! Let’s Do Math

Dear Sihldm,

First, I gotta say I was expecting a bit more from that sign-off. I really don’t see what “Sihldm” is supposed to mean, but maybe I’m just out of the loop.

Second, I’m gonna say something kind of controversial. Namely, I think the single attribute that makes people take me seriously is the fact that I’m overweight (and that, nowadays, I have grey hair, which also helps).

I think people just stop thinking “girl” and start thinking “woman” when confronted with me, and that totally works to my advantage. Controversial because, according to the social contract, I’m supposed to feel consistently bad about my weight, but here’s an example where I’m like, wow I’ve never been underestimated as a “girl”.

So, my advice to you is: pack on like 100 pounds.

Just kidding, probably not a great plan, nor possible.

Here’s another try: whenever you’re giving a talk or starting a class, wear wool slacks and a sweater. For whatever reason people take you super seriously when you do, even if you’re not fat, and even if you’re short. If it’s summer, go for summer slacks and silk shirts, although not the kind of silk that shows sweat stains easily, those are embarrassing.

And if it’s not a special event like a talk or the first day of class, then fuck it, be yourself.

Good luck!

Cathy

——

Dear Aunt Pythia,

My husband stays home with the children, but in spite of a graduate degree in engineering and graduate work in mathematics, seems incapable of maintaining a clean house.

My question is, if 95% of the time he doesn’t sort the mail, 75% of the time he doesn’t vacuum, 50% of the time he doesn’t wash the dishes, and 80% of the time he doesn’t wipe the kitchen counters, what is the probability that he doesn’t actually see dirt? (He is color blind.)

Buried in junk mail

Dear Bijm,

Bijm? Really?

Are the kids healthy? Happy? Do they get fed non-dorito-like food? I’d say be grateful. If and when you can afford it get housekeeping, but don’t make the mistake I see so much of allowing resentment to build up over chores.

Also, keep in mind that the kids will be able to help with the chores soon. And by “soon” I mean “probably already”. Buy cute toy-like vacuum cleaners and make up a game about getting all the dirt. Make it part of the dessert ritual that the counters need to be clean first. Move your bills to online payments.

And enjoy your sexy househusband!! [Important aside: is he willing to wear an apron and nothing else when he cooks? Please answer privately, preferably with jpeg-formatted evidence.]

Aunt Pythia

——

Please submit your well-specified, fun-loving, cleverly-abbreviated ethical quandary to Aunt Pythia!

Categories: Aunt Pythia

Newer Entries Older Entries

mathbabe

When big data goes bad in a totally predictable way

Ask Aunt Pythia

What’s the difference between big data and business analytics?

Are small businesses less corrupt?

How to be a pickup artist, Silicon Valley style

Larry Summers and the Lending Club

Finance and open source

Ask Aunt Pythia

New schwag for the Stacks Project

Survivorship bias for women in men’s fields

Minorities possible unfairly disqualified from opening bank accounts

Should lawmakers use algorithms?

Ask Aunt Pythia – special Sunday edition

Educational accountability scores get politically manipulated again

How much is the Stacks Project graph like a random graph?

Analyzing the complexity of the Stacks Project graphs

The Stacks Project gets ever awesomer with new viz

Larry Summers being set up to fail?

PyData talk today

Aunt Pythia’s advice

Top Posts & Pages

Follow Blog via Email

Recent Posts

Meta