Archive for the ‘statistics’ Category

Guest post: Be more careful with the vagina stats in teaching

This is a guest post by Courtney Gibbons, an assistant professor of mathematics at Hamilton College. You can see her teaching evaluations on She would like you to note that she’s been tagged as “hilarious.” Twice.

Lately, my social media has been blowing up with stories about gender bias in higher ed, especially course evaluations.   As a 30-something, female math professor, I’m personally invested in this kind of issue.  So I’m gratified when I read about well-designed studies that highlight the “vagina tax” in teaching (I didn’t coin this phrase, but I wish I had).

These kinds of studies bring the conversation about bias to the table in a way that academics can understand. We can geek out on experimental design, the fact that the research is peer-reviewed and therefore passes some basic legitimacy tests.

Indeed, the conversation finally moves out of the realm of folklore, where we have “known” for some time that students expect women to be nurturing in addition to managing the class, while men just need to keep class on track.

Let me reiterate: as a young woman in academia, I want deans and chairs and presidents to take these observed phenomena seriously when evaluating their professors. I want to talk to my colleagues and my students about these issues. Eventually, I’d like to “fix” them, or at least game them to my advantage. (Just kidding.  I’d rather fix them.)

However, let me speak as a mathematician for a minute here: bad interpretations of data don’t advance the cause. There’s beautiful link-bait out there that justifies its conclusions on the flimsy “hey, look at this chart” understanding of big data. Benjamin M. Schmidt created a really beautiful tool to visualize data he scraped from the website through a process that he sketches on his blog. The best criticisms and caveats come from Schmidt himself.

What I want to examine is the response to the tool, both in the media and among my colleagues.  USAToday, HuffPo, and other sites have linked to it, citing it as yet more evidence to support the folklore: students see men as “geniuses” and women as “bossy.” It looks like they found some screenshots (or took a few) and decided to interpret them as provocatively as possible. After playing with the tool for a few minutes, which wasn’t even hard enough to qualify as sleuthing, I came to a very different conclusion.

If you look at the ratings for “genius”  and then break them down further to look at positive and negative reviews separately, it occurs predominantly in negative reviews. I found a few specific reviews, and they read, “you have to be a genius to pass” or along those lines.

[Don’t take my word for it — search google for:

rate my professors “you have to be a genius”‘

and you’ll see how students use the word “genius” in reviews of professors. The first page of hits is pretty much all men.]

Here’s the breakdown for “genius”:


So yes, the data shows that students are using the word “genius” in more evaluations of men than women. But there’s not a lot to conclude from this; we can’t tell from the data if the student is praising the professor or damning him. All we can see that it’s a word that occurs in negative reviews more often than positive ones. From the data, we don’t even know if it refers to the professor or not.  


Similar results occur with “brilliant”:


Now check out “bossy” and negative reviews:


Okay, wow, look at how far to the right those orange dots are… and now look at the x-axis.  We’re talking about fewer than 5 uses per million words of text.  Not exactly significant compared to some of the other searches you can do.


I thought that the phrase “terrible teacher” was more illuminating, because it’s more likely in reference to the subject of the review, and we’ve got some meaningful occurrences:

And yes, there is a gender imbalance, but it's not as great as I had feared. I'm more worried about the disciplinary break down, actually. Check out math -- we have the worst teachers, but we spread it out across genders, with men ranking 187 uses of "terrible teacher" per million words; women score 192. Compare to psychology, where profs receive a score of 110.  Ouch.

And yes, there is a gender imbalance, but it’s not as great as I had feared. I’m more worried about the disciplinary break down, actually. Check out math — we have the worst teachers, but we spread it out across genders, with men ranking 187 uses of “terrible teacher” per million words; women score 192. Compare to psychology, where profs receive a score of 110.  Ouch.


Who’s doing this reporting, and why aren’t we reading these reports more critically?  Journalists, get your shit together and report data responsibly.  Academics, be a little more skeptical of stories that simply post screenshots of a chart coupled with inciting prose from conclusions drawn, badly, from hastily scanned data.

Is this tool useless? No. Is it fun to futz around with? Yes.

Is it being reported and understood well? Resounding no!

I think even our students would agree with me: that’s just f*cked up.

What would a data-driven Congress look like?

Recently I’ve seen two very different versions of what a more data-driven Congress would look like, both emerging from the recent cruddy Cromnibus bill mess.

First, there’s this Bloomberg article, written by the editors, about using data to produce evidence on whether a given policy is working or not. Given what I know about how data is produced, and how definitions of success are politically manipulated, I don’t have much hope for this idea.

Second, there was a reader’s comments on this New York Times article, also about the Cromnibus bill. Namely, the reader was calling on the New York Times to not only explore a few facts about what was contained in the bill, but lay it out with more numbers and more consistency. I think this is a great idea. What if, when Congress gave us a shitty bill, we could see stuff like:

  1. how much money is allocated to each thing, both raw dollars and as a percentage of the whole bill,
  2. who put it in the omnibus bill,
  3. the history of that proposed spending, and the history of voting,
  4. which lobbyists were pushing it, and who gets paid by them, and ideally
  5. all of this would be in an easy-to-use interactive.

That’s the kind of data that I’d love to see. Data journalism is an emerging field, and we might not be there yet, but it’s something to strive for.

Categories: data science, statistics

Fairness, accountability, and transparency in big data models

As I wrote about already, last Friday I attended a one day workshop in Montreal called FATML: Fairness, Accountability, and Transparency in Machine Learning. It was part of the NIPS conference for computer science, and there were tons of nerds there, and I mean tons. I wanted to give a report on the day, as well as some observations.

First of all, I am super excited that this workshop happened at all. When I left my job at Intent Media in 2011 with the intention of studying these questions and eventually writing a book about them, they were, as far as I know, on nobody’s else’s radar. Now, thanks to the organizers Solon and Moritz, there are communities of people, coming from law, computer science, and policy circles, coming together to exchange ideas and strategies to tackle the problems. This is what progress feels like!

OK, so on to what the day contained and my copious comments.

Hannah Wallach

Sadly, I missed the first two talks, and an introduction to the day, because of two airplane cancellations (boo American Airlines!). I arrived in the middle of Hannah Wallach’s talk, the abstract of which is located here. Her talk was interesting, and I liked her idea of having social scientists partnered with data scientists and machine learning specialists, but I do want to mention that, although there’s a remarkable history of social scientists working within tech companies – say at Bell Labs and Microsoft and such – we don’t see that in finance at all, nor does it seem poised to happen. So in other words, we certainly can’t count on social scientists to be on hand when important mathematical models are getting ready for production.

Also, I liked Hannah’s three categories of models: predictive, explanatory, and exploratory. Even though I don’t necessarily think that a given model will fall neatly into one category or the other, they still give you a way to think about what we do when we make models. As an example, we think of recommendation models as ultimately predictive, but they are (often) predicated on the ability to understand people’s desires as made up of distinct and consistent dimensions of personality (like when we use PCA or something equivalent). In this sense we are also exploring how to model human desire and consistency. For that matter I guess you could say any model is at its heart an exploration into whether the underlying toy model makes any sense, but that question is dramatically less interesting when you’re using linear regression.

Anupam Datta and Michael Tschantz

Next up Michael Tschantz reported on work with Anupam Datta that they’ve done on Google profiles and Google ads. The started with google’s privacy policy, which I can’t find but which claims you won’t receive ads based on things like your health problems. Starting with a bunch of browsers with no cookies, and thinking of each of them as fake users, they did experiments to see what actually happened both to the ads for those fake users and to the google ad profiles for each of those fake users. They found that, at least sometimes, they did get the “wrong” kind of ad, although whether Google can be blamed or whether the advertiser had broken Google’s rules isn’t clear. Also, they found that fake “women” and “men” (who did not differ by any other variable, including their searches) were offered drastically different ads related to job searches, with men being offered way more ads to get $200K+ jobs, although these were basically coaching sessions for getting good jobs, so again the advertisers could have decided that men are more willing to pay for such coaching.

An issue I enjoyed talking about was brought up in this talk, namely the question of whether such a finding is entirely evanescent or whether we can call it “real.” Since google constantly updates its algorithm, and since ad budgets are coming and going, even the same experiment performed an hour later might have different results. In what sense can we then call any such experiment statistically significant or even persuasive? Also, IRL we don’t have clean browsers, so what happens when we have dirty browsers and we’re logged into gmail and Facebook? By then there are so many variables it’s hard to say what leads to what, but should that make us stop trying?

From my perspective, I’d like to see more research into questions like, of the top 100 advertisers on Google, who saw the majority of the ads? What was the economic, racial, and educational makeup of those users? A similar but different (because of the auction) question would be to reverse-engineer the advertisers’ Google ad targeting methodologies.

Finally, the speakers mentioned a failure on Google’s part of transparency. In your advertising profile, for example, you cannot see (and therefore cannot change) your marriage status, but advertisers can target you based on that variable.

Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian

Next up we had Sorelle talk to us about her work with two guys with enormous names. They think about how to make stuff fair, the heart of the question of this workshop.

First, if we included race in, a resume sorting model, we’d probably see negative impact because of historical racism. Even if we removed race but included other attributes correlated with race (say zip code) this effect would remain. And it’s hard to know exactly when we’ve removed the relevant attributes, but one thing these guys did was define that precisely.

Second, say now you have some idea of the categories that are given unfair treatment, what can you do? One thing suggested by Sorelle et al is to first rank people in each category – to assign each person a percentile in their given category – and then to use the “forgetful function” and only consider that percentile. So, if we decided at a math department that we want 40% women graduate students, to achieve this goal with this method we’d independently rank the men and women, and we’d offer enough spots to top women to get our quota and separately we’d offer enough spots to top men to get our quota. Note that, although it comes from a pretty fancy setting, this is essentially affirmative action. That’s not, in my opinion, an argument against it. It’s in fact yet another argument for it: if we know women are systemically undervalued, we have to fight against it somehow, and this seems like the best and simplest approach.

Ed Felten and Josh Kroll

After lunch Ed Felton and Josh Kroll jointly described their work on making algorithms accountable. Basically they suggested a trustworthy and encrypted system of paper trails that would support a given algorithm (doesn’t really matter which) and create verifiable proofs that the algorithm was used faithfully and fairly in a given situation. Of course, we’d really only consider an algorithm to be used “fairly” if the algorithm itself is fair, but putting that aside, this addressed the question of whether the same algorithm was used for everyone, and things like that. In lawyer speak, this is called “procedural fairness.”

So for example, if we thought we could, we might want to turn the algorithm for punishment for drug use through this system, and we might find that the rules are applied differently to different people. This algorithm would catch that kind of problem, at least ideally.

David Robinson and Harlan Yu

Next up we talked to David Robinson and Harlan Yu about their work in Washington D.C. with policy makers and civil rights groups around machine learning and fairness. These two have been active with civil rights group and were an important part of both the Podesta Report, which I blogged about here, and also in drafting the Civil Rights Principles of Big Data.

The question of what policy makers understand and how to communicate with them came up several times in this discussion. We decided that, to combat cherry-picked examples we see in Congressional Subcommittee meetings, we need to have cherry-picked examples of our own to illustrate what can go wrong. That sounds bad, but put it another way: people respond to stories, especially to stories with innocent victims that have been wronged. So we are on the look-out.

Closing panel with Rayid Ghani and Foster Provost

I was on the closing panel with Rayid Ghani and Foster Provost, and we each had a few minutes to speak and then there were lots of questions and fun arguments. To be honest, since I was so in the moment during this panel, and also because I was jonesing for a beer, I can’t remember everything that happened.

As I remember, Foster talked about an algorithm he had created that does its best to “explain” the decisions of a complicated black box algorithm. So in real life our algorithms are really huge and messy and uninterpretable, but this algorithm does its part to add interpretability to the outcomes of that huge black box. The example he gave was to understand why a given person’s Facebook “likes” made a black box algorithm predict they were gay: by displaying, in order of importance, which likes added the most predictive power to the algorithm.

[Aside, can anyone explain to me what happens when such an algorithm comes across a person with very few likes? I’ve never understood this very well. I don’t know about you, but I have never “liked” anything on Facebook except my friends’ posts.]

Rayid talked about his work trying to develop a system for teachers to understand which students were at risk of dropping out, and for that system to be fair, and he discussed the extent to which that system could or should be transparent.

Oh yeah, and that reminds me that, after describing my book, we had a pretty great argument about whether credit scoring models should be open source, and what that would mean, and what feedback loops that would engender, and who would benefit.

Altogether a great day, and a fantastic discussion. Thanks again to Solon and Moritz for their work in organizing it.

Video cameras won’t solve the #EricGarner situation, but they will help

As many thoughtful people have pointed out already, Eric Garner’s case proves that video evidence is not a magic bullet to combat and punish undue police brutality. The Grand Jury deemed such evidence insufficient for an indictment, even if the average person watching the video cannot understand that point of view.

Even so, it would be a mistake to dismiss video cameras on police as entirely a bad idea. We shouldn’t assume no progress could be made simply because there’s an example which lets us down. I am no data evangelist, but neither am I someone who dismisses data. It can be powerful and we should use its power when we can.

And before I try to make the general case for video cameras on cops, let me make one other point. The Eric Garner video has already made progress in one arena, namely public opinion. Without the video, we wouldn’t be seeing nationwide marches protesting the outrageous police conduct.

A few of my data nerd thoughts:

  1. If cops were required to wear cameras, we’d have more data. We should think of that as building evidence, with the potential to use it to sway grand juries, criminal juries, judges, or public opinion.
  2. One thing I said time after time to my students this summer at the data journalism program I directed is the following: a number by itself is usually meaningless. What we need is to compare that number to a baseline. The baseline could be the average number for a population, or the median, or some range of 5th to 95th percentiles, or how it’s changed over time, or whatnot. But in order to gauge any baseline you need data.
  3. So in the case of police videotapes, we’d need to see how cops usually handle a situation, or how cops from other precincts handle similar situations, or the extremes of procedures in such situations, or how police have changed their procedures over time. And if we think the entire approach is heavy handed, we can also compare the data to the police manual, or to other countries, or what have you. More data is better for understanding aggregate approaches, and aggregate understanding makes it easier to fit a given situation into context.
  4. Finally, the cameras might also change their behavior when they are policing, knowing they are being taped. That’s believable but we shouldn’t depend on it.
  5. And also, we have to be super careful about how we use video evidence, and make sure it isn’t incredibly biased due to careful and unfair selectivity by the police. So, some cops are getting in trouble for turning off their cameras at critical moments, or not turning them on ever.

Let’s take a step back and think about how large-scale data collection and mining works, for example in online advertising. A marketer collects a bunch of data. And knowing a lot about one person doesn’t necessarily help them, but if they know a lot about most people, it statistically speaking does help them sell stuff. A given person might not be in the mood to buy, or might be broke, but if you dangle desirable good in front of a whole slew of people, you make sales. It’s a statistical play which, generally speaking, works.

In this case, we are the marketer, and the police are the customers. We want a lot of information about how they do their job so when the time comes we have some sense of “normal police behavior” and something to compare a given incident to or a given cop to. We want to see how they do or don’t try to negotiate peace, and with whom. We want to see the many examples of good and great policing as well as the few examples of terrible, escalating policing.

Taking another step back, if the above analogy seems weird, there’s a reason for that. In general data is being collected on the powerless, on the consumers, on the citizens, or the job applicants, and we should be pushing for more and better data to be collected instead on the powerful, on the police, on the corporations, and on the politicians. There’s a reason there is a burgeoning privacy industry for rich and powerful people.

For example, we want to know how many people have been killed by the police, but even a statistic that important is incredibly hard to come by (see this and this for more on that issue). However, it’s never been easier for the police to collect data on us and act on suspicions of troublemakers, however that is defined.

Another example – possibly the most extreme example of all – comes this very week from the reports on the CIA and torture. That is data and evidence we should have gotten much earlier, and as the New York Times demands, we should be able to watch videos of waterboarding and decide for ourselves whether it constitutes torture.

So yes, let’s have video cameras on every cop. It is not a panacea, and we should not expect it to solve our problems over night. In fact video evidence, by itself, will not solve any problem. We should think it as a mere evidence collecting device, and use it in the public discussion of how the most powerful among us treat the least powerful. But more evidence is better.

Finally, there’s the very real question of who will have access to the video footage, and whether the public will be allowed to see it at all. It’s a tough question, which will take a while to sort out (FOIL requests!), but until then, everyone should know that it is perfectly legal to videotape police in every place in this country. So go ahead and make a video with your camera when you suspect weird behavior.


De-anonymizing what used to be anonymous: NYC taxicabs

Thanks to Artem Kaznatcheev, I learned yesterday about the recent work of Anthony Tockar in exploring the field of anonymization and deanonymization of datasets.

Specifically, he looked at the 2013 cab rides in New York City, which was provided under a FOIL request, and he stalked celebrities Bradley Cooper and Jessica Alba (and discovered that neither of them tipped the cabby). He also stalked a man who went to a slew of NYC titty bars: found out where the guy lived and even got a picture of him.

Previously, some other civic hackers had identified the cabbies themselves, because the original dataset had scrambled the medallions, but not very well.

The point he was trying to make was that we should not assume that “anonymized” datasets actually protect privacy. Instead we should learn how to use more thoughtful approaches to anonymizing stuff, and he proposes a method called “differential privacy,” which he explains here. It involves adding noise to the data, in a certain way, so that at the end any given person doesn’t risk too much of their own privacy by being included in the dataset versus being not included in the dataset.

Bottomline, it’s actually pretty involved mathematically, and although I’m a nerd and it doesn’t intimidate me, it does give me pause. Here are a few concerns:

  1. It means that most people, for example the person in charge of fulfilling FOIL requests, will not actually understand the algorithm.
  2. That means that, if there’s a requirement that such a procedure is used, that person will have to use and trust a third party to implement it. This leads to all sorts of problems in itself.
  3. Just to name one, depending on what kind of data it is, you have to implement differential privacy differently. There’s no doubt that a complicated mapping of datatype to methodology will be screwed up when the person doing it doesn’t understand the nuances.
  4. Here’s another: the third party may not be trustworthy and may have created a backdoor.
  5. Or they just might get it wrong, or do something lazy that doesn’t actually work, and they can get away with it because, again, the user is not an expert and cannot accurately evaluate their work.

Altogether I’m imagining that this is at best an expensive solution for very important datasets, and won’t be used for your everyday FOIL requests like taxicab rides unless the culture around privacy changes dramatically.

Even so, super interesting and important work by Anthony Tockar. Also, if you think that’s cool, take a look at my friend Luis Daniel‘s work on de-anonymizing the Stop & Frisk data.

People hate me, I must be doing something right

September 30, 2014 32 comments

Not sure if you’ve seen this recent New York Times article entitled Learning to Love Criticism, but go ahead and read it if you haven’t. The key figures:

…76 percent of the negative feedback given to women included some kind of personality criticism, such as comments that the woman was “abrasive,” “judgmental” or “strident.” Only 2 percent of men’s critical reviews included negative personality comments.

This is so true! I re-re-learned this recently (again) when I started podcasting on Slate and the iTunes reviews of the show included attacks on me personally. For example: “Felix is great but Cathy is just annoying… and is not very interesting on anything” as well as “The only problem seems to be Cathy O’Neill who doesn’t have anything to contribute to the conversation…”

By contrast the men on the show, Jordan and Felix, are never personally attacked, although Felix is sometimes criticized for interrupting people, mostly me. In other words, I have some fans too. I am divisive.

So, what’s going on here?

Well, I have a thick skin already, partly from blogging and partly from being in men’s fields all my life, and partly just because I’m an alpha female. So what that means is that I know that it’s not really about me when people anonymously complain that I’m annoying or dumb. To be honest, when I see something like that, which isn’t a specific criticism that might help me get better but is rather a vague attack on my character, I immediately discount it as sexism if not misogyny, and I feel pity for the women in that guy’s life. Sometimes I also feel pity for the guy too, because he’s stunted and that’s sad.

But there’s one other thing I conclude when I piss people off: that I’m getting under their skin, which means what I’m saying is getting out there, to a wider audience than just people who already agree with me, and if that guy hates me then maybe 100 other people are listening and not quite hating me. They might even be agreeing with me. They might even be changing their minds about some things because of my arguments.

So, I realize this sounds twisted, but when people hate me, I feel like I must be doing something right.

One other thing I’ll say, which the article brings up. It is a luxury indeed to be a woman who can afford to be hated. I am not at risk, or at least I don’t feel at all at risk, when other people hate me. They are entitled to hate me, and I don’t need to bother myself about getting them to like me. It’s a deep and wonderful fact about our civilization that I can say that, and I am very glad to be living here and now, where I can be a provocative and opinionated intellectual woman.

Fuck yes! Let’s do this, people! Let’s have ideas and argue about them and disagree! It’s what freedom is all about.

Categories: musing, statistics

Women not represented in clinical trials

September 26, 2014 13 comments

This recent NYTimes article entitled Health Researchers Will Get $10.1 Million to Counter Gender Bias in Studies spelled out a huge problem that kind of blows me away as a statistician (and as a woman!).

Namely, they have recently decided over at the NIH, which funds medical research in this country, that we should probably check to see how women’s health are affected by drugs, and not just men’s. They’ve decided to give “extra money” to study this special group, namely females.

Here’s the bizarre and telling explanation for why most studies have focused on men and excluded women:

Traditionally many investigators have worked only with male lab animals, concerned that the hormonal cycles of female animals would add variability and skew study results.

Let’s break down that explanation, which I’ve confirmed with a medical researcher is consistent with the culture.

If you are afraid that women’s data would “skew study results,” that means you think the “true result” is the result that works for men. Because adding women’s data would add noise to the true signal, that of the men’s data. What?! It’s an outrageous perspective. Let’s take another look at this reasoning, from the article:

Scientists often prefer single-sex studies because “it reduces variability, and makes it easier to detect the effect that you’re studying,” said Abraham A. Palmer, an associate professor of human genetics at the University of Chicago. “The downside is that if there is a difference between male and female, they’re not going to know about it.”

Ummm… yeah. So instead of testing the effect on women, we just go ahead and optimize stuff for men and let women just go ahead and suffer the side effects of the treatment we didn’t bother to study. After all, women only comprise 50.8% of the population, they won’t mind.

This is even true for migraines, where 2/3rds of migraine sufferers are women.

One reason they like to exclude women: they have periods, and they even sometimes get pregnant, which is confusing for people who like to have clean statistics (on men’s health). In fact my research contact says that traditionally, this bias towards men in clinical trials was said to protect women because they “could get pregnant” and then they’d be in a clinical trial while pregnant. OK.

I’d like to hear more about who is and who isn’t in clinical trials, and why.

Categories: modeling, news, rant, statistics

Get every new post delivered to your Inbox.

Join 2,778 other followers