journalism | mathbabe

Stuff I’m reading

June 23, 2015 Cathy O'Neil, mathbabe 8 comments

A fascinating conversation with Gerald Posner, author of God’s Bankers: a History of Money and Power at the Vaticas, with crazy and horrible details of the Vatican’s bank’s dealings with the Nazis (hat tip Aryt Alasti). Also a review of the book in the New York Times.
Nerding out on an interesting blog post by Laura McClay, who describes her involvement researching flood insurance (hat tip Jordan Ellenberg). One of my favorite point about insurance comes up in this piece, namely if you price insurance too accurately, it fails in its most basic function, and gets too expensive for those at highest risk.
There’s a new social network created specifically to get people more involved in politics. It’s called Brigade, and it gets users to answer a bunch of questions about their beliefs. The business model hasn’t been unveiled yet, but this is information that political campaigns would find very valuable. Also see Alex Howard’s take. Could be scary, could be useful.

Categories: data science, finance, journalism

Driving While Black in the Bronx

April 24, 2015 Cathy O'Neil, mathbabe 10 comments

This is the story of Q, a black man living in the Bronx, who kindly allowed me to interview him about his recent experience. The audio recording of my interview with him is available below as well.

Q was stopped in the Bronx driving a new car, the fourth time that week, by two rookie officers on foot. The officers told Q to “give me your fucking license,” and Q refused to produce his license, objecting to the tone of the officer’s request. When Q asked him why he was stopped, the officer told him that it was because of his tinted back windows, in spite of there being many other cars on the same block, and even next to him, with similarly tinted windows. Q decided to start recording the interaction on his phone after one of the cops used the n-word.

After a while seven cop cars came to the scene, and eventually a more polite policeman asked Q to produce his license, which he did. They brought him in, claiming they had a warrant for him. Q knew he didn’t actually have a warrant, but when he asked, they said it was a warrant for littering. It sounded like an excuse to arrest him because Q was arguing. He recorded them saying, “We should just lock this black guy up.”

They brought him to the precinct and Q asked him for a phone call. He needed to unlock his phone to get the phone number, and when he did, the policeman took his phone and ran out of the room. Q later found out his recordings had been deleted.

After a while he was assigned a legal aid lawyer, to go before a judge. Q asked the legal aid why he was locked up. She said there was no warrant on his record and that he’d been locked up for disorderly conduct. This was the third charge he’d heard about.

He had given up his car keys, his cell phone, his money, his watch and his house keys, all in different packages. When he went back to pick up his property while his white friend waited in the car, the people inside the office claimed they couldn’t find anything except his cell phone. They told him to come back at 9pm when the arresting officer would come in. Then Q’s white friend came in, and after Q explained the situation to him in front of the people working there, they suddenly found all of his possessions. Q thinks they assumed his friend was a lawyer because he was white and well dressed.

They took the starter plug out of his car as well, and he got his cell phone back with no videos. The ordeal lasted 12 hours altogether.

“The sad thing about it,” Q said, “is that it happens every single day. If you’re wearing a suit and tie it’s different, but when you’re wearing something fitted and some jeans, you’re treated as a criminal. It’s sad that people have to go through this on a daily basis, for what?”

Here’s the raw audio file of my interview with Q:

Categories: #OWS, discrimination, feedback loop, journalism, news, white privilege

Putting the dick pic on the Snowden story

April 17, 2015 Cathy O'Neil, mathbabe 8 comments

I’m on record complaining about how journalists dumb down stories in blind pursuit of “naming the victim” or otherwise putting a picture on the story.

But then again, sometimes that’s exactly what you need to do, especially when the story is super complicated. Case in point: the Snowden revelations story.

In the past 2 weeks I’ve seen the Academy Award winning feature length film CitizenFour, I’ve read Bruce Schneier’s recent book, Data and Goliath: The Hidden Battles To Collect Your Data And Control Your World, and finally I watched John Oliver’s recent Snowden episode.

They were all great in their own way. I liked Schneier’s book, it was a quick read, and I’d recommend it to people who want to know more than Oliver’s interview shows us. He’s very very smart, incredibly well informed, and almost completely reasonable (unlike this review).

To be honest, though, when I recommend something to other people, I pick John Oliver’s approach; he cleverly puts the dick pic on the story (you have to reset it to the beginning):

Here’s the thing that I absolutely love about Oliver’s interview. He’s not absolutely smitten by Snowden, but he recognizes Snowden’s goal, and makes it absolutely clear what it means to people using the handy use case of how nude pictures get captured in the NSA dragnets. It is really brilliant.

Compared to Schneier’s book, Oliver is obviously not as informational. Schneier is a world-wide expert on security, and gives us real details on which governmental programs know what and how. But honestly, unless you’re interested in becoming a security expert, that isn’t so important. I’m a tech nerd and even for me the details were sometimes overwhelming.

Here’s what I want to concentrate on. In the last part of the book, Schneier suggests all sorts of ways that people can protect their own privacy, using all sorts of encryption tools and so on. He frames it as a form of protest, but it seems like a LOT of work to me.

Compare that to my favorite part of the Oliver interview, when Oliver asks Snowden (starting at minute 30:28 in the above interview) if we should “just stop taking dick pics.” Snowden’s answer is no: changing what we normally do because of surveillance is a loss of liberty, even if it’s dumb.

I agree, which is why I’m not going to stop blabbing my mouth off everywhere (I don’t actually send naked pictures of myself to people, I think that’s a generational thing).

One last thing I can’t resist saying, and which Schneier discusses at length: almost every piece of data collected about us by our government is more or less for sale anyway. Just think about that. It is more meaningful for people worried about large scale discrimination, like me, than it is for people worried about case-by-case pinpointed governmental acts of power and suppression.

Or, put it this way: when we are up in arms about the government having our dick pics, we forget that so do our phones, and so does Facebook, or Snapchat, not to mention all the backups on the cloud somewhere.

Categories: data science, discrimination, journalism, news

Fingers crossed – book coming out next May

April 15, 2015 Cathy O'Neil, mathbabe 14 comments

As it turns out, it takes a while to write a book, and then another few months to publish it.

I’m very excited today to tentatively announce that my book, which is tentatively entitled Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, will be published in May 2016, in time to appear on summer reading lists and well before the election.

Fuck yeah! I’m so excited.

p.s. Fight for 15 is happening now.

Categories: arms race, credit scores, data journalism, data science, discrimination, economics, education, feedback loop, finance, journalism, law, math education, modeling, musing, open source tools, rant, statistics

A critique of a review of a book by Bruce Schneier

March 17, 2015 Cathy O'Neil, mathbabe 6 comments

I haven’t yet read Bruce Schneier’s new book, Data and Goliath: The Hidden Battles To Collect Your Data and Control Your World. I plan to in the coming days, while I’m traveling with my kids for spring break.

Even so, I already feel capable of critiquing this review of his book (hat tip Jordan Ellenberg), written by Columbia Business School Professor and Investment Banker Jonathan Knee. You see, I’m writing a book myself on big data, so I feel like I understand many of the issues intimately.

The review starts out flattering, but then it hits this turn:

When it comes to his specific policy recommendations, however, Mr. Schneier becomes significantly less compelling. And the underlying philosophy that emerges — once he has dispensed with all pretense of an evenhanded presentation of the issues — seems actually subversive of the very democratic principles that he claims animates his mission.

That’s a pretty hefty charge. Let’s take a look into Knee’s evidence that Schneier wants to subvert democratic principles.

NSA

First, he complains that Schneier wants the government to stop collecting and mining massive amounts of data in its search for terrorists. Knee thinks this is dumb because it would be great to have lots of data on the “bad guys” once we catch them.

Any time someone uses the phrase “bad guys,” it makes me wince.

But putting that aside, Knee is either ignorant of or is completely ignoring what mass surveillance and data dredging actually creates: the false positives, the time and money and attention, not to mention the potential for misuse and hacking. Knee’s opinion on that is simply that we normal citizens just don’t know enough to have an opinion on whether it works, including Schneier, and in spite of Schneier knowing Snowden pretty well.

It’s just like waterboarding – Knee says – we can’t be sure it isn’t a great fucking idea.

Wait, before we move on, who is more pro-democracy, the guy who wants to stop totalitarian social control methods, or the guy who wants to leave it to the opaque authorities?

Corporate Data Collection

Here’s where Knee really gets lost in Schneier’s logic, because – get this – Schneier wants corporate collection and sale of consumer data to stop. The nerve. As Knee says:

Mr. Schneier promotes no less than a fundamental reshaping of the media and technology landscape. Companies with access to large amounts of personal data would be “automatically classified as fiduciaries” and subject to “special legal restrictions and protections.”

That these limits would render illegal most current business models — under which consumers exchange enhanced access by advertisers for free services – does not seem to bother Mr. Schneier”

I can’t help but think that Knee cannot understand any argument that would threaten the business world as he knows it. After all, he is a business professor and an investment banker. Things seem pretty well worked out when you live in such an environment.

By Knee’s logic, even if the current business model is subverting democracy – which I also argue in my book – we shouldn’t tamper with it because it’s a business model.

The way Knee paints Schneier as anti-democratic is by using the classic fallacy in big data which I wrote about here:

Although professing to be primarily preoccupied with respect of individual autonomy, the fact that Americans as a group apparently don’t feel the same way as he does about privacy appears to have little impact on the author’s radical regulatory agenda. He actually blames “the media” for the failure of his positions to attract more popular support.

Quick summary: Americans as a group do not feel this way because they do not understand what they are trading when they trade their privacy. Commercial and governmental interests, meanwhile, are all united in convincing Americans not to think too hard about it. There are very few people devoting themselves to alerting people to the dark side of big data, and Schneier is one of them. It is a patriotic act.

Also, yes Professor Knee, “the media” generally speaking writes down whatever a marketer in the big data world says is true. There are wonderful exceptions, of course.

So, here’s a question for Knee. What if you found out about a threat on the citizenry, and wanted to put a stop to it? You might write a book and explain the threat; the fact that not everyone already agrees with you wouldn’t make your book anti-democratic, would it?

MLK

The rest of the review basically boils down to, “you don’t understand the teachings of the Reverend Dr. Martin Luther King Junior like I do.”

Do you know about Godwin’s law, which says that as soon as someone invokes the Nazis in an argument about anything, they’ve lost the argument?

I feel like we need another, similar rule, which says, if you’re invoking MLK and claiming the other person is misinterpreting him while you have him nailed, then you’ve lost the argument.

Categories: data science, economics, journalism, modeling, rant

Guest post: Be more careful with the vagina stats in teaching

February 20, 2015 Cathy O'Neil, mathbabe 7 comments

This is a guest post by Courtney Gibbons, an assistant professor of mathematics at Hamilton College. You can see her teaching evaluations on ratemyprofessor.com. She would like you to note that she’s been tagged as “hilarious.” Twice.

Lately, my social media has been blowing up with stories about gender bias in higher ed, especially course evaluations. As a 30-something, female math professor, I’m personally invested in this kind of issue. So I’m gratified when I read about well-designed studies that highlight the “vagina tax” in teaching (I didn’t coin this phrase, but I wish I had).

These kinds of studies bring the conversation about bias to the table in a way that academics can understand. We can geek out on experimental design, the fact that the research is peer-reviewed and therefore passes some basic legitimacy tests.

Indeed, the conversation finally moves out of the realm of folklore, where we have “known” for some time that students expect women to be nurturing in addition to managing the class, while men just need to keep class on track.

Let me reiterate: as a young woman in academia, I want deans and chairs and presidents to take these observed phenomena seriously when evaluating their professors. I want to talk to my colleagues and my students about these issues. Eventually, I’d like to “fix” them, or at least game them to my advantage. (Just kidding. I’d rather fix them.)

However, let me speak as a mathematician for a minute here: bad interpretations of data don’t advance the cause. There’s beautiful link-bait out there that justifies its conclusions on the flimsy “hey, look at this chart” understanding of big data. Benjamin M. Schmidt created a really beautiful tool to visualize data he scraped from the website ratemyprofessor.com through a process that he sketches on his blog. The best criticisms and caveats come from Schmidt himself.

What I want to examine is the response to the tool, both in the media and among my colleagues. USAToday, HuffPo, and other sites have linked to it, citing it as yet more evidence to support the folklore: students see men as “geniuses” and women as “bossy.” It looks like they found some screenshots (or took a few) and decided to interpret them as provocatively as possible. After playing with the tool for a few minutes, which wasn’t even hard enough to qualify as sleuthing, I came to a very different conclusion.

If you look at the ratings for “genius” and then break them down further to look at positive and negative reviews separately, it occurs predominantly in negative reviews. I found a few specific reviews, and they read, “you have to be a genius to pass” or along those lines.

[Don’t take my word for it — search google for:

rate my professors “you have to be a genius”‘

and you’ll see how students use the word “genius” in reviews of professors. The first page of hits is pretty much all men.]

Here’s the breakdown for “genius”:

So yes, the data shows that students are using the word “genius” in more evaluations of men than women. But there’s not a lot to conclude from this; we can’t tell from the data if the student is praising the professor or damning him. All we can see that it’s a word that occurs in negative reviews more often than positive ones. From the data, we don’t even know if it refers to the professor or not.

Similar results occur with “brilliant”:

Now check out “bossy” and negative reviews:

Okay, wow, look at how far to the right those orange dots are… and now look at the x-axis. We’re talking about fewer than 5 uses per million words of text. Not exactly significant compared to some of the other searches you can do.

I thought that the phrase “terrible teacher” was more illuminating, because it’s more likely in reference to the subject of the review, and we’ve got some meaningful occurrences:

And yes, there is a gender imbalance, but it’s not as great as I had feared. I’m more worried about the disciplinary break down, actually. Check out math — we have the worst teachers, but we spread it out across genders, with men ranking 187 uses of “terrible teacher” per million words; women score 192. Compare to psychology, where profs receive a score of 110. Ouch.

Who’s doing this reporting, and why aren’t we reading these reports more critically? Journalists, get your shit together and report data responsibly. Academics, be a little more skeptical of stories that simply post screenshots of a chart coupled with inciting prose from conclusions drawn, badly, from hastily scanned data.

Is this tool useless? No. Is it fun to futz around with? Yes.

Is it being reported and understood well? Resounding no!

I think even our students would agree with me: that’s just f*cked up.

Categories: guest post, journalism, math, math education, statistics, women in math

Video cameras won’t solve the #EricGarner situation, but they will help

December 10, 2014 Cathy O'Neil, mathbabe 12 comments

As many thoughtful people have pointed out already, Eric Garner’s case proves that video evidence is not a magic bullet to combat and punish undue police brutality. The Grand Jury deemed such evidence insufficient for an indictment, even if the average person watching the video cannot understand that point of view.

Even so, it would be a mistake to dismiss video cameras on police as entirely a bad idea. We shouldn’t assume no progress could be made simply because there’s an example which lets us down. I am no data evangelist, but neither am I someone who dismisses data. It can be powerful and we should use its power when we can.

And before I try to make the general case for video cameras on cops, let me make one other point. The Eric Garner video has already made progress in one arena, namely public opinion. Without the video, we wouldn’t be seeing nationwide marches protesting the outrageous police conduct.

A few of my data nerd thoughts:

If cops were required to wear cameras, we’d have more data. We should think of that as building evidence, with the potential to use it to sway grand juries, criminal juries, judges, or public opinion.
One thing I said time after time to my students this summer at the data journalism program I directed is the following: a number by itself is usually meaningless. What we need is to compare that number to a baseline. The baseline could be the average number for a population, or the median, or some range of 5th to 95th percentiles, or how it’s changed over time, or whatnot. But in order to gauge any baseline you need data.
So in the case of police videotapes, we’d need to see how cops usually handle a situation, or how cops from other precincts handle similar situations, or the extremes of procedures in such situations, or how police have changed their procedures over time. And if we think the entire approach is heavy handed, we can also compare the data to the police manual, or to other countries, or what have you. More data is better for understanding aggregate approaches, and aggregate understanding makes it easier to fit a given situation into context.
Finally, the cameras might also change their behavior when they are policing, knowing they are being taped. That’s believable but we shouldn’t depend on it.
And also, we have to be super careful about how we use video evidence, and make sure it isn’t incredibly biased due to careful and unfair selectivity by the police. So, some cops are getting in trouble for turning off their cameras at critical moments, or not turning them on ever.

Let’s take a step back and think about how large-scale data collection and mining works, for example in online advertising. A marketer collects a bunch of data. And knowing a lot about one person doesn’t necessarily help them, but if they know a lot about most people, it statistically speaking does help them sell stuff. A given person might not be in the mood to buy, or might be broke, but if you dangle desirable good in front of a whole slew of people, you make sales. It’s a statistical play which, generally speaking, works.

In this case, we are the marketer, and the police are the customers. We want a lot of information about how they do their job so when the time comes we have some sense of “normal police behavior” and something to compare a given incident to or a given cop to. We want to see how they do or don’t try to negotiate peace, and with whom. We want to see the many examples of good and great policing as well as the few examples of terrible, escalating policing.

Taking another step back, if the above analogy seems weird, there’s a reason for that. In general data is being collected on the powerless, on the consumers, on the citizens, or the job applicants, and we should be pushing for more and better data to be collected instead on the powerful, on the police, on the corporations, and on the politicians. There’s a reason there is a burgeoning privacy industry for rich and powerful people.

For example, we want to know how many people have been killed by the police, but even a statistic that important is incredibly hard to come by (see this and this for more on that issue). However, it’s never been easier for the police to collect data on us and act on suspicions of troublemakers, however that is defined.

Another example – possibly the most extreme example of all – comes this very week from the reports on the CIA and torture. That is data and evidence we should have gotten much earlier, and as the New York Times demands, we should be able to watch videos of waterboarding and decide for ourselves whether it constitutes torture.

So yes, let’s have video cameras on every cop. It is not a panacea, and we should not expect it to solve our problems over night. In fact video evidence, by itself, will not solve any problem. We should think it as a mere evidence collecting device, and use it in the public discussion of how the most powerful among us treat the least powerful. But more evidence is better.

Finally, there’s the very real question of who will have access to the video footage, and whether the public will be allowed to see it at all. It’s a tough question, which will take a while to sort out (FOIL requests!), but until then, everyone should know that it is perfectly legal to videotape police in every place in this country. So go ahead and make a video with your camera when you suspect weird behavior.

Categories: #OWS, data science, journalism, statistics

Alt Banking in Huffington Post #OWS

November 11, 2014 Cathy O'Neil, mathbabe 1 comment

Great news! The Alt Banking group had a piece published today in the Huffington Post entitled With Economic Justice For All, about our hopes for the next Attorney General.

For the sake of the essay, we coined the term “marble columns” to mean the opposite of “broken windows.” Instead of getting arrested for nothing, you never get arrested, as long as you work at a company with marble columns. For more, take a look at the whole piece!

Also, my good friend and bandmate Tom Adams (our band, the Tomtown Ramblers, is named after him) will be covering for me on mathbabe for the next few days while I’m away in Haiti. Please make him feel welcome!

Categories: #OWS, economics, finance, journalism, rant

Guest post: The dangers of evidence-based sentencing

October 21, 2014 Cathy O'Neil, mathbabe 15 comments

This is a guest post by Luis Daniel, a research fellow at The GovLab at NYU where he works on issues dealing with tech and policy. He tweets @luisdaniel12. Crossposted at the GovLab.

What is Evidence-based Sentencing?

For several decades, parole and probation departments have been using research-backed assessments to determine the best supervision and treatment strategies for offenders to try and reduce the risk of recidivism. In recent years, state and county justice systems have started to apply these risk and needs assessment tools (RNA’s) to other parts of the criminal process.

Of particular concern is the use of automated tools to determine imprisonment terms. This relatively new practice of applying RNA information into the sentencing process is known as evidence-based sentencing (EBS).

What the Models Do

The different parameters used to determine risk vary by state, and most EBS tools use information that has been central to sentencing schemes for many years such as an offender’s criminal history. However, an increasing amount of states have been utilizing static factors such as gender, age, marital status, education level, employment history, and other demographic information to determine risk and inform sentencing. Especially alarming is the fact that the majority of these risk assessment tools do not take an offender’s particular case into account.

This practice has drawn sharp criticism from Attorney General Eric Holder who says “using static factors from a criminal’s background could perpetuate racial bias in a system that already delivers 20% longer sentences for young black men than for other offenders.” In the annual letter to the US Sentencing Commission, the Attorney General’s Office states that “utilizing such tools for determining prison sentences to be served will have a disparate and adverse impact on offenders from poor communities already struggling with social ills.” Other concerns cite the probable unconstitutionality of using group-based characteristics in risk assessments.

Where the Models Are Used

It is difficult to precisely quantify how many states and counties currently implement these instruments, although at least 20 states have implemented some form of EBS. Some of the states or states with counties that have implemented some sort of EBS (any type of sentencing: parole, imprisonment, etc) are: Pennsylvania, Tennessee, Vermont, Kentucky, Virginia, Arizona, Colorado, California, Idaho, Indiana, Missouri, Nebraska, Ohio, Oregon, Texas, and Wisconsin.

The Role of Race, Education, and Friendship

Overwhelmingly states do not include race in the risk assessments since there seems to be a general consensus that doing so would be unconstitutional. However, even though these tools do not take race into consideration directly, many of the variables used such as economic status, education level, and employment correlate with race. African-Americans and Hispanics are already disproportionately incarcerated and determining sentences based on these variables might cause further racial disparities.

The very socioeconomic characteristics such as income and education level used in risk assessments are the characteristics that are already strong predictors of whether someone will go to prison. For example, high school dropouts are 47 times more likely to be incarcerated than people in their similar age group who received a four-year college degree. It is reasonable to suspect that courts that include education level as a risk predictor will further exacerbate these disparities.

Some states, such as Texas, take into account peer relations and considers associating with other offenders as a “salient problem”. Considering that Texas is in 4th place in the rate of people under some sort of correctional control (parole, probation, etc) and that the rate is 1 in 11 for black males in the United States it is likely that this metric would disproportionately affect African-Americans.

Sonja Starr’s paper

Even so, in some cases, socioeconomic and demographic variables receive significant weight. In her forthcoming paper in the Stanford Law Review, Sonja Starr provides a telling example of how these factors are used in presentence reports. From her paper:

For instance, in Missouri, pre-sentence reports include a score for each defendant on a scale from -8 to 7, where “4-7 is rated ‘good,’ 2-3 is ‘above average,’ 0-1 is ‘average’, -1 to -2 is ‘below average,’ and -3 to -8 is ‘poor.’ Unlike most instruments in use, Missouri’s does not include gender. However, an unemployed high school dropout will score three points worse than an employed high school graduate—potentially making the difference between “good” and “average,” or between “average” and “poor.” Likewise, a defendant under age 22 will score three points worse than a defendant over 45. By comparison, having previously served time in prison is worth one point; having four or more prior misdemeanor convictions that resulted in jail time adds one point (three or fewer adds none); having previously had parole or probation revoked is worth one point; and a prison escape is worth one point. Meanwhile, current crime type and severity receive no weight.

Starr argues that such simple point systems may “linearize” a variable’s effect. In the underlying regression models used to calculate risk, some of the variable’s effects do not translate linearly into changes in probability of recidivism, but they are treated as such by the model.

Another criticism Starr makes is that they often make predictions on an individual based on averages of a group. Starr says these predictions can predict with reasonable precision the average recidivism rate for all offenders who share the same characteristics as the defendant, but that does not make it necessarily useful for individual predictions.

The Future of EBS Tools

The Model Penal Code is currently in the process of being revised and is set to include these risk assessment tools in the sentencing process. According to Starr, this is a serious development because it reflects the increased support of these practices and because of the Model Penal Code’s great influence in guiding penal codes in other states. Attorney General Eric Holder has already spoken against the practice, but it will be interesting to see whether his successor will continue this campaign.

Even if EBS can accurately measure risk of recidivism (which is uncertain according to Starr), does that mean that a greater prison sentence will result in less future offenses after the offender is released? EBS does not seek to answer this question. Further, if knowing there is a harsh penalty for a particular crime is a deterrent to commit said crime, wouldn’t adding more uncertainty to sentencing (EBS tools are not always transparent and sometimes proprietary) effectively remove this deterrent?

Even though many questions remain unanswered and while several people have been critical of the practice, it seems like there is great support for the use of these instruments. They are especially easy to support when they are overwhelmingly regarded as progressive and scientific, something Starr refutes. While there is certainly a place for data analytics and actuarial methods in the criminal justice system, it is important that such research be applied with the appropriate caution. Or perhaps not at all. Even if the tools had full statistical support, the risk of further exacerbating an already disparate criminal justice system should be enough to halt this practice.

Both Starr and Holder believe there is a strong case to be made that the risk prediction instruments now in use are unconstitutional. But EBS has strong advocates, so it’s a difficult subject. Ultimately, evidence-based sentencing is used to determine a person’s sentencing not based on what the person has done, but who that person is.

Categories: data science, guest post, journalism, modeling

Bad Paper by Jake Halpern

October 16, 2014 Cathy O'Neil, mathbabe 7 comments

Yesterday I finished Jake Halpern’s new book, Bad Paper: Chasing Debt From Wall Street To The Underground.

It’s an interesting series of close-up descriptions of the people who have been buying and selling revolving debt since the credit crisis, as well as the actual business of debt collecting. He talks about the very real problem, for debt collectors, of having no proof of debt, of having other people who have stolen on your debt trying to collect on it at the same time, and of course the fact that some debt collectors resort to illegal threats and misleading statements to get debtors – or possibly ex-debtors, it’s never entirely clear – to pay up or suffer the consequences. An arms race of quasi-legal and illegal cultural practices.

Halpern does a good job explaining the plight of the debt collectors, including the people hired for the call centers. It’s the poor pitted against the poorer here, a dirty fight where information asymmetry is absolutely essential to the profit margin of any given tier of the system.

Halpern outlines those tiers well, as well as the interesting lingo created by this subculture centered, at least until recently, in Buffalo, New York. People at the top are credit card companies themselves or hedge fund buyers from credit card companies; in other words, people who get “fresh debt” lists in the form of excel spreadsheets, where the people listed have recently stopped paying and might have some resources to pull. Then there are people who deal in older debt, which is harder to collect on. After that are people who have yet older debt which may or may not be stolen, so other collectors might simultaneously be picking over the carcasses. At the very bottom of the pile, from Halpern’s perspective, come the lawyers. They bring debtors to court and try to garnish wages.

Somewhat buried at very end of Halpern’s book is some quite useful information for the debtors. So for example, if you ever get dragged to court by a debt collection lawyer,

definitely show up (or else they will just garnish your wages)
ask for proof that they own the debt and how you spent it. They will likely not have such documentation and will dismiss your case.

Overall Bad Paper is a good book, and it explains a lot of interesting and useful information, but from my perspective, being firmly on the side of (most of) the debtors, everyone who gets a copy of the book should also get a copy of Strike Debt’s Debt Resistors’ Operation Manual, which has way more useful information, and even form letters, for the debtor.

As far as real solutions, we see the usual problems: underfunded and impotent regulators in the FTC, the CFPB, and the Attorney General’s office, as well as ridiculously small fines when actually caught that amount to fractions of the profit already made by illegal tactics. Everyone is feasting, even when they don’t find much meat on the bones.

Given how big a problem this is, and how many people are being pursued by debt collectors, you’d think they might set up a system of incentives so lawyers can make money by nailing illegal actions instead of just leveraging outdated information and trying to squeeze poor people out of their paychecks.

The bigger problem, once again, is that so many people are flat broke and largely go into debt for things like emergency expenses. And yes, of course there are people who buy a bunch of things they don’t need and then refuse to pay off their debts – Halpern profiles one such person – but the vast majority of the people we’re talking about are the struggling poor. It would be nice to see our country become a place where we don’t need so much damn debt in the first place, then the scavengers wouldn’t have so many rubbish piles to live off of.

Categories: #OWS, economics, journalism

Reverse-engineering Chinese censorship

October 9, 2014 Cathy O'Neil, mathbabe 1 comment

This recent paper written by Gary King, Jennifer Pan, and Margaret Roberts explores the way social media posts are censored in China. It’s interesting, take a look, or read this article on their work.

Here’s their abstract:

Existing research on the extensive Chinese censorship organization uses observational methods with well-known limitations. We conducted the first large-scale experimental study of censorship by creating accounts on numerous social media sites, randomly submitting different texts, and observing from a worldwide network of computers which texts were censored and which were not. We also supplemented interviews with confidential sources by creating our own social media site, contracting with Chinese firms to install the same censoring technologies as existing sites, and—with their software, documentation, and even customer support—reverse-engineering how it all works. Our results offer rigorous support for the recent hypothesis that criticisms of the state, its leaders, and their policies are published, whereas posts about real-world events with collective action potential are censored.

Interesting that they got so much help from the Chinese to censor their posts. Also keep in mind a caveat from the article:

Yu Xie, a sociologist at the University of Michigan, Ann Arbor, says that although the study is methodologically sound, it overemphasizes the importance of coherent central government policies. Political outcomes in China, he notes, often rest on local officials, who are evaluated on how well they maintain stability. Such officials have a “personal interest in suppressing content that could lead to social movements,” Xie says.

I’m a sucker for reverse-engineering powerful algorithms, even when there are major caveats.

Categories: data journalism, journalism, modeling, news

When the story IS the interaction with the public

August 21, 2014 Cathy O'Neil, mathbabe 1 comment

Here at the Lede Program we’ve been getting lots of different perspectives on what data journalism is and what it could be. As usual I will oversimplify for the sake of clarity, and apologies in advance to anyone I might offend.

The old school version of data journalism, which is called computer assisted reporting, maintains that a data story is first and foremost a story and should be viewed as such: you are investigating and interrogating the data as you would a witness, but the data isn’t itself a story, but rather a way of gathering evidence for the claims posed in the story. Every number cited needs to be independently supported with a secondary source.

Really important journalism lives in this context and is supported by the data, and the journalists in this realm are FOIA experts and speak truth to power in an exciting way. Think leaks and whistleblowers.

The new school vision of data journalism – again, entirely oversimplified – is that, by creating interesting data interactives that allow people to see how the news affects them – whether that means a map of “stuff happening” where they can see the stuff happening near them, or a big dataset that people can interact with in a tailored way, or a jury duty quiz that allows people to see how answers might get them kicked off or kept on a jury.

I imagine that some of these new-fangled approaches don’t even seem like stories at all to the old-school journalists, who want to see a bad guy caught, or a straight-up story told with a twist and a surprise and a “human face”. I’m not sure many of them would even get past the pitch stage if proffered to a curmudgeonly editor (and all editors are curmudgeonly, that’s just a fact).

The new interactive stories do not tell one story. Instead, they tell a bunch of stories to a bunch of people, and that interaction itself becomes the story. They also educate the public in a somewhat untamed way: by interacting with a database a reader can see variations in time, or in space, or in demographic, at least if the data is presented carefully.

Similarly, by seeing how each question on a jury duty quiz nudges you towards the plaintiff or the defendant, you can begin to see how seemingly innocuous information collected about you accumulates, which is how profiles are formed, on and offline.

Categories: data journalism, journalism

The problem with charter schools

July 29, 2014 Cathy O'Neil, mathbabe 13 comments

Today I read this article written by Allie Gross (hat tip Suresh Naidu), a former Teach for America teacher whose former idealism has long been replaced by her experiences in the reality of education in this country. Her article is entitled The Charter School Profiteers.

It’s really important, and really well written, and just one of the articles in the online magazine Jacobin that I urge you to read and to subscribe to. In fact that article is part of a series (here’s another which focuses on charter schools in New Orleans) and it comes with a booklet called Class Action: An Activist Teacher’s Handbook. I just ordered a couple of hard copies.

I’d really like you to read the article, but as a teaser here’s one excerpt, a rant which she completely backs up with facts on the ground:

You haven’t heard of Odeo, the failed podcast company the Twitter founders initially worked on? Probably not a big deal. You haven’t heard about the failed education ventures of the person now running your district? Probably a bigger deal.

When we welcome schools that lack democratic accountability (charter school boards are appointed, not elected), when we allow public dollars to be used by those with a bottom line (such as the for-profit management companies that proliferate in Michigan), we open doors for opportunism and corruption. Even worse, it’s all justified under a banner of concern for poor public school students’ well-being.

While these issues of corruption and mismanagement existed before, we should be wary of any education reformer who claims that creating an education marketplace is the key to fixing the ills of DPS or any large city’s struggling schools. Letting parents pick from a variety of schools does not weed out corruption. And the lax laws and lack of accountability can actually exacerbate the socioeconomic ills we’re trying to root out.

Categories: education, journalism, modeling, rant

Surveillance in NYC

July 14, 2014 Cathy O'Neil, mathbabe 8 comments

There’s a CNN video news story explaining how the NYC Mayor’s Office of Data Analytics is working with private start-up Placemeter to count and categorize New Yorkers, often with the help of private citizens who install cameras in their windows. Here’s a screenshot from the Placemeter website:

From placemeter.com

You should watch the video and decide for yourself whether this is a good idea.

Personally, it disturbs me, but perhaps because of my priors on how much we can trust other people with our data, especially when it’s in private hands.

To be more precise, there is, in my opinion, a contradiction coming from the Placemeter representatives. On the one hand they try to make us feel safe by saying that, after gleaning a body count with their video tapes, they dump the data. But then they turn around and say that, in addition to counting people, they will also categorize people: gender, age, whether they are carrying a shopping bag or pushing strollers.

That’s what they are talking about anyway, but who knows what else? Race? Weight? Will they use face recognition software? Who will they sell such information to? At some point, after mining videos enough, it might not matter if they delete the footage afterwards.

Since they are a private company I don’t think such information on their data methodologies will be accessible to us via Freedom of Information Laws either. Or, let me put that another way. I hope that MODA sets up their contract so that such information is accessible via FOIL requests.

Categories: data science, journalism, modeling, news

What constitutes evidence?

July 7, 2014 Cathy O'Neil, mathbabe 30 comments

My most recent Slate Money podcast with Felix Salmon and Jordan Weissmann was more than usually combative. I mean, we pretty much always have disagreements, but Friday it went beyond the usual political angles.

Specifically, Felix thought I was jumping too quickly towards a dystopian future with regards to medical data. My claim was that, now that the ACA has motivated hospitals and hospital systems to keep populations healthy – a good thing in itself – we’re seeing dangerous side-effects involving the proliferation of health profiling and things like “health scores” attached to people much like we now have credit scores. I’m worried that such scores, which are created using data not covered under HIPAA, will be used against people when they try to get a job.

Felix asked me to point to evidence of such usage.

Of course, it’s hard to do that, partly because it’s just the beginning of such data collection – although the FTC’s recent report pointed to data warehouses that already puts people into categories such as “diabetes interest” – and also because it’s proprietary all the way down. In other words, web searches and the like are being legally collected and legally sold and then it’s legal to use risk scores or categories to filter job applications. What’s illegal is to use HIPAA-protected data such as disability status to remove someone from consideration for a job, but that’s not what’s happening.

Anyhoo, it’s made me think. Am I a conspiracy theorist for worrying about this? Or is Felix lacking imagination if he requires evidence to believe it? Or some combination? This is super important to me because if I can’t get Felix, or someone like Felix, to care about this issue, I’m afraid it will be ignored.

This kind of thing came up a second time on that same show, when Felix complained that the series of articles (for example this one from NY Magazine) talking about money laundering in New York real estate also lacked evidence. But that’s also tricky since the disclosure requirements on real estate are not tight. In other words, they are avoiding collecting evidence of money laundering, so it’s hard to complain there’s a lack of data. From my perspective the journalists investigating this article did a good job finding examples of laundering and showing it was easy to set up (especially in Delaware). But Felix wasn’t convinced.

It’s a general question I have, actually, and I’m glad to be involved with the Lede Program because it’s actually my job to think about this kind of thing, especially in the context of journalism. Namely, when do we require data – versus anecdotal evidence – to believe in something? And especially when the data is being intentionally obscured?

Categories: data journalism, education, journalism

Update on the Lede Program

June 11, 2014 Cathy O'Neil, mathbabe 6 comments

My schedule nowadays is to go to the Lede Program classes every morning from 10am until 1pm, then office hours, when I can, from 2-4pm. The students are awesome and are learning a huge amount in a super short time.

So for instance, last time I mentioned we set up iPython notebooks on the cloud, on Amazon EC2 servers. After getting used to the various kinds of data structures in python like integers and strings and lists and dictionaries, and some simple for loops and list comprehensions, we started examining regular expressions and we played around with the old enron emails for things like social security numbers and words that had four or more vowels in a row (turns out that always means you’re really happy as in “woooooohooooooo!!!” or really sad as in “aaaaaaarghghgh”).

Then this week we installed git and started working in an editor and using the command line, which is exciting, and then we imported pandas and started to understand dataframes and series and boolean indexes. At some point we also plotted something in matplotlib. We had a nice discussion about unsupervised learning and how such techniques relate to surveillance.

My overall conclusion so far is that when you have a class of 20 people installing git, everything that can go wrong does (versus if you do it yourself, then just anything that could go wrong might), and also that there really should be a better viz tool than matplotlib. Plus my Lede students are awesome.

Categories: journalism, open source tools

Was Jill Abramson’s firing a woman thing?

May 15, 2014 Cathy O'Neil, mathbabe 30 comments

Need to be both nerdy and outraged today.

I’ve noticed something. When something shitty happens to me, and I’m complaining to a group of friends about it, I sometimes say something like “that only happened to me because I’m a woman.”

Now, first of all, I want to be clear, I’m no victim. I don’t let sexism get me down. In fact when I say something like that it usually is a coping mechanism to separate that person’s actions from my own actions, and to help me figure out what to do next. Usually I let it slide off of me and continue on my merry way.

But here’s where it’s weird. If I’m with a bunch of women friends, their immediate reaction is always the same: “hell yes, that bitch/ bastard is just a sexist fuck.” But if I’m with a bunch of man friend of mine, the reaction is very likely to be different: “oh, I don’t think there’s any reason to assume it was sexist. That guy/ girl is just an asshole.”

What it comes down to is priors. My prior is that there is sexism in the world, and it happens all the fucking time, especially to women with perceived power (or to women with no power whatsoever), and so when someone treats me or someone else badly, I do assume we should look into the sexism angle. It’s a natural choice, and Occam’s razor suggests it is involved.

So when Jill Abramson got fired, a bunch of the world’s women were like, those fuckers fired her because she is a powerful, take-no-bullshit woman, and if she’d been a man she would have been expected to act like a dick, but because she’s a woman they couldn’t handle it.

And a bunch of the world’s men were like, wow, I wonder what happened?

So, yes, now I have a prior on people’s priors on sexism, and I think men’s and women’s sexism priors are totally different. I can even explain it.

Men are men, so they don’t experience sexism. So they don’t update their priors like women do. Plus, because there is rarely a moment when an event or reaction is officially deemed “sexist,” men even categorize events differently than women (as discussed above), so even when they do update their prior, it is differently updated, partly because their prior is that nothing is sexist unless proven to be, since it’s so freaking unlikely, according to their prior.

Ezra Klein isn’t speculating, for example, but Emily Bell is, and I’m with her. That just strengthened my priors about other people’s priors.

Categories: journalism, statistics

The Lede Program has awesome faculty

April 18, 2014 Cathy O'Neil, mathbabe 7 comments

A few weeks ago I mentioned that I’m the Program Director for the new Lede Program at the Columbia Graduate School of Journalism. I’m super excited to announce that I’ve found amazing faculty for the summer part of the program, including:

Jonathan Soma, who will be the primary instructor for Basic Computing and for Algorithms
Dennis Tenen, who will be helping Soma in the first half of the summer with Basic Computing
Chris Wiggins, who will be helping Soma in the second half of the summer with Algorithms
An amazing primary instructor for Databases who I will announce soon,
Matthew Jones, who will help that amazing yet-to-be-announced instructor in Data and Databases
Three amazing TA’s: Charles Berret, Sophie Chou, and Josh Vekhter (who doesn’t have a website!).

I’m planning to teach The Platform with the help of a bunch of generous guest lecturers (please make suggestions or offer your services!).

Applications are open now, and we’re hoping to get amazing students to enjoy these amazing faculty and the truly innovative plan they have for the summer (and I don’t use the word “innovative” lightly!). We’ve already gotten some super strong applications and made a couple offers of admission.

Also, I was very pleased yesterday to see a blogpost I wrote about the genesis and the goals of the program be published in PBS’s MediaShift.

Finally, it turns out I’m a key influencer, according to The Big Roundtable.

Categories: data journalism, journalism, open source tools

Journalism after Snowden

January 31, 2014 Cathy O'Neil, mathbabe 8 comments

Last night I was lucky enough to grab a seat across Broadway at an event put on by Columbia Journalism School’s Tow Center called “Journalism after Snowden.”

It featured four distinguished panelists:

Jill Abramson Executive Editor, The New York Times
Janine Gibson Editor-in-Chief, Guardian U.S.
David Schulz Outside Counsel to The Guardian and Partner, Levine, Sullivan Koch & Schulz LLP
Cass Sunstein Member, President Obama’s Review Group on Intelligence and Communications Technologies and Robert Walmsley University Professor, Harvard University

First Janine talked about receiving the documents from Snowden, or “the source” as he was called, and spending a bunch of time with her team in verifying the documents as well as focusing on exactly two questions:

Is this story true?
Is this story in the public’s interest?

She and her team decided it passed both those tests and they published it. Then Jill Abramson chimed in to talk about how the New York Times got in on the story as well.

David Schulz, and also Lee Bollinger who started out the evening, framed the legal issues around newspapers publishing things in the context of national security here in the U.S., and although much of it was over my head I came away with the distinct impression that in this country, journalisms have historically had a protected space.

However, there have been exceptions recently, and very recently Director of National Intelligence James Clapper insinuated that dozens of journalists reporting on documents leaked by NSA whistleblower Edward Snowden were “accomplices” to a crime.

Those recent events, and Obama’s general campaign against whistleblowers, which are in direct contradiction to his campaign promises, have had a chilling effect on reporting and on reporters who work on national security issues, according to NY Times Executive Editor Jill Abramson.

There was some discussion about how difficult it was to have secure communication between Snowden and journalists, given the situation, and how crucial it is to be able to do so for journalists in order to protect their sources. The question came up of whether it even makes sense for a journalist to suggest to a source that they’d be protected, given how much surveillance now exists.

My favorite line of the night came when David Schulz pointed out that we normal citizens might not think we care about having secure communications, since we don’t intend to do top secret messaging, but even so the lack of secure messaging systems for other people effects what we learn about the world.

Finally, there was a poll taken by the moderator Emily Bell: are we better off because of Snowden? Not all of the panelists agreed, or rather Jill, Janine, and David seemed to think it was obvious but Cass demurred, which I guess was consistent with his being on a Review Group for Obama.

Personally, I don’t think it’s super cut and dry, but I do think we need to have people like Snowden, and whistleblowers more generally, and that in any case journalists absolutely need legal protection to do their jobs.

One last personal comment: I find it absolutely amazing that an entire profession like journalism would actually consider the public good as a major question they put before them before they choose what to work on. I’m coming from inside the tech industry and finance, where the only question that is ever asked is whether an idea is profitable and, secondarily, legal. It’s a refreshing perspective, although I’m guessing somewhat misleading.

Categories: journalism, news

mathbabe

Archive