“Data Science” is one of my least favorite tech buzzwords, second to probably “Big Data”, which in my opinion should be always printed followed by a winky face (after all, my data is bigger than yours). It’s mostly a marketing ploy used by companies to attract talented scientists, statisticians, and mathematicians, who, at the end of the day, will probably be working on some sort of advertising problem or the other.
Still, you have to admit, it does have a nice ring to it. Thus the title Democratizing Data Science, a vision paper which I co-authored with two cool Ph.D students at MIT CSAIL, William Li and Ramesh Sridharan.
The paper focuses on the latter part of the situation mentioned above. Namely, how can we direct these data scientists, aka scientists who interact with the data pipeline throughout the problem-solving process (whether they be computer scientists or programmers or statisticians or mathematicians in practice) towards problems focused on societal issues?
In the paper, we briefly define Data Science (asking ourselves what the heck it even means), then question what it means to democratize the field, and to what end that may be achieved. In other words, the current applications of Data Science, a new but growing field, in both research and industry, has the potential for great social impact, but in reality, resources are rarely distributed in a way to optimize the social good.
We’ll be presenting the paper at the KDD Conference next Sunday, August 24th at 11am as a highlight talk in the Bloomberg Building, 731 Lexington Avenue, NY, NY. It will be more like an open conversation than a lecture and audience participation and opinion is very welcome.
The conference on Sunday at Bloomberg is free, although you do need to register. There are three “tracks” going on that morning, “Data Science & Policy”, “Urban Computing”, and “Data Frameworks”. Ours is in the 3rd track. Sign up here!
If you don’t have time to make it, give the paper a skim anyway, because if you’re on Mathbabe’s blog you probably care about some of these things we talk about.
Everyone I know who codes uses stackoverflow.com for absolutely everything.
Just yesterday I met a cool coding chick who was learning python and pandas (of course!) with the assistance of stackoverflow. It is exactly what you need to get stuff working, and it’s better than having a friend to ask, even a highly knowledgable friend, because your friend might be busy or might not know the answer, or even if your friend knew the answer her answer isn’t cut-and-paste-able.
If you are someone who has never used stackoverflow for help, then let me explain how it works. Say you want to know how to load a JSON file into python but you don’t want to write a script for that because you’re pretty sure someone already has. You just search for “import json into python” and you get results with vote counts:
Also, every math nerd I know uses and contributes to mathoverflow.net. It’s not just for math facts and questions, either, there are interesting discussions going on there all the time. Here’s an example of a comment in response to understanding the philosophy behind the claimed proof of the ABC Conjecture:
OK well hold on tight because now there’s a new online forum, but not about coding and not about math. It’s about all the other STEM subjects, which since we’ve removed math might need to be called STE subjects, which is not catchy.
So far only statistics is open, but other stuff is coming very soon. Specifically it covers, or soon will cover, the following fields:
- Cognitive Sciences
- Computer Sciences
- Earth and Planetary Sciences
- Science & Math Education
- History of Science and Mathematics
- Applied Mathematics, and
I’m super excited for this site, it has serious potential to make peoples’ lives better. I wish it had a category for Data Sciences, and for Data Journalism, because I’d probably be more involved in those categories than most of the above, but then again most data science-y questions could be inserted into one of the above. I’ll try to be patient on this one.
Here’s a screen shot of an existing Stats question on the site:
There’s an interesting and horrible New York Time story by Jessica Silver-Greenberg about a PayDay loan syndicate being run out of New York State. The syndicate consists of twelve companies owned by a single dude, Carey Vaughn Brown, with help from a corrupt lawyer and another corrupt COO. Manhattan District Attorneys are charging him and his helpers with usury under New York law.
The complexity of the operation was deliberate and intended to obscure the chain of events that would start with a New Yorker online looking for quick cash online and end with a predatory loan. They’d interface with a company called MyCashNow.com, which would immediately pass their application on to a bunch of other companies in different states or overseas.
Important context: in New York, the usury law caps interest rates at 25 percent annually, and these PayDay operations were charging between 350 and 650 percent annually. Also key, the usury laws apply to where the borrower is, not where the lender is, so even though some of the companies were located (at least on paper) in the West Indies, they were still breaking the law.
They don’t know exactly how big the operation was in New York, but one clue is that in 2012, one of the twelve companies had $50 million in proceeds from New York.
Here’s my question: how did MyCashNow.com advertise? Did it use Google ads, or Facebook ads, or something else, and if so, what were the attributes of the desperate New Yorkers that it looked for to do its predatory work?
One side of this is that vulnerable people were somehow targeted. The other side is that well-off people were not, which meant they didn’t see ads like this, which makes it harder for people like the Manhattan District Attorney to even know about shady operations like this.
There was a recent New York Times op-ed by Sonja Starr entitled Sentencing, by the Numbers (hat tip Jordan Ellenberg and Linda Brown) which described the widespread use – in 20 states so far and growing – of predictive models in sentencing.
The idea is to use a risk score to help inform sentencing of offenders. The risk is, I guess, supposed to tell us how likely the person is to commit another act in the future, although that’s not specified. From the article:
The basic problem is that the risk scores are not based on the defendant’s crime. They are primarily or wholly based on prior characteristics: criminal history (a legitimate criterion), but also factors unrelated to conduct. Specifics vary across states, but common factors include unemployment, marital status, age, education, finances, neighborhood, and family background, including family members’ criminal history.
I knew about the existence of such models, at least in the context of prisoners with mental disorders in England, but I didn’t know how widespread it had become here. This is a great example of a weapon of math destruction and I will be using this in my book.
A few comments:
- I’ll start with the good news. It is unconstitutional to use information such as family member’s criminal history against someone. Eric Holder is fighting against the use of such models.
- It is also presumably unconstitutional to jail someone longer for being poor, which is what this effectively does. The article has good examples of this.
- The modelers defend this crap as “scientific,” which is the worst abuse of science and mathematics imaginable.
- The people using this claim they only use it for as a way to mitigate sentencing, but letting a bunch of rich white people off easier because they are not considered “high risk” is tantamount to sentencing poor minorities more.
- It is a great example of confused causality. We could easily imagine a certain group that gets arrested more often for a given crime (poor black men, marijuana possession) just because the police have that practice for whatever reason (Stop & Frisk). Then model would then consider any such man at a higher risk of repeat offending, but that’s not because any particular person is actually more likely to do it, but because the police are more likely to arrest that person for it.
- It also creates a negative feedback loop on the most vulnerable population: the model will impose longer sentencing on the population it considers most risky, which will in turn make them even riskier in the future, if “length of time in prison previously” is used as an attribute in the model, which is surely is.
- Not to be cynical, but considering my post yesterday, I’m not sure how much momentum will be created to stop the use of such models, considering how discriminatory it is.
- Here’s an extreme example of preferential sentencing which already happens: rich dude Robert H Richards IV raped his 3-year-old daughter and didn’t go to jail because the judge ruled he “wouldn’t fare well in prison.”
- How great would it be if we used data and models to make sure rich people went to jail just as often and for just as long as poor people for the same crime, instead of the other way around?
There’s a CNN video news story explaining how the NYC Mayor’s Office of Data Analytics is working with private start-up Placemeter to count and categorize New Yorkers, often with the help of private citizens who install cameras in their windows. Here’s a screenshot from the Placemeter website:
You should watch the video and decide for yourself whether this is a good idea.
Personally, it disturbs me, but perhaps because of my priors on how much we can trust other people with our data, especially when it’s in private hands.
To be more precise, there is, in my opinion, a contradiction coming from the Placemeter representatives. On the one hand they try to make us feel safe by saying that, after gleaning a body count with their video tapes, they dump the data. But then they turn around and say that, in addition to counting people, they will also categorize people: gender, age, whether they are carrying a shopping bag or pushing strollers.
That’s what they are talking about anyway, but who knows what else? Race? Weight? Will they use face recognition software? Who will they sell such information to? At some point, after mining videos enough, it might not matter if they delete the footage afterwards.
Since they are a private company I don’t think such information on their data methodologies will be accessible to us via Freedom of Information Laws either. Or, let me put that another way. I hope that MODA sets up their contract so that such information is accessible via FOIL requests.
This course begins with the idea that computing tools are the products of human ingenuity and effort. They are never neutral and carry with them the biases of their designers and their design process. “Platform studies” is a new term used to describe investigations into these relationships between computing technologies and the creative or research products that they help to generate. How you understand how data, code, and algorithms affect creative practices can be an effective first step toward critical thinking about technology. This will not be purely theoretical, however, and specific case studies, technologies, and project work will make the ideas concrete.
Since my first class is coming soon, I’m actively thinking about what to talk about and which readings to assign. I’ve got wonderful guest lecturers coming, and for the most part the class will focus on those guest lecturers and their topics, but for the first class I want to give them an overview of a very large subject.
I’ve decided that danah boyd and Kate Crawford’s recent article, Critical Questions for Big Data, is pretty much perfect for this goal. I’ve read and written a lot about big data but even so I’m impressed by how clearly and comprehensively they have laid out their provocations. And although I’ve heard many of the ideas and examples before, some of them are new to me, and are directly related to the theme of the class, for example:
Twitter and Facebook are examples of Big Data sources that offer very poor archiving and search functions. Consequently, researchers are much more likely to focus on something in the present or immediate past – tracking reactions to an election, TV finale, or natural disaster – because of the sheer difficulty or impossibility of accessing older data.
Of course the students in the Lede are journalists, not academic researchers, which the article mostly addresses, and moreover they are not necessarily working with big data per se, but even so they are increasingly working with social media data, and moreover they are probably covering big data even if they don’t directly analyze it. So I think it’s still relevant to them. Or another way to express this is that one thing we will attempt to do in class is examine the extent to which their provocations are relevant.
Here’s another gem, directly related to the Facebook experiment I discussed yesterday:
As computational scientists have started engaging in acts of social science, there is a tendency to claim their work as the business of facts and not interpretation. A model may be mathematically sound, an experiment may seem valid, but as soon as a researcher seeks to understand what it means, the process of interpretation has begun. This is not to say that all interpretations are created equal, but rather that not all numbers are neutral.
In fact, what with this article and that case study, I’m pretty much set for my first day, after combining them with a discussion of the students’ projects and some related statistical experiments.
I also hope to invite at least one of the authors to come talk to the class, although I know they are both incredibly busy. Danah boyd, who recently came out with a book called It’s Complicated: the social lives of networked teens, also runs the Data & Society Research Institute, a NYC-based think/do tank focused on social, cultural, and ethical issues arising from data-centric technological development. I’m hoping she comes and talks about the work she’s starting up there.
I’m super excited about the recent “mood study” that was done on Facebook. It constitutes a great case study on data experimentation that I’ll use for my Lede Program class when it starts mid-July. It was first brought to my attention by one of my Lede Program students, Timothy Sandoval.
My friend Ernest Davis at NYU has a page of handy links to big data articles, and at the bottom (for now) there are a bunch of links about this experiment. For example, this one by Zeynep Tufekci does a great job outlining the issues, and this one by John Grohol burrows into the research methods. Oh, and here’s the original research article that’s upset everyone.
It’s got everything a case study should have: ethical dilemmas, questionable methodology, sociological implications, and questionable claims, not to mention a whole bunch of media attention and dissection.
By the way, if I sound gleeful, it’s partly because I know this kind of experiment happens on a daily basis at a place like Facebook or Google. What’s special about this experiment isn’t that it happened, but that we get to see the data. And the response to the critiques might be, sadly, that we never get another chance like this, so we have to grab the opportunity while we can.