NYCLU: Stop Question and Frisk data

Home > #OWS, data science, open source tools, rant > NYCLU: Stop Question and Frisk data

NYCLU: Stop Question and Frisk data

October 15, 2011 Cathy O'Neil, mathbabe

As I mentioned yesterday, I’m the data wrangler for the Data Without Borders datadive this weekend. There are three N.G.O.’s participating: NYCLU (mine), MIX, and UN Global Pulse. The organizations all pitched their data and their questions last night to the crowd of nerds, and this morning we are meeting bright and early (8am) to start crunching.

I’m particularly psyched to be working with NYCLU on Stop and Frisk data. The women I met from NYCLU last night had spent time at Occupy Wall Street the previous day giving out water and information to the protesters. How cool!

The data is available here. It’s zipped in .por format, which is to say it was collected and used in SPSS, a language that’s not open source. I wanted to get it into csv format for the data miners this morning, but I have been having trouble. Sometimes R can handle .por files but at least my install of R is having trouble with the years 2006-2009. Then we tried installing PSPP, which is an open source version of SPSS, and it seemed to be able to import the .por files and then export as csv, in the sense that it didn’t throw any errors, but actually when we looked we saw major flaws. Finally we found a program called StatTransfer, which seems to work (you can download a trial version for free) but unless you pay $179 for the package, it actually doesn’t transfer all of the lines of the file for you.

If anyone knows how to help, please make a comment, I’ll be checking my comments. Of course there could easily be someone at the datadive with SPSS on their computer, which would solve everything, but on the other hand it could also be a major pain and we could waste lots of precious analyzing time with formatting issues. I may just buckle down and pay $179 but I’d prefer to find an open source solution.

UPDATE (9:00am): Someone has SPSS! We’re totally getting that data into csv format. Next step: set up Dropbox account to share it.

UPDATE (9:21am): Have met about 5 or 6 adorable nerds who are eager to work on this sexy data set. YES!

UPDATE (10:02am): People are starting to work in small groups. One guy is working on turning the x- and y-coordinates into latitude and longitude so we can use mapping tools easier. These guys are awesome.

UPDATE (11:37am): Now have a mapping team of 4. Really interesting conversations going on about statistically rigorous techniques for human rights abuses. Looking for publicly available data on crime rates, no luck so far… also looking for police officer id’s on data set but that seems to be missing. Looking also to extend some basic statistics to all of the data set and aggregated by months rather than years so we can plot trends. See it all take place on our wiki!

UPDATE (12:24pm): Oh my god, we have a map. We have officer ID’s (maybe). We have awesome discussions around what bayesian priors are reasonable. This is awesome! Lunch soon, where we will discuss our morning, plan for the afternoon, and regroup. Exciting!

UPDATE (2:18pm): Nice. We just had lunch, and I managed to get a sound byte about every current project, and it’s just amazing how many different things are being tried. Awesome. Will update soon.

UPDATE (7:10pm): Holy shit I’ve been inside crunching data all day while the world explodes around me.

Categories: #OWS, data science, open source tools, rant

Comments (5)

Mike Maltz

October 15, 2011 at 11:46 am

A few years ago Jeff Fagan and Andrew Gelman did a study of the NYCPD’s stop and frisk tactics. http://www.stat.columbia.edu/~gelman/research/published/frisk9.pdf. You might find it useful to compare your (and your co-analysts’) results with theirs.

LikeLike
- Cathy O'Neil, mathbabe
  
  October 15, 2011 at 11:49 am
  
  Cool, yes! We are looking at that paper and it’s on our wiki, thanks!
  
  LikeLike
Roger Witte

October 16, 2011 at 2:58 am

I don’t have personal experience of using it, but a former colleague of mine always recommended CrimeStat for analysing the spatial aspects of geographical data.

LikeLike
Nathaniel Erd

October 16, 2011 at 5:42 am

What exactly were the “Major Flaws” with PSPP? It’s always worked perfectly for me.

LikeLike
- Cathy O'Neil, mathbabe
  
  October 16, 2011 at 6:12 am
  
  We just saw lots of extra characters, and some duplicate lines in the resulting exported csv file.
  
  LikeLike