Yesterday was the end of the first half of the Lede Program, and the students presented their projects, which were really impressive. I am hoping some of them will be willing to put them up on a WordPress site or something like that in order to showcase them and so I can brag about them more explicitly. Since I didn’t get anyone’s permission yet, let me just say: wow.
During the second half of the program the students will do another project (or continue their first) as homework for my class. We’re going to start planning for that on the first day, so the fact that they’ve all dipped their toes into data projects is great. For example, during presentations yesterday I heard the following a number of times: “I spent most of my time cleaning my data” or “next time I will spend more time thinking about how to drill down in my data to find an interesting story”. These are key phrases for people learning lessons with data.
Since they are journalists (I’ve learned a thing or two about journalists and their mindset in the past few months) they love projects because they love deadlines and they want something they can add to their portfolio. Recently they’ve been learning lots of geocoding stuff, and coming up they’ll be learning lots of algorithms as well. So they’ll be well equipped to do some seriously cool shit for their final project. Yeah!
In addition to the guest lectures I’m having in The Platform, I’ll also be reviewing prerequisites for the classes many of them will be taking in the Computer Science department in the fall, so for example linear algebra, calculus, and basic statistics. I just bought them all a copy of How to Lie with Statistics as well as The Cartoon Guide to Statistics, both of which I adore. I’m also making them aware of Statistics Done Wrong, which is online. I am also considering The Cartoon Guide to Calculus, which I have but I haven’t read yet.
Keep an eye out for some of their amazing projects! I’ll definitely blog about them once they’re up.
My schedule nowadays is to go to the Lede Program classes every morning from 10am until 1pm, then office hours, when I can, from 2-4pm. The students are awesome and are learning a huge amount in a super short time.
So for instance, last time I mentioned we set up iPython notebooks on the cloud, on Amazon EC2 servers. After getting used to the various kinds of data structures in python like integers and strings and lists and dictionaries, and some simple for loops and list comprehensions, we started examining regular expressions and we played around with the old enron emails for things like social security numbers and words that had four or more vowels in a row (turns out that always means you’re really happy as in “woooooohooooooo!!!” or really sad as in “aaaaaaarghghgh”).
Then this week we installed git and started working in an editor and using the command line, which is exciting, and then we imported pandas and started to understand dataframes and series and boolean indexes. At some point we also plotted something in matplotlib. We had a nice discussion about unsupervised learning and how such techniques relate to surveillance.
My overall conclusion so far is that when you have a class of 20 people installing git, everything that can go wrong does (versus if you do it yourself, then just anything that could go wrong might), and also that there really should be a better viz tool than matplotlib. Plus my Lede students are awesome.
I get asked pretty often whether I “believe” in open data. I tend to murmur a response along the lines of “it depends,” which doesn’t seem too satisfying to me or to the person I’m talking about. But this morning, I’m happy to say, I’ve finally come up with a kind of rule, which isn’t universal. It focuses on power.
Namely, I like data that shines light on powerful people. Like the Sunlight Foundation tracks money and politicians, and that’s good. But I tend to want to protect powerless people, like people who are being surveilled with sensors and their phones. And the thing is, most of the open data focuses on the latter. How people ride the subway or how they use the public park or where they shop.
Something in the middle is crime data, where you have compilation of people being stopped by the police (powerless) and the police themselves (powerful). But here as well you’ll notice an asymmetry on identifying information. Looking at Stop and Frisk data, for example, there’s a precinct to identify the police officer, but no badge number, whereas there’s a bunch of identifying information about the person being stopped which is recorded.
A lot of the time you won’t even find data about powerful people. Worker bees get scored but the managers are somehow above scoring. Bloomberg never scored his lieutenants or himself even when he insisted that teachers should be scored. I like to keep an eye on who gets data collected about them. The power is where the data isn’t.
I guess my point is this. Data and data modeling are not magical tools. They are in fact crude tools, and so to focus on them is misleading and distracting from the real show, which is always about power (and/or money). It’s a boondoggle to think about data when we should be thinking about when and how a model is being wielded and who gets to decide.
One of the biggest problem we face is that all this data is being collected and saved now and the models haven’t even been invented yet. That’s why there’s so much urgency in getting reasonable laws in place to protect the powerless.
A few weeks ago I mentioned that I’m the Program Director for the new Lede Program at the Columbia Graduate School of Journalism. I’m super excited to announce that I’ve found amazing faculty for the summer part of the program, including:
- Jonathan Soma, who will be the primary instructor for Basic Computing and for Algorithms
- Dennis Tenen, who will be helping Soma in the first half of the summer with Basic Computing
- Chris Wiggins, who will be helping Soma in the second half of the summer with Algorithms
- An amazing primary instructor for Databases who I will announce soon,
- Matthew Jones, who will help that amazing yet-to-be-announced instructor in Data and Databases
- Three amazing TA’s: Charles Berret, Sophie Chou, and Josh Vekhter (who doesn’t have a website!).
I’m planning to teach The Platform with the help of a bunch of generous guest lecturers (please make suggestions or offer your services!).
Applications are open now, and we’re hoping to get amazing students to enjoy these amazing faculty and the truly innovative plan they have for the summer (and I don’t use the word “innovative” lightly!). We’ve already gotten some super strong applications and made a couple offers of admission.
Also, I was very pleased yesterday to see a blogpost I wrote about the genesis and the goals of the program be published in PBS’s MediaShift.
Finally, it turns out I’m a key influencer, according to The Big Roundtable.
By now most of you have read about the major bug that was found in OpenSSL, an open source security software toolkit. The bug itself is called the Heartbleed Bug, and there’s lots of information about it and how to fix it here. People are super upset about this, and lots of questions remain.
For example, was it intentionally undermined? Has the NSA deliberately inserted weaknesses into this as well? It seems like the jury is out right now, but if I’m the guy who put in the bug, I’m changing my name and going undercover just in case.
Next, how widely was the weakness exploited? If you’re super worried about stuff, or if you are a particular target of attack, the answer is probably “widely.” The frustrating thing is that there’s seemingly no way to measure or test that assumption, since the attackers would leave no trace.
Here’s what I find interesting the most interesting question: what will the long-term reaction be to open source software? People might think that open source code is a bust after this. They will complain that something like this should never have been allowed to happen – that the whole point of open software is that people should be checking this stuff as it comes in – and it never would have happened if there were people getting paid to test the software.
First of all, it did work as intended, even though it took two years instead of two days like people might have wanted. And maybe this shouldn’t have happened like it did, but I suspect that people will learn this particular lesson really well as of now.
But in general terms, bugs are everywhere. Think about Knight Capital’s trading debacle or the ObamaCare website, just two famous recent problems with large-scale coding projects that aren’t open source.
Even when people are paid to fix bugs, they fix the kind of bugs that cause the software to stop a lot sooner than the kind of bug that doesn’t make anything explode, lets people see information they shouldn’t see, and leaves no trace. So for every Knight’s Capital there are tons of other bugs in software that continue to exist.
In other words it’s more a question of who knows about the bugs and who can exploit them. And of course, whether those weaknesses will ever be exposed to the public at all.
It would be great to see the OpenSSL bug story become, over time, a success story. This would mean that, on the one hand the nerds becoming more vigilant in checking vitally important code, and learning to think like assholes, but also the public would need to acknowledge how freaking hard it is to program.
Today I’d like to mention two ideas I’ve been having recently on how to make being a research mathematician (even) more fun.
1) Mathematicians should consider holding public discussions about papers
First, math nerds, did you know that in statistics they have formal discussions about papers? It’s been a long-standing tradition by the Royal Statistical Society, whose motto is “Advancing the science and application of statistics, and promoting use and awareness for public benefit,” to choose papers by some criterion and then hold regular public discussions about those papers by a few experts who are not the author, about the paper. Then the author responds to their points and the whole conversation is published for posterity.
I think this is a cool idea for math papers too. One thing that kind of depressed me about math is how rarely you’d find people reading the same papers unless you specifically got a group of people together to do so, which was a lot of work. This way the work is done mostly by other people and more importantly the payoff is much better for them since everyone gets a view into the discussion.
Note I’m sidestepping who would organize this whole thing, and how the papers would be chosen exactly, but I’d expect it would improve the overall feeling that I had of being isolated in a tiny math community, especially if the conversations were meant to be penetrable.
2) There should be a good clustering method for papers around topics
This second idea may already be happening, but I’m going to say it anyway, and it could easily be a thesis for someone in CS.
Namely, the idea of using NLP and other such techniques to cluster math papers by topic. Right now the most obvious way to find a “nearby” paper is to look at the graph of papers by direct reference, but you’re probably missing out on lots of stuff that way. I think a different and possibly more interesting way would be to use the text in the title, abstract, and introduction to find papers with similar subjects.
This might be especially useful when you want to know the answer to a question like, “has anyone proved that such-and-such?” and you can do a text search for the statement of that theorem.
The good news here is that mathematicians are in love with terminology, and give weird names to things that make NLP techniques very happy. My favorite recent example which I hear Johan muttering under his breath from time to time is Flabby Sheaves. There’s no way that’s not a distinctive phrase.
The bad news is that such techniques won’t help at all in finding different fields who have come across the same idea but have different names for the relevant objects. But that’s OK, because it means there’s still lots of work for mathematicians.
By the way, back to the question of whether this has already been done. My buddy Max Lieblich has a website called MarXiv which is a wrapper over the math ArXiv and has a “similar” button. I have no idea what that button actually does though. In any case I totally dig the design of the similar button, and what I propose is just to have something like that work with NLP.
This is a guest post by Marc Joffe, the principal consultant at Public Sector Credit Solutions, an organization that provides data and analysis related to sovereign and municipal securities. Previously, Joffe was a Senior Director at Moody’s Analytics.
As Cathy has argued, open source models can bring much needed transparency to scientific research, finance, education and other fields plagued by biased, self-serving analytics. Models often need large volumes of data, and if the model is to be run on an ongoing basis, regular data updates are required.
Unfortunately, many data sets are not ready to be loaded into your analytical tool of choice; they arrive in an unstructured form and must be organized into a consistent set of rows and columns. This cleaning process can be quite costly. Since open source modeling efforts are usually low dollar operations, the costs of data cleaning may prove to be prohibitive. Hence no open model – distortion and bias continue their reign.
Much data comes to us in the form of PDFs. Say, for example, you want to model student loan securitizations. You will be confronted with a large number of PDF servicing reports that look like this. A corporation or well funded research institution can purchase an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from the PDFs into a database. But this is not much help to insurgent modelers who want to produce open source work.
Data journalists face a similar challenge. They often need to extract bulk data from PDFs to support their reporting. Examples include IRS Form 990s filed by non-profits and budgets issued by governments at all levels.
The data journalism community has responded to this challenge by developing software to harvest usable information from PDFs. Examples include Tabula, a tool written by Knight-Mozilla OpenNews Fellow Manuel Aristarán, extracts data from PDF tables in a form that can be readily imported to a spreadsheet – if the PDF was “printed” from a computer application. Introduced earlier this year, Tabula continues to evolve thanks to the volunteer efforts of Manuel, with help from OpenNews Fellow Mike Tigas and New York Times interactive developer Jeremy Merrill. Meanwhile, DocHive, a tool whose continuing development is being funded by a Knight Foundation grant, addresses PDFs that were created by scanning paper documents. DocHive is a project of Raleigh Public Record and is led by Charles and Edward Duncan.
These open source tools join a number of commercial offerings such as Able2Extract and ABBYY Fine Reader that extract data from PDFs. A more comprehensive list of open source and commercial resources is available here.
Unfortunately, the free and low cost tools available to modelers, data journalists and transparency advocates have limitations that hinder their ability to handle large scale tasks. If, like me, you want to submit hundreds of PDFs to a software tool, press “Go” and see large volumes of cleanly formatted data, you are out of luck.
It is for this reason that I am working with The Sunlight Foundation and other sponsors to stage the PDF Liberation Hackathon from January 17-19, 2014. We’ll have hack sites at Sunlight’s Washington DC office and at RallyPad in San Francisco. Developers can also join remotely because we will publish a number of clearly specified PDF extraction challenges before the hackathon.
Participants can work on one of the pre-specified challenges or choose their own PDF extraction projects. Ideally, hackathon teams will use (and hopefully improve upon) open source tools to meet the hacking challenges, but they will also be allowed to embed commercial tools into their projects as long as their licensing cost is less than $1000 and an unlimited trial is available.
Prizes of up to $500 will be awarded to winning entries. To receive a prize, a team must publish their source code on a GitHub public repository. To join the hackathon in DC or remotely, please sign up at Eventbrite; to hack with us in SF, please sign up via this Meetup. Please also complete our Google Form survey. Also, if anyone reading this is associated with an organization in New York or Chicago that would like to organize an additional hack space, please contact me.
The PDF Liberation Hackathon is going to be a great opportunity to advance the state of the art when it comes to harvesting data from public documents. I hope you can join us.