I’ve been reading Head First Java this past week and I’m super impressed and want to tell you guys about it if you don’t already know.
I wanted to learn what the big fuss was about object-oriented programming, plus it seems like all the classes my Lede students are planning to take either require python or java, so this seemed like a nice bridge.
But the book is outstanding, with quirky cartoons and a super fun attitude, and I’m on page 213 after less than a week, and yes that’s out of more than 600 pages but what I’m saying is that it’s a thrilling read.
My one complaint is how often the book talks about motivating programmers with women in tight sweaters. And no, I don’t think they were assuming the programmers were lesbians, but I could be wrong and I hope I am. At the beginning they made the point that people remember stuff better when there is emotional attachment to things, so I’m guessing they’re getting me annoyed to help me remember details on reference types.
Here’s another Head First book which my nerd mom recommended to me some time ago, and I bought but haven’t read yet, but now I really plan to: Head First Design Patterns. Because ultimately, programming is just a tool set and you need to learn how to think about constructing stuff with those tools. Exciting!
And by the way, there is a long list of Head First books, and I head good things about the whole series. Honestly I will never write a technical book in the old-fashioned dry way again.
Yesterday was a day filled with secrets and codes. In the morning, at The Platform, we had guest speaker Columbia history professor Matthew Connelly, who came and talked to us about his work with declassified documents. Two big and slightly depressing take-aways for me were the following:
- As records have become digitized, it has gotten easy for people to get rid of archival records in large quantities. Just press delete.
- As records have become digitized, it has become easy to trace the access of records, and in particular the leaks. Connelly explained that, to some extent, Obama’s harsh approach to leakers and whistleblowers might be explained as simply “letting the system work.” Yet another way that technology informs the way we approach human interactions.
After class we had section, in which we discussed the Computer Science classes some of the students are taking next semester (there’s a list here) and then I talked to them about prime numbers and the RSA crypto system.
I got really into it and wrote up an iPython Notebook which could be better but is pretty good, I think, and works out one example completely, encoding and decoding the message “hello”.
Yesterday was the end of the first half of the Lede Program, and the students presented their projects, which were really impressive. I am hoping some of them will be willing to put them up on a WordPress site or something like that in order to showcase them and so I can brag about them more explicitly. Since I didn’t get anyone’s permission yet, let me just say: wow.
During the second half of the program the students will do another project (or continue their first) as homework for my class. We’re going to start planning for that on the first day, so the fact that they’ve all dipped their toes into data projects is great. For example, during presentations yesterday I heard the following a number of times: “I spent most of my time cleaning my data” or “next time I will spend more time thinking about how to drill down in my data to find an interesting story”. These are key phrases for people learning lessons with data.
Since they are journalists (I’ve learned a thing or two about journalists and their mindset in the past few months) they love projects because they love deadlines and they want something they can add to their portfolio. Recently they’ve been learning lots of geocoding stuff, and coming up they’ll be learning lots of algorithms as well. So they’ll be well equipped to do some seriously cool shit for their final project. Yeah!
In addition to the guest lectures I’m having in The Platform, I’ll also be reviewing prerequisites for the classes many of them will be taking in the Computer Science department in the fall, so for example linear algebra, calculus, and basic statistics. I just bought them all a copy of How to Lie with Statistics as well as The Cartoon Guide to Statistics, both of which I adore. I’m also making them aware of Statistics Done Wrong, which is online. I am also considering The Cartoon Guide to Calculus, which I have but I haven’t read yet.
Keep an eye out for some of their amazing projects! I’ll definitely blog about them once they’re up.
My schedule nowadays is to go to the Lede Program classes every morning from 10am until 1pm, then office hours, when I can, from 2-4pm. The students are awesome and are learning a huge amount in a super short time.
So for instance, last time I mentioned we set up iPython notebooks on the cloud, on Amazon EC2 servers. After getting used to the various kinds of data structures in python like integers and strings and lists and dictionaries, and some simple for loops and list comprehensions, we started examining regular expressions and we played around with the old enron emails for things like social security numbers and words that had four or more vowels in a row (turns out that always means you’re really happy as in “woooooohooooooo!!!” or really sad as in “aaaaaaarghghgh”).
Then this week we installed git and started working in an editor and using the command line, which is exciting, and then we imported pandas and started to understand dataframes and series and boolean indexes. At some point we also plotted something in matplotlib. We had a nice discussion about unsupervised learning and how such techniques relate to surveillance.
My overall conclusion so far is that when you have a class of 20 people installing git, everything that can go wrong does (versus if you do it yourself, then just anything that could go wrong might), and also that there really should be a better viz tool than matplotlib. Plus my Lede students are awesome.
I get asked pretty often whether I “believe” in open data. I tend to murmur a response along the lines of “it depends,” which doesn’t seem too satisfying to me or to the person I’m talking about. But this morning, I’m happy to say, I’ve finally come up with a kind of rule, which isn’t universal. It focuses on power.
Namely, I like data that shines light on powerful people. Like the Sunlight Foundation tracks money and politicians, and that’s good. But I tend to want to protect powerless people, like people who are being surveilled with sensors and their phones. And the thing is, most of the open data focuses on the latter. How people ride the subway or how they use the public park or where they shop.
Something in the middle is crime data, where you have compilation of people being stopped by the police (powerless) and the police themselves (powerful). But here as well you’ll notice an asymmetry on identifying information. Looking at Stop and Frisk data, for example, there’s a precinct to identify the police officer, but no badge number, whereas there’s a bunch of identifying information about the person being stopped which is recorded.
A lot of the time you won’t even find data about powerful people. Worker bees get scored but the managers are somehow above scoring. Bloomberg never scored his lieutenants or himself even when he insisted that teachers should be scored. I like to keep an eye on who gets data collected about them. The power is where the data isn’t.
I guess my point is this. Data and data modeling are not magical tools. They are in fact crude tools, and so to focus on them is misleading and distracting from the real show, which is always about power (and/or money). It’s a boondoggle to think about data when we should be thinking about when and how a model is being wielded and who gets to decide.
One of the biggest problem we face is that all this data is being collected and saved now and the models haven’t even been invented yet. That’s why there’s so much urgency in getting reasonable laws in place to protect the powerless.
A few weeks ago I mentioned that I’m the Program Director for the new Lede Program at the Columbia Graduate School of Journalism. I’m super excited to announce that I’ve found amazing faculty for the summer part of the program, including:
- Jonathan Soma, who will be the primary instructor for Basic Computing and for Algorithms
- Dennis Tenen, who will be helping Soma in the first half of the summer with Basic Computing
- Chris Wiggins, who will be helping Soma in the second half of the summer with Algorithms
- An amazing primary instructor for Databases who I will announce soon,
- Matthew Jones, who will help that amazing yet-to-be-announced instructor in Data and Databases
- Three amazing TA’s: Charles Berret, Sophie Chou, and Josh Vekhter (who doesn’t have a website!).
I’m planning to teach The Platform with the help of a bunch of generous guest lecturers (please make suggestions or offer your services!).
Applications are open now, and we’re hoping to get amazing students to enjoy these amazing faculty and the truly innovative plan they have for the summer (and I don’t use the word “innovative” lightly!). We’ve already gotten some super strong applications and made a couple offers of admission.
Also, I was very pleased yesterday to see a blogpost I wrote about the genesis and the goals of the program be published in PBS’s MediaShift.
Finally, it turns out I’m a key influencer, according to The Big Roundtable.
By now most of you have read about the major bug that was found in OpenSSL, an open source security software toolkit. The bug itself is called the Heartbleed Bug, and there’s lots of information about it and how to fix it here. People are super upset about this, and lots of questions remain.
For example, was it intentionally undermined? Has the NSA deliberately inserted weaknesses into this as well? It seems like the jury is out right now, but if I’m the guy who put in the bug, I’m changing my name and going undercover just in case.
Next, how widely was the weakness exploited? If you’re super worried about stuff, or if you are a particular target of attack, the answer is probably “widely.” The frustrating thing is that there’s seemingly no way to measure or test that assumption, since the attackers would leave no trace.
Here’s what I find interesting the most interesting question: what will the long-term reaction be to open source software? People might think that open source code is a bust after this. They will complain that something like this should never have been allowed to happen – that the whole point of open software is that people should be checking this stuff as it comes in – and it never would have happened if there were people getting paid to test the software.
First of all, it did work as intended, even though it took two years instead of two days like people might have wanted. And maybe this shouldn’t have happened like it did, but I suspect that people will learn this particular lesson really well as of now.
But in general terms, bugs are everywhere. Think about Knight Capital’s trading debacle or the ObamaCare website, just two famous recent problems with large-scale coding projects that aren’t open source.
Even when people are paid to fix bugs, they fix the kind of bugs that cause the software to stop a lot sooner than the kind of bug that doesn’t make anything explode, lets people see information they shouldn’t see, and leaves no trace. So for every Knight’s Capital there are tons of other bugs in software that continue to exist.
In other words it’s more a question of who knows about the bugs and who can exploit them. And of course, whether those weaknesses will ever be exposed to the public at all.
It would be great to see the OpenSSL bug story become, over time, a success story. This would mean that, on the one hand the nerds becoming more vigilant in checking vitally important code, and learning to think like assholes, but also the public would need to acknowledge how freaking hard it is to program.