The Future of Data Journalism, I hope
I’m really excited about the Lede Program I’ve been working on at the Journalism School at Columbia. We’ve just now got a full and wonderful faculty and a full pilot class of 16 brilliant and excited students.
So now that we’ve gotten set up, what are we going to do?
Well, the classes are listed here, but let me say it in a few words: for the first half of the summer, we’re going to to teach the students how to use data and build models in context. We’ll teach them to script in python and use github for their code and homework. We’ll teach them how to use API’s, how to scrape data when there are no API’s, and by the end of the first half of the summer they will know how to build their own API. They will submit projects in iPython notebooks to meet the highest standard of reproducibility and transparency.
In the second half of the summer, they will learn more about algorithms, on the one hand, and how deeply to distrust algorithms, on the other. I’ll be teaching them a class invented by Mark Hansen which he called “the Platform”:
This begins with the idea that computing tools are the products of human ingenuity and effort. They are never neutral and they carry with them the biases of their designers and their design process. “Platform studies” is a new term used to describe investigations into the relationships between computing technologies and the creative or research products they help generate. How do you understand how data, code and algorithm affect creative practices can be an effective first step toward critical thinking about technology? This will not be purely theoretical, however, and specific case studies (technologies) and project work will make the ideas concrete.
In order to teach this I’ll need lots of guest lecturers on bias and in particular the politics behind modeling. Emanuel Derman has kindly offered to give one of the first guest lectures. Please suggest more!
Now, it’s easier to criticize than it is to create, and I don’t want to train a whole generation of journalists that they should just swear off mathematical modeling altogether. But I do want to make sure they are skeptical and understand the need for robustness and transparency. For that reason I’m also looking for great examples of reproducible data journalism (please provide them!).
For example, this is a great video, but where are the calculations that have been made that support it? And what assumptions went into it?
In other words, to make this a truly great video, we would need to be able to scrutinize those calculations and for that matter the data sources and the data. Then we could have a conversation about under what conditions private companies should be allowed to rely on food stamp programs for their workers.
Now I’m not claiming that all journalism is necessarily data journalism. Sometimes we’re simply talking about one person with one set of facts around them, and that’s also hugely important. For example, and in the same vein as the above video, take a look at this Reuters blog post written by Danish McDonalds worker and activist Louis Marie Rantzau who earns $21 per hour and has great benefits. Pretty much all you need to know is that she exists.
So here’s what I hope: that we start having conversations that are somewhat more based on evidence, which relies crucially on a separate discussion about what constitutes evidence. I’m hoping that we stop hiding misleading arguments behind opaque calculations and start talking about which assumptions are valid, and why we chose one model or algorithm over another, and how sensitive the conclusions are to different reasonable assumptions. I hope that, as we share our code and try out different approaches, we find ourselves acknowledging certain ground-level truths that we can agree on and then – not that we’ll stop arguing – but we might better understand why we disagree on other things.