Archive for the ‘open source tools’ Category

SEC Roundtable on credit rating agency models today

I’ve discussed the broken business model that is the credit rating agency system in this country on a few occasions. It directly contributed to the opacity and fraud in the MBS market and to the ensuing financial crisis, for example. And in this post and then this one, I suggest that someone should start an open source version of credit rating agencies. Here’s my explanation:

The system of credit ratings undermines the trust of even the most fervently pro-business entrepreneur out there. The models are knowingly games by both sides, and it’s clearly both corrupt and important. It’s also a bipartisan issue: Republicans and Democrats alike should want transparency when it comes to modeling downgrades- at the very least so they can argue against the results in a factual way. There’s no reason I can see why there shouldn’t be broad support for a rule to force the ratings agencies to make their models publicly available. In other words, this isn’t a political game that would score points for one side or the other.

Well, it wasn’t long before Marc Joffe, who had started an open source credit rating agency, contacted me and came to my Occupy group to explain his plan, which I blogged about here. That was almost a year ago.

Today the SEC is going to have something they’re calling a Credit Ratings Roundtable. This is in response to an amendment that Senator Al Franken put on Dodd-Frank which requires the SEC to examine the credit rating industry. From their webpage description of the event:

The roundtable will consist of three panels:

  • The first panel will discuss the potential creation of a credit rating assignment system for asset-backed securities.
  • The second panel will discuss the effectiveness of the SEC’s current system to encourage unsolicited ratings of asset-backed securities.
  • The third panel will discuss other alternatives to the current issuer-pay business model in which the issuer selects and pays the firm it wants to provide credit ratings for its securities.

Marc is going to be one of something like 9 people in the third panel. He wrote this op-ed piece about his goal for the panel, a key excerpt being the following:

Section 939A of the Dodd-Frank Act requires regulatory agencies to replace references to NRSRO ratings in their regulations with alternative standards of credit-worthiness. I suggest that the output of a certified, open source credit model be included in regulations as a standard of credit-worthiness.

Just to be clear: the current problem is that not only is there wide-spread gaming, but there’s also a near monopoly by the “big three” credit rating agencies, and for whatever reason that monopoly status has been incredibly well protected by the SEC. They don’t grant “NRSRO” status to credit rating agencies unless the given agency can produce something like 10 letters from clients who will vouch for them providing credit ratings for at least 3 years. You can see why this is a hard business to break into.

The Roundtable was covered yesterday in the Wall Street Journal as well: Ratings Firms Steer Clear of an Overhaul - an unfortunate title if you are trying to be optimistic about the event today. From the WSJ article:

Mr. Franken’s amendment requires the SEC to create a board that would assign a rating firm to evaluate structured-finance deals or come up with another option to eliminate conflicts.

While lawsuits filed against S&P in February by the U.S. government and more than a dozen states refocused unflattering attention on the bond-rating industry, efforts to upend its reliance on issuers have languished, partly because of a lack of consensus on what to do.

I’m just kind of amazed that, given how dirty and obviously broken this industry is, we can’t do better than this. SEC, please start doing your job. How could allowing an open-source credit rating agency hurt our country? How could it make things worse?

E-discovery and the public interest (part 2)

Yesterday I wrote this short post about my concerns about the emerging field of e-discovery. As usual the comments were amazing and informative. By the end of the day yesterday I realized I needed to make a much more nuanced point here.

Namely, I see a tacit choice being made, probably by judges or court-appointed “experts”, on how machine learning is used in discovery, and I think that the field could get better or worse. I think we need to urgently discuss this matter, before we wander into a crazy place.

And to be sure, the current discovery process is fraught with opacity and human judgment, so complaining about those features being present in a machine learning version of discovery is unreasonable – the question is whether it’s better or worse than the current system.

Making it worse: private code, opacity

The way I see it, if we allow private companies to build black box machines that we can’t peer into, nor keep track of as they change versions, then we’ll never know why a given set of documents was deemed “relevant” in a given case. We can’t, for example, check to see if the code was modified to be more friendly to a given side.

Besides the healthy response to this new revenue source of competition for clients, the resulting feedback loop will likely be a negative one, whereby private companies use the cheapest version they can get away with to achieve the best results (for their clients) that they can argue for.

Making it better: open source code, reproducibility

What we should be striving for is to use only open source software, saved in a repository so we can document exactly what happened with a given corpus and a given version of the tools. It will still be an industry to clean the data and feed in the documents, train the algorithm (whilst documenting how that works), and interpreting the results. Data scientists will still get paid.

In other words, instead of asking for interpretability, which is a huge ask considering the massive scale of the work being done, we should, at the very least, be able to ask for reproducibility of the e-discovery, as well as transparency in the code itself.

Why reproducibility? Then we can go back in time, or rather scholars can, and test how things might have changed if a different version of the code were used, for example. This could create a feedback loop crucial to improve the code itself over time, and to improve best practices for using that code.

E-discovery and the public interest

Today I want to bring up a few observations and concerns I have about the emergence of a new field in machine learning called e-discovery. It’s the algorithmic version of discovery, so I’ll start there.

Discovery is part of the process in a lawsuit where relevant documents are selected, pored over, and then handed to the other side. Nowadays, of course, there are more and more documents, almost all electronic, typically including lots of e-mails.

If you’re talking about a big lawsuit, there could be literally millions of documents to wade through, and that takes a lot of time for humans to do, and it can be incredibly expensive and time-consuming. Enter the algorithm.

With advances in Natural Language Processing (NLP), a machine algorithm can sort emails or documents by topic (after getting the documents into machine-readable form, cleaning, and deduping) and can in general do a pretty good job of figuring out whether a given email is “relevant” to the case.

And this is already happening – the Wall Street Journal recently reported that the Justice Department allowed e-discovery for a case involving the merger of two beer companies. From the article:

With the blessing of the Justice Department’s antitrust division, the lawyers loaded the documents into a program and manually reviewed a batch to train the software to recognize relevant documents. The manual review was repeated until the Justice Department and Constellation were satisfied that the program could accurately predict relevance in the rest of the documents. Lawyers for Constellation and Crown Imports used software developed by kCura Corp., which lists the Justice Department as a client.

In the end, Constellation and Crown Imports turned over hundreds of thousands of documents to antitrust investigators.

Here are some of my questions/ concerns:

  • These algorithms are typically not open source – companies like kCura make good money doing these jobs.
  • That means that they could be wrong, possibly in subtle ways.
  • Or maybe not so subtle ways: maybe they’ve been trained to find documents that are both “relevant” and “positive” for a given side.
  • In any case, the laws of this country will increasingly depend on a black box algorithm that is no accessible to the average citizen.
  • Is that in the public’s interest?
  • Is that even constitutional?

Interview with Chris Wiggins: don’t send me another $^%& shortcut alias!

When I first met Chris Wiggins of Columbia and hackNY back in 2011, he immediately introduced me to about a hundred other people, which made it obvious that his introductions were highly stereotyped. I thought he was some kind of robot, especially when I started getting emails from his phone which all had the same (long) phrases in them, like “I’m away from my keyboard right now, but when I get back to my desk I’ll calendar prune and send you some free times.”

Finally I was like “what the hell, are you sending me whole auto-generated emails”? To which he replied “of course.”

Chris posted the code to his introduction script last week so now I have proof that some of my favorite emails I thought were from him back in 2011 were actually from tcsh.

Feeling cheated, I called him to tell him he has an addiction to shell scripting. Here’s a brief interview, rewritten to make me sound smarter and cooler than I am.


CO: Ok, let’s start with these iphone shortcuts. Sometimes the whole email from you reads like a bunch of shortcuts.

CW: Yup, lots of times.

CO: What the hell? Don’t you want to personalize things for me at least a little?

CW: I do! But I also want to catch the subway.

CO: Ugh. How many shortcuts do you have on that thing?

CW: Well.. (pause)..38.

CO: Ok now I’m officially worried about you. What’s the longest one?

CW: Probably this one I wrote for Sandy: If I write “sandy” it unpacks to

“Sorry for delay and brevity in reply. Sandy knocked out my phone, power, water, and internet so I won’t be replying as quickly as usual. Please do not hesitate to email me again if I don’t reply soon.”

CO: You officially have a problem. What’s the shortest one?

CW: Well, when I type “plu” it becomes “+1”

CO: Ok, let me apply the math for you: your shortcut is longer than your longcut.

CW: I know but not if you include switching from letters to numbers on the iphone, which is annoying.

CO: How did you first become addicted to shortcuts?

CW: I got introduced to UNIX in the 80s and, in my frame of reference at the time, the closest I had come to meeting a wizard was the university’s sysadmin. I was constantly breaking things by chomping cpu with undead processes or removing my $HOME or something, and he had to come in and fix things. I learned a lot over his shoulder. In the summer before I started college, my dream was to be a university sysadmin. He had to explain to me patiently that I shouldn’t spend college in a computercave.

CO: Good advice, but now that you’re a grownup you can do that.

CW: Exactly. Anyway, everytime he would fix whatever incredible mess I had made he would sign off with some different flair and walk out, like he was dropping the mic and walking off stage. He never signed out “logout” it was always “die” or “leave” or “ciao” (I didn’t know that word at the time). So of course by the time he got back to his desk one day there was an email from me asking how to do this and he replied:

“RTFM. alias

CO: That seems like kind of a mean thing to do to you at such a young age.

CW: It’s true. UNIX alias was clearly the gateway drug that led me to writing shell scripts for everything.

CO: How many aliases do you have now?

CW: According to “alias | wc -l “, I have 1137. So far.

CO: So you’ve spent countless hours making aliases to save time.

CW: Yes! And shell scripts!

CO: Ok let’s talk about this script for introducing me to people. As you know I don’t like getting treated like a small cog. I’m kind of a big deal.

CW: Yes, you’ve mentioned that.

CO: So how does it work?

CW: I have separate biography files for everyone, and a file called nfile.asc that has first name, lastname@tag, and email address. Then I can introduce people via

% ii oneil@mathbabe schutt

It strips out the @mathbabe part (so I can keep track of multiple people named oneil) from the actual email, reads in and reformats the biographies, grepping out the commented lines, and writes an email I can pipe to mutt. The whole thing can be done in a few seconds.

CO: Ok that does sound pretty good. How many shell scripts do you have?

CW: Hundreds. A few of them are in my public mise-en-place repository, which I should update more. I’m not sure which of them I really use all the time, but it’s pretty rare I type an actual legal UNIX command at the command line. That said I try never to leave the command line. Students are always teaching me fancypants tricks for their browsers or some new app, but I spend a lot of time at the command line getting and munging data, and for that, sed, awk, and grep are here to stay.

CO: That’s kinda sad and yet… so true. Ok here’s the only question I really wanted to ask though: will you promise me you’ll never send me any more auto-generated emails?

CW: no.

Categories: news, open source tools

Guest post by Julia Evans: How I got a data science job

This is a guest post by Julia Evans. Julia is a data scientist & programmer who lives in Montréal. She spends her free time these days playing with data and running events for women who program or want to — she just started a Montréal chapter of pyladies to teach programming, and co-organize a monthly meetup called Montréal All-Girl Hack Night for women who are developers.

asked mathbabe a question a few weeks ago saying that I’d recently started a data science job without having too much experience with statistics, and she asked me to write something about how I got the job. Needless to say I’m pretty honoured to be a guest blogger here :) Hopefully this will help someone!

Last March I decided that I wanted a job playing with data, since I’d been playing with datasets in my spare time for a while and I really liked it. I had a BSc in pure math, a MSc in theoretical computer science and about 6 months of work experience as a programmer developing websites. I’d taken one machine learning class and zero statistics classes.

In October, I left my web development job with some savings and no immediate plans to find a new job. I was thinking about doing freelance web development. Two weeks later, someone posted a job posting to my department mailing list looking for a “Junior Data Scientist”. I wrote back and said basically “I have a really strong math background and am a pretty good programmer”. This email included, embarrassingly, the sentence “I am amazing at math”. They said they’d like to interview me.

The interview was a lunch meeting. I found out that the company (Via Science) was opening a new office in my city, and was looking for people to be the first employees at the new office. They work with clients to make predictions based on their data.

My interviewer (now my manager) asked me about my role at my previous job (a little bit of everything — programming, system administration, etc.), my math background (lots of pure math, but no stats), and my experience with machine learning (one class, and drawing some graphs for fun). I was asked how I’d approach a digit recognition problem and I said “well, I’d see what people do to solve problems like that, and I’d try that”.

I also talked about some data visualizations I’d worked on for fun. They were looking for someone who could take on new datasets and be independent and proactive about creating model, figuring out what is the most useful thing to model, and getting more information from clients.

I got a call back about a week after the lunch interview saying that they’d like to hire me. We talked a bit more about the work culture, starting dates, and salary, and then I accepted the offer.

So far I’ve been working here for about four months. I work with a machine learning system developed inside the company (there’s a paper about it here). I’ve spent most of my time working on code to interface with this system and make it easier for us to get results out of it quickly. I alternate between working on this system (using Java) and using Python (with the fabulous IPython Notebook) to quickly draw graphs and make models with scikit-learn to compare our results.

I like that I have real-world data (sometimes, lots of it!) where there’s not always a clear question or direction to go in. I get to spend time figuring out the relevant features of the data or what kinds of things we should be trying to model. I’m beginning to understand what people say about data-wrangling taking up most of their time. I’m learning some statistics, and we have a weekly Friday seminar series where we take turns talking about something we’ve learned in the last few weeks or introducing a piece of math that we want to use.

Overall I’m really happy to have a job where I get data and have to figure out what direction to take it in, and I’m learning a lot.

Nerd Nite: A Drunken Venue for Ideas

March 26, 2013 2 comments

MathBabe recently wrote an article critical of the elitist nature of Ted Talks, which you can read here. Fortunately for her, and for the hoi polloi everywhere clamoring for populist science edutainment, there is an alternative: Nerd Nite.  Once a month, in cities all over the globe, nerds herd into a local bar and turn it into a low-brow forum for innovative science ideas. Think Ted Talks on tequila.

Each month, three speakers present talks for 20-30 minutes, followed by questions and answers from the invariably sold-out audience. The monthly forum gives professional and amateur scientists an opportunity to explain their fairly abstruse specialties accessibly to a lay audience – a valuable skill. Since the emphasis is on science entertainment, it also gives the speakers a chance to present their ideas in a more engaging way: in iambic pentameter, in drag with a tuba, in three-part harmony, or via interpretive dance – an invaluable skill. The resulting atmosphere is informal, delightfully debauched, and refreshingly pro-science.

Slaking our thirst for both science education and mojitos, Nerd Nite started small but quickly went viral. Nerd Nites are now being held in 50 cities, from San Francisco to Kansas City and Auckland to Liberia. You can find the full listing of cities here; if you don’t see one near you, start one!

Last Wednesday night I was twitterpated to be one of three guest nerds sharing the stage at San Francisco’s Nerd Nite. I put the chic back into geek with a biology talk entitled “Genital Plugs, Projectile Penises, and Gay Butterflies: A Naturalist Explains the Birds and the Bees.”


A video recording of the presentation will be available online soon, but in the meantime, here’s a tantalizing clip from the talk, in which Isabella Rossellini explains the mating habits of the bee. Warning: this is scientifically sexy.

I shared the stage with Chris Anderson, who gave a fascinating talk on how the DIY community is building drones out of legos and open-source software. These DIY drones fly below government regulation and can be used for non-military applications, something we hear far too little of in the daily war digest that passes for news. The other speaker was Mark Rosin of the UK-based Guerrilla Science project. This clever organization reaches out to audiences at non-science venues, such as music concerts, and conducts entertaining presentations that teach core science ideas.  As part of his presentation Mark used 250 inflated balloons and a bass amp to demonstrate the physics concept of resonance.

If your curiosity has been piqued and you’d like to check out an upcoming Nerd Nite, consider attending the upcoming Nerdtacular, the first Nerd Nite Global Festival, to be held this August 16-18th in Brooklyn, New York.

The global Nerdtacular: Now that’s an idea worth spreading.



March 25, 2013 1 comment


This Friday, I’ll be participating at HackPrinceton.

My team will be training an EEG to recognize yes and no thoughts for particular electromechanical devices and creating general human brain interface (HBI) architecture.

We’ll be working on allowing you to turn on your phone and navigate various menus with your mind!

There’s lots of cool swag and prizes – the best being jobs at Google and Microsoft. Everyone on the team has experience in the field,* but of course the more the merrier and you’re welcome no matter what you bring (or don’t bring!) to the table.

If you’re interested, email ASAP!

*So far we’ve got a math Ph.D., a mech engineer, some CS/Operations Research guys and while my field is finance I picked up some neuro/machine learning along the way. If you have nothing to do for the next three days and want to learn something specifically for this competition, I recommend checking out my personal favorites:, or

Team Turnstile: how do NYC neighborhoods recover from extreme weather events?

I wanted to give you the low-down on a data hackathon I participated in this weekend, which was sponsored by the NYU Institute for Public Knowledge on the topic of climate change and social information. We were assigned teams and given a very broad mandate. We had only 24 hours to do the work, so it had to be simple.

Our team consisted of Venky Kannan, Tom Levine, Eric Schles, Aaron Schumacher, Laura Noren, Stephen Fybish, and me.

We decided to think about the effects of super storms on different neighborhoods. In particular, to measure the recovery time of the subway ridership in various neighborhoods using census information. Our project was inspired by this “nofarehikes” map of New York which tries to measure the impact of a fare hike on the different parts of New York. Here’s a copy of our final slides.

Also, it’s not directly related to climate change, but rather rests on the assumption that with climate change comes more frequent extreme weather events, which seems to be an existing myth (please tell me if the evidence is or isn’t there for that myth).

We used three data sets: subway ridership by turnstile, which only exists since May 2010, the census of 2010 (which is kind of out of date but things don’t change that quickly) and daily weather observations from NOAA.

Using the weather map and relying on some formal definitions while making up some others, we came up with a timeline of extreme weather events:

Screen Shot 2013-03-11 at 6.50.04 AM

Then we looked at subway daily ridership to see the effect of the storms or the recovery from the storms:

Screen Shot 2013-03-11 at 6.50.19 AMWe broke it down to individual stations. Here’s a closeup around Sandy:

Screen Shot 2013-03-11 at 6.51.05 AM

Then we used the census tracts to understand wealth in New York:

Screen Shot 2013-03-11 at 6.51.50 AMAnd of course we had to know which subway stations were in which census tracts. This isn’t perfect because we didn’t have time to assign “empty” census tracts to some nearby subway station. There are on the order of 2,000 census tracts but only on the order of 800 subway stations. But again, 24 hours isn’t alot of time, even to build clustering algorithms.

Finally, we attempted to put the data together to measure which neighborhoods have longer-than-expected recovery times after extreme weather events. This is our picture:

Screen Shot 2013-03-11 at 6.51.59 AM

Interestingly, it looks like the neighborhoods of Manhattan are most impacted by severe weather events, which is not in line with our prior [Update: I don't think we actually computed the impact on a given resident, but rather just the overall change in rate of ridership versus normal. An impact analysis would take into account the relative wealth of the neighborhoods and would probably look very different].

There are tons of caveats, I’ll mention only a few here:

  • We didn’t have time to measure the extent to which the recovery time took longer because the subway stopped versus other reasons people might not sure the subway. But our data is good enough to do this.
  • Our data might have been overwhelmingly biased by Sandy. We’d really like to do this with much longer-term data, but the granular subway ridership data has not been available for long. But the good news is we can do this from now on.
  • We didn’t have bus data at the same level, which is a huge part of whether someone can get to work, especially in the outer boroughs. This would have been great and would have given us a clearer picture.
  • When someone can’t get to work, do they take a car service? How much does that cost? We’d love to have gotten our hands on the alternative ways people got to work and how that would impact them.
  • In general we’d have like to measure the impact relative to their median salary.
  • We would also have loved to have measured the extent to which each neighborhood consisted of salary versus hourly wage earners to further understand how a loss of transportation would translate into an impact on income.

Unintended Consequences of Journal Ranking

I just read this paper, written by Björn Brembs and Marcus Munafò and entitled “Deep Impact: Unintended consequences of journal rank”. It was recently posted on the Computer Science arXiv (h/t Jordan Ellenberg).

I’ll give you a rundown on what it says, but first I want to applaud the fact that it was written in the first place. We need more studies like this, which examine the feedback loop of modeling at a societal level. Indeed this should be an emerging scientific or statistical field of study in its own right, considering how many models are being set up and deployed on the general public.

Here’s the abstract:

Much has been said about the increasing bureaucracy in science, stifling innovation, hampering the creativity of researchers and incentivizing misconduct, even outright fraud. Many anecdotes have been recounted, observations described and conclusions drawn about the negative impact of impact assessment on scientists and science. However, few of these accounts have drawn their conclusions from data, and those that have typically relied on a few studies. In this review, we present the most recent and pertinent data on the consequences that our current scholarly communication system has had on various measures of scientific quality (such as utility/citations, methodological soundness, expert ratings and retractions). These data confirm previous suspicions: using journal rank as an assessment tool is bad scientific practice. Moreover, the data lead us to argue that any journal rank (not only the currently-favored Impact Factor) would have this negative impact. Therefore, we suggest that abandoning journals altogether, in favor of a library-based scholarly communication system, will ultimately be necessary. This new system will use modern information technology to vastly improve the filter, sort and discovery function of the current journal system.

The key points in the paper are as follows:

  • There’s a growing importance of science and trust in science
  • There’s also a growing rate (x20 from 2000 to 2010) of retractions, with scientific misconduct cases growing even faster to become the majority of retractions (to an overall rate of 0.02% of published papers)
  • There’s a larger and growing “publication bias” problem – in other words, an increasing unreliability of published findings
  • One problem: initial “strong effects” get published in high-ranking journal, but subsequent “weak results” (which are probably more reasonable) are published in low-ranking journals
  • The formal “Impact Factor” (IF) metric for rank is highly correlated to “journal rank”, defined below.
  • There’s a higher incidence of retraction in high-ranking (measured through “high IF”) journals.
  • “A meta-analysis of genetic association studies provides evidence that the extent to which a study over-estimates the likely true effect size is positively correlated with the IF of the journal in which it is published”
  • Can the higher retraction error in high-rank journal be explained by higher visibility of those journals? They think not. Journal rank is bad predictor for future citations for example. [mathbabe inserts her opinion: this part needs more argument.]
  • “…only the most highly selective journals such as Nature and Science come out ahead over unselective preprint repositories such as ArXiv and RePEc”
  • Are there other measures of excellence that would correlate with IF? Methodological soundness? Reproducibility? No: “In fact, the level of reproducibility was so low that no relationship between journal rank and reproducibility could be detected.
  • More about Impact Factor: The IF is a metric for the number of citations to articles in a journal (the numerator), normalized by the number of articles in that journal (the denominator). Sounds good! But:
  • For a given journal, IF is not calculated but is negotiated – the publisher can (and does) exclude certain articles (but not citations). Even retroactively!
  • The IF is also not reproducible – errors are found and left unexplained.
  • Finally, IF is likely skewed by the fat-tailedness of citations (certain articles get lots, most get few). Wouldn’t a more robust measure be given by the median?


  1. Journal rank is a weak to moderate predictor of scientific impact
  2. Journal rank is a moderate to strong predictor of both intentional and unintentional scientific unreliability
  3. Journal rank is expensive, delays science and frustrates researchers
  4. Journal rank as established by IF violates even the most basic scientific standards, but predicts subjective judgments of journal quality

Long-term Consequences

  • “IF generates an illusion of exclusivity and prestige based on an assumption that it will predict subsequent impact, which is not supported by empirical data.”
  • “Systemic pressures on the author, rather than increased scrutiny on the part of the reader, inflate the unreliability of much scientific research. Without reform of our publication system, the incentives associated with increased pressure to publish in high-ranking journals will continue to encourage scientiststo be less cautious in their conclusions (or worse), in an attempt to market their research to the top journals.”
  • “It is conceivable that, for the last few decades, research institutions world-wide may have been hiring and promoting scientists who excel at marketing their work to top journals, but who are not necessarily equally good at conducting their research. Conversely, these institutions may have purged excellent scientists from their ranks, whose marketing skills did not meet institutional requirements. If this interpretation of the data is correct, we now have a generation of excellent marketers (possibly, but not necessarily also excellent scientists) as the leading figures of the scientific enterprise, constituting another potentially major contributing factor to the rise in retractions. This generation is now in charge of training the next generation of scientists, with all the foreseeable consequences for the reliability of scientific publications in the future.

The authors suggest that we need a new kind of publishing platform. I wonder what they’d think of the Episciences Project.

NYC data hackathons, past and future: Politics, Occupy, and Climate change (#OWS)

The past: Money in politics

First thing’s first, I went to the Bicoastal Datafest a few weekends ago and haven’t reported back. Mostly that’s because I got sick and didn’t go on the second day, but luckily other people did, like Kathy Kiely from the Sunlight Foundation, who wrote up this description of the event and the winning teams’ projects.

And hey, it turns out that my new company shares an office with Harmony Institute, whose data scientist Burton DeWilde was on the team that won “Best in Show” for their orchestral version of the federal government’s budget.

Another writeup of the event comes by way of Michael Lawson, who worked on the team that set up an accounting fraud detection system through Benford’s Law. I might be getting a guest blog post about this project through another one of its team members soon.

And we got some good progress on our DataKind/ Sunlight Foundation money-in-politics project as well, thanks to DataKind intern Pete Darche and math nerds Kevin Wilson and Johan de Jong.

The future one week from now: Occupy

Next up, on March 1st and 2nd at CUNY Graduate Center is this data hackathon called OccupyData (note this is a Friday and Saturday, which is unusual). You can register for the event here.

It’s a combination of an Occupy event and a datafest, so obviously I am going to try to go. The theme is general – data for the 99% – but there’s a discussion on this listserv as to the various topics people might want to focus on (Aaron Swartz and Occupy Sandy are coming up for example). I’m looking forward to reporting back (or reporting other people’s report-backs if my kids don’t let me go).

The future two weeks from now: Climate change

Finally, there’s this datathon, which doesn’t look open to registration, but which I’ll be participating in through my work. It’s stated goal is “to explore how social and meteorological data can be combined to enhance social science research on climate change and cities.”  The datathon will run Saturday March 9th – Sunday March 10th, 2013, starting noon Saturday, with final presentations at noon Sunday. I’ll try to report back on that as well.

The Sandy Hook Project

I wanted to share with you guys a project I’ve been involved with started by John Spens of Thoughtworks regarding data collection and open analysis around guns and gun-related violence. John lives in Connecticut and has friends who were directly affected by the massacre in Newtown. Here is John’s description of the project:

I initiated the Sandy Hook Project in response to this need for information.  The purpose of this project is to produce rigorous and transparent analyses of data pertaining to gun-related violence.  My goal is to move beyond the rhetoric and emotion and produce (hopefully) objective insight  into the relationship between guns and violence in the US.  I realize that objectivity will be challenging, which is why I want to share the methods and the data openly so others can validate or refute my findings as well as contribute their own.

I’ve put the project on GitHub. (  While it’s not designed as a data repository, I think the ubiquitous nature of GitHub and the control enforced through the code repository model will support effective collaboration.

John has written a handful of posts about statistics and guns, including A Brief Analysis of Firearm-related Homicide Rates and Investigating Statistics Regarding Right to Carry Laws.

In addition to looking into the statistics that exist, John wants to address the conversation itself. As he said in his most recent post:

What limited data and analysis that exists is often misrepresented and abused, and is given much less attention than anecdotal evidence.  It is relatively simple to produce a handful of cases that support either side in this debate.  What we really need is to understand the true impact of guns on our society.  Push back by the NRA that any such research would be “political opinion masquerading as medical science.” is unacceptable.  We can only make intelligent decisions when we have the fullest picture possible.

John is looking for nerd collaborators who can help him with data collection and open analysis. He’s also hoping to have a weekend datafest to work on this project in March, so stay tuned if you want to be part of that!

Google search is already open source

I’ve been talking a lot recently, with various people and on this blog, about data and model privacy. It seems like individuals, who should have the right to protect their data, don’t seem to, but huge private companies, with enormous powers over the public, do.

Another example: models working on behalf of the public, like Fed stress tests and other regulatory models, seem essentially publicly known, which is useful indeed to the financial insiders, the very people who are experts on gaming systems.

Google search has a deeply felt power over the public, and arguably needs to be understood for the consistent threat it poses to people’s online environment. It’s a scary thought experiment to imagine what could be done with it, and after all why should we blindly trust a corporation to have our best intentions in mind? Maybe it’s time to call for the Google search model to be open source.

But what would that look like? At first blush we might imagine forcing them to actually opening up their source code. But at this point that code must be absolutely enormous, unreadable, and written specifically for their uniquely massive machine set-up. In other words, totally overwhelming and useless (as my friend Suresh might say, the singularity has already happened and this is what it looks like (update: Suresh credits Cosma)).

Considering how many people would actually be able to make sense of the underlying code base, then you quickly realize that opening it up would be meaningless for the task of protecting the public. Instead, we’d want to make the code accessible in some way.

But I claim that’s exactly what Google does, by allowing everyone to search using the model from anywhere. In other words, it’s on us, the public, to run experiments to undertand what the underlying model actually does. We have the tools, let’s get going.

If we think there’s inherent racism in google searches, then we should run experiments like Nathan Newman recently did, examining the different ads that pop up when someone writes an email about buying a car, for example, with different names and in different zip codes. We should organize to change our zip codes, our personas (which would mean deliberately creating personas and gmail logins, etc.), and our search terms, and see how the Google search results change as our inputs change.

After all, I don’t know what’s in the code base but I’m pretty sure there’s no sub-routine that’s called “add_racism_to_search”; instead, it’s a complicated Rube-Goldberg machine that should be judged by its outputs, in a statistical way, rather than expected to prescriptively describe how it treats things on a case-by-case basis.

Another thing: I don’t think there are bad intentions on the part of the modelers, but that doesn’t mean there aren’t bad consequences – the model is too complicated for anyone to anticipate exactly how it acts unless they perform experiments to test them. In the meantime, until people undertand that, we need to distinguish between the intentions and the results. So, for example, in the update to Nathan Newman’s experiments with Google mail, Google responded with this:

This post relies on flawed methodology to draw a wildly inaccurate conclusion. If Mr. Newman had contacted us before publishing it, we would have told him the facts: we do not select ads based on sensitive information, including ethnic inferences from names.

And then Newman added this:

Now, I’m happy to hear Google doesn’t “select ads” on this basis, but Google’s words seem chosen to allow a lot of wiggle room (as such Google statements usually seem to). Do they mean that Google algorithms do not use the ethnicity of names in ad selection or are they making the broader claim that they bar advertisers from serving up different ads to people with different names?

My point is that it doesn’t matter what Google says it does or doesn’t do, if statistically speaking the ads change depending on ethnicity. It’s a moot argument what they claim to do if what actually happens, the actual output of their Rube-Goldberg machine, is racist.

And I’m not saying Google’s models are definitively racist, by the way, since Newman’s efforts were small, the efforts of one man, and there were not thousands and thousands of tests but only a few. But his approach to understanding the model was certainly correct, and it’s a cause that technologists and activists should take up and expand on.

Mathematically speaking, it’s already as open-source as we need it to be to understand it, although in a dual way than people are used to thinking about. Actually, it defines the gold standard of open-source: instead of getting a bunch of gobbly-gook that we can’t process and that depends on enormously large data that changes over time, we get real-time access to the newest version that even a child can use.

I only wish that other public-facing models had such access. Let’s create a large-scale project like SETI to understand the Google search model.

R is mostly like python but sometimes like SQL

I’m learning a bit of R in my current stint at ThoughtWorks. Coming from python, I was happy to see most of the plotting functions are very similar, as well as many of the vector-level data handling functions. Besides the fact that lists start at 1 instead of 0, things were looking pretty familiar.

But then I came across something that totally changed my mind. In R they have these data frames, which are like massive excel spreadsheets: very structured matrices with named columns and rows, on which you can perform parallelized operations.

One thing I noticed right away about these rigid data structures is that they make handling missing data very easy. So if you have a huge data frame where a few rows are missing a few data points, then one command, na.omit, gets rid of your problem. Sometimes you don’t even need that, you can just perform your operation on your NA’s, and you just get back more NA’s where appropriate.

This ease-of-use for crappy data is good and bad: good because it’s convenient, bad because you never feel the pain of missing data. When I use python, I rely on dictionaries of dictionaries (of dictionaries) to store my data, and I have to make specific plans for missing data, which means it’s a pain but also that I have to face up to bad data directly.

But that’s not why I think R is somewhat like SQL. It’s really because of how bad “for” loops are in R.

So I was trying to add a new column to my rather large (~65,000 row) dataframe. Adding a column is very easy indeed, if the value in the new column is a simple function of the values in the current columns, because of the way you can parallelize operations. So if the new value is the square of the first column value plus the second column value, it can do it on the whole columns all at once and it’s super fast.

In my case, though, the new value required a look-up in the table itself, which may or may not work, and then required a decision depending on whether it worked. For the life of me I couldn’t figure out how to do it using iterated “apply” or “lapply” functions in the existing dataframe. Of course it’s easy to do using a “for” loop, but that is excruciatingly slow.

Finally I realized I needed to think like a SQL programmer, and build a new dataframe which consisted of the look-up row, if it existed, along with a unique identifier in common with the row I start with. Then I merged the two dataframes, which is like a SQL join, using that unique identifier as the pivot. This would never happen in python with a dataset of this size, because dictionaries are very unstructured and fast.

Easy peasy lemon squeazy, once you understand it, but it made me realize that the approach to learning a new language by translating each word really doesn’t work. You need to think like a Parisian to really speak French.

Categories: open source tools

Quantifying the pull of poverty traps

In yesterday’s New York Times Science section, there was an article called “Life in the Red” (hat tip Becky Jaffe) about people’s behavior when they are in debt, summed up by this:

The usual explanations for reckless borrowing focus on people’s character, or social norms that promote free spending and instant gratification. But recent research has shown that scarcity by itself is enough to cause this kind of financial self-sabotage.

“When we put people in situations of scarcity in experiments, they get into poverty traps,” said Eldar Shafir, a professor of psychology and public affairs at Princeton. “They borrow at high interest rates that hurt them, in ways they knew to avoid when there was less scarcity.”

The psychological burden of debt not only saps intellectual resources, it also reinforces the reckless behavior, and quickly, Dr. Shafir and other experts said. Millions of Americans have been keeping the lights on through hard times with borrowed money, running a kind of shell game to keep bill collectors away.

So what we’ve got here is a feedback loop of poverty, which certainly jives with my observations of friends and acquaintances I’ve seen who are in debt.

I’m guessing the experiments described in the article are not as bad as real life, however.

I say that because I’ve been talking on this blog as well as in my recent math talks about a separate feedback loop involving models, namely the feedback loop whereby people who are judged poor by the model are offered increasingly bad terms on their loans. I call it the death spiral of modeling.

If you think about how these two effects work together – the array of offers gets worse as your vulnerability to bad deals increases – then you start to understand what half of our country is actually living through on a day-to-day basis.

As an aside, I have an enormous amount of empathy for people experiencing this poverty trap. I don’t think it’s a moral issue to be in debt: nobody wants to be poor, and nobody plans it that way.

This opinion article (hat tip Laura Strausfeld), also in yesterday’s New York Times, makes the important point that listening to a bunch of rich, judgmental people like David Bach, Dave Ramsey, and Suze Orman telling us it’s our fault we haven’t finished saving for retirement isn’t actually useful, and suggest we individually choose a money issue to take charge and sort out.

So my empathetic nerd take on poverty traps is this: how can we quantitatively measure this phenomenon, or more precisely these phenomena, since we’ve identified at least two feedback loops?

One reason it’s hard is that it’d be hard to perform natural tests where some people are submitted to the toxic environment but other people aren’t – it’s the “people who aren’t” category that’s the hard part, of course.

For the vulnerability to bad terms, the article describes the level of harassment that people receive from bill collectors as a factor in how they react, which doesn’t surprise anyone who’s ever dealt with a bill collector. Are there certain people who don’t get harassed for whatever reason, and do they fall prey to bad deals at a different rate? Are there local laws in some places prohibiting certain harassment? Can we go to another country where the bill collectors are reined in and see how people in debt behave there?

Also, in terms of availability of loans, it might be relatively easy to start out with people who live in states with payday loans versus people who don’t, and see how much faster the poverty spiral overtakes people with worse options. Of course, as crappy loans get more and more available online, this proximity study will become moot.

It’s also going to be tricky to tease out the two effects from each other. One is a question of supply and the other is a question of demand, and as we know those two are related.

I’m not answering these questions today, it’s a long-term project that I need your help on, so please comment below with ideas. Maybe if we have a few good ideas and if we find some data we can plan a data hackathon.

Should the U.S. News & World Reports college ranking model be open source?

I had a great time giving my “Weapons of Math Destruction” talk in San Diego, and the audience was fantastic and thoughtful.

One question that someone asked was whether the US News & World Reports college ranking model should be forced to be open sourced – wouldn’t that just cause colleges to game the model?

First of all, colleges are already widely gaming the model and have been for some time. And that gaming is a distraction and has been heading colleges in directions away from good instruction, which is a shame.

And if you suggest that they change the model all the time to prevent this, then you’ve got an internal model of this model that needs adjustment. They might be tinkering at the edges but overall it’s quite clear what’s going into the model: namely, graduation rates, SAT scores, number of Ph.D’s on staff, and so on. The exact percentages change over time but not by much.

The impact that this model has had on education and how universities apportion resources has been profound. Academic papers have been written on the law school version of this story.

Moreover, the tactics that US News & World Reports uses to enforce their dominance of the market are bullying, as you can learn from the President of Reed College, which refuses to be involved.

Back to the question. Just as I realize that opening up all data is not reasonable or desirable, because first of all there are serious privacy issues but second of all certain groups have natural advantages to openly shared resources, it is also true that opening up all models is similarly problematic.

However, certain data should surely be open: for example, the laws of our country, that we are all responsible to know, should be freely available to us (something that Aaron Swartz understood and worked towards). How can we be held responsible for laws we can’t read?

Similarly, public-facing models, such as credit scoring models and teacher value-added models, should absolutely be open and accessible to the public. If I’m being judged and measured and held accountable by some model in my daily life as a citizen, that has real impact on how my future will unfold, then I should know how that process works.

And if you complain about the potential gaming of those public-facing models, I’d answer: if they are gameable then they shouldn’t be used, considering the impact they have on so many people’s lives. Because a gameable model is a weak model, with proxies that fail.

Another way to say this is we should want someone to “game” the credit score model if it means they pay their bills on time every month (I wrote about this here).

Back to the US News & World Report model. Is it public facing? I’m no lawyer but I think a case can be made that it is, and that the public’s trust in this model makes it a very important model indeed. Evidence can be gathered by measuring  the extent to which colleges game the model, which they only do because the public cares so much about the rankings.

Even so, what difference would that make, to open it up?

In an ideal world, where the public is somewhat savvy about what models can and cannot do, opening up the US News & World Reports college ranking model would result in people losing faith in it. They’d realize that it’s no more valuable than an opinion from a highly vocal uncle of theirs who is obsessed with certain metrics and blind to individual eccentricities and curriculums that may be a perfect match for a non-conformist student. It’s only one opinion among many, and not to be religiously believed.

But this isn’t an ideal world, and we have a lot of work to do to get people to understand models as opinions in this sense, and to get people to stop trusting them just because they’re mathematically presented.

Data scientists and engineers needed for a weekend datafest exploring money and politics

I just signed up for an upcoming datafest called “Big Data, Big Money, and You” which will be co-hosted at Columbia University and Stanford University on February 2nd and 3rd.

The idea is to use data from:

and open source tools such as R, python, and various api’s to model and explore various issues in the intersection of money and politics. Among those listed are things like: “look for correlation between the subject of bills introduced to state legislatures to big companies within those districts and campaign donations” and “comparing contributions per and post redistricting”.

As usual, a weekend-long datafest is just the beginning of a good data exploration: if you’re interested in this, think of this as an introduction to the ideas and the people involved; it’s just as much about networking with like-minded people as it is about finding an answer in two days.

So sign up, come on by, and get ready to roll up your sleeves and have a great time for that weekend, but also make sure you get people’s email addresses so you can keep in touch as things continue to develop down the road.

Open data and the emergence of data philanthropy

This is a guest post. Crossposted at aluation.

I’m a bit late to this conversation, but I was reminded by Cathy’s post over the weekend on open data – which most certainly is not a panacea – of my own experience a couple of years ago with a group that is trying hard to do the right thing with open data.

The UN funded a new initiative in 2009 called Global Pulse, with a mandate to explore ways of using Big Data for the rapid identification of emerging crises as well as for crafting more effective development policy in general. Their working hypothesis at its most simple is that the digital traces individuals leave in their electronic life – whether through purchases, mobile phone activity, social media or other sources – can reveal emergent patterns that can help target policy responses. The group’s website is worth a visit for anyone interested in non-commercial applications of data science – they are absolutely the good guys here, doing the kind of work that embodies the social welfare promise of Big Data.

With that said, I think some observations about their experience in developing their research projects may shed some light on one of Cathy’s two main points from her post:

  1. How “open” is open data when there are significant differences in both the ability to access the data, and more important, in the ability to analyze it?
  2. How can we build in appropriate safeguards rather than just focusing on the benefits and doing general hand-waving about the risks?

I’ll focus on Cathy’s first question here since the second gets into areas beyond my pay grade.

The Global Pulse approach to both sourcing and data analytics has been to rely heavily on partnerships with academia and the private sector. To Cathy’s point above, this is true of both closed data projects (such as those that rely on mobile phone data) as well as open data projects (those that rely on blog posts, news sites and other sources). To take one example, the group partnered with two firms in Cambridge to build a real-time indicator of bread prices in Latin America in order. The data in this case was open, while the web-scraping analytics (generally using grocery-story website prices) were developed and controlled by the vendors. As someone who is very interested in food prices, I found their work fascinating. But I also found it unsettling that the only way to make sense of this open data – to turn it into information, in other words – was through the good will of a private company.

The same pattern of open data and closed analytics characterized another project, which tracked Twitter in Indonesia for signals of social distress around food, fuel prices, health and other issues. The project used publicly available Twitter data, so it was open to that extent, though the sheer volume of data and the analytical challenges of teasing meaningful patterns out of it called for a powerful engine. As we all know, web-based consumer analytics are far ahead of the rest of the world in terms of this kind of work. And that was precisely where Global Pulse rationally turned – to a company that has generally focused on analyzing social media on behalf of advertisers.

Does this make them evil? Of course not – as I said above, Global Pulse are the good guys here. My point is not about the nature of their work but about its fragility.

The group’s Director framed their approach this way in a recent blog post:

We are asking companies to consider a new kind of CSR – call it “data philanthropy.” Join us in our efforts by making anonymized data sets available for analysis, by underwriting technology and research projects, or by funding our ongoing efforts in Pulse Labs. The same technologies, tools and analysis that power companies’ efforts to refine the products they sell, could also help make sure their customers are continuing to improve their social and economic wellbeing. We are asking governments to support our efforts because data analytics can help the United Nations become more agile in understanding the needs of and supporting the most vulnerable populations around the globe, which in terms boosts the global economy, benefiting people everywhere.

What happens when corporate donors are no longer willing to be data philanthropists? And a question for Cathy – how can we ensure that these new Data Science programs like the one at Columbia don’t end up just feeding people into consumer analytics firms, in the same way that math and econ programs ended up feeding people into Wall Street jobs?

I don’t have any answers here, and would be skeptical of anyone who claimed to. But the answers to these questions will likely define a lot of the gap between the promise of open data and whatever it ends up becoming.

Open data is not a panacea

I’ve talked a lot recently about how there’s an information war currently being waged on consumers by companies that troll the internet and collect personal data, search histories, and other “attributes” in data warehouses which then gets sold to the highest bidders.

It’s natural to want to balance out this information asymmetry somehow. One such approach is open data, defined in Wikipedia as the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyrightpatents or other mechanisms of control.

I’m going to need more than one blog post to think this through, but I wanted to make two points this morning.

The first is my issue with the phrase “freely available to everyone to use”. What does that mean? Having worked in futures trading, where we put trading machines and algorithms in close proximity with exchanges for large fees so we can get to the market data a few nanoseconds before anyone else, it’s clear to me that availability and access to data is an incredibly complicated issue.

And it’s not just about speed. You can have hugely important, rich, and large data sets sitting in a lump on a publicly available website like wikipedia, and if you don’t have fancy parsing tools and algorithms you’re not going to be able to make use of it.

When important data goes public, the edge goes to the most sophisticated data engineer, not the general public. The Goldman Sachs’s of the world will always know how to make use of “freely available to everyone” data before the average guy.

Which brings me to my second point about open data. It’s general wisdom that we should hope for the best but prepare for the worst. My feeling is that as we move towards open data we are doing plenty of the hoping part but not enough of the preparing part.

If there’s one thing I learned working in finance, it’s not to be naive about how information will be used. You’ve got to learn to think like an asshole to really see what to worry about. It’s a skill which I don’t regret having.

So, if you’re giving me information on where public schools need help, I’m going to imagine using that information to cut off credit for people who live nearby. If you tell me where environmental complaints are being served, I’m going to draw a map and see where they aren’t being served so I can take my questionable business practices there.

I’m not saying proponents of open data aren’t well-meaning, they often seem to be. And I’m not saying that the bad outweighs the good, because I’m not sure. But it’s something we should figure out how to measure, and in this information war it’s something we should keep a careful eye on.

Making math beautiful with XyJax

December 17, 2012 1 comment

My husband A. Johan de Jong has an open source algebraic geometry project called the stacks project. It’s hosted at Columbia, just like his blog which is aptly named the stacks project blog.

The stacks project is awesome: it explains the theory of stacks thoroughly, assuming only that you have a basic knowledge of algebra and a shitload of time to read. It’s about three thousand update: it’s exactly 3,452 pages, give or take, and it has a bunch of contributors besides Johan. I’m on the list most likely because of the fact that I helped him develop the tag system which allows permanent references to theorems and lemmas even within an evolving latex manuscript.

He even has pictures of tags, and hands out t-shirts with pictures of tags when people find mistakes in the stacks project.

Speaking of latex, that’s what I wanted to mention today.

Recently a guy named Pieter Belmans has been helping Johan out with development for the site: spiffing it up and making it look more professional. The most recent thing he did was to render the latex into human readable form using XyJax package, which is an “almost xy-pic compatible package for MathJax“. I think they are understating the case; it looks great to me:

Categories: math, open source tools

MOOC is here to stay, professors will have to find another job

I find myself every other day in a conversation with people about the massive online open course (MOOC) movement.

People often want to complain about the quality of this education substitute. They say that students won’t get the one-on-one interaction between the professor and student that is required to really learn. They complain that we won’t know if someone really knows something if they only took a MOOC or two.

First of all, this isn’t going away, nor should it: it’s many people’s only opportunity to learn this stuff. It’s not like MIT has plans to open 4,000 campuses across the world. It’s really awesome that rural villagers (with internet access) all over the world can now take MIT classes anyway through edX.

Second, if we’re going to put this new kind of education under the microscope, let’s put the current system under the microscope too. Many of the people fretting about the quality of MOOC education are themselves products of super elite universities, and probably don’t know what the average student’s experience actually is. Turns out not everyone gets a whole lot of attention from their professors.

Even at elite institutions, there are plenty of masters programs which are treated as money machines for the university and where the quality and attention of the teaching is a secondary concern. If certain students decide to forgo the thousands of dollars and learn the stuff just as well online, then that would be a good thing (for them at least).

Some things I think are inevitable:

  1. Educational institutions will increasingly need to show they add value beyond free MOOC experiences. This will be an enormous market force for all but the most elite universities.
  2. Instead of seeing where you went to school, potential employers will directly test knowledge of candidates. This will mean weird things like you never actually have to learn a foreign language or study Shakespeare to get a job, but it will be good for the democratization of education in general.
  3. Professors will become increasingly scarce as the role of the professor is decreased.
  4. One-on-one time with masters of a subject will become increasingly rare and expensive. Only truly elite students will have the mythological education experience.
Categories: musing, open source tools

Get every new post delivered to your Inbox.

Join 886 other followers