Archive

Archive for the ‘open source tools’ Category

Does OpenSSL bug prove that open source code doesn’t work?

By now most of you have read about the major bug that was found in OpenSSL, an open source security software toolkit. The bug itself is called the Heartbleed Bug, and there’s lots of information about it and how to fix it here. People are super upset about this, and lots of questions remain.

For example, was it intentionally undermined? Has the NSA deliberately inserted weaknesses into this as well? It seems like the jury is out right now, but if I’m the guy who put in the bug, I’m changing my name and going undercover just in case.

Next, how widely was the weakness exploited? If you’re super worried about stuff, or if you are a particular target of attack, the answer is probably “widely.” The frustrating thing is that there’s seemingly no way to measure or test that assumption, since the attackers would leave no trace.

Here’s what I find interesting the most interesting question: what will the long-term reaction be to open source software? People might think that open source code is a bust after this. They will complain that something like this should never have been allowed to happen – that the whole point of open software is that people should be checking this stuff as it comes in – and it never would have happened if there were people getting paid to test the software.

First of all, it did work as intended, even though it took two years instead of two days like people might have wanted. And maybe this shouldn’t have happened like it did, but I suspect that people will learn this particular lesson really well as of now.

But in general terms, bugs are everywhere. Think about Knight Capital’s trading debacle or the ObamaCare website, just two famous recent problems with large-scale coding projects that aren’t open source.

Even when people are paid to fix bugs, they fix the kind of bugs that cause the software to stop a lot sooner than the kind of bug that doesn’t make anything explode, lets people see information they shouldn’t see, and leaves no trace. So for every Knight’s Capital there are tons of other bugs in software that continue to exist.

In other words it’s more a question of who knows about the bugs and who can exploit them. And of course, whether those weaknesses will ever be exposed to the public at all.

It would be great to see the OpenSSL bug story become, over time, a success story. This would mean that, on the one hand the nerds becoming more vigilant in checking vitally important code, and learning to think like assholes, but also the public would need to acknowledge how freaking hard it is to program.

Categories: musing, open source tools

Two thoughts on math research papers

Today I’d like to mention two ideas I’ve been having recently on how to make being a research mathematician (even) more fun.

1) Mathematicians should consider holding public discussions about papers

First, math nerds, did you know that in statistics they have formal discussions about papers? It’s been a long-standing tradition by the Royal Statistical Society, whose motto is “Advancing the science and application of statistics, and promoting use and awareness for public benefit,” to choose papers by some criterion and then hold regular public discussions about those papers by a few experts who are not the author, about the paper. Then the author responds to their points and the whole conversation is published for posterity.

I think this is a cool idea for math papers too. One thing that kind of depressed me about math is how rarely you’d find people reading the same papers unless you specifically got a group of people together to do so, which was a lot of work. This way the work is done mostly by other people and more importantly the payoff is much better for them since everyone gets a view into the discussion.

Note I’m sidestepping who would organize this whole thing, and how the papers would be chosen exactly, but I’d expect it would improve the overall feeling that I had of being isolated in a tiny math community, especially if the conversations were meant to be penetrable.

2) There should be a good clustering method for papers around topics

This second idea may already be happening, but I’m going to say it anyway, and it could easily be a thesis for someone in CS.

Namely, the idea of using NLP and other such techniques to cluster math papers by topic. Right now the most obvious way to find a “nearby” paper is to look at the graph of papers by direct reference, but you’re probably missing out on lots of stuff that way. I think a different and possibly more interesting way would be to use the text in the title, abstract, and introduction to find papers with similar subjects.

This might be especially useful when you want to know the answer to a question like, “has anyone proved that such-and-such?” and you can do a text search for the statement of that theorem.

The good news here is that mathematicians are in love with terminology, and give weird names to things that make NLP techniques very happy. My favorite recent example which I hear Johan muttering under his breath from time to time is Flabby Sheaves. There’s no way that’s not a distinctive phrase.

The bad news is that such techniques won’t help at all in finding different fields who have come across the same idea but have different names for the relevant objects. But that’s OK, because it means there’s still lots of work for mathematicians.

By the way, back to the question of whether this has already been done. My buddy Max Lieblich has a website called MarXiv which is a wrapper over the math ArXiv and has a “similar” button. I have no idea what that button actually does though. In any case I totally dig the design of the similar button, and what I propose is just to have something like that work with NLP.

Categories: math, musing, open source tools

PDF Liberation Hackathon: January 17-19

This is a guest post by Marc Joffe, the principal consultant at Public Sector Credit Solutions, an organization that provides data and analysis related to sovereign and municipal securities. Previously, Joffe was a Senior Director at Moody’s Analytics.

As Cathy has argued, open source models can bring much needed transparency to scientific research, finance, education and other fields plagued by biased, self-serving analytics. Models often need large volumes of data, and if the model is to be run on an ongoing basis, regular data updates are required.

Unfortunately, many data sets are not ready to be loaded into your analytical tool of choice; they arrive in an unstructured form and must be organized into a consistent set of rows and columns. This cleaning process can be quite costly. Since open source modeling efforts are usually low dollar operations, the costs of data cleaning may prove to be prohibitive. Hence no open model – distortion and bias continue their reign.

Much data comes to us in the form of PDFs. Say, for example, you want to model student loan securitizations. You will be confronted with a large number of PDF servicing reports that look like this. A corporation or well funded research institution can purchase an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from the PDFs into a database. But this is not much help to insurgent modelers who want to produce open source work.

Data journalists face a similar challenge. They often need to extract bulk data from PDFs to support their reporting. Examples include IRS Form 990s filed by non-profits and budgets issued by governments at all levels.

The data journalism community has responded to this challenge by developing software to harvest usable information from PDFs. Examples include Tabula, a tool written by Knight-Mozilla OpenNews Fellow Manuel Aristarán, extracts data from PDF tables in a form that can be readily imported to a spreadsheet – if the PDF was “printed” from a computer application. Introduced earlier this year, Tabula continues to evolve thanks to the volunteer efforts of Manuel, with help from OpenNews Fellow Mike Tigas and New York Times interactive developer Jeremy Merrill. Meanwhile, DocHive, a tool whose continuing development is being funded by a Knight Foundation grant, addresses PDFs that were created by scanning paper documents. DocHive is a project of Raleigh Public Record and is led by Charles and Edward Duncan.

These open source tools join a number of commercial offerings such as Able2Extract and ABBYY Fine Reader that extract data from PDFs. A more comprehensive list of open source and commercial resources is available here.

Unfortunately, the free and low cost tools available to modelers, data journalists and transparency advocates have limitations that hinder their ability to handle large scale tasks. If, like me, you want to submit hundreds of PDFs to a software tool, press “Go” and see large volumes of cleanly formatted data, you are out of luck.

It is for this reason that I am working with The Sunlight Foundation and other sponsors to stage the PDF Liberation Hackathon from January 17-19, 2014. We’ll have hack sites at Sunlight’s Washington DC office and at RallyPad in San Francisco. Developers can also join remotely because we will publish a number of clearly specified PDF extraction challenges before the hackathon.

Participants can work on one of the pre-specified challenges or choose their own PDF extraction projects. Ideally, hackathon teams will use (and hopefully improve upon) open source tools to meet the hacking challenges, but they will also be allowed to embed commercial tools into their projects as long as their licensing cost is less than $1000 and an unlimited trial is available.

Prizes of up to $500 will be awarded to winning entries. To receive a prize, a team must publish their source code on a GitHub public repository. To join the hackathon in DC or remotely, please sign up at Eventbrite; to hack with us in SF, please sign up via this Meetup. Please also complete our Google Form survey. Also, if anyone reading this is associated with an organization in New York or Chicago that would like to organize an additional hack space, please contact me.

The PDF Liberation Hackathon is going to be a great opportunity to advance the state of the art when it comes to harvesting data from public documents. I hope you can join us.

Cool open-source models?

I’m looking to develop my idea of open models, which I motivated here and started to describe here. I wrote the post in March 2012, but the need for such a platform has only become more obvious.

I’m lucky to be working with a super fantastic python guy on this, and the details are under wraps, but let’s just say it’s exciting.

So I’m looking to showcase a few good models to start with, preferably in python, but the critical ingredient is that they’re open source. They don’t have to be great, because the point is to see their flaws and possible to improve them.

  1. For example, I put in a FOIA request a couple of days ago to get the current teacher value-added model from New York City.
  2. A friends of mine, Marc Joffe, has an open source municipal credit rating model. It’s not in python but I’m hopeful we can work with it anyway.
  3. I’m in search of an open source credit scoring model for individuals. Does anyone know of something like that?
  4. They don’t have to be creepy! How about a Nate Silver – style weather model?
  5. Or something that relies on open government data?
  6. Can we get the Reinhart-Rogoff model?

The idea here is to get the model, not necessarily the data (although even better if it can be attached to data and updated regularly). And once we get a model, we’d build interactives with the model (like this one), or at least the tools to do so, so other people could build them.

At its core, the point of open models is this: you don’t really know what a model does until you can interact with it. You don’t know if a model is robust unless you can fiddle with its parameters and check. And finally, you don’t know if a model is best possible unless you’ve let people try to improve it.

Crisis Text Line: Using Data to Help Teens in Crisis

This morning I’m helping out at a datadive event set up by DataKind (apologies to Aunt Pythia lovers).

The idea is that we’re analyzing metadata around a texting hotline for teens in crisis. We’re trying to see if we can use the information we have on these texts (timestamps, character length, topic – which is most often suicide – and outcome reported by both the texter and the counselor) to help the counselors improve their responses.

For example, right now counselors can be in up to 5 conversations at a time – is that too many? Can we figure that out from the data? Is there too much waiting between texts? Other questions are listed here.

Our “hackpad” is located here, and will hopefully be updated like a wiki with results and visuals from the exploration of our group. It looks like we have a pretty amazing group of nerds over here looking into this (mostly python users!), and I’m hopeful that we will be helping the good people at Crisis Text Line.

A.F.R. Transparency Panel coming up on Friday in D.C.

I’m preparing for a short trip to D.C. this week to take part in a day-long event held by Americans for Financial Reform. You can get the announcement here online, but I’m not sure what the finalized schedule of the day is going to be. Also, I believe it will be recorded, but I don’t know the details yet.

In any case, I’m psyched to be joining this, and the AFR are great guys doing important work in the realm of financial reform.

——

 Opening Wall Street’s Black Box: Pathways to Improved Financial Transparency 

Sponsored By Americans for Financial Reform and Georgetown University Law Center

Keynote Speaker: Gary Gensler Chair, Commodity Futures Trading Commission

October 11, 2013 10 AM – 3 PM

Georgetown Law Center, Gewirz Student Center, 12th Floor

120 F Street NW, Washington, DC (Judiciary Square Metro) (Space is limited. Please RSVP to AFRtransparencyrsvp@gmail.com)

The 2008 financial crisis revealed that regulators and many sophisticated market participants were in the dark about major risks and exposures in our financial system. The lack of financial transparency enabled large-scale fraud and deception of investors, weakened the stability of the financial system, and contributed to the market failure after the collapse of Lehman Brothers. Five years later, despite regulatory efforts, it’s not clear how much the situation has improved.

Join regulators, market participants, and academic experts for an exploration of the progress made – and the work that remains to be done – toward meaningful transparency on Wall Street. How can better information and disclosure make the financial system both fairer and safer?

Panelists include:

Jesse Eisinger, Pulitzer Prize-winning reporter for the New York Times and Pro Publica
Zach Gast, Head of financial sector research, Center on Financial Research and Analysis
Amias Gerety, Deputy Assistant Secretary for the FSOC, United States Treasury
Henry Hu, Alan Shivers Chair in the Law of Banking and Finance, University of Texas Law School
Albert “Pete” Kyle, Charles E. Smith Professor of Finance, University of Maryland
Adam Levitan, Professor of Law, Georgetown University Law Center
Antoine Martin, Vice President, New York Federal Reserve Bank
Brad Miller, Former Representative from North Carolina; Of Counsel, Grais & Ellsworth
Cathy O’Neil, Senior Data Scientist, Johnson Research Labs; Occupy Alternative Banking
Gene Phillips, Director, PF2 Securities Evaluation
Greg Smith, Author of “Why I Left Goldman Sachs”; former Goldman Sachs Executive Director

A Code of Conduct for data scientists from the Bellagio Fellows

September 25, 2013 3 comments

The 2013 PopTech & Rockefeller Foundation Bellagio Fellows – Kate CrawfordPatrick MeierClaudia PerlichAmy LuersGustavo Faleiros and Jer Thorp – yesterday published “Seven Principles for Big Data and Resilience Projects” on Patrick Meier’s blog iRevolution.

Although they claim that these principles are meant for “best practices for resilience building projects that leverage Big Data and Advanced Computing,” I think they’re more general than that (although I’m not sure exactly what a resilience building project is) I and I really like them. They are looking for public comments too. Go to the post for the full description of each, but here is a summary:

1. Open Source Data Tools

Wherever possible, data analytics and manipulation tools should be open source, architecture independent and broadly prevalent (R, python, etc.).

2. Transparent Data Infrastructure

Infrastructure for data collection and storage should operate based on transparent standards to maximize the number of users that can interact with the infrastructure.

3. Develop and Maintain Local Skills

Make “Data Literacy” more widespread. Leverage local data labor and build on existing skills.

4. Local Data Ownership

Use Creative Commons and licenses that state that data is not to be used for commercial purposes.

5. Ethical Data Sharing

Adopt existing data sharing protocols like the ICRC’s (2013). Permission for sharing is essential. How the data will be used should be clearly articulated. An opt in approach should be the preference wherever possible, and the ability for individuals to remove themselves from a data set after it has been collected must always be an option.

6. Right Not To Be Sensed

Local communities have a right not to be sensed. Large scale city sensing projects must have a clear framework for how people are able to be involved or choose not to participate.

7. Learning from Mistakes

Big Data and Resilience projects need to be open to face, report, and discuss failures.

Follow

Get every new post delivered to your Inbox.

Join 1,716 other followers