PDF Liberation Hackathon: January 17-19

Home > data science, finance, news, open source tools > PDF Liberation Hackathon: January 17-19

PDF Liberation Hackathon: January 17-19

December 6, 2013 Cathy O'Neil, mathbabe

This is a guest post by Marc Joffe, the principal consultant at Public Sector Credit Solutions, an organization that provides data and analysis related to sovereign and municipal securities. Previously, Joffe was a Senior Director at Moody’s Analytics.

As Cathy has argued, open source models can bring much needed transparency to scientific research, finance, education and other fields plagued by biased, self-serving analytics. Models often need large volumes of data, and if the model is to be run on an ongoing basis, regular data updates are required.

Unfortunately, many data sets are not ready to be loaded into your analytical tool of choice; they arrive in an unstructured form and must be organized into a consistent set of rows and columns. This cleaning process can be quite costly. Since open source modeling efforts are usually low dollar operations, the costs of data cleaning may prove to be prohibitive. Hence no open model – distortion and bias continue their reign.

Much data comes to us in the form of PDFs. Say, for example, you want to model student loan securitizations. You will be confronted with a large number of PDF servicing reports that look like this. A corporation or well funded research institution can purchase an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from the PDFs into a database. But this is not much help to insurgent modelers who want to produce open source work.

Data journalists face a similar challenge. They often need to extract bulk data from PDFs to support their reporting. Examples include IRS Form 990s filed by non-profits and budgets issued by governments at all levels.

The data journalism community has responded to this challenge by developing software to harvest usable information from PDFs. Examples include Tabula, a tool written by Knight-Mozilla OpenNews Fellow Manuel Aristarán, extracts data from PDF tables in a form that can be readily imported to a spreadsheet – if the PDF was “printed” from a computer application. Introduced earlier this year, Tabula continues to evolve thanks to the volunteer efforts of Manuel, with help from OpenNews Fellow Mike Tigas and New York Times interactive developer Jeremy Merrill. Meanwhile, DocHive, a tool whose continuing development is being funded by a Knight Foundation grant, addresses PDFs that were created by scanning paper documents. DocHive is a project of Raleigh Public Record and is led by Charles and Edward Duncan.

These open source tools join a number of commercial offerings such as Able2Extract and ABBYY Fine Reader that extract data from PDFs. A more comprehensive list of open source and commercial resources is available here.

Unfortunately, the free and low cost tools available to modelers, data journalists and transparency advocates have limitations that hinder their ability to handle large scale tasks. If, like me, you want to submit hundreds of PDFs to a software tool, press “Go” and see large volumes of cleanly formatted data, you are out of luck.

It is for this reason that I am working with The Sunlight Foundation and other sponsors to stage the PDF Liberation Hackathon from January 17-19, 2014. We’ll have hack sites at Sunlight’s Washington DC office and at RallyPad in San Francisco. Developers can also join remotely because we will publish a number of clearly specified PDF extraction challenges before the hackathon.

Participants can work on one of the pre-specified challenges or choose their own PDF extraction projects. Ideally, hackathon teams will use (and hopefully improve upon) open source tools to meet the hacking challenges, but they will also be allowed to embed commercial tools into their projects as long as their licensing cost is less than $1000 and an unlimited trial is available.

Prizes of up to $500 will be awarded to winning entries. To receive a prize, a team must publish their source code on a GitHub public repository. To join the hackathon in DC or remotely, please sign up at Eventbrite; to hack with us in SF, please sign up via this Meetup. Please also complete our Google Form survey. Also, if anyone reading this is associated with an organization in New York or Chicago that would like to organize an additional hack space, please contact me.

The PDF Liberation Hackathon is going to be a great opportunity to advance the state of the art when it comes to harvesting data from public documents. I hope you can join us.

Categories: data science, finance, news, open source tools

Comments (4)

mb

December 6, 2013 at 8:13 am

You know who is doing a great job addressing some of these issues? The republicans in the house. You know who is not? I’ll let you guess. I’ll say a couple of other things, do you think sallie mae or government entitities have the data in excel spreadsheets or databases? Why don’t they allow access to that data? what do you think they gain by not being transparent? Open source projects are great all that, but the core issue is that the government is purposefully releasing data that can’t be used to avoid accountability. Yet some people trust them, I’ll never get that.

LikeLike
mb

December 6, 2013 at 8:13 am

You know who is doing a great job addressing some of these issues? The republicans in the house. You know who is not? I’ll let you guess. I’ll say a couple of other things, do you think sallie mae or government entitities have the data in excel spreadsheets or databases? Why don’t they allow access to that data? what do you think they gain by not being transparent? Open source projects are great all that, but the core issue is that the government is purposefully releasing data that can’t be used to avoid accountability. Yet some people trust them, I’ll never get that.

LikeLike
Guest2

December 6, 2013 at 12:50 pm

I’ve used OmniPage to OCR in the past, but I guess isn’t up to the challenge of massive piles of data. No need with OmniPage to have the data “printed” from a spreadsheet. What other OCR packages are out there?

LikeLike
- Marc
  
  December 6, 2013 at 8:44 pm
  
  I prefer Abbyy FineReader to OmniPage, but have not looked at OmniPage recently. The open source alternative is Tesseract.
  
  LikeLike