Today I’d like to mention two ideas I’ve been having recently on how to make being a research mathematician (even) more fun.
1) Mathematicians should consider holding public discussions about papers
First, math nerds, did you know that in statistics they have formal discussions about papers? It’s been a long-standing tradition by the Royal Statistical Society, whose motto is “Advancing the science and application of statistics, and promoting use and awareness for public benefit,” to choose papers by some criterion and then hold regular public discussions about those papers by a few experts who are not the author, about the paper. Then the author responds to their points and the whole conversation is published for posterity.
I think this is a cool idea for math papers too. One thing that kind of depressed me about math is how rarely you’d find people reading the same papers unless you specifically got a group of people together to do so, which was a lot of work. This way the work is done mostly by other people and more importantly the payoff is much better for them since everyone gets a view into the discussion.
Note I’m sidestepping who would organize this whole thing, and how the papers would be chosen exactly, but I’d expect it would improve the overall feeling that I had of being isolated in a tiny math community, especially if the conversations were meant to be penetrable.
2) There should be a good clustering method for papers around topics
This second idea may already be happening, but I’m going to say it anyway, and it could easily be a thesis for someone in CS.
Namely, the idea of using NLP and other such techniques to cluster math papers by topic. Right now the most obvious way to find a “nearby” paper is to look at the graph of papers by direct reference, but you’re probably missing out on lots of stuff that way. I think a different and possibly more interesting way would be to use the text in the title, abstract, and introduction to find papers with similar subjects.
This might be especially useful when you want to know the answer to a question like, “has anyone proved that such-and-such?” and you can do a text search for the statement of that theorem.
The good news here is that mathematicians are in love with terminology, and give weird names to things that make NLP techniques very happy. My favorite recent example which I hear Johan muttering under his breath from time to time is Flabby Sheaves. There’s no way that’s not a distinctive phrase.
The bad news is that such techniques won’t help at all in finding different fields who have come across the same idea but have different names for the relevant objects. But that’s OK, because it means there’s still lots of work for mathematicians.
By the way, back to the question of whether this has already been done. My buddy Max Lieblich has a website called MarXiv which is a wrapper over the math ArXiv and has a “similar” button. I have no idea what that button actually does though. In any case I totally dig the design of the similar button, and what I propose is just to have something like that work with NLP.
This is a guest post by Marc Joffe, the principal consultant at Public Sector Credit Solutions, an organization that provides data and analysis related to sovereign and municipal securities. Previously, Joffe was a Senior Director at Moody’s Analytics.
As Cathy has argued, open source models can bring much needed transparency to scientific research, finance, education and other fields plagued by biased, self-serving analytics. Models often need large volumes of data, and if the model is to be run on an ongoing basis, regular data updates are required.
Unfortunately, many data sets are not ready to be loaded into your analytical tool of choice; they arrive in an unstructured form and must be organized into a consistent set of rows and columns. This cleaning process can be quite costly. Since open source modeling efforts are usually low dollar operations, the costs of data cleaning may prove to be prohibitive. Hence no open model – distortion and bias continue their reign.
Much data comes to us in the form of PDFs. Say, for example, you want to model student loan securitizations. You will be confronted with a large number of PDF servicing reports that look like this. A corporation or well funded research institution can purchase an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from the PDFs into a database. But this is not much help to insurgent modelers who want to produce open source work.
Data journalists face a similar challenge. They often need to extract bulk data from PDFs to support their reporting. Examples include IRS Form 990s filed by non-profits and budgets issued by governments at all levels.
The data journalism community has responded to this challenge by developing software to harvest usable information from PDFs. Examples include Tabula, a tool written by Knight-Mozilla OpenNews Fellow Manuel Aristarán, extracts data from PDF tables in a form that can be readily imported to a spreadsheet – if the PDF was “printed” from a computer application. Introduced earlier this year, Tabula continues to evolve thanks to the volunteer efforts of Manuel, with help from OpenNews Fellow Mike Tigas and New York Times interactive developer Jeremy Merrill. Meanwhile, DocHive, a tool whose continuing development is being funded by a Knight Foundation grant, addresses PDFs that were created by scanning paper documents. DocHive is a project of Raleigh Public Record and is led by Charles and Edward Duncan.
These open source tools join a number of commercial offerings such as Able2Extract and ABBYY Fine Reader that extract data from PDFs. A more comprehensive list of open source and commercial resources is available here.
Unfortunately, the free and low cost tools available to modelers, data journalists and transparency advocates have limitations that hinder their ability to handle large scale tasks. If, like me, you want to submit hundreds of PDFs to a software tool, press “Go” and see large volumes of cleanly formatted data, you are out of luck.
It is for this reason that I am working with The Sunlight Foundation and other sponsors to stage the PDF Liberation Hackathon from January 17-19, 2014. We’ll have hack sites at Sunlight’s Washington DC office and at RallyPad in San Francisco. Developers can also join remotely because we will publish a number of clearly specified PDF extraction challenges before the hackathon.
Participants can work on one of the pre-specified challenges or choose their own PDF extraction projects. Ideally, hackathon teams will use (and hopefully improve upon) open source tools to meet the hacking challenges, but they will also be allowed to embed commercial tools into their projects as long as their licensing cost is less than $1000 and an unlimited trial is available.
Prizes of up to $500 will be awarded to winning entries. To receive a prize, a team must publish their source code on a GitHub public repository. To join the hackathon in DC or remotely, please sign up at Eventbrite; to hack with us in SF, please sign up via this Meetup. Please also complete our Google Form survey. Also, if anyone reading this is associated with an organization in New York or Chicago that would like to organize an additional hack space, please contact me.
The PDF Liberation Hackathon is going to be a great opportunity to advance the state of the art when it comes to harvesting data from public documents. I hope you can join us.
I’m lucky to be working with a super fantastic python guy on this, and the details are under wraps, but let’s just say it’s exciting.
So I’m looking to showcase a few good models to start with, preferably in python, but the critical ingredient is that they’re open source. They don’t have to be great, because the point is to see their flaws and possible to improve them.
- For example, I put in a FOIA request a couple of days ago to get the current teacher value-added model from New York City.
- A friends of mine, Marc Joffe, has an open source municipal credit rating model. It’s not in python but I’m hopeful we can work with it anyway.
- I’m in search of an open source credit scoring model for individuals. Does anyone know of something like that?
- They don’t have to be creepy! How about a Nate Silver – style weather model?
- Or something that relies on open government data?
- Can we get the Reinhart-Rogoff model?
The idea here is to get the model, not necessarily the data (although even better if it can be attached to data and updated regularly). And once we get a model, we’d build interactives with the model (like this one), or at least the tools to do so, so other people could build them.
At its core, the point of open models is this: you don’t really know what a model does until you can interact with it. You don’t know if a model is robust unless you can fiddle with its parameters and check. And finally, you don’t know if a model is best possible unless you’ve let people try to improve it.
The idea is that we’re analyzing metadata around a texting hotline for teens in crisis. We’re trying to see if we can use the information we have on these texts (timestamps, character length, topic – which is most often suicide – and outcome reported by both the texter and the counselor) to help the counselors improve their responses.
For example, right now counselors can be in up to 5 conversations at a time – is that too many? Can we figure that out from the data? Is there too much waiting between texts? Other questions are listed here.
Our “hackpad” is located here, and will hopefully be updated like a wiki with results and visuals from the exploration of our group. It looks like we have a pretty amazing group of nerds over here looking into this (mostly python users!), and I’m hopeful that we will be helping the good people at Crisis Text Line.
I’m preparing for a short trip to D.C. this week to take part in a day-long event held by Americans for Financial Reform. You can get the announcement here online, but I’m not sure what the finalized schedule of the day is going to be. Also, I believe it will be recorded, but I don’t know the details yet.
In any case, I’m psyched to be joining this, and the AFR are great guys doing important work in the realm of financial reform.
Opening Wall Street’s Black Box: Pathways to Improved Financial Transparency
Sponsored By Americans for Financial Reform and Georgetown University Law Center
Keynote Speaker: Gary Gensler Chair, Commodity Futures Trading Commission
October 11, 2013 10 AM – 3 PM
Georgetown Law Center, Gewirz Student Center, 12th Floor
120 F Street NW, Washington, DC (Judiciary Square Metro) (Space is limited. Please RSVP to AFRtransparencyrsvp@gmail.com)
The 2008 financial crisis revealed that regulators and many sophisticated market participants were in the dark about major risks and exposures in our financial system. The lack of financial transparency enabled large-scale fraud and deception of investors, weakened the stability of the financial system, and contributed to the market failure after the collapse of Lehman Brothers. Five years later, despite regulatory efforts, it’s not clear how much the situation has improved.
Join regulators, market participants, and academic experts for an exploration of the progress made – and the work that remains to be done – toward meaningful transparency on Wall Street. How can better information and disclosure make the financial system both fairer and safer?
|Jesse Eisinger, Pulitzer Prize-winning reporter for the New York Times and Pro Publica|
|Zach Gast, Head of financial sector research, Center on Financial Research and Analysis|
|Amias Gerety, Deputy Assistant Secretary for the FSOC, United States Treasury|
|Henry Hu, Alan Shivers Chair in the Law of Banking and Finance, University of Texas Law School|
|Albert “Pete” Kyle, Charles E. Smith Professor of Finance, University of Maryland|
|Adam Levitan, Professor of Law, Georgetown University Law Center|
|Antoine Martin, Vice President, New York Federal Reserve Bank|
|Brad Miller, Former Representative from North Carolina; Of Counsel, Grais & Ellsworth|
|Cathy O’Neil, Senior Data Scientist, Johnson Research Labs; Occupy Alternative Banking|
|Gene Phillips, Director, PF2 Securities Evaluation|
|Greg Smith, Author of “Why I Left Goldman Sachs”; former Goldman Sachs Executive Director|
The 2013 PopTech & Rockefeller Foundation Bellagio Fellows – Kate Crawford, Patrick Meier, Claudia Perlich, Amy Luers, Gustavo Faleiros and Jer Thorp – yesterday published “Seven Principles for Big Data and Resilience Projects” on Patrick Meier’s blog iRevolution.
Although they claim that these principles are meant for “best practices for resilience building projects that leverage Big Data and Advanced Computing,” I think they’re more general than that (although I’m not sure exactly what a resilience building project is) I and I really like them. They are looking for public comments too. Go to the post for the full description of each, but here is a summary:
1. Open Source Data Tools
Wherever possible, data analytics and manipulation tools should be open source, architecture independent and broadly prevalent (R, python, etc.).
2. Transparent Data Infrastructure
Infrastructure for data collection and storage should operate based on transparent standards to maximize the number of users that can interact with the infrastructure.
3. Develop and Maintain Local Skills
Make “Data Literacy” more widespread. Leverage local data labor and build on existing skills.
4. Local Data Ownership
Use Creative Commons and licenses that state that data is not to be used for commercial purposes.
5. Ethical Data Sharing
Adopt existing data sharing protocols like the ICRC’s (2013). Permission for sharing is essential. How the data will be used should be clearly articulated. An opt in approach should be the preference wherever possible, and the ability for individuals to remove themselves from a data set after it has been collected must always be an option.
6. Right Not To Be Sensed
Local communities have a right not to be sensed. Large scale city sensing projects must have a clear framework for how people are able to be involved or choose not to participate.
7. Learning from Mistakes
Big Data and Resilience projects need to be open to face, report, and discuss failures.
There are lots of things I know nothing at all about. It annoys me not to understand a subject at all, because it often means I can’t follow a conversation that I care about. The list includes, just as a start: accounting, law, and politics.
Of those three, accounting seems like the easiest thing to tackle by far. This is partly because the space between what it’s theoretically supposed to be and how it’s practiced is smaller than with law or politics. Or maybe the kind of tricks accountants use seem closer to the kind of tricks I know about from being a quant, so that space seems easier to navigate for me personally.
Anyway, I might be wrong, but my impression is that my lack of understanding of accounting is mostly a language barrier, rather than a conceptual problem. There are expenses, and revenue, and lots of tax issues. There are categories. I’m working on the assumption that none of this stuff is exactly mathematical either, it’s all about knowing what things are called. And I don’t know any of it.
So I just signed up to learn at least some of it on a free Coursera course from the Wharton MBA Foundation Series. Here’s the introductory video, the professor seems super nerdy and goofy, which is a good start.
So in my copious free time I’ll be watching videos explaining the language of tax deferment and the like. Or at least that’s the fantasy – the thing about Coursera is that it’s free, so there’s not much incentive to keep up with the course. And the fact that all four Wharton 1st-year courses are being given away for free is proof of something, by the way – possibly that what you’re really paying for in business school is the connections you make while you’re there.