This is a guest post by Marc Joffe, the principal consultant at Public Sector Credit Solutions, an organization that provides data and analysis related to sovereign and municipal securities. Previously, Joffe was a Senior Director at Moody’s Analytics.
As Cathy has argued, open source models can bring much needed transparency to scientific research, finance, education and other fields plagued by biased, self-serving analytics. Models often need large volumes of data, and if the model is to be run on an ongoing basis, regular data updates are required.
Unfortunately, many data sets are not ready to be loaded into your analytical tool of choice; they arrive in an unstructured form and must be organized into a consistent set of rows and columns. This cleaning process can be quite costly. Since open source modeling efforts are usually low dollar operations, the costs of data cleaning may prove to be prohibitive. Hence no open model – distortion and bias continue their reign.
Much data comes to us in the form of PDFs. Say, for example, you want to model student loan securitizations. You will be confronted with a large number of PDF servicing reports that look like this. A corporation or well funded research institution can purchase an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from the PDFs into a database. But this is not much help to insurgent modelers who want to produce open source work.
Data journalists face a similar challenge. They often need to extract bulk data from PDFs to support their reporting. Examples include IRS Form 990s filed by non-profits and budgets issued by governments at all levels.
The data journalism community has responded to this challenge by developing software to harvest usable information from PDFs. Examples include Tabula, a tool written by Knight-Mozilla OpenNews Fellow Manuel Aristarán, extracts data from PDF tables in a form that can be readily imported to a spreadsheet – if the PDF was “printed” from a computer application. Introduced earlier this year, Tabula continues to evolve thanks to the volunteer efforts of Manuel, with help from OpenNews Fellow Mike Tigas and New York Times interactive developer Jeremy Merrill. Meanwhile, DocHive, a tool whose continuing development is being funded by a Knight Foundation grant, addresses PDFs that were created by scanning paper documents. DocHive is a project of Raleigh Public Record and is led by Charles and Edward Duncan.
These open source tools join a number of commercial offerings such as Able2Extract and ABBYY Fine Reader that extract data from PDFs. A more comprehensive list of open source and commercial resources is available here.
Unfortunately, the free and low cost tools available to modelers, data journalists and transparency advocates have limitations that hinder their ability to handle large scale tasks. If, like me, you want to submit hundreds of PDFs to a software tool, press “Go” and see large volumes of cleanly formatted data, you are out of luck.
It is for this reason that I am working with The Sunlight Foundation and other sponsors to stage the PDF Liberation Hackathon from January 17-19, 2014. We’ll have hack sites at Sunlight’s Washington DC office and at RallyPad in San Francisco. Developers can also join remotely because we will publish a number of clearly specified PDF extraction challenges before the hackathon.
Participants can work on one of the pre-specified challenges or choose their own PDF extraction projects. Ideally, hackathon teams will use (and hopefully improve upon) open source tools to meet the hacking challenges, but they will also be allowed to embed commercial tools into their projects as long as their licensing cost is less than $1000 and an unlimited trial is available.
Prizes of up to $500 will be awarded to winning entries. To receive a prize, a team must publish their source code on a GitHub public repository. To join the hackathon in DC or remotely, please sign up at Eventbrite; to hack with us in SF, please sign up via this Meetup. Please also complete our Google Form survey. Also, if anyone reading this is associated with an organization in New York or Chicago that would like to organize an additional hack space, please contact me.
The PDF Liberation Hackathon is going to be a great opportunity to advance the state of the art when it comes to harvesting data from public documents. I hope you can join us.
I’m lucky to be working with a super fantastic python guy on this, and the details are under wraps, but let’s just say it’s exciting.
So I’m looking to showcase a few good models to start with, preferably in python, but the critical ingredient is that they’re open source. They don’t have to be great, because the point is to see their flaws and possible to improve them.
- For example, I put in a FOIA request a couple of days ago to get the current teacher value-added model from New York City.
- A friends of mine, Marc Joffe, has an open source municipal credit rating model. It’s not in python but I’m hopeful we can work with it anyway.
- I’m in search of an open source credit scoring model for individuals. Does anyone know of something like that?
- They don’t have to be creepy! How about a Nate Silver – style weather model?
- Or something that relies on open government data?
- Can we get the Reinhart-Rogoff model?
The idea here is to get the model, not necessarily the data (although even better if it can be attached to data and updated regularly). And once we get a model, we’d build interactives with the model (like this one), or at least the tools to do so, so other people could build them.
At its core, the point of open models is this: you don’t really know what a model does until you can interact with it. You don’t know if a model is robust unless you can fiddle with its parameters and check. And finally, you don’t know if a model is best possible unless you’ve let people try to improve it.
The idea is that we’re analyzing metadata around a texting hotline for teens in crisis. We’re trying to see if we can use the information we have on these texts (timestamps, character length, topic – which is most often suicide – and outcome reported by both the texter and the counselor) to help the counselors improve their responses.
For example, right now counselors can be in up to 5 conversations at a time – is that too many? Can we figure that out from the data? Is there too much waiting between texts? Other questions are listed here.
Our “hackpad” is located here, and will hopefully be updated like a wiki with results and visuals from the exploration of our group. It looks like we have a pretty amazing group of nerds over here looking into this (mostly python users!), and I’m hopeful that we will be helping the good people at Crisis Text Line.
I’m preparing for a short trip to D.C. this week to take part in a day-long event held by Americans for Financial Reform. You can get the announcement here online, but I’m not sure what the finalized schedule of the day is going to be. Also, I believe it will be recorded, but I don’t know the details yet.
In any case, I’m psyched to be joining this, and the AFR are great guys doing important work in the realm of financial reform.
Opening Wall Street’s Black Box: Pathways to Improved Financial Transparency
Sponsored By Americans for Financial Reform and Georgetown University Law Center
Keynote Speaker: Gary Gensler Chair, Commodity Futures Trading Commission
October 11, 2013 10 AM – 3 PM
Georgetown Law Center, Gewirz Student Center, 12th Floor
120 F Street NW, Washington, DC (Judiciary Square Metro) (Space is limited. Please RSVP to AFRtransparencyrsvp@gmail.com)
The 2008 financial crisis revealed that regulators and many sophisticated market participants were in the dark about major risks and exposures in our financial system. The lack of financial transparency enabled large-scale fraud and deception of investors, weakened the stability of the financial system, and contributed to the market failure after the collapse of Lehman Brothers. Five years later, despite regulatory efforts, it’s not clear how much the situation has improved.
Join regulators, market participants, and academic experts for an exploration of the progress made – and the work that remains to be done – toward meaningful transparency on Wall Street. How can better information and disclosure make the financial system both fairer and safer?
|Jesse Eisinger, Pulitzer Prize-winning reporter for the New York Times and Pro Publica|
|Zach Gast, Head of financial sector research, Center on Financial Research and Analysis|
|Amias Gerety, Deputy Assistant Secretary for the FSOC, United States Treasury|
|Henry Hu, Alan Shivers Chair in the Law of Banking and Finance, University of Texas Law School|
|Albert “Pete” Kyle, Charles E. Smith Professor of Finance, University of Maryland|
|Adam Levitan, Professor of Law, Georgetown University Law Center|
|Antoine Martin, Vice President, New York Federal Reserve Bank|
|Brad Miller, Former Representative from North Carolina; Of Counsel, Grais & Ellsworth|
|Cathy O’Neil, Senior Data Scientist, Johnson Research Labs; Occupy Alternative Banking|
|Gene Phillips, Director, PF2 Securities Evaluation|
|Greg Smith, Author of “Why I Left Goldman Sachs”; former Goldman Sachs Executive Director|
The 2013 PopTech & Rockefeller Foundation Bellagio Fellows - Kate Crawford, Patrick Meier, Claudia Perlich, Amy Luers, Gustavo Faleiros and Jer Thorp - yesterday published “Seven Principles for Big Data and Resilience Projects” on Patrick Meier’s blog iRevolution.
Although they claim that these principles are meant for “best practices for resilience building projects that leverage Big Data and Advanced Computing,” I think they’re more general than that (although I’m not sure exactly what a resilience building project is) I and I really like them. They are looking for public comments too. Go to the post for the full description of each, but here is a summary:
1. Open Source Data Tools
Wherever possible, data analytics and manipulation tools should be open source, architecture independent and broadly prevalent (R, python, etc.).
2. Transparent Data Infrastructure
Infrastructure for data collection and storage should operate based on transparent standards to maximize the number of users that can interact with the infrastructure.
3. Develop and Maintain Local Skills
Make “Data Literacy” more widespread. Leverage local data labor and build on existing skills.
4. Local Data Ownership
Use Creative Commons and licenses that state that data is not to be used for commercial purposes.
5. Ethical Data Sharing
Adopt existing data sharing protocols like the ICRC’s (2013). Permission for sharing is essential. How the data will be used should be clearly articulated. An opt in approach should be the preference wherever possible, and the ability for individuals to remove themselves from a data set after it has been collected must always be an option.
6. Right Not To Be Sensed
Local communities have a right not to be sensed. Large scale city sensing projects must have a clear framework for how people are able to be involved or choose not to participate.
7. Learning from Mistakes
Big Data and Resilience projects need to be open to face, report, and discuss failures.
There are lots of things I know nothing at all about. It annoys me not to understand a subject at all, because it often means I can’t follow a conversation that I care about. The list includes, just as a start: accounting, law, and politics.
Of those three, accounting seems like the easiest thing to tackle by far. This is partly because the space between what it’s theoretically supposed to be and how it’s practiced is smaller than with law or politics. Or maybe the kind of tricks accountants use seem closer to the kind of tricks I know about from being a quant, so that space seems easier to navigate for me personally.
Anyway, I might be wrong, but my impression is that my lack of understanding of accounting is mostly a language barrier, rather than a conceptual problem. There are expenses, and revenue, and lots of tax issues. There are categories. I’m working on the assumption that none of this stuff is exactly mathematical either, it’s all about knowing what things are called. And I don’t know any of it.
So I just signed up to learn at least some of it on a free Coursera course from the Wharton MBA Foundation Series. Here’s the introductory video, the professor seems super nerdy and goofy, which is a good start.
So in my copious free time I’ll be watching videos explaining the language of tax deferment and the like. Or at least that’s the fantasy – the thing about Coursera is that it’s free, so there’s not much incentive to keep up with the course. And the fact that all four Wharton 1st-year courses are being given away for free is proof of something, by the way – possibly that what you’re really paying for in business school is the connections you make while you’re there.
I want to bring up two quick topics this morning I’ve been mulling over lately which are both related to this recent post by Economist Rajiv Sethi from Barnard (h/t Suresh Naidu), who happened to be my assigned faculty mentor when I was an assistant prof there. I have mostly questions and few answers right now.
In his post, Sethi talks about former computer nerd for Goldman Sachs Sergey Aleynikov and his trial, which was chronicled by Michael Lewis recently. See also this related interview with Lewis, h/t Chris Wiggins.
I haven’t read Lewis’s piece yet, only his interview and Sethi’s reaction. But I can tell it’ll be juicy and fun, as Lewis usually is. He’s got a way with words and he’s bloodthirsty, always an entertaining combination.
So, the two topics.
First off, let’s talk a bit about high frequency trading, or HFT. My first two questions are, who does HFT benefit and what does HFT cost? For both of these, there’s the easy answer and then there’s the hard answer.
Easy answer for HFT benefitting someone: primarily the people who make loads of money off of it, including the hardware industry and the people who get paid to drill through mountains with cables to make connections between Chicago and New York faster.
Secondarily, market participants whose fees have been lowered because of the tight market-making brought about by HFT, although that savings may be partially undone by the way HFT’ers operate to pick off “dumb money” participants. After all, you say market making, I say arbing. Sorting out the winners, especially when you consider times of “extreme market conditions”, is where it gets hard.
Easy answer for the costs of HFT is for the companies that invest in IT and infrastructure and people to do the work, although to be sure they wouldn’t be willing to make that investment if they didn’t expect it to pay off.
A harder and more complete answer would involve how much risk we take on as a society when we build black boxes that we don’t understand and let them collide with each other with our money, as well as possibly a guess at what those people and resources now doing HFT might be doing otherwise.
And that brings me to my second topic, namely the interaction between the open source community and the finance community, but mostly the HFTers.
Sethi said it
well (Cathy: see bottom of this for an update) this way in his post:
Aleynikov relied routinely on open-source code, which he modified and improved to meet the needs of the company. It is customary, if
not mandatory(Cathy: see bottom of this for an update) for these improvements to be released back into the public domain for use by others. But his attempts to do so were blocked:
Serge quickly discovered, to his surprise, that Goldman had a one-way relationship with open source. They took huge amounts of free software off the Web, but they did not return it after he had modified it, even when his modifications were very slight and of general rather than financial use. “Once I took some open-source components, repackaged them to come up with a component that was not even used at Goldman Sachs,” he says. “It was basically a way to make two computers look like one, so if one went down the other could jump in and perform the task.” He described the pleasure of his innovation this way: “It created something out of chaos. When you create something out of chaos, essentially, you reduce the entropy in the world.” He went to his boss, a fellow named Adam Schlesinger, and asked if he could release it back into open source, as was his inclination. “He said it was now Goldman’s property,” recalls Serge. “He was quite tense. When I mentioned it, it was very close to bonus time. And he didn’t want any disturbances.”
This resonates with my experience at D.E. Shaw. We used lots of python stuff, and as a community were working at the edges of its capabilities (not me, I didn’t do fancy HFT stuff, my models worked at a much longer time frame of at least a few hours between trades).
The urge to give back to the OS community was largely thwarted, when it came up at all, because there was a fear, or at least an argument, that somehow our competition would use it against us, to eliminate our edge, even if it was an invention or tool completely sanitized from the actual financial algorithm at hand.
A few caveats: First, I do think that stuff, i.e. python technology and the like eventually gets out to the open source domain even if people are consistently thwarting it. But it’s incredibly slow compared to what you might expect.
Second, It might be the case that python developers working outside of finance are actually much better at developing good tools for python, especially if they have some interaction with finance but don’t work inside. I’m guessing this because, as a modeler, you have a very selfish outlook and only want to develop tools for your particular situation. In other words, you might have some really weird looking tools if you did see a bunch coming from finance.
Finally, I think I should mention that quite a few people I knew at D.E. Shaw have now left and are actively contributing to the open source community now. So it’s a lagged contribution but a contribution nonetheless, which is nice to see.
Update: from my Facebook page, a discussion of the “mandatoriness” of giving back to the OS community from my brother Eugene O’Neil, super nerd, and friend William Stein, other super nerd:
Eugene O’Neil: the GPL says that if you give someone a binary executable compiled with GPL source code, you also have to provide them free access to all the source code used to generate that binary, under the terms of the GPL. This makes the commercial sale of GPL binaries without source code illegal. However, if you DON’T give anyone outside your organization a binary, you are not legally required to give them the modified source code for the binary you didn’t give them. That being said, any company policy that tries to explicitly PROHIBIT employees from redistributing modified GPL code is in a legal gray area: the loophole works best if you completely trust everyone who has the modified code to simply not want to distribute it.
William Stein: Eugene — You are absolutely right. The “mandatory” part of the quote: “It is customary, if not mandatory, for these improvements to be released back into the public domain for use by others.” from Cathy’s article is misleading. I frequently get asked about this sort of thing (because of people using Sage (http://sagemath.org) for web backends, trading, etc.). I’m not aware of any popular open source license that make it mandatory to give back changes if you use a project internally in an organization (let alone the GPL, which definitely doesn’t). The closest is AGPL, which involves external use for a website. Cathy — you might consider changing “Sethi said it well…”, since I think his quote is misleading at best. I’m personally aware of quite a few people that do use Sage right now who wouldn’t otherwise if Sethi’s statement were correct.
Crossposted on Not Even Wrong.
Here’s a completely biased interview I did with my husband A. Johan de Jong, who has been working with Pieter Belmans on a very cool online math project using d3js. I even made up some of his answers (with his approval).
Q: What is the Stacks Project?
A: It’s an open source textbook and reference for my field, which is algebraic geometry. It builds foundations starting from elementary college algebra and going up to algebraic stacks. It’s a self-contained exposition of all the material there, which makes it different from a research textbook or the experience you’d have reading a bunch of papers.
We were quite neurotic setting it up – everything has a proof, other results are referenced explicitly, and it’s strictly linear, which is to say there’s a strict ordering of the text so that all references are always to earlier results.
Of course the field itself has different directions, some of which are represented in the stacks project, but we had to choose a way of presenting it which allowed for this idea of linearity (of course, any mathematician thinks we can do that for all of mathematics).
Q: How has the Stacks Project website changed?
A: It started out as just a place you could download the pdf and tex files, but then Pieter Belmans came on board and he added features such as full text search, tag look-up, and a commenting system. In this latest version, we’ve added a whole bunch of features, but the most interesting one is the dynamic generation of dependency graphs.
We’ve had some crude visualizations for a while, and we made t-shirts from those pictures. I even had this deal where, if people found mathematical mistakes in the Stacks Project, they’d get a free t-shirt, and I’m happy to report that I just last week gave away my last t-shirt. Here’s an old picture of me with my adorable son (who’s now huge).
Q: Talk a little bit about the new viz.
A: First a word about the tags, which we need to understand the viz.
Every mathematical result in the Stacks Project has a “tag”, which is a four letter code, and which is a permanent reference for that result, even as other results are added before or after that one (by the way, Cathy O’Neil figured this system out).
The graphs show the logical dependencies between these tags, represented by arrows between nodes. You can see this structure in the above picture already.
So for example, if tag ABCD refers to Zariski’s Main Theorem, and tag ADFG refers to Nakayama’s Lemma, then since Zariski depends on Nakayama, there’s a logical dependency, which means the node labeled ABCD points to the node labeled ADFG in the entire graph.
Of course, we don’t really look at the entire graph, we look at the subgraph of results which a given result depends on. And we don’t draw all the arrows either, we only draw the arrows corresponding to direct references in the proofs. Which is to say, in the subgraph for Zariski, there will be a path from node ABCD to node ADFG, but not necessarily a direct link.
Q: Can we see an example?
Let’s move to an example for result 01WC, which refers to the proof that “a locally projective morphism is proper”.
First, there are two kinds of heat maps. Here’s one that defines distance as the maximum (directed) distance from the root node. In other words, how far down in the proof is this result needed? In this case the main result 01WC is bright red with a black dotted border, and any result that 01WC depends on is represented as a node. The edges are directed, although the arrows aren’t drawn, but you can figure out the direction by how the color changes. The dark blue colors are the leaf nodes that are farthest away from the root.
Another way of saying this is that the redder results are the results that are closer to it in meaning and sophistication level.
Note if we had defined the distance as the minimum distance from the root node (to come soon hopefully), then we’d have a slightly different and also meaningful way of thinking about “redness” as “relevance” to the root node.
This is a screenshot but feel free to play with it directly here. For all of the graphs, hovering over a result will cause the statement of the result to appear, which is awesome.
Next, let’s look at another kind of heat map where the color is defined as maximum distance from some leaf note in the overall graph. So dark blue nodes are basic results in algebra, sheaves, sites, cohomology, simplicial methods, and other chapters. The link is the same, you can just toggle between the different metric.
Next we delved further into how results depend on those different topics. Here, again for the same result, we can see the extent to which that result depends on the different on results from the various chapters. If you scroll over the nodes you can see more details. This is just a screenshot but you can play with it yourself here and you can collapse it in various ways corresponding to the internal hierarchy of the project.
Finally, we have a way of looking at the logical dependency graph directly, where result node is labeled with a tag and colored by “type”: whether it’s a lemma, proposition, theorem, or something else, and it also annotates the results which have separate names. Again a screenshot but play with it here, it rotates!
Check out the whole project here, and feel free to leave comments using the comment feature!
Not much time because I’m giving a keynote talk at the PyData 2013 conference in Cambridge today, which is being held at the Microsoft NERD conference center.
It’s gonna be videotaped so I’ll link to that when it’s ready.
My title is “Storytelling With Data” but for whatever reason on the schedule handed out yesterday the name had been changed to “Scalable Storytelling With Data”. I’m thinking of addressing this name change in my talk – one of the points of the talk, in fact, is that with great tools, we don’t need to worry too much about the scale.
Plus since it’s Sunday morning I’m going to make an effort to tie my talk into an old testament story, which is totally bizarre since I’m not at all religious but for some reason it feels right. Please wish me luck.
Have you read this recent article in Slate about they canceled online courses at San Jose State University after more than half the students failed? The failure rate ranged from 56 to 76 percent for five basic undergrad classes with a student enrollment limit of 100 people.
Personally, I’m impressed that so many people passed them considering how light-weight the connection is in such course experiences. Maybe it’s because they weren’t free – they cost $150.
It all depends on what you were expecting, I guess. It begs the question of what college is for anyway.
I was talking to a business guy about the MOOC potential for disruption, and he mentioned that, as a Yale undergrad himself, he never learned a thing in classes, that in fact he skipped most of his classes to hang out with his buddies. He somehow thought MOOCs would be a fine replacement for that experience. However, when I asked him whether he still knew any of his buddies from college, he acknowledged that he does business with them all the time.
Personally, this confirms my theory that education is more about making connection than education per se, and although I learned a lot of math in college, I also made a friend who helped me get into grad school and even introduced me to my thesis advisor.
I’ve blogged before about how I find it outrageous that the credit scoring models are proprietary, considering the impact they have on so many lives.
The argument given for keeping them secret is that otherwise people would game the models, but that really doesn’t make sense.
After all, the models that the big banks have to deal with through regulation aren’t secret, and they game those models all the time. It’s one of the main functions of the banks, in fact, to figure out how to game the models. So either we don’t mind gaming or we don’t hold up our banks to the same standards as our citizens.
Plus, let’s say the models were open and people started gaming the credit score models – what would that look like? A bunch of people paying their electricity bill on time?
Let’s face it: the real reason the models are secret is that the companies who set them up make more money that way, pretending to have some kind of secret sauce. What they really have, of course, is a pretty simple model and access to an amazing network of up-to-date personal financial data, as well as lots of clients.
Their fear is that, if their model gets out, anyone could start a credit scoring agency, but actually it wouldn’t be so easy – if I wanted to do it, I’d have to get all that personal data on everyone. In fact, if I could get all that personal data on everyone, including the historical data, I could easily build a credit scoring model.
So anyhoo, it’s all about money, that and the fact that we’re living under the assumption that it’s appropriate for credit scoring companies to wield all this power over people’s lives, including their love lives.
It’s like we have a secondary system of secret laws where we don’t actually get to see the rules, nor do we get to point out mistakes or reasonably refute them. And if you’re thinking “free credit report,” let’s be clear that that only tells you what data goes in to the model, it doesn’t tell you how it’s used.
As it turns out, though, it’s now more than like a secondary system of laws – it’s become embedded in our actual laws. Somehow the proprietary credit scoring company Equifax is now explicitly part of our healthcare laws. From this New York Times article (hat tip Matt Stoller):
Federal officials said they would rely on Equifax — a company widely used by mortgage lenders, social service agencies and others — to verify income and employment and could extend the initial 12-month contract, bringing its potential value to $329.4 million over five years.
Contract documents show that Equifax must provide income information “in real time,” usually within a second of receiving a query from the federal government. Equifax says much of its information comes from data that is provided by employers and updated each payroll period.
Under the contract, Equifax can use sources like credit card applications but must develop a plan to indicate the accuracy of data and to reduce the risk of fraud.
Thanks Equifax, I guess we’ll just trust you on all of this.
This is a guest post by Peter Darche, an engineer at DataKind and recent graduate of NYU’s ITP program. At ITP he focused primarily on using personal data to improve personal social and environmental impact. Prior to graduate school he taught in NYC public schools with Teach for America and Uncommon Schools.
We all ‘know’ that money influences the way congressmen and women legislate; at least we certainly believe it does. According to poll conducted by law professor Larry Lessig for his book Republic Lost, 75% of respondents (Republican and Democrat) said that ‘money buys results in Congress.’
But what does that explanation really tell us? Yes, a congresswoman’s receiving millions dollars from an industry then voting with that industry’s interests reeks of corruption. But, when that industry is responsible for 80% of her constituents’ jobs the causation becomes much less clear and the explanation much less informative.
The real devil is in the details. It is in the ways that money has shaped her legislative worldview over time and in the small, particular actions that tilt her policy one way rather than another.
In the past finding these many and subtle ways would have taken a herculean effort: untold hours collecting campaign contributions, voting records, speeches, and so on. Today however, due to the efforts of organizations like the Sunlight Foundation and Center for Responsive Politics, this information is online and programmatically accessible; you can write a few lines of code and have a computer gather it all for you.
The last few months Cathy O’Neil, Lee Drutman (a Senior Fellow at the Sunlight Foundation), myself and others have been working on a project that leverages these data sources to attempt to unearth some of these particular facts. By connecting all the avenues by which influence is exerted on the legislative process to the actions taken by legislators, we’re hoping to find some of the detailed ways money changes behavior over time.
The ideas is this: first, find and aggregate what data exists related to the ways influence can be exerted on the legislative process (data on campaign contributions, lobbying contributions, etc), then find data that might track influence manifesting itself in the legislative process (bill sponsorships, co-sponsorships, speeches, votes, committee memberships, etc). Finally, connect the interest group or industry behind the influence to the policies and see how they change over time.
One immediate and attainable goal for this project, for example, is to create an affinity score between legislators and industries, or in other words a metric that would indicate the extent to which a given legislator is influenced by and acts in the interest of a given industry.
So far most of our efforts have focused on finding, collecting, and connecting the records of influence and legislative behavior. We’ve pulled in lobbying and campaign contribution data, as well as sponsored legislation, co-sponsored legislation, speeches and votes. We’ve connected the instances of influence to legislative actions for a given legislator and visualized it on a timeline showing the entirety of a legislator’s career.
Here’s an example of how one might use the timeline. The example below is of Nancy Pelosi’s career. Each green circle represents a campaign contribution she received, and is grouped within a larger circle by the month it was recorded by the FEC. Above are colored rectangles representing legislative actions she took during the time-period in focus (indigo are votes, orange speeches, red co-sponsored bills, blue sponsored bills). Some of the green circles are highlighted because the events have been filtered for connection to health professionals.
Changing the filter to Health Services/HMOs, we see different contributions coming from that industry as well as a co-sponsored bill related to that industry.
Mousing over the bill indicates its a proposal to amend the Social Security act to provide Medicaid coverage to low-income individuals with HIV. Further, looking around at speeches, one can see a relevant speech about the children’s health insurance. Clicking on the speech reveals the text.
By combining data about various events, and allowing users to filter and dive into them, we’re hoping to leverage our natural pattern-seeking capabilities to find specific hypotheses to test. Once an interesting pattern has been found, the tool would allow one to download the data and conduct analyses.
Again, It’s just start, and the timeline and other project related code are internal prototypes created to start seeing some of the connections. We wanted to open it up to you all though to see what you all think and get some feedback. So, with it’s pre-alphaness in mind, what do you think about the project generally and the timeline specifically? What works well – helps you gain insights or generate hypotheses about the connection between money and politics – and what other functionality would you like to see?
The demo version be found here with data for the following legislators:
- Nancy Pelosi
- John Boehner
- Cathy McMorris Rodgers
- John Boehner
- Eric Cantor
- James Lankford
- John Cornyn
- Nancy Pelosi
- James Clyburn
- Kevin McCarthy
- Steny Hoyer
Note: when the timeline is revealed, click and drag over content at the bottom of the timeline to reveal the focus events.
So here’s the thing about being a parent of benign neglect: it’s no walk in the park. I talk a big game, but the truth is I’ve have trouble getting to sleep from the anxiety. To distract myself I’ve been watching Law & Order episodes on Netflix until the wee hours of the night.
Two things about this plan suck. First, my husband is in Amsterdam, which means he’s 6 time zones away from our oldest son whereas I’m only 3, but somehow that means I’m shouldering 99.5% of the responsibility to worry (there’s some universal geographic law of parenting at work there but I don’t know how to formulate it). Second, half of the L&O episodes involve either children getting maimed or killed or child killers. Not restful but I freaking can’t stop!
In any case, not much extra energy to spring out of bed and write the blog, so apologies for a sparse period for mathbabe. For whatever reason I woke up this morning in time to blog, however, so as to not miss an opportunity it’s gonna be in list form:
- I’ve been invited to keynote at PyData in Cambridge, MA at the end of the month – me and Travis Oliphant! I’m still coming up with the title and abstract for my talk, but it’s going to be something about storytelling with data using the iPython Notebook. Please make suggestions!
- I was in a Wall Street Journal article about Larry Summers, talking about whether he’s got a good personality to take over from Ben Bernanke, i.e. should we trust our lives and our future with him. I say nope. What’s funny is that my uncle, economist Bob Hall, is also referred to in the same article. The journalist didn’t know we’re related until after the article came out and Uncle Bob informed him.
- Hey, can we give it up for Eliot Spitzer? The powers that be are down about that guy presumably for having sex with prostitutes but really because he’s a threat. I say legalize prostitution, unionize the prostitutes a la the dutch, and put Spitzer in charge of something involving money and corruption, he’s smart and fearless. Who’s with me?
- It looks like good news: the Consumer Financial Protection Bureau might be cracking down on illegal debt collector tactics. Update: wait, the fines are fractions of 1% of the revenue these guys made on their unfair practices. Can we please have a rule that when you get caught breaking the law, the fine will be large enough so it’s no longer profitable?
I’m psyched to see Suresh Naidu tonight in the first Data Skeptics Meetup. He’s talking about Political Uses and Abuses of Data and his abstract is this:
While a lot has been made of the use of technology for election campaigns, little discussion has focused on other political uses of data. From targeting dissidents and tax-evaders to organizing protests, the same datasets and analytics that let data scientists do prediction of consumer and voter behavior can also be used to forecast political opponents, mobilize likely leaders, solve collective problems and generally push people around. In this discussion, Suresh will put this in a 1000 year government data-collection perspective, and talk about how data science might be getting used in authoritarian countries, both by regimes and their opponents.
Given the recent articles highlighting this kind of stuff, I’m sure the topic will provoke a lively discussion – my favorite kind!
Unfortunately the Meetup is full but I’d love you guys to give suggestions for more speakers and/or more topics.
This is a guest post by Rachel Law, a conceptual artist, designer and programmer living in Brooklyn, New York. She recently graduated from Parsons MFA Design&Technology. Her practice is centered around social myths and how technology facilitates the creation of new communities. Currently she is writing a book with McKenzie Wark called W.A.N.T, about new ways of analyzing networks and debunking ‘mapping’.
Let’s start with a timely question. How would you like to be able to change how you are identified by online networks? We’ll talk more about how you’re currently identified below, but for now just imagine having control over that process for once – how would that feel? Vortex is something I’ve invented that will try to make that happen.
Namely, Vortex is a data management game that allows players to swap cookies, change IPs and disguise their locations. Through play, individuals experience how their browser changes in real time when different cookies are equipped. Vortex is a proof of concept that illustrates how network collisions in gameplay expose contours of a network determined by consumer behavior.
What happens when users are allowed to swap cookies?
These cookies, placed by marketers to track behavioral patterns, are stored on our personal devices from mobile phones to laptops to tablets, as a symbolic and data-driven signifier of who we are. In other words, to the eyes of the database, the cookies are us. They are our identities, controlling the way we use, browse and experience the web. Depending on cookie type, they might follow us across multiple websites, save entire histories about how we navigate and look at things and pass this information to companies while still living inside our devices.
If we have the ability to swap cookies, the debate on privacy shifts from relying on corporations to follow regulations to empowering users by giving them the opportunity to manage how they want to be perceived by the network.
What are cookies?
The corporate technological ability to track customers and piece together entire personal histories is a recent development. While there are several ways of doing so, the most common and prevalent method is with HTTP cookies. Invented in 1994 by a computer programmer, Lou Montulli, HTTP cookies were originally created with the shopping cart system as a way for the computer to store the current state of the session, i.e. how many items existed in the cart without overloading the company’s server. These session histories were saved inside each user’s computer or individual device, where companies accessed and updated consumer history constantly as a form of ‘internet history’. Information such as where you clicked, how to you clicked, what you clicked first, your general purchasing history and preferences were all saved in your browsing history and accessed by companies through cookies.
Cookies were originally implemented to the general public without their knowledge until the Financial Times published an article about how they were made and utilized on websites without user knowledge on February 12th, 1996 . This revelation led to a public outcry over privacy issues, especially since data was being gathered without the knowledge or consent of users. In addition, corporations had access to information stored on personal computers as the cookie sessions were stored on your computer and not their servers.
At the center of the debate was the issue on third-party cookies, also known as “persistent” or “tracking” cookies. When you are browsing a webpage, there may be components on the page that are hosted on the same server, but different domain. These external objects then pass cookies to you if you click an image, link or article. They are then used by advertising and media mining corporations to track users across multiple sites to garner more knowledge about the users browsing patterns to create more specific and targeted advertising.
In August 2013, Wall Street Journal ran an article on how Mac users were being unfairly targeted by travel site Orbitz with advertisements that were 13% more expensive than PC users. New York Times followed it up with a similar article in November 2012 about how the data collected and re-sold to advertisers. These advertisers would analyze users buying habits to create micro-categories where the personal experiences were tailored to maximize potential profits.
What does that mean for us?
The current state of today’s internet is no longer the same as the carefree 90s of ‘internet democracy’ and utopian ‘cyberspace’. Mediamining exploits invasive technologies such as IP tracking, geolocating and cookies to create specific advertisements targeted to individuals. Browsing is now determined by your consumer profile what you see, hear and the feeds you receive are tailored from your friends’ lists, emails, online purchases etc. The ‘Internet’ does not exist. Instead, it is many overlapping filter bubbles which selectively curate us into data objects to be consumed and purchased by advertisers.
This information, though anonymous, is built up over time and used to track and trace an individual’s history – sometimes spanning an entire lifetime. Who you are, and your real name is irrelevant in the overall scale of collected data, depersonalizing and dehumanizing you into nothing but a list of numbers on a spreadsheet.
The superstore Target, provides a useful case study for data profiling in its use of statisticians on their marketing teams. In 2002, Target realized that when a couple is expecting a child, the way they shop and purchase products changes. But they needed a tool to be able to see and take advantage of the pattern. As such, they asked mathematicians to come up with algorithms to identify behavioral patterns that would indicate a newly expectant mother and push direct marketing materials their way. In a public relations fiasco, Target had sent maternity and infant care advertisements to a household, inadvertedly revealing that their teenage daughter was pregnant before she told her parents .
This build-up of information creates a ‘database of ruin’, enough information that marketers and advertisers know more about your life and predictive patterns than any single entity. Databases that can predict whether you’re expecting, or when you’ve moved, or what stage of your life or income level you’re at… information that you have no control over where it goes to, who is reading it or how it is being used. More importantly, these databases have collected enough information that they know secrets such as family history of illness, criminal or drug records or other private information that could potentially cause harm upon the individual data point if released – without ever needing to know his or her name.
What happens now is two terrifying possibilities:
- Corporate databases with information about you, your family and friends that you have zero control over, including sensitive information such as health, criminal/drug records etc. that are bought and re-sold to other companies for profit maximization.
- New forms of discrimination where your buying/consumer habits determine which level of internet you can access, or what kind of internet you can experience. This discrimination is so insidious because it happens on a user account level which you cannot see unless you have access to other people’s accounts.
Here’s a visual describing this process:
What can Vortex do, and where can I download a copy?
As Vortex lives on the browser, it can manage both pseudo-identities (invented) as well as ‘real’ identities shared with you by other users. These identity profiles are created through mining websites for cookies, swapping them with friends as well as arranging and re-arranging them to create new experiences. By swapping identities, you are essentially ‘disguised’ as someone else – the network or website will not be able to recognize you. The idea is that being completely anonymous is difficult, but being someone else and hiding with misinformation is easy.
This does not mean a death knell for online shopping or e-commerce industries. For instance, if a user decides to go shoe-shopping for summer, he/she could equip their browser with the cookies most associated and aligned with shopping, shoes and summer. Targeted advertising becomes a targeted choice for both advertisers and users. Advertisers will not have to worry about misinterpreting or mis-targeting inappropriate advertisements i.e. showing tampon advertisements to a boyfriend who happened to borrow his girlfriend’s laptop; and at the same time users can choose what kind of advertisements they want to see. (i.e. Summer is coming, maybe it’s time to load up all those cookies linked to shoes and summer and beaches and see what websites have to offer; or disable cookies it completely if you hate summer apparel.)
Currently the game is a working prototype/demo. The code is licensed under creative commons and will be available on GitHub by the end of summer. I am trying to get funding to make it free, safe & easy to use; but right now I’m broke from grad school and a proper back-end to be built for creating accounts that is safe and cannot be intercepted. If you have any questions on technical specs or interest in collaborating to make it happen – particularly looking for people versed in python/mongodb, please email me: Rachel@milkred.net.
This is a guest post by Marc Joffe, the principal consultant at Public Sector Credit Solutions, an organization that provides data and analysis related to sovereign and municipal securities. Previously, Joffe was a Senior Director at Moody’s Analytics for more than a decade.
Note to readers: for a bit of background on the SEC Credit Ratings Roundtable and the Franken Amendment see this recent mathbabe post.
I just returned from Washington after participating in the SEC’s Credit Ratings Roundtable. The experience was very educational, and I wanted to share what I’ve learned with readers interested in financial industry reform.
First and foremost, I learned that the Franken Amendment is dead. While I am not a proponent of this idea – under which the SEC would have set up a ratings agency assignment authority – I do welcome its intentions and mourn its passing. Thus, I want to take some time to explain why I think this idea is dead, and what financial reformers need to do differently if they want to see serious reforms enacted.
The Franken Amendment, as revised by the Dodd Frank conference committee, tasked the SEC with investigating the possibility of setting up a ratings assignment authority and then executing its decision. Within the SEC, the responsibility for Franken Amendment activities fell upon the Office of Credit Ratings (OCR), a relatively new creature of the 2006 Credit Rating Agency Reform Act.
OCR circulated a request for comments – posting the request on its web site and in the federal register – a typical SEC procedure. The majority of serious comments OCR received came from NRSROs and others with a vested interest in perpetuating the status quo or some close approximation thereof. Few comments came from proponents of the Franken Amendment, and some of those that did were inarticulate (e.g., a note from Joe Sixpack of Anywhere, USA saying that rating agencies are terrible and we just gotta do something about them).
OCR summarized the comments in its December 2012 study of the Franken Amendment. Progressives appear to have been shocked that OCR’s work product was not an originally-conceived comprehensive blueprint for a re-imagined credit rating business. Such an expectation is unreasonable. SEC regulators sit in Washington and New York; not Silicon Valley. There is little upside and plenty of political downside to taking major risks. Regulators are also heavily influenced by the folks they regulate, since these are the people they talk to on a day-to-day basis.
Political theorists Charles Lindblom and Aaron Wildavsky developed a theory that explains the SEC’s policymaking process quite well: it is called incrementalism. Rather than implement brand new ideas, policymakers prefer to make marginal changes by building upon and revising existing concepts.
While I can understand why Progressives think the SEC should “get off its ass” and really fix the financial industry, their critique is not based in the real world. The SEC is what it is. It will remain under budget pressure for the forseeable future because campaign donors want to restrict its activities. Staff will always be influenced by financial industry players, and out-of-the-box thinking will be limited by the prevailing incentives.
Proponents of the Franken Amendment and other Progressive reforms have to work within this system to get their reforms enacted. How? The answer is simple: when a request for comment arises they need to stuff the ballot box with varying and well informed letters supporting reform. The letters need to place proposed reforms within the context of the existing system, and respond to anticipated objections from status quo players. If 20 Progressive academics and Occupy-leaning financial industry veterans had submitted thoughtful, reality-based letters advocating the Franken Amendment, I believe the outcome would have been very different. (I should note that Occupy the SEC has produced a number of comment letters, but they did not comment on the Franken Amendment and I believe they generally send a single letter).
While the Franken Amendment may be dead, I am cautiously optimistic about the lifecycle of my own baby: open source credit rating models. I’ll start by explaining how I ended up on the panel and then conclude by discussing what I think my appearance achieved.
The concept of open source credit rating models is extremely obscure. I suspect that no more than a few hundred people worldwide understand this idea and less than a dozen have any serious investment in it. Your humble author and one person on his payroll, are probably the world’s only two people who actually dedicated more than 100 hours to this concept in 2012.
That said, I do want to acknowledge that the idea of open source credit rating models is not original to me – although I was not aware of other advocacy before I embraced it. Two Bay Area technologists started FreeRisk, a company devoted to open source risk models, in 2009. They folded the company without releasing a product and went on to more successful pursuits. FreeRisk left a “paper” trail for me to find including an article on the P2P Foundation’s wiki. FreeRisk’s founders also collaborated with Cate Long, a staunch advocate of financial markets transparency, to create riski.us – a financial regulation wiki.
In 2011, Cathy O’Neil (a.k.a. Mathbabe) an influential Progressive blogger who has a quantitative finance background ran a post about the idea of open source credit ratings, generating several positive comments. Cathy also runs the Alternative Banking group, an affiliate of Occupy Wall Street that attracts a number of financially literate activists.
I stumbled across Cathy’s blog while Googling “open source credit ratings”, sent her an email, had a positive phone conversation and got an invitation to address her group. Cathy then blogged about my open source credit rating work. This too was picked up on the P2P Foundation wiki, leading ultimately to a Skype call with the leader of the P2P Foundation, Michel Bauwens. Since then, Michel – a popularizer of progressive, collaborative concepts – has offered a number of suggestions about organizations to contact and made a number of introductions.
Most of my outreach attempts on behalf of this idea – either made directly or through an introduction – are ignored or greeted with terse rejections. I am not a proven thought leader, am not affiliated with a major research university and lack a resume that includes any position of high repute or authority. Consequently, I am only a half-step removed from the many “crackpots” that send around their unsolicited ideas to all and sundry.
Thus, it is surprising that I was given the chance to address the SEC Roundtable on May 14. The fact that I was able to get an invitation speaks well of the SEC’s process and is thus worth recounting. In October 2012, SEC Commissioner Dan Gallagher spoke at the Stanford Rock Center on Corporate Governance. He mentioned that the SEC was struggling with the task of implementing Dodd Frank Section 939A, which calls for the replacement of credit ratings in federal regulations, such as those that govern asset selection by money market funds.
After his talk, I pitched him the idea of open source credit ratings as an alternative creditworthiness standard that would satisfy the intentions of 939A. He suggested that I write to Tom Butler, head of the Office of Credit Ratings (OCR) and copy him. This led to a number of phone calls and ultimately a presentation to OCR staff in New York in January. Staff members that joined the meeting were engaged and asked good questions. I connected my proposal to an earlier SEC draft regulation which would have required structured finance issuers to publish cashflow waterall models in Python – a popular open source language.
I walked away from the meeting with the perception that, while they did not want to reinvent the industry, OCR staff were sincerely interested in new ideas that might create incremental improvements. That meeting led to my inclusion in the third panel of the Credit Ratings Roundtable.
For me, the panel discussion itself was mostly positive. Between the opening statement, questions and discussion, I probably had about 8 minutes to express my views. I put across all the points I hoped to make and even received a positive comment from one of the other panelists. On the downside, only one commissioner attended my panel – whereas all five had been present at the beginning of the day when Al Franken, Jules Kroll, Doug Peterson and other luminaries held the stage.
The roundtable generated less media attention than I expected, but I got an above average share of the limited coverage relative to the day’s other 25 panelists. The highlight was a mention in the Wall Street Journal in its pre-roundtable coverage.
Perhaps the fact that I addressed the SEC will make it easier for me to place op-eds and get speaking engagements to promote the open source ratings concept. Only time will tell. Ultimately, someone with a bigger reputation than mine will need to advocate this concept before it can progress to the next level.
Also, the idea is now part of the published record of SEC deliberations. The odds of it getting into a proposed regulation remain long in the near future, but these odds are much shorter than they were prior to the roundtable.
Political scientist John Kingdon coined the term “policy entrepreneurs” to describe people who look for and exploit opportunities to inject new ideas into the policy discussion. I like to think of myself as a policy entrepreneur, although I have a long way to go before I become a successful one. If you have read this far and also have strongly held beliefs about how the financial system should improve, I suggest you apply the concepts of incrementalism and policy entrepreneurship to your own activism.
This is a guest post by Adam Obeng, a Ph.D. candidate in the Sociology Department at Columbia University. His work encompasses computational social science, social network analysis and sociological theory (basically anything which constitutes an excuse to sit in front of a terminal for unadvisably long periods of time). This post is Copyright Adam Obeng 2013 and licensed under a (Creative Commons Attribution-ShareAlike 3.0 Unported License). Crossposted on adamobeng.com.
Eben Moglen’s delivery leaves you in no doubt as to the sincerity of this sentiment. Stripy-tied, be-hatted and pocked-squared, he took to the stage at last week’s IDSE Seminar Series event without slides, but with engaging — one might say, prosecutorial — delivery. Lest anyone doubt his neckbeard credentials, he let slip that he had participated in the development of almost certainly the first networked email system in the United States, as well as mentioning his current work for the Freedom Box Foundation and the Software Freedom Law Center.
A superorganism called humankind
The content was no less captivating than the delivery: we were invited to consider the world where every human consciousness is connected by an artificial extra-skeletal nervous system, linking everyone into a new superorganism. What we refer to as data science is the nascent study of flows of neural data in that network. And having access to the data will entirely transform what the social sciences can explain: we will finally have a predictive understanding of human behaviour, based not on introspection but empirical science. It will do for the social sciences what Newton did for physics.
The reason the science of the nervous system – “this wonderful terrible art” – is optimised to study human behaviour is because consumption and entertainment are a large part of economic activity. The subjects of the network don’t own it. In a society which is more about consumption than production, the technology of economic power will be that which affects consumption. Indeed, what we produce becomes information about consumption which is itself used to drive consumption. Moglen is matter-of-fact: this will happen, and is happening.
And it’s also ineluctable that this science will be used to extend the reach of political authority, and it has the capacity to regiment human behaviour completely. It’s not entirely deterministic that it should happen at a particular place and time, but extrapolation from history suggests that somewhere, that’s how it’s going to be used, that’s how it’s going to come out, because it can. Whatever is possible to engineer will eventually be done. And once it’s happened somewhere, it will happen elsewhere. Unlike the components of other super-organisms, humans possess consciousness. Indeed, it is the relationship between sociality and consciousness that we call the human condition. The advent of the human species-being threatens that balance.
The Oppenheimer moment
Moglen’s vision of the future is, as he describes it, both familiar and strange. But his main point, is as he puts it, very modest: unless you are sure that this future is absolutely 0% possible, you should engage in the discussion of its ethics.
First, when the network is wrapped around every human brain, privacy will be nothing more than a relic of the human past. He believes that privacy is critical to creativity and freedom, but really the assumption that privacy – the ability to make decisions independent of the machines – should be preserved is axiomatic.
What is crucial about privacy is that it is not personal, or even bilateral, it is ecological: how others behave determine the meaning of the actions I take. As such, dealing with privacy requires an ecological ethics. It is irrelevant whether you consent to be delivered poisonous drinking water, we don’t regulate such resources by allowing individuals to make desicions about how unsafe they can afford their drinking water to be. Similarly, whether you opt in or opt out of being tracked online is irrelevant.
The existing questions of ethics that science has had to deal with – how to handle human subjects – are of no use here: informed consent is only sufficient when the risks to investigating a human subject produce apply only to that individual.
These ethical questions are for citizens, but perhaps even more so for those in the business of making products from personal information. Whatever goes on to be produced from your data will be trivially traced back to you. Whatever finished product you are used to make, you do not disappear from it. What’s more, the scientists are beholden to the very few secretive holders of data.
Consider, says Moglen,the question of whether punishment deters crime: there will be increasing amounts of data about it, but we’re not even going to ask – because no advertising sale depends on it. Consider also, the prospect of machines training humans, which is already beginning to happen. The Coursera business model is set to do to the global labour market what Google did to the global advertising market: auctioning off the good learners, found via their learning patterns, to to employers. Granted, defeating ignorance on a global scale is within grasp. But there are still ethical questions here, and evil is ethics undealt with.
One of the criticisms often levelled at techno-utopians is that the enabling power of technology can very easily be stymied by the human factors, the politics, the constants of our species, which cannot be overwritten by mere scientific progress. Moglen could perhaps be called a a techno-dystopian, but he has recognised that while the technology is coming, inevitably, how it will affect us depends on how we decide to use it.
But these decisions cannot just be made at the individual level, Moglen pointed out, we’ve changed everything except the way people think. I can’t say that I wholeheartedly agree with either Moglen’s assumptions or his conclusions, but he is obviously asking important questions, and he has shown the form in which they need to be asked.
Another doubt: as a social scientist, I’m also not convinced that having all these data available will make all human behaviour predictable. We’ve catalogued a billion stars, the Large Hadron Collider has produced a hundred thousand million million bytes of data, and yet we’re still trying to find new specific solutions to the three-body problem. I don’t think that just having more data is enough. I’m not convinced, but I don’t think it’s 0% possible.
This post is Copyright Adam Obeng 2013 and licensed under a (Creative Commons Attribution-ShareAlike 3.0 Unported License).
I’ve discussed the broken business model that is the credit rating agency system in this country on a few occasions. It directly contributed to the opacity and fraud in the MBS market and to the ensuing financial crisis, for example. And in this post and then this one, I suggest that someone should start an open source version of credit rating agencies. Here’s my explanation:
The system of credit ratings undermines the trust of even the most fervently pro-business entrepreneur out there. The models are knowingly games by both sides, and it’s clearly both corrupt and important. It’s also a bipartisan issue: Republicans and Democrats alike should want transparency when it comes to modeling downgrades- at the very least so they can argue against the results in a factual way. There’s no reason I can see why there shouldn’t be broad support for a rule to force the ratings agencies to make their models publicly available. In other words, this isn’t a political game that would score points for one side or the other.
Well, it wasn’t long before Marc Joffe, who had started an open source credit rating agency, contacted me and came to my Occupy group to explain his plan, which I blogged about here. That was almost a year ago.
Today the SEC is going to have something they’re calling a Credit Ratings Roundtable. This is in response to an amendment that Senator Al Franken put on Dodd-Frank which requires the SEC to examine the credit rating industry. From their webpage description of the event:
The roundtable will consist of three panels:
- The first panel will discuss the potential creation of a credit rating assignment system for asset-backed securities.
- The second panel will discuss the effectiveness of the SEC’s current system to encourage unsolicited ratings of asset-backed securities.
- The third panel will discuss other alternatives to the current issuer-pay business model in which the issuer selects and pays the firm it wants to provide credit ratings for its securities.
Marc is going to be one of something like 9 people in the third panel. He wrote this op-ed piece about his goal for the panel, a key excerpt being the following:
Section 939A of the Dodd-Frank Act requires regulatory agencies to replace references to NRSRO ratings in their regulations with alternative standards of credit-worthiness. I suggest that the output of a certified, open source credit model be included in regulations as a standard of credit-worthiness.
Just to be clear: the current problem is that not only is there wide-spread gaming, but there’s also a near monopoly by the “big three” credit rating agencies, and for whatever reason that monopoly status has been incredibly well protected by the SEC. They don’t grant “NRSRO” status to credit rating agencies unless the given agency can produce something like 10 letters from clients who will vouch for them providing credit ratings for at least 3 years. You can see why this is a hard business to break into.
The Roundtable was covered yesterday in the Wall Street Journal as well: Ratings Firms Steer Clear of an Overhaul - an unfortunate title if you are trying to be optimistic about the event today. From the WSJ article:
Mr. Franken’s amendment requires the SEC to create a board that would assign a rating firm to evaluate structured-finance deals or come up with another option to eliminate conflicts.
While lawsuits filed against S&P in February by the U.S. government and more than a dozen states refocused unflattering attention on the bond-rating industry, efforts to upend its reliance on issuers have languished, partly because of a lack of consensus on what to do.
I’m just kind of amazed that, given how dirty and obviously broken this industry is, we can’t do better than this. SEC, please start doing your job. How could allowing an open-source credit rating agency hurt our country? How could it make things worse?
Yesterday I wrote this short post about my concerns about the emerging field of e-discovery. As usual the comments were amazing and informative. By the end of the day yesterday I realized I needed to make a much more nuanced point here.
Namely, I see a tacit choice being made, probably by judges or court-appointed “experts”, on how machine learning is used in discovery, and I think that the field could get better or worse. I think we need to urgently discuss this matter, before we wander into a crazy place.
And to be sure, the current discovery process is fraught with opacity and human judgment, so complaining about those features being present in a machine learning version of discovery is unreasonable – the question is whether it’s better or worse than the current system.
Making it worse: private code, opacity
The way I see it, if we allow private companies to build black box machines that we can’t peer into, nor keep track of as they change versions, then we’ll never know why a given set of documents was deemed “relevant” in a given case. We can’t, for example, check to see if the code was modified to be more friendly to a given side.
Besides the healthy response to this new revenue source of competition for clients, the resulting feedback loop will likely be a negative one, whereby private companies use the cheapest version they can get away with to achieve the best results (for their clients) that they can argue for.
Making it better: open source code, reproducibility
What we should be striving for is to use only open source software, saved in a repository so we can document exactly what happened with a given corpus and a given version of the tools. It will still be an industry to clean the data and feed in the documents, train the algorithm (whilst documenting how that works), and interpreting the results. Data scientists will still get paid.
In other words, instead of asking for interpretability, which is a huge ask considering the massive scale of the work being done, we should, at the very least, be able to ask for reproducibility of the e-discovery, as well as transparency in the code itself.
Why reproducibility? Then we can go back in time, or rather scholars can, and test how things might have changed if a different version of the code were used, for example. This could create a feedback loop crucial to improve the code itself over time, and to improve best practices for using that code.
Today I want to bring up a few observations and concerns I have about the emergence of a new field in machine learning called e-discovery. It’s the algorithmic version of discovery, so I’ll start there.
Discovery is part of the process in a lawsuit where relevant documents are selected, pored over, and then handed to the other side. Nowadays, of course, there are more and more documents, almost all electronic, typically including lots of e-mails.
If you’re talking about a big lawsuit, there could be literally millions of documents to wade through, and that takes a lot of time for humans to do, and it can be incredibly expensive and time-consuming. Enter the algorithm.
With advances in Natural Language Processing (NLP), a machine algorithm can sort emails or documents by topic (after getting the documents into machine-readable form, cleaning, and deduping) and can in general do a pretty good job of figuring out whether a given email is “relevant” to the case.
And this is already happening – the Wall Street Journal recently reported that the Justice Department allowed e-discovery for a case involving the merger of two beer companies. From the article:
With the blessing of the Justice Department’s antitrust division, the lawyers loaded the documents into a program and manually reviewed a batch to train the software to recognize relevant documents. The manual review was repeated until the Justice Department and Constellation were satisfied that the program could accurately predict relevance in the rest of the documents. Lawyers for Constellation and Crown Imports used software developed by kCura Corp., which lists the Justice Department as a client.
In the end, Constellation and Crown Imports turned over hundreds of thousands of documents to antitrust investigators.
Here are some of my questions/ concerns:
- These algorithms are typically not open source – companies like kCura make good money doing these jobs.
- That means that they could be wrong, possibly in subtle ways.
- Or maybe not so subtle ways: maybe they’ve been trained to find documents that are both “relevant” and “positive” for a given side.
- In any case, the laws of this country will increasingly depend on a black box algorithm that is no accessible to the average citizen.
- Is that in the public’s interest?
- Is that even constitutional?