Open Models (part 1)

Home > data science, finance, open source tools > Open Models (part 1)

Open Models (part 1)

January 11, 2012 Cathy O'Neil, mathbabe

A few days ago I posted about how riled up I was to see the Heritage Foundation publish a study about teacher pay which was obviously politically motivated. In the comment section a skeptical reader challenged me on a few things. He had some great points, and I’d love to address them all, but today I will only address the most important one, namely:

…the criticism about this particular study could be leveled to any study funded by any think tank, from the lowly ones, to the more prestigious ones, which have near-academic status (e.g. Brookings or Hoover). But indeed, most social scientists have a political bias. Piketty advised Segolene Goyal. Does it invalidate his study on inequality in America? Rogoff is a republican. Should one dismiss his work on debit crises? I think the best reaction is not to dismiss any study, or any author for that sake, on the basis of their political opinion, even if we dislike their pre-made tweets (which may have been prepared by editors that have nothing to do with the authors, by the way). Instead, the paper should be judged on its own merit. Even if we know we’ll disagree, a good paper can sharpen and challenge our prior convictions.

Agreed! Let’s judge papers on their own merits. However, how can we do that well? Especially when the data is secret and/or the model itself is only vaguely described, it’s impossible. I claim we need to demand more information in such cases, especially when the results of the study are taken seriously and policy decisions are potentially made based on them.

What should we do?

Addressing this problem of verifying modelling results is my goal with defining open source models. I’m not really inventing something new, but rather crystallizing and standardizing something that is already in the air (see below) among modelers who are sufficiently skeptical of the underlying incentives that modelers and their institutions have to look confident.

The basic idea is that we cannot and should not trust models that are opaque. We should all realize how sensitive models are to design decisions and tuning parameters. In the best case, this means we, the public, should have access to the model itself, manifested as a kind of app that we can play with.

Specifically, this means we can play around with the parameters and see how the model changes. We can input new data and see what the model spits out. We can retrain the model altogether with a slightly different assumption, or with new data, or with a different cross validation set.

The technology to allow us to do this all exists – even the various ways we can anonymize sensitive data so that it can still be semi-public. I will go further into how we can put this together in later posts. For now let me give you some indication of how badly this is needed.

Already in the Air

I was heartened yesterday to read this article from Bloomberg written by Victoria Stodden and Samuel Arbesman. In it they complain about how much of science depends on modeling and data, and how difficult it is to confirm studies when the data (and modeling) is being kept secret. They call on federal agencies to insist on data sharing:

Many people assume that scientists the world over freely exchange not only the results of their experiments but also the detailed data, statistical tools and computer instructions they employed to arrive at those results. This is the kind of information that other scientists need in order to replicate the studies. The truth is, open exchange of such information is not common, making verification of published findings all but impossible and creating a credibility crisis in computational science.

Federal agencies that fund scientific research are in a position to help fix this problem. They should require that all scientists whose studies they finance share the files that generated their published findings, the raw data and the computer instructions that carried out their analysis.

The ability to reproduce experiments is important not only for the advancement of pure science but also to address many science-based issues in the public sphere, from climate change to biotechnology.

How bad is it now?

You may think I’m exaggerating the problem. Here’s an article that you should read, in which the case is made that most published research is false. Now, open source modeling won’t fix all of that problem, since a large part of is it the underlying bias that you only publish something that looks important (you never publish results explaining all the things you tried but didn’t look statistically significant).

But think about it, that’s most published research. I’d like to posit that it’s the unpublished research that we should be really worried about. Note that banks and hedge funds don’t ever publish their research, obviously, because of proprietary reasons, but that this doesn’t improve the verifiability problems.

Indeed my experience is that very few people in the bank or hedge fund actually vet the underlying models, partly because they don’t want information to leak and partly because those models are really hard. You may argue that the models are carefully vetted, since big money is often at stake. But I’d reply that actually, you’d be surprised.

How about on the internet? Again, not published, and we don’t have reason to believe that they are more correct than published scientific models. And those models are being used day in and day out and are drawing conclusions about you (what is your credit score, whether you deserve a certain loan) every time you click.

We need a better way to verify models. I will attempt to outline specific ideas of how this should work in further posts.

Categories: data science, finance, open source tools

Comments (7)

hazu chan

January 11, 2012 at 7:30 pm

Universities should set up data commons for all published work, then make it available to the masses. We’re in an epoch where it’s entirely possible to semi-automate the ability to take a model and replicate the results in a paper, let’s frickin’ do it!

LikeLike
- Cathy O'Neil, mathbabe
  
  January 12, 2012 at 3:20 pm
  
  Amen!
  
  LikeLike
Kevin Wilson

January 11, 2012 at 10:26 pm

I once took the NIH online course for human research (http://phrp.nihtraining.com/users/login.php) and I remember that there was a whole section on sharing data. But the training focused on when it was OK to share data with other researchers when you hadn’t explicitly notified the participants that their data would be shared with another researcher.

Of note, though, is the beginning of the third paragraph of this document (http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html). Here it seems to indicate that researchers who get grants in excess of half a million dollars *already* have to have a data sharing policy in place. I’d be curious as to what this means in practice, considering that there’s a lot of wiggle room in the statement.

LikeLike
thedukeofurl

January 12, 2012 at 4:10 pm

Re the vetting of financial modeling, it is even worse than you suggest. A primary reason that these models are not vetted is that almost no one but their creators can understand them within a reasonable time frame, which in the investment/casino community is very short. Another reason they aren’t vetted is that no one cares as long as it works in the way intended, which may include fraud. These are not research organizations. their only concern is profit. And they produce nothing of value.

LikeLike
Luke Lea

January 12, 2012 at 5:42 pm

Should models even be used? They all depend upon equations, i.e., putative mathematical functions linked by equal signs. But in the real world of markets nothing can be measured with much precision. Not even prices. For one thing the yardstick of money itself is not rigid but rubbery (cf., Fisher’s The Money Illusion). For another prices vary from place to place at any given moment, to very few of which times and places (often only one) we have access. For a third, commodities vary significantly in quality, which cannot be specified. And for a fourth, human behavior is labile.

No model has ever been validated to my knowledge.

So if mathematical “functions” are ruled out, what does that leave? Blurry geometrical relationships of “convexity” and “concavity” that ultimately boil down to the law of diminishing returns. Not that that is not nothing. The tendency to general equilibrium can be deduced them General policy prescriptions can be intuited by experienced observers.

LikeLike
Nadia Hassan

January 13, 2012 at 12:08 pm

Hey Cathy, you might find this post at Andrew Gelman’s blog interesting, though I know he’s on your blogroll.

http://andrewgelman.com/2012/01/what-are-the-important-issues-in-ethics-and-statistics-im-looking-for-your-input/

LikeLike
Gappy

January 16, 2012 at 12:34 pm

I am the reader that inspired this post, and broadly agree with it. Model usage, validation, interpretation in the Social sciences (to which also Finance belongs) is the central issue in methodology of social sciences, and yet most academics and graduate students don’t think much about it. Really, hardly anyone cares about methodology of Economics, and researchers publishing on it are considered mostly second-rate. Before discussing methods, I agree that a preliminary step should be to shed sunlight on data, Brandeis-style. Not that this is a new idea: the concept of “reproducible research” (http://reproducibleresearch.net/) has been around for a while. To quote Donoho:

“An article about computational science in a scientific publication is not the scholarship itself, it is merely *advertising* of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”

For computational science this is the beginning and the end of the story. In the case of social sciences, also data collection, validation and any preprocessing of the data should be described in great detail. That won’t solve the problem, because the goal is not reproduce a result, but to ascertain the truth.

Having said that, I don’t hold my breath. First, the authors have no incentive to publish the data, as long as their papers are accepted in journal under the current regime. Nobody likes to run the risk to be proven wrong. Second, I believe that the vast majority of these studies are simply used to reinforce one’s bias. Finally, the degrees of indirection to the original source lose important details and qualifications to eaxh study. Nowhere is this more evident than in decriptive studies of economic inequalities and mobility. Two examples: If you think that unions are bad, you’ll love the AEI/EF on teachers’ pay, even if it’s obviously based on proxies of what it purports to measure, and the data are not available. If your mom was a teacher, you may think otherwise. On the other side, a chart showing the relationship between Gini index and intergenerational elasticity of income made the news, (see e.g. http://krugman.blogs.nytimes.com/2012/01/15/the-great-gatsby-curve/). Lost in the media machine is that it relies on a) meta-analyses b) of small survey; c) of interviews of sons about the occupations of their parents; d) at different ages. There is enough uncertainty to fit an elephant.

The best one can hope is that some form of model auditing may help uncover the obvious frauds and educate the readers’ prejudices and make them aware of the fact that knowledge is provisional and very inaccurate, way more that the experts would like you to believe.

Final note: all of this applies less to financial modeling, where often datasets are large, of relatively good quality and nonstationary, and the studies are more inferential in nature.

LikeLike