Open data is not a panacea

Home > data science, finance, open source tools > Open data is not a panacea

Open data is not a panacea

December 29, 2012 Cathy O'Neil, mathbabe

I’ve talked a lot recently about how there’s an information war currently being waged on consumers by companies that troll the internet and collect personal data, search histories, and other “attributes” in data warehouses which then gets sold to the highest bidders.

It’s natural to want to balance out this information asymmetry somehow. One such approach is open data, defined in Wikipedia as the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

I’m going to need more than one blog post to think this through, but I wanted to make two points this morning.

The first is my issue with the phrase “freely available to everyone to use”. What does that mean? Having worked in futures trading, where we put trading machines and algorithms in close proximity with exchanges for large fees so we can get to the market data a few nanoseconds before anyone else, it’s clear to me that availability and access to data is an incredibly complicated issue.

And it’s not just about speed. You can have hugely important, rich, and large data sets sitting in a lump on a publicly available website like wikipedia, and if you don’t have fancy parsing tools and algorithms you’re not going to be able to make use of it.

When important data goes public, the edge goes to the most sophisticated data engineer, not the general public. The Goldman Sachs’s of the world will always know how to make use of “freely available to everyone” data before the average guy.

Which brings me to my second point about open data. It’s general wisdom that we should hope for the best but prepare for the worst. My feeling is that as we move towards open data we are doing plenty of the hoping part but not enough of the preparing part.

If there’s one thing I learned working in finance, it’s not to be naive about how information will be used. You’ve got to learn to think like an asshole to really see what to worry about. It’s a skill which I don’t regret having.

So, if you’re giving me information on where public schools need help, I’m going to imagine using that information to cut off credit for people who live nearby. If you tell me where environmental complaints are being served, I’m going to draw a map and see where they aren’t being served so I can take my questionable business practices there.

I’m not saying proponents of open data aren’t well-meaning, they often seem to be. And I’m not saying that the bad outweighs the good, because I’m not sure. But it’s something we should figure out how to measure, and in this information war it’s something we should keep a careful eye on.

Categories: data science, finance, open source tools

Comments (21)

charles sereno

December 29, 2012 at 8:49 am

Let’s call the complement of open data “closed data.” Doesn’t Goldman have an even greater advantage there?

LikeLike
- Cathy O'Neil, mathbabe
  
  December 29, 2012 at 11:44 am
  
  Absolutely, but we are aware of that war already.
  
  LikeLike
  - Lou Puls (@MonkeeRench)
    
    December 29, 2012 at 11:55 am
    
    If the closed or proprietary data includes meaningful market data unavailable to retail investors, then Banksters like G$ can take advantage of such a failure of so-called “efficient markets” and use the best data “scientists” to compete among other Banksters. The retailers are excluded by (de)regulation and the market is fixed for enshafting, worldwide.
    
    LikeLike
  - Mike
    
    December 29, 2012 at 8:34 pm
    
    “So, if you’re giving me information on where public schools need help, I’m going to imagine using that information to cut off credit for people who live nearby. If you tell me where environmental complaints are being served, I’m going to draw a map and see where they aren’t being served so I can take my questionable business practices there.”
    
    You don’t have any guarantee this information won’t be collected and used in a closed-data system. The only thing you’re guaranteeing is that the public will not be aware of it.
    
    LikeLike
themusicgod1

December 29, 2012 at 9:28 am

the edge goes to the most sophisticated data engineer, not the general public.

While it is a truism that there are assholes that will use the data…there is no good reason why the general public cannot be organized into better using the data than even they are capable. Open Data realistically has only been around for a fraction of a generation — we have not figured out, in every aspect of life, how important it is to have access to so we’re not yet demanding it, not yet taking advantage of it fully, and the hardware for processing it is, while starting to get widely distributed, still unpowered by Free Software sufficient to do the job.

Hackerspaces are only one of many parts of society that are capable of using this data — farmer cooperatives, libraries, elementary schools…each of these areas can both set about trying to use the data and developing tools to do so, in both large & small projects. At a certain point, the Goldman Sachs will not have an advantage — there’s simply too many of ‘us’ and too few of ‘them’ — tools to process the data are a bit of a missing piece, but things like Google Correlation are very much the light at the end of the tunnel here – we’re a long way off from perfect use of this data in a socially optimal way, but Open Data is very much the second or third step.

LikeLike
- charles sereno
  
  December 29, 2012 at 9:51 am
  
  Plus, more scope of development when aiming for everyone’s good rather than just “my” good.
  
  LikeLike
- Mike
  
  December 29, 2012 at 8:39 pm
  
  Not to mention open-source software can be used to give even a novice engineer an edge in data analytics. Some of the most common statistical analysis packages are open-source.
  
  LikeLike
mdb

December 29, 2012 at 10:22 am

You may want to look up arbitrage, that is what you are describing. Much has been written, it happens everywhere, and like you point not just with prices. Much ink was spilled on regulatory arbitrage leading to the financial crisis (which regulator had the most favorable rules for what the particular bank was doing).

LikeLike
Mike Loukides

December 29, 2012 at 10:24 am

You make a good point; certainly open data isn’t pure goodness. And certainly “thinking like an asshole” is a good way to anticipate how data might be misused.

But I still think open data is, on the whole, better than closed data. (And I wrote about this somewhere on Radar, though I’m too lazy to look up the URL.) Most data (school performance, environmental complaints to take your specific examples) is already available, if not openly, to people with the money to exploit it. So they already have the data to cut off credit to people living near poor schools, and I bet they’re already doing it. I’m sure that companies with questionable environmental practices locate new facilities in places where data shows they’re less likely to run into enforcement problems. (Louisiana, this means you.)

So Goldman Sachs already has that edge. They can buy all the data they need. The real question is whether the public good of forcing Goldman Sachs, Exxon, and their ilk to pay for access to data outweighs the public good of open access to data. I think the answer to that question is clear, largely because I doubt Sachs and Exxon are paying much, on their scale, for data access. Though this way of phrasing the question suggests that there’s another way to balance the equation, and that’s for data users to pay what the data is actually worth to them.

LikeLike
- Cathy O'Neil, mathbabe
  
  December 29, 2012 at 11:48 am
  
  There’s no doubt we’d need to do the math here, but my experience is that those big guys do actually pay a lot for data – in fact they’d pay more for better and cleaner data, and they see their data sources as their biggest impediment for making even more money. So a bunch of free data is a huge win for them.
  
  There’s no obvious solution to this – the public should do a better job taking advantage of free data. The problem is (and it’s possibly a temporary problem) that the big money and time is going to the for-profit big money guys, and the public just doesn’t have the resources to compete (yet).
  
  LikeLike
  - charles sereno
    
    December 29, 2012 at 11:55 am
    
    “the public just doesn’t have the resources to compete (yet).”
    You mean like the situation with our government?
    
    LikeLike
Cynicism

December 29, 2012 at 11:04 am

This post reminds me of the writings of Jaron Lanier and the way one can have mountains of data that doesn’t amount to an eyedropper full of information because there’s no one thinking about it or who knows how to parse it.

He also argues against open data in governments as you argue against open data in finance. The upshot being that for governments, a wikileaks-type organization does not make everyone freer, but punishes those who are neither for absolute transparency or absolute opacity. Therefore, slightly open governments like those of the USA are pushed towards becoming less transparent. See for instance : http://www.theatlantic.com/technology/archive/2010/12/the-hazards-of-nerd-supremacy-the-case-of-wikileaks/68217/#

LikeLike
- Cathy O'Neil, mathbabe
  
  December 29, 2012 at 11:49 am
  
  There’s definitely a dynamic system in effect here. I’d like to think about ways of defining the boundary conditions carefully.
  
  LikeLike
- charles sereno
  
  December 29, 2012 at 12:08 pm
  
  For my part, I am even less sympathetic to governments when they use their power arrogantly and non-constructively.
  
  LikeLike
mathematrucker

December 29, 2012 at 12:30 pm

I’ve always been curious about the incentive (or lack thereof) the insurance industry has to be the least bit cutting-edge with data analysis. Despite being cited for speeding one year ago while on duty as a truck driver and having that citation go on my personal driving record, my auto premiums have yet to rise. In a perfect world this would be because insurers treat citations received by long-haul truckers on the job differently than ones received by ordinary drivers, due to vast differences in exposure: long-haul truckers drive roughly ten times the number of miles ordinary drivers do.

But why on Earth would any insurance company bother to get so picky when it means less revenue? Even though “truck driver” is the most common job held among men in America, there aren’t THAT many of us! Insurers obviously care a lot about data; but aren’t their sniffers inherently biased towards profit?

We don’t live in a perfect world, and I’m still keeping my fingers crossed about that ticket.

LikeLike
Evelyn

December 29, 2012 at 2:18 pm

http://www.guardian.co.uk/commentisfree/2012/dec/29/fbi-coordinated-crackdown-occupy
Data at: lhttp://www.justiceonline.org/commentary/fbi-files-ows.html#documents

LikeLike
south northern

December 29, 2012 at 5:16 pm

When I aged 26 and began getting bald I went to the Public Health System. Back then, the (young) doctor told me that there was nothing that could be done, that getting bald was inevitable. I believed him.
As months passed and my baldness was getting worse, I began investigating myself and found a working solution that had been in the market for years. Now, I think that it was impossible that the PHS doctor I asked didn’t know about it. He probably thought that giving that kind of information might give me an artificial advantage and end up hurting society as a whole if everyone did the same.
But he didn’t prevent me from obtaining that information, and if the didn’t stop me, he didn’t stop anyone that made a reasonable effort to obtain it. He only delayed the process and made it more expensive since I had to go to a (very rich) private doctor to get the drugs prescribed.
Sure, there are lots of ways to look at it and probably no one is “right”, but where I think the PHS doctor was wrong is in that there are lots of people that don’t mind getting bald, but I did.

LikeLike
Paddy3118

December 31, 2012 at 3:29 am

Maybe we could create a better than open data license. A license under which users of the data are obliged, by law, to make their parsers and data-miners on that data publicly available under an open-source license?

That would help with the wealth of different formats that data is kept in. It would do something to redress the balance in favour of those with novel uses and ways to combine different data-sets.

LikeLike
David Petraitis

December 31, 2012 at 4:21 pm

“think like an asshole…
Cathy’s Axiom of Financial Thought

I think it is one for posterity. 😉

LikeLike
islandletters

January 1, 2013 at 1:07 pm

Well, maybe one approach would be to have less data, in total that is. The European Commission currently discusses “The Right to be Forgotten”. In other words: a right to get one’s data deleted (for instance after a pre-defined time). Practically there might be too many loopholes to really get this off the ground, but I maintain its worth pondering about.
Secondly, something like a Creative-Commons license for open data might alleviate some of the exploitation dangers you are afraid of. If you want to use information about the need of schools since you are a charity looking for a suitable project: you are welcome; if you want to use the information for commercial gains -any commercial gains that is- you are charged. Depending on the charge collected this approach could at least stifle casual, large-scale harvesting of data for exploitation purposes.

LikeLike
Jason Hare

January 2, 2013 at 8:17 am

Open data professionals are currently debating the ethics of exposing data sets to the public. The main issues are privacy of the individual and the usability of the data. The transformation of data into information and the ethical exposure of “anonymized” data sets are what we are after.

I work at the municipal level as a city open data program manager. In my role my task is to deliver a framework of collaboration between our city government and the citizens. The city wants collaboration and input on ideas related to social issues, economic development issues as well as exposing performance indicators on department spending and performance.

Data like any asset within government is part of infrastructure. Data alone does not equal transparency. Certainly machine readable data sets are available for anyone to use and analyze. The transformation from data to information to democratize city information is absolutely necessary.

One very sticky issue is that open data principles for the most part are based on European privacy laws. The Open Knowledge Foundation’s “Open Data Handbook” articulates what is fit for open data and what is not. This flies in the face of Freedom of Information Act requests which can identify individuals. A second sticky issue is the use of public data to expose individuals by the media. Recently a gun ownership map that was released showed an individual’s name and address along with identifying them as a gun owner. This is not about gun ownership but about exposing public data and individual identity. That map was released by a newspaper and not a public sector agency. I question the ethics of releasing any individual’s name or any linked data sets that can be cross referenced to identify an individual.

I am glad writers like yourself are discussing open data.

My views are my own but are reflected in the policies I am proposing to my municipality.

LikeLike