The bursting of the big data bubble

Home > data science > The bursting of the big data bubble

The bursting of the big data bubble

September 20, 2013 Cathy O'Neil, mathbabe

It’s been a good ride. I’m not gonna lie, it’s been a good time to be a data whiz, a quant-turned-data scientist. I get lots of attention and LinkedIn emails just for my title and my math Ph.D., and it’s flattering. But all of that is going to change, starting now.

You see, there are some serious headwinds. They started a while ago but they’re picking up speed, and the magical wave of hype propelling us forward is giving way. I can tell, I’ve got a nose for sinking ships and sailing metaphors.

First, the hype and why it’s been so strong.

It seems like data and the ability to use data is the secret sauce in so many of the big success stories. Look at Google. They managed to think of the entire web as their data source, and have earned quite a bit of respect and advertising money for their chore of organizing it like a huge-ass free library for our benefit. That took some serious data handling and modeling know-how.

We humans are pretty good at detecting patterns, so after a few companies made it big with the secret data sauce, we inferred that, when you take a normal tech company and sprinkle on data, you get the next Google.

Next, a few reasons it’s unsustainable

Most companies don’t have the data that Google has, and can never hope to cash in on stuff at the scale of the ad traffic that Google sees. Even so, there are lots of smaller but real gains that lots of companies – but not all – could potentially realize if they collected the right kind of data and had good data people helping them.

Unfortunately, this process rarely actually happens the right way, often because the business people ask their data people the wrong questions to being with, and since they think of their data people as little more than pieces of software – data in, magic out – they don’t get their data people sufficiently involved with working on something that data can address.

Also, since there are absolutely no standards for what constitutes a data scientist, and anyone who’s taken a machine learning class at college can claim to be one, the data scientists walking around often have no clue how to actually form the right questions to ask anyway. They are lopsided data people, and only know how to answer already well-defined questions like the ones that Kaggle comes up with. That’s less than half of what a good data scientist does, but people have no idea what a good data scientist does.

Plus, it’s super hard to accumulate hard evidence that you have a crappy data science team. If you’ve hired one or more unqualified data scientists, how can you tell? They still might be able to implement crappy models which don’t answer the right question, but in order to see that you’d need to also have a good data scientist who implements a better solution to the right question. But you only have one. It’s a counterfactual problem.

Here’s what I see happening. People have invested some real money in data, and they’ve gotten burned with a lack of medium-term results. Now they’re getting impatient for proof that data is an appropriate place to invest what little money their VC’s have offered them. That means they want really short-term results, which means they’re lowballing data science expertise, which means they only attract people who’ve taken one machine learning class and fancy themselves experts.

In other words, data science expertise has been commodified, and it’s a race to the bottom. Who will solve my business-critical data problem on a short-term consulting basis for less than $5000? Less than $4000?

What’s next?

There really is a difference between A) crude models that someone constructs not really knowing what they’re doing and B) thoughtful models which gain an edge along the margin. It requires someone who actually knows what they’re doing to get the latter kind of model. But most people are unaware of even the theoretical difference between type A and type B models, nor would they recognize which type they’ve got once they get one.

Even so, over time, type B models outperform type A models, and if you care enough about the marginal edge between the two types, say because you’re in a competitive environment, then you will absolutely need type B to make money. And by the way, if you don’t care about that marginal edge, then by all means you should use a type A solution. But you should at least know the difference and make that choice deliberately.

My forecast is that, once the hype wave of big data is dead and gone, there will emerge reasonable standards of what a data scientist should actually be able to do, and moreover a standard of when and how to hire a good one. It’ll be a rubrik, and possibly some tests, of both problem solving and communication.

Personally, I’m looking forward to a more reasonable and realistic vision of how data and data expertise can help with things. I might have to change my job title, but I’m used to it.

Categories: data science

Comments (33)

Marcus Kirsch

September 20, 2013 at 7:15 am

Ever heard of Creative Technology or Strategic Technology? Same problem. One could insert many ‘new’ things like apps, etc. in here to prove that just hiring a skill into production-side of the business creates innovation in the business. And then, yes, you might have someone hired, who doesn’t come from the hard-knock growth process that created the new area of expertise, but just someone who puts the same title on themselves, because ‘they do something sort of like that’. Before you know it, businesses end up becoming risk-averse and hate the ‘new stuff’. Oh wait, that just happened.

LikeLike
Eroteme

September 20, 2013 at 7:22 am

But do you have data to support your hypothesis of the impending puncturing of the bubble? 😉

LikeLike
Greg Taylor

September 20, 2013 at 8:31 am

This is the first big data bubble bursting forecast I’ve seen. Based on previous forecasts of bursting bubbles, the first forecast is typically 3-5 years premature. So, I’d say we can plan on at least another 3 years of bubble and another order of magnitude of growth before Cathy’s forecast pans out;)

LikeLike
- gwern
  
  September 20, 2013 at 3:05 pm
  
  Indeed. Let’s remember Amara’s law: we often overestimate a technology in the short run, and underestimate it in the long run.
  
  LikeLike
Josh

September 20, 2013 at 8:35 am

Yes, there is a lot of hype about big data and so surely it will be seen to be overblown some day (though Greg has a good point regarding when that is likely to happen).

But I am surprised that you say the edge between Type A and Type B models is marginal. I suspect there are lots of circumstances where it is very big and will become obvious.

LikeLike
- Cathy O'Neil, mathbabe
  
  September 20, 2013 at 8:50 am
  
  Not marginal to me and you. Marginally understood by the typical startup entrepreneur.
  
  LikeLike
  - Tim
    
    September 22, 2013 at 3:19 pm
    
    Oh, I’ve gotta agree with your reply here Cathy. Marginally misunderstood by my previous startup CEO who was once quoted to trust her gut over her data and subsequently lost a well talented data team, only to begin her descent to the bottom grade data “jockeys”. Oh, not to mention, I’ve decided to take a class in “data science” but ready to quit it as well, since most of the folks there have no statistical grounding nor programming prowess and they’ll graduate calling themselves data scientists for employability’s sake.
    
    LikeLike
  - john
    
    January 10, 2014 at 5:58 pm
    
    As a startup entrepreneur getting into big data, who do I surround myself or who should I be looking to in order to know that we are setting up type B models and not type A models?
    
    LikeLike
- Kevin
  
  September 20, 2013 at 2:29 pm
  
  Marginal doesn’t mean small. It means how much you gain due to the upgrade from A to B. http://en.wikipedia.org/wiki/Marginal_utility
  
  LikeLike
Charlie Board

September 20, 2013 at 9:18 am

“there will emerge reasonable standards of what a data scientist should actually be able to do,”

Something along the lines of this? https://www.informs.org/Certification-Continuing-Ed/Analytics-Certification

LikeLike
Abe Kohen

September 20, 2013 at 9:43 am

Many people call themselves painters, but few can create a Mona Lisa or even a copy of a Rembrandt. Some can only paint a kitchen wall. But that does not take away from the true artist who calls her/him-self a painter.

Many people call themselves programmers. Some have credentials such as a degree in Computer Science. Others are self taught. Yet many write the same line of code 200 times, rather than a for loop (such was the case with the initial crop of DE Shaw’s Hyderabad hires). And yet some programmers are true magicians and are more productive than 100 less talented ones.

There are always going to be quacks in any new (or even pseudo-new) area. The shakeout will come, as you predict.

But there is much more data than just Google. Think NSA. Think financial data. Think credit bureaus. Think aggregators. And you don’t have to be a google to get your hands on google data – like the people who worked on reelecting Obama.

LikeLike
Victor3

September 20, 2013 at 10:10 am

I think the big data bubble may be more like one of those un-popable soap bubbles that my kids have that spend days stuck in the bushes till I pick them out or till the rain knocks them off. As long as there is money to be made, carpet baggers will squeeze every last dollar out of those they can fool into believing in their special sauce. Big data will remain useful for medical research, but has been imploding from the start in education despite claims to the contrary. When Rupert Murdoch and Bill Gates team up, lock your wallets in the trunk.

LikeLike
j2kun

September 20, 2013 at 10:15 am

On a related note, there is a workshop going on at Berkeley about coming up with a theory of data science, or rather, trying to address the question of what such a theory should say. I’m quite interested to see what big insights come out of the workshop, and we can hope that it will help when the formation of data science standards after the bubble bursts.

LikeLike
Dave Baum

September 20, 2013 at 10:51 am

Substitute “software” for “big data” and most of your observations still hold. Remember in the dot-com boom when anyone that could spell HTML was a programmer? Those days are gone, yet software engineering continues to grow. Most organizations figure out how to lurch along with crappy programmers, a few find a great programmer and manage not to drive them away. Even fewer organizations create a culture where great programmers can flourish. I’m convinced those latter organizations develop a competitive advantage in whatever business they are engaged in.

Google is a great example. Yes, “search” is essentially a big data problem. It is also a constantly moving target, and the hardware and software needed to support it is astounding. They couldn’t succeed with crappy engineering teams, and they know it.

(Disclaimer: I work at Google so I’m biased. But prior to Google I spent many years working in organizations that didn’t understand software. I’ve seen both sides and the difference is incredible.)

Perhaps big data will shake out the same way, perhaps not. But as long as there is economic value in data analysis the field is not going to disappear.

LikeLike
Patrick Morrison

September 20, 2013 at 12:04 pm

” People have invested some real money in , and they’ve gotten burned with a lack of medium-term results.”

I’ve seen the same phenomenon happen where X = PC’s, LAN’s and the Web. In the medium term there have been downturns, but over the long run all have survived, grown and prospered. Where ideas are hard to understand and easy to mis-convey, there will always be snake oil salesmen… but as long as there’s real value in there someplace, there will be a market.

LikeLike
Zathras

September 20, 2013 at 1:08 pm

One of the hallmarks of a bubble is the rapid expansion of people who claim to be experts. Several people have said that the sign of the decadence of the tech bubble was that taxi drivers were giving stock tips on which tech stock to buy.

Where do you see this today? You cannot understand state of the big data bubble without looking at the visualization industry. The visualization folks claim to bring big data to the level that anyone can do it. I have heard this sales pitch now from SAS, Tableau, and SAP. It’s all BS. And bringing the big data capabilities to the masses is the equivalent of the taxi driver giving stock tips on tech companies. It means we are near the end, and soon we will have to pay the piper.

LikeLike
Zathras

September 20, 2013 at 1:16 pm

“There really is a difference between A) crude models that someone constructs not really knowing what they’re doing and B) thoughtful models which gain an edge along the margin. It requires someone who actually knows what they’re doing to get the latter kind of model. But most people are unaware of even the theoretical difference between type A and type B models, nor would they recognize which type they’ve got once they get one.

Even so, over time, type B models outperform type A models, and if you care enough about the marginal edge between the two types, say because you’re in a competitive environment, then you will absolutely need type B to make money. ”

This is all true, but the real question is whether the decision-makers can actually tell which model is performing better. It is not at all obvious which side would win this dispute. What if the builders of the Type A models communicate better to executive than the builders of Type B? Type B might be better, but if you have professional salesmen for Type A, who wins? I have very low confidence in executives’ being able to tell the difference.

LikeLike
Muradin Bronzebeard

September 20, 2013 at 1:35 pm

I have genuine interest in this field and am generally glad more resources are surfacing, I took few courses but I do not dare call myself an expert, rather, I am now lost, I know enough to be overwhelmed and not know where to go next.

So can you attribute to the standards and guide me to what do I need to know/do/be in order to produce and tell the difference between A and B ?

LikeLike
Thomas Nyberg

September 20, 2013 at 2:05 pm

Sounds great timing for me to try to break into this industry! 😦

Actually I think I’d be happier with the industry in general if it came down to earth a bit. It would be unfortunate if suddenly it turned into cheap one-off data consulting jobs as you describe, but if that’s the value of the work provided (which seems right if the market is flooded with fakes), then I guess that’s the pay that would be deserved. Of course it makes harder for people to prove themselves worthy of more money/respect/whatever.

Balance would come back to the system eventually though there would probably be fewer opportunities as a whole as many people throwing money wildly now don’t come back to the game. Hopefully the storm isn’t too bad.

LikeLike
medicalquackblog

September 20, 2013 at 2:05 pm

Good article and in the software development area we have an explosion of this as well with what I call “Cash for Code”, open up an incubator, hang out some carrots and “write code for our platform”..same model seems to be creeping up the ladder as I see it:) One thing to add as well is that coding is being done on “platforms” and is not from the ground up which means that you are relying on 1, 2, 3,4 or maybe more platforms underneath yours to be accurate and effective, and sometimes that works and sometimes it doesn’t, bu the end consumer will focus on you when there’s problems. I had that with a customized video simple platform I wrote (actually just modified) with a video company. My stuff was solid but the platforms beneath had issues and what I wrote was not any good until they fixed theirs and don’t think I didn’t catch you know what over it:)

Big companies are cashing in on this and even if the platform written is a little rough, they will buy and perfect it for a few thousand or so and the programmer gets his few thousand dollars by winning the carrot and they are left to start all over with writing another carrot for some other firm, that is until they burn out and can’t pay their rent anymore:)

LikeLike
Jim Bender

September 20, 2013 at 2:07 pm

I might define “big data” as the vast amount of information collected that in some sense violates user privacy when interacting with websites that do the collection. There are many sites that do the collection and exploit the data. Even the Weather Channel website knows about me and when I interact with the site from a new device, they immediately know my home location. You have huge amounts of data about users and save it in non-relational databases. I am not sure that even modeling is what we might be doing, but rather we would be finding new and innovative ways to exploit data that we collect. Do we need a PhD in Math or do we just need clever programmers (the difference between “neats” and “scruffies”.

LikeLike
Lou Puls (@MonkeeRench)

September 20, 2013 at 2:11 pm

The root of the problem seems to me to be the ad hoc and highly subjective “foundations” of widely accepted and appallingly-misused statistical inference, whether its based on 200 years of lack of understanding of Frequentist vs. Bayesian inference, or the more recent idiotic compromises between Fisher’s “significance testing” vs Neyman-Pearson “hypothesis testing”.

“All models are wrong, but some are useful.” — George Box.

Randomly useful, perhaps?

LikeLike
medicalquackblog

September 20, 2013 at 4:52 pm

I wrote my opinion on this too as it’s right up there with the same thing with programming except the next level down if you will..anyway a got comment back from one I know who works for a huge Fortune 500 company in network/server/database administration and he’s one of the top folks that takes care of all their world wide data..interesting as he’s looking at the maintenance and size of the data he has to maintain.. His quote:

“If this is true, then I know where all of the users are coming from who access my Oracle Reporting Databases using such poor SQL that it requires 3-4 times the capacity that would be necessary if someone who really understood how to properly structure a query or write non-abusive SQL.”

LikeLike
csrollyson

September 20, 2013 at 6:01 pm

@cathy, thanks for your observations. I am not in the field but have been studying it intensely because my passion is using social business for digital transformation, which led me to found the Chief Digital Office. I have significant experience with enterprise transformation, so these adoption patterns feel very familiar to me. My studies are in line with your viewpoint here, but I am optimistic. There are people who will do it right, but they are in the severe minority. I perceive that orgs see that they have all this data, and they dream of “machine intelligence” to help them provide “customer experience.” This will not be the magic that people believe. Machines can’t *touch* people, but analytics can organize and deliver relevant information to people who can touch customers. The firms that succeed will learn to channel many of their touches through people. I like the “lean data” idea, and using social business pilots to develop and validate big data hypotheses as to what kind of customer outcomes we want to nurture using our data. It’s funny, but few execs have experienced deep interaction/collaboration online, so they have no concept of how easy and powerful it is. They are accustomed to “designing for a concept” without involving the people for whom they’re supposedly trying to serve! That’s unnecessary now. I riff longer on social/big data at chiefdigitaloffice.com/bigdata

LikeLike
isomorphismes

September 20, 2013 at 7:43 pm

Wow, it barely had any time to inflate yet.

LikeLike
isomorphismes

September 20, 2013 at 7:44 pm

Cathy, do you think A.N.T. has any applications to data analysis? Maybe you’ve already answered that in another post.

LikeLike
isomorphismes

September 20, 2013 at 7:45 pm

https://twitter.com/gappy3000/status/381032868085891072

Sounds like then we’re back to “statistician”.

LikeLike
Michael Edesess

September 21, 2013 at 2:54 am

Cathy, I wonder if you or someone else can provide a reference to or provide a quick definition of Type A and Type B models for those of us who haven’t even taken a machine learning course. It didn’t respond to a Google search.

LikeLike
E.L. Wisty

September 21, 2013 at 3:49 pm

Reblogged this on Pink Iguana and commented:
Calling a top in Big Data

LikeLike
Artem Kaznatcheev

September 21, 2013 at 8:19 pm

I shared your post on r/MachineLearning yesterday, and there has been some very strong discussion there. Most of it on the skeptical side, some of it asking you to give positive definitions of what a good data-scientists is, etc.They also seem to be questioning how important consultants versus integrated employees (that also wear many other hats at the company they work) are to the big data ecology. I thought maybe some of those questions could serve as fodder for a future post? I would definitely like to see you answer to some of them.

LikeLike
Greta Roberts

September 22, 2013 at 9:14 am

Agreed! Great article. We’re starting to see some indication that the hype around no “Supply” of Data Scientists is in fact more of a “Demand” problem. Where organizations assume Data Scientists will not only solve the problems but come up with the problems to solve. Demand needs to come first – challenging problems that require Data Scientists to solve them. Our work with organizations show there really isn’t enough Demand to justify the hype over the need for all these Data Scientists. Thoughts?

LikeLike
Tim

September 22, 2013 at 3:25 pm

The other thing I will have to mention is that whilst bigger companies (here in the UK) like Barclays are currently hiring data scientists in the UK and the expertise from quants in the UK, this race to the bottom as you describe and I agree with will ultimately create new ways of off-shoring IT just like they’ve done with their BI teams, especially as “data science” becomes commoditised.

I’ve left the startup world, only to want to create a startup of my own that, like you, uses data and quant models for the good of the human race. I can only hope to learn from your blog and read about your adventures so far.

LikeLike
a friend

September 24, 2013 at 8:01 pm

“Also, since there are absolutely no standards for what constitutes a data scientist, and anyone who’s taken a machine learning class at college can claim to be one, the data scientists walking around often have no clue how to actually form the right questions to ask anyway. They are lopsided data people, and only know how to answer already well-defined questions like the ones that Kaggle comes up with. That’s less than half of what a good data scientist does, but people have no idea what a good data scientist does.

Plus, it’s super hard to accumulate hard evidence that you have a crappy data science team. If you’ve hired one or more unqualified data scientists, how can you tell? They still might be able to implement crappy models which don’t answer the right question, but in order to see that you’d need to also have a good data scientist who implements a better solution to the right question. But you only have one. It’s a counterfactual problem.”

I LOL’d at this. I work for a well-known Big Data consultancy and on my current engagement we have a team that constructed a monte-carlo model that takes distribution parameters and, after FOUR DAYS of simulations, returns those distributional parameters.

We are paid $$$ for this.

LikeLike