Statisticians aren’t the problem for data science. The real problem is too many posers

Home > data science, rant > Statisticians aren’t the problem for data science. The real problem is too many posers

Statisticians aren’t the problem for data science. The real problem is too many posers

July 31, 2012 Cathy O'Neil, mathbabe

Cosma Shalizi

I recently was hugely flattered by my friend Cosma Shalizi’s articulate argument against my position that data science distinguishes itself from statistics in various ways.

Cosma is a well-read broadly educated guy, and a role model for what a statistician can be, not that every statistician lives up to hist standard. I’ve enjoyed talking to him about data, big data, and working in industry, and I’ve blogged about his blogposts as well.

That’s not to say I agree with absolutely everything Cosma says in his post: in particular, there’s a difference between being a master at visualizations for the statistics audience and being able to put together a power point presentation for a board meeting, which some data scientists in the internet start-up scene definitely need to do (mostly this is a study in how to dumb stuff down without letting it become vapid, and in reading other people’s minds in advance to see what they find sexy).

And communications skills are a funny thing; my experience is communicating with an academic or a quant is a different kettle of fish than communicating with the Head of Product. Each audience has its own dialect.

But I totally believe that any statistician who willingly gets a job entitled “Data Scientist” would be able to do these things, it’s a self-selection process after all.

Statistics and Data Science are on the same team

I think that casting statistics as the enemy of data science is a straw man play. The truth is, an earnest, well-trained and careful statistician in a data scientist role would adapt very quickly to it and flourish as well, if he or she could learn to stomach the business-speak and hype (which changes depending on the role, and for certain data science jobs is really not a big part of it, but for others may be).

It would be a petty argument indeed to try to make this into a real fight. As long as academic statisticians are willing to admit they don’t typically spend just as much time (which isn’t to say they never spend as much time) worrying about how long it will take to train a model as they do wondering about the exact conditions under which a paper will be published, and as long as data scientists admit that they mostly just redo linear regression in weirder and weirder ways, then there’s no need for a heated debate at all.

Let’s once and for all shake hands and agree that we’re here together, and it’s cool, and we each have something to learn from the other.

Posers

What I really want to rant about today though is something else, namely posers. There are far too many posers out there in the land of data scientists, and it’s getting to the point where I’m starting to regret throwing my hat into that ring.

Without naming names, I’d like to characterize problematic pseudo-mathematical behavior that I witness often enough that I’m consistently riled up. I’ll put aside hyped-up, bullshit publicity stunts and generalized political maneuvering because I believe that stuff speaks for itself.

My basic mathematical complaint is that it’s not enough to just know how to run a black box algorithm. You actually need to know how and why it works, so that when it doesn’t work, you can adjust. Let me explain this a bit by analogy with respect to the Rubik’s cube, which I taught my beloved math nerd high school students to solve using group theory just last week.

Rubiks

First we solved the “position problem” for the 3-by-3-by-3 cube using 3-cycles, and proved it worked, by exhibiting the group acting on the cube, understanding it as a subgroup of $S_8 \times S_{12},$ and thinking hard about things like the sign of basic actions to prove we’d thought of and resolved everything that could happen. We solved the “orientation problem” similarly, with 3-cycles.

I did this three times, with the three classes, and each time a student would ask me if the algorithm is efficient. No, it’s not efficient, it takes about 4 minutes, and other people can solve it way faster, I’d explain. But the great thing about this algorithm is that it seamlessly generalizes to other problems. Using similar sign arguments and basic 3-cycle moves, you can solve the 7-by-7-by-7 (or any of them actually) and many other shaped Rubik’s-like puzzles as well, which none of the “efficient” algorithms can do.

Something I could have mentioned but didn’t is that the efficient algorithms are memorized by their users, are basically black-box algorithms. I don’t think people understand to any degree why they work. And when they are confronted with a new puzzle, some of those tricks generalize but not all of them, and they need new tricks to deal with centers that get scrambled with “invisible orientations”. And it’s not at all clear they can solve a tetrahedron puzzle, for example, with any success.

Democratizing algorithms: good and bad

Back to data science. It’s a good thing that data algorithms are getting democratized, and I’m all for there being packages in R or Octave that let people run clustering algorithms or steepest descent.

But, contrary to the message sent by much of Andrew Ng’s class on machine learning, you actually do need to understand how to invert a matrix at some point in your life if you want to be a data scientist. And, I’d add, if you’re not smart enough to understand the underlying math, then you’re not smart enough to be a data scientist.

I’m not being a snob. I’m not saying this because I want people to work hard. It’s not a laziness thing, it’s a matter of knowing your shit and being for real. If your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is. That’s your job.

As I see it, there are three problems with the democratization of algorithms:

As described already, it lets people who can load data and press a button describe themselves as data scientists.
It tempts companies to never hire anyone who actually knows how these things work, because they don’t see the point. This is a mistake, and could have dire consequences, both for the company and for the world, depending on how widely their crappy models get used.
Businesses might think they have awesome data scientists when they don’t. That’s not an easy problem to fix from the business side: posers can be fantastically successful exactly because non-data scientists who hire data scientists in business, i.e. business people, don’t know how to test for real understanding.

How do we purge the posers?

We need to come up with a plan to purge the posers, they are annoying and making a bad name for data science.

One thing that will be helpful in this direction is Rachel Schutt’s Data Science class at Columbia next semester, which is going to be a much-needed bullshit free zone. Note there’s been a time change that hasn’t been reflected on the announcement yet, namely it’s going to be once a week, Wednesdays for three hours starting at 6:15pm. I’m looking forward to blogging on the contents of these lectures.

Categories: data science, rant

Comments (18)

Clive Jones

July 31, 2012 at 9:41 am

MathBabe – good stuff, also other posts on challenges and requirements for a data scientist. Made mention of your blog and opinions at http://www.businessforecastblog.com

LikeLike
kwm

July 31, 2012 at 10:06 am

Speaking of visualizing data:
http://demonocracy.info/infographics/usa/derivatives/bank_exposure.html
The ‘risk’ is so high that Muppets can’t play.

LikeLike
EB

July 31, 2012 at 1:10 pm

No argument here that posers should be given some, ah, *friendly encouragement* to step up the rigor.

But it is both a fact and a good thing that many more people will use libraries than implement them. So while it’s important to know how the algorithms work, the necessary first step in defusing black box explosions is making sure people know how to evaluate model performance: are their packaged predictions reliable and significant?

An analogy: it would be great if all drivers understood internal combustion engines, but it would improve on the status quo if all simply knew how to check their oil and fluid levels.

LikeLike
James

July 31, 2012 at 7:29 pm

“you actually do need to understand how to invert a matrix at some point in your life”… ah, but you must also have the wisdom to not invert the matrix. 😉

http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/

LikeLike
G Lau

August 2, 2012 at 12:03 pm

Reblogged this on Data Meaning….

LikeLike
Ryan

August 2, 2012 at 12:36 pm

Excellent article. The publicity stunts and self-promotion get old. We all know who these people are.

I do agree with you though that statistics and data science are different. It could be just my graduate program, but there is staunch difference between the two fields. Data scientists have to be competent coders with a background in algorithms and data structures IMHO.

LikeLike
rbjarnason

August 2, 2012 at 1:33 pm

Statisticians, mathematicians, computer scientists, data scientists … even the posers – we are all on the same side. And we are all going to have to come to terms with this.

The future demand for data-science skills will outstrip the available talent so badly that *most* companies simply won’t be able to find a “non-poser” data scientist (nyti.ms/yxv9zR). They are going to have to do the best they can with what they have. Most of the people calling themselves “data scientist” will do so for no other reason than that’s what the HR manager was instructed to hire.

Those who you call “posers” are doing the best they can. No, they don’t have the same math/statistics background as you. Instead, they’ve been given basic instructions for a limited set of tools and no time to catch up … even if they knew where to start. In the end, we can’t hate on them because there simply aren’t enough data scientists to go around.

This is a big playground. There are lots of sandboxes. We can all get along.

LikeLike
- uclamathguy
  
  August 2, 2012 at 4:45 pm
  
  I don’t think she is referring to those that are not as educated as posers. There are a ton of people that play around with data with passion and want to learn more. I am pretty sure she is referring to bandwagoners; those that have found a soapbox in data science to talk big, but don’t actually produce. They make a lot of noise. If you are on Twitter, you’ve probably seen what she is referring to. I see it everyday and it is irritating.
  
  Amateurs are fine. Amateurs that talk themselves up as professionals are not.
  
  LikeLike
  - Cathy O'Neil, mathbabe
    
    August 3, 2012 at 6:56 am
    
    Thanks, UCLAMathGuy.
    
    I love amateurs. I’m a teacher, and I love teaching. I just don’t like people who pretend to be experts in stuff they don’t understand.
    
    LikeLike
Ian

August 3, 2012 at 12:11 am

reminds me of automobiles. There are 3 categories of people:
1. You have engineers who design the cars.
2. You have mechanics who can fix them
3. You have drivers who drive them

Your claiming that data scientists need to all be engineers who are capable of building cars from the ground up. This was true a couple of years ago, but the technologies and frameworks have matured significantly.

I’d argue at this stage of the cycle, for most environments you only need mechanics. people who are familiar with the algorithms, can tweak them a bit, but are no where near capable of building one from scratch. Andrew Ng’s class isn’t about training engineers, it is about training mechanics. Ideally they are smart enough to know when to call in the engineers when they need help, but for most problems a mechanic is all that is needed.

As the field matures, and services like google analytics become more robust, regular drivers will be doing more and more of “data science” type problems, utilizing the building blocks designed by engineers, and maintained by mechanics.

LikeLike
- Cathy O'Neil, mathbabe
  
  August 3, 2012 at 6:58 am
  
  Do you think mechanics should build new cars? Would you trust those cars on the road? Do you think mechanics can actually be trusted to call in the engineers when they’re stuck? How about if their title is “engineer” even though they’ve only been trained as a mechanic?
  
  LikeLike
- araybold
  
  August 5, 2012 at 10:20 pm
  
  Ian: General software development and ‘software engineering’ has followed the route you propose for several decades, trading rigor and in-depth understanding for ‘masses of asses’. It is one of the main reasons, I believe, why large development projects continue to get into serious trouble at a disturbing rate, and why egregious security errors are so common (consider the state of electronic voting, for example.) This experience also shows, as Cathy suspects, that expert help is rarely sought even when it is needed; as the distinction between technician and engineer has been erased from the field, this is hardly surprising.
  
  LikeLike
slime shady

August 3, 2012 at 10:58 pm

Bit of a tangent here, but I like to remind people that often jobs go to people because they DON’T know stuff and WON’T do certain things. Any hyped field is especially prone to this, to really be capable and willing to act on that is to threaten the field’s scam. If you highlight the difficulty in making bold pronouncements from data all the time, would one of these ‘big data’ business selling hope to ignoramuses really want you?

LikeLike
Jim

August 16, 2012 at 5:47 pm

I enjoyed reading this post. But I think the term you want is “poseur” not “poser” …

LikeLike
Rhiannon

August 19, 2012 at 10:06 am

I have been reading the literature lately on the use of Bayesian Belief Nets, from the infancy in the late 80’s through to the PR papers by what I call the “for profit risk analysts” like AgenaRisk. It is admirable to want to bring the methods into the realm of utility for people who don’t know what they really are, but caution and expertise is still needed. Now for example, I have had four co-workers in the last three months come to me and essentially say: “I have problem X and I need to solve it with BBNs! So, what’s a BBN?” I have to explain to them that them telling me “BBNs will solve my problem” is like me telling them “C++ will solve my problem. So tell me again what C++ is and why it will solve my problem?” They are familiar with the term, and they know about things like “neural networks”, “Hidden Markov Models” and possibly eg, decision trees. None of them would have any clue whatsoever what a random effects or generalized linear model is, or how all of those things are related or possibly actually better suited to solve their problem. So they could build the best BBN in the world with the software out there and it could still be inefficient, inelegant, or inadequate.

LikeLike
Jennifer

September 2, 2012 at 11:12 am

Hi Cathy! I took your math course at Columbia just before you left for DE Shaw a few years ago. I totally agree about the posers and it’s nice to hear someone else say it for once. I’m taking Rachel’s course this semester so I hope i’ll get a chance to see you after all these years.

LikeLike
TC

October 16, 2012 at 9:22 pm

How about some forum for recognizing individuals, teams, companies, academic departments that follow best practices with respect to analytical rigor, ability to produce accurate results, and a reputation for ethics and integrity. Publicly highlighting the best should serve to down-weight the worst.

Also consider how other professions handle this: word-of-mouth reputation (e.g. software developers, home contractors), test-based professional designations (e.g. actuaries), registration/certification (barbers, social workers, securities firm representatives), licensure & accreditation (doctors, hospitals, universities).

Is there a zero-sum continuum between (open source + many poseurs) versus (closed guild + fewer poseurs)? Or can we come up with a more clever, consistent, and objective way around this?

LikeLike
George

November 5, 2012 at 11:03 pm

Hi,

Definitely one of the poseurs :). I’m interested in the data science field, because I think the algorithms and analysis are really interesting, but I’d hardly say I had the requisite skills to do this field properly (I’m like a 5 year old looking at coloring book). The one thing I’ve done is learn more statistics, learn programs like R, SQL and Python, and picking up some books on statistical learning and linear algebra.

That said :). I’m thinking that the best way to do this properly is through a master’s degree or even a Ph.D, because even my data analysis skills are immature and I’m constantly worried about making bad conclusions (I’m not even in Machine Learning).

So, my big question, coming from a non-CS, non-math bachelor’s background, what would you recommend I do to get experience in the field or do you think the best way is to get a master’s or Ph.D?

Do you think business approaches things differently when it comes to data science?

Thanks,

George.

LikeLike