Home > musing, statistics > How to Lie With Statistics (in the Age of Big Data)

## How to Lie With Statistics (in the Age of Big Data)

February 3, 2014

When I emailed my mom last month to tell her the awesome news about the book I’m writing she emailed me back the following:

i.e, A modern-day How to Lie with Statistics (1954), avail on Amazon
for \$9.10.  Love, Mom

That was her whole email. She’s never been very verbose, in person or electronically. Too busy hacking.

Even so, she gave me enough to go on, and I bought the book and recently read it. It was awesome and I recommend it to anyone who hasn’t read it – or read it recently. It’s a quick read and available as a free pdf download here.

The goal of the book is to demonstrate all the ways marketers, journalists, accountants, and sometimes even statisticians can bias your interpretation of statistical facts or even just confuse you into thinking something is true when it’s not. It’s illustrated as well, which is fun and often funny.

The author does things like talk about how you can present graphs to be very misleading – my favorite, because it happens to be my pet peeve, is the “growth chart” where the y-axis goes from 1400 to 1402 so things look like they’ve grown a huge amount because “0” isn’t represented anywhere. Or of course the chart that has no numbers at all so you don’t know what you’re looking at.

There are a few things that don’t translate: so for example, he has a big thing about how people say “average” but they don’t specify whether they mean “arithmetic mean” or “median.” Nowadays this is taken to mean the former (am I wrong?).

And also, it’s fascinating to see how culture has changed – many of his examples that involve race would be very different nowadays, and issues around women, and the idea that you could run a randomized experiment to give half the people polio vaccines and withhold them from the other half, when polio is a real threat that leaves children paralyzed, is really strange.

Also, many of the examples – there are hundreds – refer to the Great Depression and the recovery since then, and the assumptions are bizarrely different in 1954 than you see in 2014 (and I’d guess different than how it will be in 2024 but I hope I’m wrong). Specifically, it seems that many of the lies that people are propagating with statistics are to downplay their profits so as to not seem excessive. Can you imagine?!

One of the reasons I read this book, of course, was to see if my book really is a modern version of that one. And I have to say that many of the issues do not translate, but some of them do, in interesting ways.

Even the reason that many of them don’t is kind of interesting: in the age of big data, we often don’t even see charts of data so how can we be misled by them? In other words, the presumption is that the data is so big as to be inaccessible. Google doesn’t bother showing us the numbers. Plus they don’t have to since we use their services anyway.

The most transferrable tips on how to lie with statistics probably stem from discussions on the following topics:

• Selection bias (things like, of the people who responded to our poll, they are all happy with our service)
• Survivorship bias (things like, companies that have been in the S&P for 30 years have great stock performance)
• Confusing people about topic A by discussing a related but not directly relevant topic B. This is described in the book as a “semi-attached figure”

The last one is the most relevant, I believe. In the age of big data, and partly because the data is “too big” to take a real look at, we spend an amazing amount of time talking about how a model is measuring something we care about (teachers’ value, or how good a candidate is for a job) when in fact the model is doing something quite different (test scores, demographic data).

If we were aware of those discrepancies we’d have way more skepticism, but we’re intimidated by the size of the data and the complexity of the models.

A final point. For the most part that crucial big data issue of complexity isn’t addressed in the book. It kind of makes me pine for the olden days, except not really if I’m black, a woman, or at risk of being exposed to polio.

UPDATES: First, my bad for not understanding that, at the time, the polio vaccine wasn’t known to work, or even be harmful, so of course there were trials. I was speaking from the perspective of the present day when it seems obvious that it works. For that matter I’m not even sure it was the particular vaccine that ended up working that was being tested.

Second, I showed my mom this post and her response was perfect:

Glad you liked it! Love, Mom

Categories: musing, statistics
1. February 3, 2014 at 8:03 am

Excellent commentary. Thanks!

Like

2. February 3, 2014 at 8:18 am

Mean, median or mode: I’ve seen all used to mean ‘average’. The biggest problem is that the appropriate meaning is not used. The classic example is using mean for ‘average’ income when Bill Gates is passing through.

Like

3. February 3, 2014 at 9:35 am

A classic. The syntax of the title of my book is an intentional echo of this one.

Interesting stuff from Andrew Gelman on Huff and tobacco statistics:

Like

• February 3, 2014 at 5:53 pm

By way of warning others – I remember reading this post from Andrew Gelman when it first went up; it totally ruined my enjoyment of HTLWS and Huff’s other book How To Take a Chance .

Like

4. February 3, 2014 at 9:36 am

If I were you, I’d be tempted to lead your book with the HG Wells quote that Huff put 2nd in his: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”

Like

5. February 3, 2014 at 10:20 am

“My mom the hacker”…that is so funny and how many can claim that by all means…it’ cute:) You are right on target as usual and yes the “perception” issues numbers are creating, be it a survey, a news article, etc. is messing with people big time. I spent 25 years in sales and old sales manager from years ago told me, “you want to make that sale, throw numbers at them” so that concept has been around for years and yes I did throw numbers out there, harmless of course compared to today’s world by all means. That book was written the same year I was born so you make the case that this process is not new.

I’m sure you have noticed how the world of journalism is changing too and there’s a lot of reporters on this pay for performance type of compensation too, the clicks. So I say not really their fault but rather the business model they have to work with to keep jobs. Online newspapers are struggling to find revenue streams and yesterday too in addition to what I have been pounding on my blog, one of the long time journalists at the LA Times wrote about it too. He’s former Pulitzer prize winner and he’s concerned too.

“Supply of news is dwindling amid the digital media transformation”

I think Bezos from Amazon bought the Washington Post to make sure there’s at least one news agency that will carry enough news as of course they are a public company and the stock bots read the news so got to have enough news sources for the stock bots to read too:) If you look at journalists that are leaving major papers, they are hiring more statisticians to write their news now, Nate Silvers, good example. The control over consumers is massive here and so we end up with news with a reporter interviewing a data base or the OMG news that jerks your emotional jugular with being so outrageous as news agencies don’t have the budgets any longer for good investigative reporting.

People are getting confused as I see it and can’t tell where the virtual software worlds of stats and so on leave off and where the real world kicks in any more and the movie HER kind of bring this to light. Heck the trailer is enough to get the idea here. When I can fry a Facebook like button in a skillet for breakfast then I’ll know the virtual worlds have won:)

http://ducknetweb.blogspot.com/2014/01/movie-her-good-example-on-how-folks.html

Keeping the virtual world out there to create a lot of gray is profit though, we know that and so do a lot of billionaires. Keep an eye on Facebook too with their new partnership with Barclays in the UK…let’s get everybody on the web and get more folks writing code..and is happening anyway, but why does a bank and Facebook need to back a start up? Well they might get some cheap code and buy what somebody writes out there on the Facebook platform. That process has been going on for a while as they get everyone excited to participate and we know that very few start ups make it, but let’s say they don’t have enough to launch a company, but wrote some good code that will have some use somewhere, so they give them a big prize and recognition and couple thousand dollars for their code and off go the developers for the next context if they can still pay their rent. I call it the cash for code module and companies like Verizon, United Healthcare and ton of others have been doing this for a while.

Anyway, perhaps one of these days somehow some of the “lies” will come to surface if we ever get anyone in government that doesn’t run for the hills when “math” is mentioned.

Like

6. February 3, 2014 at 10:26 am

I think people use “average” to describe median often when talking about things that are ordered but not obviously numbered … “I guess I’m an average dater/it was an average meal”

Like

7. February 3, 2014 at 10:28 am

Before you know whether a vaccine works, it is moral (and essential) to test it in a double-blind experiment. The control participants should be getting whatever is the current standard preventative treatment, which (if this is the first vaccine, or if universal vaccination is not standard) may very well be nothing. At the time of the experiment, not getting the vaccine didn’t make you worse off than not participating in the experiment at all.

Like

• February 3, 2014 at 10:28 am

I do not believe that is the current standard however.

Like

• February 4, 2014 at 8:16 am

The current standards of bio-ethics have changed, but valid statistical methods like RCT remain crucial. In a case like polio, I suspect the decision would be the same today. You have a high risk vaccine, with great potential to cause serious harm, that might provide no immunity or worsen the risks. Yet, if it works, it removes a serious danger. The early vaccine consisted of giving the child a very mild case of polio using a weakened live virus. That is high risk for both alternatives.

Like

• February 4, 2014 at 5:52 pm

Here’s a 2011 HIV vaccine tested by the same method: the control group got a placebo,
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0024254

Like

• February 4, 2014 at 8:54 pm

Oh yes it is. Vaccines are given to millions of healthy children–you want to be very very sure that not only does it work, but that it’s safe. The Jenny McCarthy autism stuff is bogus, but vaccines do have the capacity to harm–remember the improperly manufactured polio vaccine that gave hundreds of children polio. Until you know for sure that it’s both effective AND safe, the only ethical thing to do is to test it rigorously (and give half the children the chance to NOT be exposed).

Like

8. February 3, 2014 at 11:05 am

I use the occasional example from How to Lie with Statistics and the ideas of selection bias etc when I teach Elementary Statistics at the college level. Many texts give little space to how to recognize non-randomness in surveys and experiments and our syllabus leaves little time for it. Most students who take a “quantitative literacy” class end up in Statisics. So my question is what should the modern college-educated person really know about data science? How much time should be spent critiquing studies? Should the Bayesian point of view have prominence? What would such a curriculum look like? Are there any texts suitable for general studies student?

Like

9. February 3, 2014 at 11:24 am

How to Lie with Statistics is a classic book. Still valuable today as all those tricks are still in widespread use.

It probably can use an update but I am assuming your book will go far beyond that sort of thing. For instance, ratings for sub-prime were not “lying with statistics” in that sense but surely also important.

Other crucial subjects (that I think you already plan to address) include:
How use of models erodes their value

Like

10. February 3, 2014 at 1:39 pm

Startup Employees Cash In Stock Options Early -the Lending Club from WSJ today..employees got to sell their stock to Google as Google became a new investors..geez..done to reduce the pressure of going public…what next..

http://blogs.wsj.com/accelerators/2014/02/03/startup-employees-cash-in-stock-options-early/?mod=WSJBlog

Like

11. February 4, 2014 at 12:44 am

What I like most about that book is that it absolutely, straightforwardly, tells you _how_ to lie with statistics.

Like

12. February 4, 2014 at 7:06 pm

I read it…finally…very good and would like to embed it with a little bit about your post here Cathy as I have quite a bit of discussion going on about healthcare stats if that’s ok with you? The book is the same age as me:) Still get a kick out of “my mom was busy hacking”..priceless…

Like

13. February 5, 2014 at 9:43 pm

I would suspect that when many people say “average” they mean “typical.”

Like

14. February 7, 2014 at 1:42 am

Cathy, I am a former student of yours and read up on the site. Not too long ago, I implemented a pricing strategy at a company and tested its efficacy in a variety of ways. The results suggested the strategy was a moderate improvement on the previous pricing model. However, I should have lied with statistics. Turns out, that in business, no one cares what’s correct, it’s all about fishing for data that tells the story that makes you look like an all time success. Silly me for trying to be rigorous and honest. Just my two cents on the warped world outside of dear Barnard.

Like

15. February 15, 2014 at 1:04 pm

Be sure to include this detailed paper on problems with statistical significance testing. I had no idea!

Click to access orlitzky2012orm.pdf

Like

1. February 4, 2014 at 6:59 am
2. February 7, 2014 at 4:44 pm
3. February 12, 2014 at 8:04 am