Home > data science, finance, rant, statistics > Is Big Data Evil?

Is Big Data Evil?

October 27, 2011

Back when I was growing up, your S.A.T. score was a big deal, but I feel like I lived in a relatively unfettered world of anonymity compared to what we are creating now. Imagine if your SAT score decided your entire future.

Two days ago I wrote about Emanuel Derman’s excellent new book “Models. Behaving. Badly.” and mentioned his Modeler’s Hippocratic Oath, which I may have to restate on every post from now on:

  • I will remember that I didn’t make the world, and it doesn’t satisfy my equations.
  • Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.
  • I will never sacrifice reality for elegance without explaining why I have done so.
  • Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.
  • I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension.

I mentioned that every data scientist should sign at the bottom of this page. Since then I’ve read three disturbing articles about big data. First, this article in the New York Times, which basically says that big data is a bubble:

This is a common characteristic of technology that its champions do not like to talk about, but it is why we have so many bubbles in this industry. Technologists build or discover something great, like railroads or radio or the Internet. The change is so important, often world-changing, that it is hard to value, so people overshoot toward the infinite. When it turns out to be merely huge, there is a crash, in railroad bonds, or RCA stock, or Pets.com. Perhaps Big Data is next, on its way to changing the world.

In a way I agree, but let’s emphasize the “changing the world” part, and ignore the hype. The truth is that, beyond the hype, the depth of big data’s reach is not really understood yet by most people, especially people inside big data. I’m not talking about the technological reach, but rather the moral and philosophical reach.

Let me illustrate my point by explaining the gist of the other two articles, both from the Wall Street Journal. The second article describes a model which uses the information on peoples’ credit card purchases to direct online advertising at them:

MasterCard earlier this year proposed an idea to ad executives to link Internet users to information about actual purchase behaviors for ad targeting, according to a MasterCard document and executives at some of the world’s largest ad companies who were involved in the talks. “You are what you buy,” the MasterCard document says.

MasterCard doesn’t collect people’s names or addresses when processing credit-card transactions. That makes it tricky to directly link people’s card activity to their online profiles, ad executives said. The company’s document describes its “extensive experience” linking “anonymized purchased attributes to consumer names and addresses” with the help of third-party companies.

MasterCard has since backtracked on this plan:

The MasterCard spokeswoman also said the idea described in MasterCard’s April document has “evolved significantly” and has “changed considerably” since August. After the company’s conversations with ad agencies, MasterCard said, it found there was “no feasible way” to connect Internet users with its analysis of their purchase history. “We cannot link individual transaction data,” MasterCard said.

How loudly can you hear me say “bullshit”? Even if they decide not to do this because of bad public relations, there are always smaller third-party companies who don’t even have a PR department:

Credit-card issuers including Discover Financial Services’ Discover Card, Bank of America Corp., Capital One Financial Corp. and J.P. Morgan Chase & Co. disclose in their privacy policies that they can share personal information about people with outside companies for marketing. They said they don’t make transaction data or purchase-history information available to outside companies for digital ad targeting.

The third article talks about using credit scores, among other “scoring” systems, to track and forecast peoples’ behavior. They model all sorts of things, like the likelihood you will take your pills:

Experian PLC, the credit-report giant, recently introduced an Income Insight score, designed to estimate the income of a credit-card applicant based on the applicant’s credit history. Another Experian score attempts to gauge the odds that a consumer will file for bankruptcy.

Rival credit reporter Equifax Inc. offers an Ability to Pay Index and a Discretionary Spending Index that purports to indicate whether people have extra money burning a hole in their pocket.

Understood, this is all about money. This is, in fact, all about companies ranking you in terms of your potential profitability to them. Just to make sure we’re all clear on the goal then:

The system “has been incredibly powerful for consumers,” said Mr. Wagner.

Ummm… well, at least it’s nice to see that it’s understood there is some error in the modeling:

Eric Rosenberg, director of state-government relations for credit bureau TransUnion LLC, told Oregon state lawmakers last year that his company can’t show “any statistical correlation” between the contents of a credit report and job performance.

But wait, let’s see what the CEO of Fair Isaac Co, one of the companies creating the scores, says about his new system:

“We know what you’re going to do tomorrow”

This is not well aligned with the fourth part of the Modeler’s Hippocratic Oath (MHO). The article goes on to expose some of the questionable morality that stems from such models:

Use of credit histories also raises concerns about racial discrimination, because studies show blacks and Hispanics, on average, have lower credit scores than non-Hispanic whites. The U.S. Equal Employment Opportunity Commission filed suit last December against the Kaplan Higher Education unit of Washington Post Co., claiming it discriminated against black employees and applicants by using credit-based screens that were “not job-related.”

Let me make the argument for these models before I explain why I think they’re flawed.

First, in terms of the credit card information, you should all be glad that the ads coming to us online are so beautifully tailored to your needs and desires- it’s so convenient, almost like someone read your mind and anticipated you’d be needing more vacuum cleaner bags at just the right time! And in terms of the scoring, it’s also very convenient that people and businesses somehow know to trust you, know that you’ve been raised with good (firm) middle-class values and ethics. You don’t have to argue my way into a new credit card or a car purchase, because the model knows you’re good for it. Okay, I’m done.

The flip side of this is that, if you don’t happen to look good to the models, you are funneled into a shitty situation, where you will continue to look bad. It’s a game of chutes and ladders, played on an enormous scale.

[If there’s one thing about big data that we all need to understand, it’s the enormous scale of these models.]

Moreover, this kind of cyclical effect will actually decrease the apparent error of the models: this is because if we forecast you as being uncredit-worthy, and your life sucks from now on and you have trouble getting a job or a credit card and when you do you have to pay high fees, then you are way more likely to be a credit risk in the future.

One last word about errors: it’s always scary to see someone on the one hand admit that the forecasting abilities of a model may be weak, but on the other hand say things like “we know what you’re going to do tomorrow”. It’s a human nature thing to want something to work better than it does, and that’s why we need the IMO (especially the fifth part).

This all makes me think of the movie Blade Runner, with its oppressive sense of corporate control, where the seedy underground economy of artificial eyeballs was the last place on earth you didn’t need to show ID. There aren’t any robots to kill (yet) but I’m getting the feeling more and more that we are sorting people at birth, or soon after, to be winners or losers in this culture.

Of course, collecting information about people isn’t new. Why am I all upset about it? Here are a few reasons, which I will expand on in another post:

  1. There’s way more information about people nowadays than their Social Security Number; the field of consumer information gathering is huge and growing exponentially
  2. All of those quants who left Wall Street are now working in data science and have real skills (myself included)
  3. They also typically don’t have any qualms; they justify models like this by saying, hey we’re just using correlations, we’re not forcing people to behave well or badly, and anyway if I don’t make this model someone else will
  4. The real bubble is this: thinking these things work, and advocating their bulletproof convenience and profitability (in the name of mathematics)
  5. Who suffers when these models fail? Answer: not the corporations that use them, but rather the invisible people who are designated as failures.
  1. N/A
    October 28, 2011 at 12:19 am

    Is chemistry evil? How about physics? This particular application of “Big Data” is likely to be of very little value to credit card customers, but appears to have potentially high value to credit card companies. I don’t think that makes “Big Data” evil.


    • October 28, 2011 at 8:50 am

      I don’t know enough about chemistry or physics to answer that. Do you think they are evil? Please explain! But even if they are (or aren’t), does that really interfere with my worries about Big Data?


      • N/A
        October 28, 2011 at 10:22 am

        They, chemistry, physics, “Big Data”, are all powerful tools that can be used to help and to hurt us. The tool itself is not “evil” it’s the user of the tool that can apply it to perform “evil” deeds. This does not interfere with your worries about the way Big Data can be used to hurt us.


  2. October 28, 2011 at 1:19 am

    “We know what you’re going to do tomorrow” Hmmmm….I actually do know what most single 30 year old New Yorkers will do tomorrow…bitch about the weather, then their boss, then leave work and drink while trying to get laid… Sort of a joke, but my point is that the models will perform pretty well for the majority and miss the outliers–the people who got low S.A.T scores but can perform well above average. The models used are constrained by the need for computational tractability…e.g. linear regression with “big data” is still just linear regression, and the world isn’t governed by y = Ax + normal error. Moore’s law isn’t going to change this.


  3. October 28, 2011 at 10:44 am

    Some references on the technical side to begin with
    Scale: How Large Quantities of Information Change EverythingBasics: Correlationhttp://scienceblogs.com/goodmath/2008/07/petabyte_scale_dataanalysis_an.php

    Then, as Nassim Taleb said:
    When we are dealing with time-series we may never see the tails because the rate of change of the underlying process may be significant in comparison with the frequency of tail events and the tails may be fat – meaning second and higher moments don’t exist

    (roughly speaking … first moment is mean, second moment is standard deviation,…)

    So, if we don’t adjust our notions of significance up to reflect the data volume, or we don’t check that assumptions such as independence of random variables, finite means (ie not fat tails) are explicit or justified (preferably both) the models will mislead.

    And when the models are used in socially significant ways, the consequences of those errors could cause injustice.

    That doesn’t make modelling evil, but like other professions, data science needs to figure out an appropriate ethical code for its practitioners.


  4. October 31, 2011 at 3:44 pm

    I’ve usually felt that tools were neutral. They could be used for good or for evil. But I’m starting to wonder if – despite the fact that every tool could potentially be used for good – maybe some tools are slanted toward evil? Perhaps some tools are more tainted by evil than others. Maybe Big Data has good uses but it seems like it is more weighed toward evil uses.

    Of course, this dilemma runs throughout any study of humanity or human behavior. How much can one study human behavior without influencing it? If I had a model to perfectly predict the stock market – and assuming I released it – wouldn’t that so deeply affected they way stocks were traded that I would, in the end, be changing the very behavior I was trying to scientifically study?

    I’m kind of an amateur on these matters so feel free to ignore this.


    • October 31, 2011 at 5:28 pm

      I want to argue that data science can be used to benefit humanity (and in passing that women are mathematical pioneers more often than they are given credit for) by citing an example from history. My heroine for this story is the great pioneering statistician, Florence Nightingale. Florence was by no means the only lady to go out the Crimea to nurse the British wounded soldiers; Mary Seacole was another famous example. However the lady with the lamp was the one to collect and analyse data that demonstrated that the field hospitals were killing more British soldiers than the enemy was! She was also able to prove that this could be avoided by having good standards of hygiene. Although not the inventor of pie charts, she was the first to present any to parliament and thus persuade the government that investing in adequate medical care for soldiers would enable them to fight wars more cheaply. She went on to study sanitation and public health in India.

      Check the illustration at http://en.wikipedia.org/wiki/Florence_Nightingale#Statistics_and_sanitary_reform

      Many people pay tribute to her contribution to nursing and public health; few bother to note that she did so primarily by inventing techniques for the collection, analysis, and presentation of statistical data!


  5. FogOfWar
    November 5, 2011 at 11:43 pm

    I did not know that & it’s a wonderful example!



  6. February 24, 2014 at 5:50 pm

    Big Data is big computing grid. This means distributed processing and/or distributed persistence – http://gridwizard.wordpress.com/2014/02/18/big-data-grid-computing-distributed-computation-vs-distributed-persistence/
    There’s nothing new in Distributed Processing (running code on multiple nodes) and what’s new in Hadoop really is just MapR and Distributed Persistence – which anyone can replicate simply by multiple database instance, or database cluster – it does not have to be HDFS or NoSQL/Mongodb.
    There’s been a lot of confusion what’s meant by Big Data and Grid Computing and people loses sight – it’s not the “Data Platform”, it’s what you run on top of your Data Platform that matters. And what runs on Big/Small Data Platform, however sophisticated your BI, should come from Good Intuition.
    What’s happening now is everyone busy try to figure out just how to make Hadoop to work – but neglected the “Spirits” of any analytic computations. Besides, how complicated should Data Platform be?
    It’s not Big Data Tool that’s evil – it’s the Hype, the Big Name that is.


  1. October 27, 2011 at 9:28 am
  2. October 29, 2011 at 6:48 am
  3. November 10, 2011 at 11:35 pm
  4. October 5, 2012 at 7:19 am
  5. October 5, 2012 at 7:51 am
Comments are closed.
%d bloggers like this: