The dark matter of big data

Home > data science, modeling, statistics > The dark matter of big data

The dark matter of big data

June 25, 2014 Cathy O'Neil, mathbabe

A tiny article in The Cap Times was recently published (hat tip Jordan Ellenberg) which describes the existence of a big data model which claims to help filter and rank school teachers based on their ability to raise student test scores. I guess it’s a kind of pre-VAM filtering system, and if it was hard to imagine a more vile model than the VAM, here you go. The article mentioned that the Madison School Board was deliberating on whether to spend $273K on this model.

One of the teachers in the district wrote her concerns about this model in her blog and then there was a debate at the school board meeting, and a journalist covered the meeting, so we know about it. But it was a close call, and this one could have easily slipped under the radar, or at least my radar.

Even so, now I know about it, and once I looked at the website of the company promoting this model, I found links to an article where they name a customer, for example in the Charlotte-Mecklenburg School District of North Carolina. They claim they only filter applications using their tool, they don’t make hiring decisions. Cold comfort for people who got removed by some random black box algorithm.

I wonder how many of the teachers applying to that district knew their application was being filtered through such a model? I’m going to guess none. For that matter, there are all sorts of application screening algorithms being regularly used of which applicants are generally unaware.

It’s just one example of the dark matter of big data. And by that I mean the enormous and growing clusters of big data models that are only inadvertently detectable by random small-town or small-city budget meeting journalism, or word-of-mouth reports coming out of conferences or late-night drinking parties with VC’s.

The vast majority of big data dark matter is still there in the shadows. You can only guess at its existence and its usage. Since the models themselves are proprietary, and are generally deployed secretly, there’s no reason for the public to be informed.

Let me give you another example, this time speculative, but not at all unlikely.

Namely, big data health models arising from the quantified self movement data. This recent Wall Street Journal article entitled Can Data From Your Fitbit Transform Medicine? articulated the issue nicely:

A recent review of 43 health- and fitness-tracking apps by the advocacy group Privacy Rights Clearinghouse found that roughly one-third of apps tested sent data to a third party not disclosed by the developer. One-third of the apps had no privacy policy. “For us, this is a big trust issue,” said Kaiser’s Dr. Young.

Consumer wearables fall into a regulatory gray area. Health-privacy laws that prevent the commercial use of patient data without consent don’t apply to the makers of consumer devices. “There are no specific rules about how those vendors can use and share data,” said Deven McGraw, a partner in the health-care practice at Manatt, Phelps, and Phillips LLP.

The key is that phrase “regulatory gray area”; it should make you think “big data dark matter lives here”.

When you have unprotected data that can be used as a proxy of HIPAA-protected medical data, there’s no reason it won’t be. So anyone who wants stands to benefit from knowing health-related information about you – think future employers who might help pay for future insurance claims – will be interested in using big data dark matter models gleaned from this kind of unregulated data.

To be sure, most people nowadays who wear fitbits are athletic, trying to improve their 5K run times. But the article explained that the medical profession is on the verge of suggesting a much larger population of patients use such devices. So it could get ugly real fast.

Secret big data models aren’t new, of course. I remember a friend of mine working for a credit card company a few decades ago. Her job was to model which customers to offer subprime credit cards to, and she was specifically told to target those customers who would end up paying the most in fees. But it’s become much much easier to do this kind of thing with the proliferation of so much personal data, including social media data.

I’m interested in the dark matter, partly as research for my book, and I’d appreciate help from my readers in trying to spot it when it pops up. For example, I remember begin told that a certain kind of online credit score is used to keep people on hold for customer service longer, but now I can’t find a reference to it anywhere. We should really compile a list at the boundaries of this dark matter. Please help! And if you don’t feel comfortable commenting, my email address is on the About page.

Categories: data science, modeling, statistics

Comments (28)

Thomas W. Dinsmore

June 25, 2014 at 8:07 am

By “data model”, I presume that you mean “predictive model.” A data model is something else:

http://en.wikipedia.org/wiki/Data_model

Organizations use predictive models to screen applicants for employment, credit, insurance and many other things. No surprises there. Nor is there anything wrong with the practice.

LikeLike
- Cathy O'Neil, mathbabe
  
  June 25, 2014 at 8:37 am
  
  The word “model” is certainly overloaded. I meant big data models that people claim are predictive. The problem is they often aren’t, and they are also often unintentionally discriminatory. That means in particular that there are potential very serious problems with the practice.
  
  LikeLike
  - Thomas W. Dinsmore
    
    June 25, 2014 at 8:50 am
    
    A predictive model that predicts poorly is still a predictive model, just as a Yugo is still a car. Calling a predictive model a data model is simply wrong.
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      June 25, 2014 at 8:53 am
      
      Yo, stay respectful. You are parsing my words wrong. Think “big data” model. I never say “data model” without the “big” in front.
      
      LikeLike
    - JSE
      
      June 25, 2014 at 11:55 am
      
      A Yugo is indeed a car, but if my school district is spending three hundred grand on one, I’d like a little more public oversight.
      
      LikeLike
  - Bill Raynor
    
    June 25, 2014 at 9:52 am
    
    Hello Cathy,
    
    What do you mean by “unintentionally discriminatory”? Discrimination is the purpose of prediction. If a predictive model can associate past behavior with increased probabilities of good or bad future outcomes, where is the rub?
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      June 25, 2014 at 10:04 am
      
      I mean racist or sexist discrimination, the illegal kind. And I don’t think much of that is intentional, but it could easily sneak in. See for example this recent post.
      
      LikeLike
    - Min
      
      June 25, 2014 at 11:49 am
      
      Suppose that you run a restaurant in an area with a visible minority that is discriminated against. This minority is on the whole poor. Every other restaurant in the area does not serve members of the minority, except those owned by minority members, and the restaurants that do serve minority members do not make as much money on average as the restaurants that do not, because their patrons cannot afford more expensive food.
      
      Data on customers will tell you that serving certain customers will be less lucrative than not serving them. The data only have to be correlated with minority membership to do so. Thus the data will advise you to discriminate against minority members, even though that discrimination is not built into the model being used.
      
      LikeLike
    - Bill Raynor
      
      June 25, 2014 at 4:55 pm
      
      Min,
      That is hardly a big data prediction model. If I were considering entering a market , I’d be collecting data on neighborhood demographics and disposable income, along with food preferences and on the ground studies of what it takes to be successful there, and what the local spoilage and skrinkage rates. That would be used to forecast an ROI for use in the inevitable judgement calls.
      
      I’d be suprised at any model that would “advise you to discriminate…” The situation you describe is a little data problem, akin to the “Law of Small Numbers” problem.
      
      Bill
      
      LikeLike
Richard Foxworthy

June 25, 2014 at 8:33 am

This old TomTom scandal springs to mind – http://www.theguardian.com/technology/2011/apr/28/tomtom-satnav-data-police-speed-traps

LikeLike
- Cathy O'Neil, mathbabe
  
  June 25, 2014 at 8:38 am
  
  Great example, thanks!
  
  LikeLike
Meta Brown

June 25, 2014 at 8:37 am

Here are two cases that really disturb me:

1) The news that telephone metadata is not viewed as private by the courts, because we share it with the phone company. There are obvious practical reasons why the phone company must know what numbers I call and when in order to provide service, just as my doctor has to see me naked. But my calls, like my health information, should be private.

2) Social media applications are following viewing activity on other sites through “share” icons. The kind of tracking was bad enough with third party advertisers, but we’re getting deeper and deeper into the creepy zone.

LikeLike
moosesnsquirrels

June 25, 2014 at 10:03 am

The irony here is quite rich. The original Progressives included many utopians who were very much in thrall with the sort of rationalized, Panglossian society that science, using statistics, would enable. This led to all sorts of disasters like the eugenics movement, but the idea was always using these techniques for the greater public good. Read Edward Bellamy’s Looking Backward some time.

Now, we’re living in nightmarish world in which the rich are using statisitics and the pseduo-science of Big Data to further a rentier economy in which we’re all charged for the privilege of just living.

LikeLike
Jon Awbrey

June 25, 2014 at 10:15 am

When you have a market — and Moneytheists are determind [sic❢ ] that all creation shalt never be anything but a market, then it becomes a pressing matter that every commodity shall have a market value, but it remains a matter of indifference to Moneytheists whether the market value has any significant relation to any conceivable actual value.

☞ Moneytheism

LikeLike
JSE

June 25, 2014 at 11:56 am

Cathy, I must protest: Madison is a small city, not a small town. A quarter-million people!

LikeLike
- Cathy O'Neil, mathbabe
  
  June 25, 2014 at 11:57 am
  
  Sorry! Duly corrected.
  
  LikeLike
Manoel Galdino

June 25, 2014 at 3:02 pm

Have you seen this?
http://marginalrevolution.com/marginalrevolution/2014/06/measuring-worker-value-the-new-average-is-over.html

LikeLike
- Cathy O'Neil, mathbabe
  
  June 25, 2014 at 3:05 pm
  
  Perfect example, thanks!
  
  On Wed, Jun 25, 2014 at 3:02 PM, mathbabe wrote:
  
  >
  
  LikeLike
Richard Séguin

June 26, 2014 at 12:24 am

Sorry, this is a little rambling. It’s getting late, I’m tired, and I have to shut down.

I too live in Madison, and I’m appalled by the possibility that the school board would use this software let alone purchase it for that much money. When I saw the article a few days ago I went to the website of the software company, and learned — virtually nothing. There is so little information there about how their system works that it could be vaporware for all we know. And there is no information about licensing. Is this a one-time purchase, or does the $273,000 repeat every three years? Are there additional maintenance fees? Does the school district have to pay extra if they require changes? Will they even make requested changes specific for us, or is this like canned software? If it’s canned, that’s awfully expensive software. They do claim that the system may work better after a few years of use, which means that if we purchase it and don’t like it then they could keep us on the hook for years with the hope that it will improve, and by that time we would be so dependent on it that it that it would be difficult to cut ties with them. I know how that works. I’ve seen it. Is their database designed to obscure the meaning of data to the point where it is difficult to migrate it to someone else’s system if need be? I know how that works. I’ve seen it. I’m sure that the school board was given a really slick sales presentation by the company. I know how that works. I’ve seen it multiple times. And the customer, or rather the school board, was probably straining to think of what to ask about it, and doesn’t yet realize how inflexible it could be. The company makes vague reference to help with teacher development beyond hiring, but is that something that has to be paid extra for?

And worst of all is that dehumanizing 100 question multiple choice test.

Also must see:

http://readforeign.wordpress.com/2013/03/21/fishy-teachers-matching-site/

LikeLike
Richard Séguin

June 26, 2014 at 1:50 am

I can’t quite leave this alone tonight. As Cathy called attention to, we have to ask: who owns this data that they are collecting? It’s clear from what I’ve seen that the questions that prospective teachers are asked are stored on the company’s servers. Can they pass that information on to other parties (for profit or otherwise)? I’m guessing that their proprietary data model includes some psychological profiling. We have to also allow for the possibility of political profiling. Just what are the questions that the teacher candidates have to answer? Shouldn’t these be available to the affected community for evaluation? Who are the supposedly real teachers they consulted with, and who are the “university professors” they consulted with? Don’t we also need to know this in order to know where this company is coming from?

LikeLike
- Meta Brown
  
  June 26, 2014 at 7:45 am
  
  Similar issues come up with “plagiarism detection” software used on student work. Bill Franks wrote a post about such software being used on student work in his area. The software’s analysis determined that all the students entering AP English had plagiarized their work. The school believed this result and was about to record a plagiarism offense on every single student’s transcript.
  
  Some parents looked into what the software actually did. Each three word combination that the software had encountered more than once was tagged as an instance of plagiarism. Three words!! Every student was tagged as a plagiarist because the English language includes a lot of common three word combinations. At the time the post was written, the parents had succeeded in getting the offense removed from the official records, but had not been able to reverse a grade of 0% for the assignment.
  
  What’s more, the software evidently saves all the text it encounters, meaning that the developers were using the students’ work without consent, let alone compensation. What’s worse, once a student’s work is in that mill, future work which builds on, or is similar to, a previous assignment, is even more likely to be tagged.
  
  Here is the original post:
  
  The Dire Consequences of Analytics Gone Wrong: Ruining Kids’ Futures by Bill Franks
  
  http://smartdatacollective.com/billfranks/42699/dire-consequences-analytics-gone-wrong-ruining-kids-futures
  
  Bill is Chief Analytics Officer for Teradata and author of Taming The Big Data Tidal Wave
  
  LikeLike
  - Cathy O'Neil, mathbabe
    
    June 26, 2014 at 7:55 am
    
    Great example!
    
    LikeLike
Jon Awbrey

June 26, 2014 at 2:18 pm

Being a pragmatist and a systems thinker, one of the first questions I ask about any thing that homo faber fabricates is, “What is its purpose?”

When it comes to accounting systems, I always find myself recurring to the idea — of ancient origin, Plato, at least — that Max Weber applied to economics and society, namely, that the purpose of an accounting system is to provide a true account of reality, Or persistent and pressing phenomena if you prefer that to “reality”.

LikeLike
Barbara Santry

July 5, 2014 at 4:14 pm

I just heard this week’s Slate Money and want to thank you and your panelists for a terrific show. I enthusiastically recommend it to any who read this comment. I heard your comments about health care big data. Health care reimbursement is my field and, while I appreciate your vigilance about privacy and our health care data, i”m with Felix that your jump to dystopia was not quite evidence based.

The telephone follow up that you cited is being done by hospitals (in the case of preventing preventable heart failure readmssions) or by health plans (in the case of reminding covered patients to get preventive or follow up care). Hospitals and health plans are the main focus of HIPAA which has significant penalties. Any HIPAA break is very serious but these calls lists are among the more benign bits of data in hospital and health plan data bases.

With respect to fit bots, you cite that the central data base with physiologic measures may not be covered by HIPAA. I suspect that this is the case. (There will be no fit bot industry if they have to be HIPAA compliant since it is expensive.) There is not enough predictive information that is not otherwise obtainable in the fit bot universe to be (IMHO) an employment discrimination problem. The not otherwise obtainable modifier is that excess weight is predictive of future health status, but any employer who wants to discriminate against overweight people can do so without fit bot data. The other detail is that fit bots are not FDA approved, so they cannot make claims that their measurements are accurate (and many probably are loaded with measurement error).

LikeLike
- Cathy O'Neil, mathbabe
  
  July 6, 2014 at 1:58 pm
  
  Thanks, I’m always hoping to be more informed about this stuff, it’s very important.
  
  However, I’d like to take issue with a few of your comments. First, whether or not the information available through non-HIPAA covered means is predictive is largely a matter of a) the amount of such data floating around (which can grow massively if and when we start putting fitbits on entire neighborhoods) and b) the extent to which quants and data scientists are paid to collect such data and create models. I don’t see anything preventing either of those things, so when you say “there is not enough predictive information” I’d say “yet.”
  
  Also, it’s very important to understand that measurements don’t have to be totally accurate to achieve a modest goal of removing people who are at high risk of future disease from an employment pool. There are so many applicants nowadays, the employers are actively looking for reasons to exclude and filter applicants. Why not this one? It doesn’t have to be perfectly accurate to give them a true incentive to use such filters. It just has to be cost-saving in aggregate, which is a much lower bar.
  
  Do you agree? I’d love to hear your thoughts.
  
  Cathy
  
  LikeLike
  - Meta Brown
    
    July 6, 2014 at 2:07 pm
    
    Models don’t actually have to be right to be used. Even if the data is of poor quality and the model is junk, someone may use it. That’s the essense of discrimination.
    
    LikeLike
  - bsantry
    
    July 6, 2014 at 6:51 pm
    
    Thanks for the reply. I will try to respond with a public policy flavor, to my best ability. Of course, if every employer knew every blood pressure every job candidate ever had, the risk, while tiny, is greater than zero. Let me say that I believe that job discrimination is a real problem that cries out for public policy response. I will discuss the public policy response in place with ACA at the end of the post.
    
    My assumption is that the physiologic measurements in these fitbit devices are pulse, respiratory rate and blood pressure. Let us have a thought experiment about blood pressure. Blood pressure rises with age and we have risk assessment guidelines about which factors, in addition to the measurement itself, would lead a clinician to begin anti-hypertensive drugs. The major factors that add risk (beyond the actual measurement) are other evidence of cardiovascular disease, diabetes, kidney failure, and being African American. Treatment with anti-hypertensives reduces an individual patient’s risk of cardiovascular related mortality, but successfully treated people with the risk factors have a higher risk that those without them.
    
    If I wanted to discriminate and I had a massive data base of blood pressure, what would I do?
    
    1. I would try, and fail, to make sense of the within day variations. All research about how blood pressure is related to risk is with a periodic (usually semi-annually) measurement taken after a patient has been at rest for a while and then is sitting with both feet on the floor, during normal business hours. There is no research basis for making risk-related sense of the variations which occur when standing and during exercise and the like. (This has been studied with respect to whether continuous blood pressure measurements over 24X7 periods would help in managing hypertension, due to “white coat hypertension” or unusually high measurements due to the stress of being measured. These trials did show that some people, about 20%, do have white coat hypertension but it failed to show that knowing it gave a clinical benefit in treating hypertension. This failure is probably due to clinicians considering enough other factors whose significance muted that of the somewhat elevated measurement.)
    
    2. If I just look at the highest blood pressures, I will produce a population of older people with many African Americans in it. I already knew that health care spending rises with age and that the major sources of working life morbidity are cardiovascular disease and cancer and that they are more common in middle aged and older people and in African Americans. Knowing blood pressure likely hasn’t added anything beyond what I already knew about aging and the reduced treatment access of African Americans.
    
    3. I am not able to distinguish a successfully treated person from a person who does not need treatment. The successfully treated person has a higher risk (lower after treatment, but still higher because there is an underlying disease process) in comparison with the person with a similar, but untreated blood pressure. This matters because the successfully treated are a large portion of the middle aged population. About 60% of middle aged men are within the treatment guidelines and about half of them (30% of the total) are currently treated.
    
    I could go on. But what you see is that the very real problem in giving all Americans a fair chance at employment is not made worse by massive blood pressure data bases. Since public policy energy is limited, I would want to focus it away from safeguarding fitbit data bases and toward the real problem.
    
    BTW, we have made very significant recent progress on the real problem. ACA has made a huge and positive change in the economics behind employment discrimination related to expected health expenses. It will take a while for bad employers to learn, but the mechanics have already happened.
    
    To go into the poorly understood weeds. ACA requires health plans to “community rate” all policies marketed to small employers. These polices had, in most states, been “experience rated”. Small employers cover about 80% of our work force. ACA requires health plans to take all applicants and ACA allows premiums to be differentially influenced only by age (limited to a 3 to 1 ratio), premium rating area, family composition, and tobacco use. What this means is that every small employer will pay the same relative premium for a 45 year old worker with the same family composition, whether or not his other workers have experienced higher or lower actual health care spending, and whether or not that worker has experienced higher or lower actual spending, or has higher or lower risk of such spending in the future. This change has significant economic benefit. It replaces what we had been doing which was making the small employers subsidize the large employers, while also being subject to unpredictable and significant premium hits if an employee got sick.
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      July 6, 2014 at 9:49 pm
      
      I don’t think that’s what would happen at all. In the world where you have that kind of data, you can infer all kinds of things – like how much exercise you get, when do you get stressed out and how often, and various behavioral patterns – which you can then correlate to long-term health outcomes.
      
      In other words, I think you are assuming that big data miners would just attempt to replicate what epidemiologists already study. But they won’t. Instead, they will look at the micro level of human behavior, much like they already do when they measure your clicks and your “attention minutes” on a given website, and they will build complicated and intricate models of what you might do next. They will segment the population based on these behaviors, some of which I can’t anticipate and in fact nobody can. The more data they have the better these kinds of models will get, as in more predictive.
      
      And the good news is, such data mining can save lots of lives. The worse news is that, without some rules of usage, we could easily see it being used unfairly against people.
      
      On Sun, Jul 6, 2014 at 6:51 PM, mathbabe wrote:
      
      >
      
      LikeLike