Search Results

Keyword: ‘big data’

Critical Questions for Big Data by danah boyd & Kate Crawford

I’m teaching a class this summer in the Lede Program, starting in mid-July, which is called The Platform. Here’s the course description:

This course begins with the idea that computing tools are the products of human ingenuity and effort. They are never neutral and carry with them the biases of their designers and their design process. “Platform studies” is a new term used to describe investigations into these relationships between computing technologies and the creative or research products that they help to generate. How you understand how data, code, and algorithms affect creative practices can be an effective first step toward critical thinking about technology. This will not be purely theoretical, however, and specific case studies, technologies, and project work will make the ideas concrete.

Since my first class is coming soon, I’m actively thinking about what to talk about and which readings to assign. I’ve got wonderful guest lecturers coming, and for the most part the class will focus on those guest lecturers and their topics, but for the first class I want to give them an overview of a very large subject.

I’ve decided that danah boyd and Kate Crawford’s recent article, Critical Questions for Big Data, is pretty much perfect for this goal. I’ve read and written a lot about big data but even so I’m impressed by how clearly and comprehensively they have laid out their provocations. And although I’ve heard many of the ideas and examples before, some of them are new to me, and are directly related to the theme of the class, for example:

Twitter and Facebook are examples of Big Data sources that offer very poor archiving and search functions. Consequently, researchers are much more likely to focus on something in the present or immediate past – tracking reactions to an election, TV finale, or natural disaster – because of the sheer difficulty or impossibility of accessing older data.

Of course the students in the Lede are journalists, not academic researchers, which the article mostly addresses, and moreover they are not necessarily working with big data per se, but even so they are increasingly working with social media data, and moreover they are probably covering big data even if they don’t directly analyze it. So I think it’s still relevant to them. Or another way to express this is that one thing we will attempt to do in class is examine the extent to which their provocations are relevant.

Here’s another gem, directly related to the Facebook experiment I discussed yesterday:

As computational scientists have started engaging in acts of social science, there is a tendency to claim their work as the business of facts and not interpretation. A model may be mathematically sound, an experiment may seem valid, but as soon as a researcher seeks to understand what it means, the process of interpretation has begun. This is not to say that all interpretations are created equal, but rather that not all numbers are neutral.

In fact, what with this article and that case study, I’m pretty much set for my first day, after combining them with a discussion of the students’ projects and some related statistical experiments.

I also hope to invite at least one of the authors to come talk to the class, although I know they are both incredibly busy. Danah boyd, who recently came out with a book called It’s Complicated: the social lives of networked teensalso runs the Data & Society Research Institute, a NYC-based think/do tank focused on social, cultural, and ethical issues arising from data-centric technological development. I’m hoping she comes and talks about the work she’s starting up there.

The dark matter of big data

A tiny article in The Cap Times was recently published (hat tip Jordan Ellenberg) which describes the existence of a big data model which claims to help filter and rank school teachers based on their ability to raise student test scores. I guess it’s a kind of pre-VAM filtering system, and if it was hard to imagine a more vile model than the VAM, here you go. The article mentioned that the Madison School Board was deliberating on whether to spend $273K on this model.

One of the teachers in the district wrote her concerns about this model in her blog and then there was a debate at the school board meeting, and a journalist covered the meeting, so we know about it. But it was a close call, and this one could have easily slipped under the radar, or at least my radar.

Even so, now I know about it, and once I looked at the website of the company promoting this model, I found links to an article where they name a customer, for example in the Charlotte-Mecklenburg School District of North Carolina. They claim they only filter applications using their tool, they don’t make hiring decisions. Cold comfort for people who got removed by some random black box algorithm.

I wonder how many of the teachers applying to that district knew their application was being filtered through such a model? I’m going to guess none. For that matter, there are all sorts of application screening algorithms being regularly used of which applicants are generally unaware.

It’s just one example of the dark matter of big data. And by that I mean the enormous and growing clusters of big data models that are only inadvertently detectable by random small-town or small-city budget meeting journalism, or word-of-mouth reports coming out of conferences or late-night drinking parties with VC’s.

The vast majority of big data dark matter is still there in the shadows. You can only guess at its existence and its usage. Since the models themselves are proprietary, and are generally deployed secretly, there’s no reason for the public to be informed.

Let me give you another example, this time speculative, but not at all unlikely.

Namely, big data health models arising from the quantified self movement data. This recent Wall Street Journal article entitled Can Data From Your Fitbit Transform Medicine? articulated the issue nicely:

A recent review of 43 health- and fitness-tracking apps by the advocacy group Privacy Rights Clearinghouse found that roughly one-third of apps tested sent data to a third party not disclosed by the developer. One-third of the apps had no privacy policy. “For us, this is a big trust issue,” said Kaiser’s Dr. Young.

Consumer wearables fall into a regulatory gray area. Health-privacy laws that prevent the commercial use of patient data without consent don’t apply to the makers of consumer devices. “There are no specific rules about how those vendors can use and share data,” said Deven McGraw, a partner in the health-care practice at Manatt, Phelps, and Phillips LLP.

The key is that phrase “regulatory gray area”; it should make you think “big data dark matter lives here”.

When you have unprotected data that can be used as a proxy of HIPAA-protected medical data, there’s no reason it won’t be. So anyone who wants stands to benefit from knowing health-related information about you – think future employers who might help pay for future insurance claims – will be interested in using big data dark matter models gleaned from this kind of unregulated data.

To be sure, most people nowadays who wear fitbits are athletic, trying to improve their 5K run times. But the article explained that the medical profession is on the verge of suggesting a much larger population of patients use such devices. So it could get ugly real fast.

Secret big data models aren’t new, of course. I remember a friend of mine working for a credit card company a few decades ago. Her job was to model which customers to offer subprime credit cards to, and she was specifically told to target those customers who would end up paying the most in fees. But it’s become much much easier to do this kind of thing with the proliferation of so much personal data, including social media data.

I’m interested in the dark matter, partly as research for my book, and I’d appreciate help from my readers in trying to spot it when it pops up. For example, I remember begin told that a certain kind of online credit score is used to keep people on hold for customer service longer, but now I can’t find a reference to it anywhere. We should really compile a list at the boundaries of this dark matter. Please help! And if you don’t feel comfortable commenting, my email address is on the About page.

The business of big data audits: monetizing fairness

I gave a talk to the invitation-only NYC CTO Club a couple of weeks ago about my fears about big data modeling, namely:

  • that big data modeling is discriminatory,
  • that big data modeling increases inequality, and
  • that big data modeling threatens democracy.

I had three things on my “to do” list for the audience of senior technologists, namely:

  • test internal, proprietary models for discrimination,
  • help regulators like the CFPB develop reasonable audits, and
  • get behind certain models being transparent and publicly accessible, including credit scoring, teacher evaluations, and political messaging models.

Given the provocative nature of my talk, I was pleasantly surprised by the positive reception I was given. Those guys were great – interactive, talkative, and very thoughtful. I think it helped that I wasn’t trying to sell them something.

Even so, I shouldn’t have been surprised when one of them followed up with me to talk about a possible business model for “fairness audits.” The idea is that, what with the recent bad press about discrimination in big data modeling (some of the audience had actually worked with the Podesta team), there will likely be a business advantage to being able to claim that your models are fair. So someone should develop those tests that companies can take. Quick, someone, monetize fairness!

One reason I think this might actually work – and more importantly, be useful – is that I focused on “effects-based” discrimination, which is to say testing a model by treating it like a black box and seeing how it works on different inputs and gives different outputs. In other words, I want to give a resume-sorting algorithm different resumes with similar qualifications but different races. An algorithmically induced randomized experiment, if you will.

From the business perspective, a test that allows a model to remain a black box feels safe, because it does not require true transparency, and allows the “secret sauce” to remain secret.

One thing, though. I don’t think it makes too much sense to have a proprietary model for fairness auditing. In fact the way I was imagining this was to develop an open-source audit model that the CFPB could use. What I don’t want, and which would be worse than nothing, would be if some private company developed a proprietary “fairness audit” model that we cannot trust and would claim to solve the very real problems listed above.

Update: something like this is already happening for privacy compliance in the big data world (hat tip David Austin).

Inside the Podesta Report: Civil Rights Principles of Big Data

I finished reading Podesta’s Big Data Report to Obama yesterday, and I have to say I was pretty impressed. I credit some special people that got involved with the research of the report like Danah Boyd, Kate Crawford, and Frank Pasquale for supplying thoughtful examples and research that the authors were unable to ignore. I also want to thank whoever got the authors together with the civil rights groups that created the Civil Rights Principles for the Era of Big Data:

  1. Stop High-Tech Profiling. New surveillance tools and data gathering techniques that can assemble detailed information about any person or group create a heightened risk of profiling and discrimination. Clear limitations and robust audit mechanisms are necessary to make sure that if these tools are used it is in a responsible and equitable way.
  2. Ensure Fairness in Automated Decisions. Computerized decisionmaking in areas such as employment, health, education, and lending must be judged by its impact on real people, must operate fairly for all communities, and in particular must protect the interests of those that are disadvantaged or that have historically been the subject of discrimination. Systems that are blind to the preexisting disparities faced by such communities can easily reach decisions that reinforce existing inequities. Independent review and other remedies may be necessary to assure that a system works fairly.
  3. Preserve Constitutional Principles. Search warrants and other independent oversight of law enforcement are particularly important for communities of color and for religious and ethnic minorities, who often face disproportionate scrutiny. Government databases must not be allowed to undermine core legal protections, including those of privacy and freedom of association.
  4. Enhance Individual Control of Personal Information. Personal information that is known to a corporation — such as the moment-to-moment record of a person’s movements or communications — can easily be used by companies and the government against vulnerable populations, including women, the formerly incarcerated, immigrants, religious minorities, the LGBT community, and young people. Individuals should have meaningful, flexible control over how a corporation gathers data from them, and how it uses and shares that data. Non-public information should not be disclosed to the government without judicial process.
  5. Protect People from Inaccurate Data. Government and corporate databases must allow everyone — including the urban and rural poor, people with disabilities, seniors, and people who lack access to the Internet — to appropriately ensure the accuracy of personal information that is used to make important decisions about them. This requires disclosure of the underlying data, and the right to correct it when inaccurate.

This was signed off on by multiple civil rights groups listed here, and it’s a great start.

One thing I was not impressed by: the only time the report mentioned finance was to say that, in finance, they are using big data to combat fraud. In other words, finance was kind of seen as an industry standing apart from big data, and using big data frugally. This is not my interpretation.

In fact, I see finance as having given birth to big data. Many of the mistakes we are making as modelers in the big data era, which require the Civil Rights Principles as above, were made first in finance. Those modeling errors – and when not errors, politically intentional odious models – were created first in finance, and were a huge reason we first had the mortgage-backed-securities rated with AAA ratings and then the ensuing financial crisis.

In fact finance should have been in the report standing as a worst case scenario.

One last thing. The recommendations coming out of the Podesta report are lukewarm and are even contradicted by the contents of the report, as I complained about here. That’s interesting, and it shows that politics played a large part of what the authors could include as acceptable recommendations to the Obama administration.

Categories: data science, modeling

Podesta’s Big Data report to Obama: good but not great

This week I’m planning to read Obama’s new big data report written by John Podesta. So far I’ve only scanned it and read the associated recommendations.

Here’s one recommendation related to discrimination:

Expand Technical Expertise to Stop Discrimination. The detailed personal profiles held about many consumers, combined with automated, algorithm-driven decision-making, could lead—intentionally or inadvertently—to discriminatory outcomes, or what some are already calling “digital redlining.” The federal government’s lead civil rights and consumer protection agencies should expand their technical expertise to be able to identify practices and outcomes facilitated by big data analytics that have a discriminatory impact on protected classes, and develop a plan for investigating and resolving violations of law.

First, I’m very glad this has been acknowledged as an issue; it’s a big step forward from the big data congressional subcommittee meeting I attended last year for example, where the private-data-for-services fallacy was leaned on heavily.

So yes, a great first step. However, the above recommendation is clearly insufficient to the task at hand.

It’s one thing to expand one’s expertise – and I’d be more than happy to be a consultant for any of the above civil rights and consumer protection agencies, by the way – but it’s quite another to expect those groups to be able to effectively measure discrimination, never mind combat it.

Why? It’s just too easy to hide discrimination: the models are proprietary, and some of them are not even apparent; we often don’t even know we’re being modeled. And although the report brings up discriminatory pricing practices, it ignores redlining and reverse-redlining issues, which are even harder to track. How do you know if you haven’t been made an offer?

Once they have the required expertise, we will need laws that allow institutions like the CFPB to deeply investigate these secret models, which means forcing companies like Larry Summer’s Lending Club to give access to them, where the definition of “access” is tricky. That’s not going to happen just because the CFPB asks nicely.

Categories: modeling, news

Let’s not replace the SAT with a big data approach

The big news about the SAT is that the College Boards, which makes the SAT, has admitted there is a problem, which is widespread test-prep and gaming. As I talked about in this post, the SAT mainly serves to sort people by income.

It shouldn’t be a surprise to anyone when a weak proxy gets gamed. Yesterday I discussed this very thing in the context of Google’s PageRank algorithm, and today it’s student learning aptitude. The question is, what do we do next?

Rick Bookstaber wrote an interesting post yesterday (hat tip Marcos Carreira) with an idea to address the SAT problem with the same approach that I’m guessing Google is addressing the PageRank problem, namely by abandoning the poor proxy and getting a deeper, more involved one. Here’s Bookstaber’s suggestion:

You would think that in the emerging world of big data, where Amazon has gone from recommending books to predicting what your next purchase will be, we should be able to find ways to predict how well a student will do in college, and more than that, predict the colleges where he will thrive and reach his potential.  Colleges have a rich database at their disposal: high school transcripts, socio-economic data such as household income and family educational background, recommendations and the extra-curricular activities of every applicant, and data on performance ex post for those who have attended. For many universities, this is a database that encompasses hundreds of thousands of students.

There are differences from one high school to the next, and the sample a college has from any one high school might be sparse, but high schools and school districts can augment the data with further detail, so that the database can extend beyond those who have applied. And the data available to the colleges can be expanded by orders of magnitude if students agree to share their admission data and their college performance on an anonymized basis. There already are common applications forms used by many schools, so as far as admission data goes, this requires little more than adding an agreement in the college applications to share data; the sort of agreement we already make with Facebook or Google.

The end result, achievable in a few years, is a vast database of high school performance, drilling down to the specific high school, coupled with the colleges where each student applied, was accepted and attended, along with subsequent college performance. Of course, the nature of big data is that it is data, so students are still converted into numerical representations.  But these will cover many dimensions, and those dimensions will better reflect what the students actually do. Each college can approach and analyze the data differently to focus on what they care about.  It is the end of the SAT version of standardization. Colleges can still follow up with interviews, campus tours, and reviews of musical performances, articles, videos of sports, and the like.  But they will have a much better filter in place as they do so.

Two things about this. First, I believe this is largely already happening. I’m not an expert on the usage of student data at colleges and universities, but the peek I’ve had into this industry tells me that the analytics are highly advanced (please add related comments and links if you have them!). And they have more to do with admissions and college aid – and possibly future alumni giving – than any definition of academic success. So I think Bookstaber is being a bit naive and idealistic if he thinks colleges will use this information for good. They already have it and they’re not.

Secondly, I want to think a little bit harder about when the “big, deeper data” approach makes sense. I think it does for teachers to some extent, as I talked about yesterday, because after all it’s part of a job to get evaluated. For that matter I expect this kind of thing to be part of most jobs soon (but it will be interesting to see when and where it stops – I’m pretty sure Bloomberg will never evaluate himself quantitatively).

I don’t think it makes sense to evaluate children in the same way, though. After all, we’re basically talking about pre-consensual surveillance, not to mention the collection and mining of information far beyond the control of the individual child. And we’re proposing to mine demographic and behavioral data to predict future success. This is potentially much more invasive than just one crappy SAT test. Childhood is a time which we should try to do our best to protect, not quantify.

Also, the suggestion that this is less threatening because “the data is anonymized” is misleading. Stripping out names in historical data doesn’t change or obscure the difference between coming from a rich high school or a poor one. In the end you will be judged by how “others like you” performed, and in this regime the system gets off the hook but individuals are held accountable. If you think about it, it’s exactly the opposite of the American dream.

I don’t want to be naive. I know colleges will do what they can to learn about their students and to choose students to make themselves look good, at least as long as the US News & World Reports exists. I’d like to make it a bit harder for them to do so.

How to Lie With Statistics (in the Age of Big Data)

When I emailed my mom last month to tell her the awesome news about the book I’m writing she emailed me back the following:

i.e, A modern-day How to Lie with Statistics (1954), avail on Amazon
for $9.10.  Love, Mom

That was her whole email. She’s never been very verbose, in person or electronically. Too busy hacking.

Even so, she gave me enough to go on, and I bought the book and recently read it. It was awesome and I recommend it to anyone who hasn’t read it – or read it recently. It’s a quick read and available as a free pdf download here.

The goal of the book is to demonstrate all the ways marketers, journalists, accountants, and sometimes even statisticians can bias your interpretation of statistical facts or even just confuse you into thinking something is true when it’s not. It’s illustrated as well, which is fun and often funny.

Screen Shot 2014-02-03 at 7.02.22 AM

The author does things like talk about how you can present graphs to be very misleading – my favorite, because it happens to be my pet peeve, is the “growth chart” where the y-axis goes from 1400 to 1402 so things look like they’ve grown a huge amount because “0” isn’t represented anywhere. Or of course the chart that has no numbers at all so you don’t know what you’re looking at.

Screen Shot 2014-02-03 at 6.43.24 AM

There are a few things that don’t translate: so for example, he has a big thing about how people say “average” but they don’t specify whether they mean “arithmetic mean” or “median.” Nowadays this is taken to mean the former (am I wrong?).

And also, it’s fascinating to see how culture has changed – many of his examples that involve race would be very different nowadays, and issues around women, and the idea that you could run a randomized experiment to give half the people polio vaccines and withhold them from the other half, when polio is a real threat that leaves children paralyzed, is really strange.

Also, many of the examples – there are hundreds – refer to the Great Depression and the recovery since then, and the assumptions are bizarrely different in 1954 than you see in 2014 (and I’d guess different than how it will be in 2024 but I hope I’m wrong). Specifically, it seems that many of the lies that people are propagating with statistics are to downplay their profits so as to not seem excessive. Can you imagine?!

One of the reasons I read this book, of course, was to see if my book really is a modern version of that one. And I have to say that many of the issues do not translate, but some of them do, in interesting ways.

Even the reason that many of them don’t is kind of interesting: in the age of big data, we often don’t even see charts of data so how can we be misled by them? In other words, the presumption is that the data is so big as to be inaccessible. Google doesn’t bother showing us the numbers. Plus they don’t have to since we use their services anyway.

The most transferrable tips on how to lie with statistics probably stem from discussions on the following topics:

  • Selection bias (things like, of the people who responded to our poll, they are all happy with our service)
  • Survivorship bias (things like, companies that have been in the S&P for 30 years have great stock performance)
  • Confusing people about topic A by discussing a related but not directly relevant topic B. This is described in the book as a “semi-attached figure”

The last one is the most relevant, I believe. In the age of big data, and partly because the data is “too big” to take a real look at, we spend an amazing amount of time talking about how a model is measuring something we care about (teachers’ value, or how good a candidate is for a job) when in fact the model is doing something quite different (test scores, demographic data).

If we were aware of those discrepancies we’d have way more skepticism, but we’re intimidated by the size of the data and the complexity of the models.

A final point. For the most part that crucial big data issue of complexity isn’t addressed in the book. It kind of makes me pine for the olden days, except not really if I’m black, a woman, or at risk of being exposed to polio.

UPDATES: First, my bad for not understanding that, at the time, the polio vaccine wasn’t known to work, or even be harmful, so of course there were trials. I was speaking from the perspective of the present day when it seems obvious that it works. For that matter I’m not even sure it was the particular vaccine that ended up working that was being tested.

Second, I showed my mom this post and her response was perfect:

Glad you liked it! Love, Mom

Categories: musing, statistics

The bursting of the big data bubble

September 20, 2013 42 comments

It’s been a good ride. I’m not gonna lie, it’s been a good time to be a data whiz, a quant-turned-data scientist. I get lots of attention and LinkedIn emails just for my title and my math Ph.D., and it’s flattering. But all of that is going to change, starting now.

You see, there are some serious headwinds. They started a while ago but they’re picking up speed, and the magical wave of hype propelling us forward is giving way. I can tell, I’ve got a nose for sinking ships and sailing metaphors.

First, the hype and why it’s been so strong.

It seems like data and the ability to use data is the secret sauce in so many of the big success stories. Look at Google. They managed to think of the entire web as their data source, and have earned quite a bit of respect and advertising money for their chore of organizing it like a huge-ass free library for our benefit. That took some serious data handling and modeling know-how.

We humans are pretty good at detecting patterns, so after a few companies made it big with the secret data sauce, we inferred that, when you take a normal tech company and sprinkle on data, you get the next Google.

Next, a few reasons it’s unsustainable

Most companies don’t have the data that Google has, and can never hope to cash in on stuff at the scale of the ad traffic that Google sees. Even so, there are lots of smaller but real gains that lots of companies – but not all – could potentially realize if they collected the right kind of data and had good data people helping them.

Unfortunately, this process rarely actually happens the right way, often because the business people ask their data people the wrong questions to being with, and since they think of their data people as little more than pieces of software – data in, magic out – they don’t get their data people sufficiently involved with working on something that data can address.

Also, since there are absolutely no standards for what constitutes a data scientist, and anyone who’s taken a machine learning class at college can claim to be one, the data scientists walking around often have no clue how to actually form the right questions to ask anyway. They are lopsided data people, and only know how to answer already well-defined questions like the ones that Kaggle comes up with. That’s less than half of what a good data scientist does, but people have no idea what a good data scientist does.

Plus, it’s super hard to accumulate hard evidence that you have a crappy data science team. If you’ve hired one or more unqualified data scientists, how can you tell? They still might be able to implement crappy models which don’t answer the right question, but in order to see that you’d need to also have a good data scientist who implements a better solution to the right question. But you only have one. It’s a counterfactual problem.

Here’s what I see happening. People have invested some real money in data, and they’ve gotten burned with a lack of medium-term results. Now they’re getting impatient for proof that data is an appropriate place to invest what little money their VC’s have offered them. That means they want really short-term results, which means they’re lowballing data science expertise, which means they only attract people who’ve taken one machine learning class and fancy themselves experts.

In other words, data science expertise has been commodified, and it’s a race to the bottom. Who will solve my business-critical data problem on a short-term consulting basis for less than $5000? Less than $4000?

What’s next?

There really is a difference between A) crude models that someone constructs not really knowing what they’re doing and B) thoughtful models which gain an edge along the margin. It requires someone who actually knows what they’re doing to get the latter kind of model. But most people are unaware of even the theoretical difference between type A and type B models, nor would they recognize which type they’ve got once they get one.

Even so, over time, type B models outperform type A models, and if you care enough about the marginal edge between the two types, say because you’re in a competitive environment, then you will absolutely need type B to make money. And by the way, if you don’t care about that marginal edge, then by all means you should use a type A solution. But you should at least know the difference and make that choice deliberately.

My forecast is that, once the hype wave of big data is dead and gone, there will emerge reasonable standards of what a data scientist should actually be able to do, and moreover a standard of when and how to hire a good one. It’ll be a rubrik, and possibly some tests, of both problem solving and communication.

Personally, I’m looking forward to a more reasonable and realistic vision of how data and data expertise can help with things. I might have to change my job title, but I’m used to it.

Categories: data science

When big data goes bad in a totally predictable way

Three quick examples this morning in the I-told-you-so category. I’d love to hear Kenneth Neil Cukier explain how “objective” data science is when confronted with this stuff.

1. When an unemployed black woman pretends to be white her job offers skyrocket (Urban Intellectuals, h/t Mike Loukides). Excerpt from the article: “Two years ago, I noticed that had added a “diversity questionnaire” to the site.  This gives an applicant the opportunity to identify their sex and race to potential employers. guarantees that this “option” will not jeopardize your chances of gaining employment.  You must answer this questionnaire in order to apply to a posted position—it cannot be skipped.  At times, I would mark off that I was a Black female, but then I thought, this might be hurting my chances of getting employed, so I started selecting the “decline to identify” option instead.  That still had no effect on my getting a job.  So I decided to try an experiment:  I created a fake job applicant and called her Bianca White.”

2. How big data could identify the next felon – or blame the wrong guy (Bloomberg). From the article: “The use of physical characteristics such as hair, eye and skin color to predict future crimes would raise ‘giant red privacy flags’ since they are a proxy for race and could reinforce discriminatory practices in hiring, lending or law enforcement, said Chi Chi Wu, staff attorney at the National Consumer Law Center.”

3. How algorithms magnify misbehavior (the Guardian, h/t Suresh Naidu). From the article: “For one British university, what began as a time-saving exercise ended in disgrace when a computer model set up to streamline its admissions process exposed – and then exacerbated – gender and racial discrimination.”

This is just the beginning, unfortunately.

Categories: data science, modeling

What’s the difference between big data and business analytics?

I offend people daily. People tell me they do “big data” and that they’ve been doing big data for years. Their argument is that they’re doing business analytics on a larger and larger scale, so surely by now it must be “big data”.


There’s an essential difference between true big data techniques, as actually performed at surprisingly few firms but exemplified by Google, and the human-intervention data-driven techniques referred to as business analytics.

No matter how big the data you use is, at the end of the day, if you’re doing business analytics, you have a person looking at spreadsheets or charts or numbers, making a decision after possibly a discussion with 150 other people, and then tweaking something about the way the business is run.

If you’re really doing big data, then those 150 people probably get fired laid off, or even more likely are never hired in the first place, and the computer is programmed to update itself via an optimization method.

That’s not to say it doesn’t also spit out monitoring charts and numbers, and it’s not to say no person takes a look every now and then to make sure the machine is humming along, but there’s no point at which the algorithm waits for human intervention.

In other words, in a true big data setup, the human has stepped outside the machine and lets the machine do its thing. That means, of course, that it takes way more to set up that machine in the first place, and probably people make huge mistakes all the time in doing this, but sometimes they don’t. Google search got pretty good at this early on.

So with a business analytics set up we might keep track of the number of site visitors and a few sales metrics so we can later try to (and fail to) figure out whether a specific email marketing campaign had the intended effect.

But in a big data set-up it’s typically much more microscopic and detail oriented, collecting everything it can, maybe 1,000 attributed of a single customer, and figuring out what that guy is likely to do next time, how much they’ll spend, and the magic question, whether there will even be a next time.

So the first thing I offend people about is that they’re not really part of the “big data revolution”. And the second thing is that, usually, their job is potentially up for grabs by an algorithm.

Categories: data science, modeling

Technocrats and big data

Today I’m finally getting around to reporting on the congressional subcommittee I went to a few weeks ago on big data and analytics. Needless to say it wasn’t what I’d hoped.

My observations are somewhat disjointed, since there was no coherent discussion, so I guess I’ll just make a list:

  1. The Congressmen and women seem to know nothing more about the “Big Data Revolution” than what they’d read in the now-famous McKinsey report which talks about how we’ll need 180,000 data scientists in the next decade and how much money we’ll save and how competitive it will make our country.
  2. In other words, with one small exception I’ll discuss below, the Congresspeople were impressed, even awed, at the intelligence and power of the panelists. They were basically asking for advice on how to let big data happen on a bigger and better scale. Regulation never came up, it was all about, “how do we nurture this movement that is vital to our country’s health and future?”
  3. There were three useless panelists, all completely high on big data and making their money being like that. First there was a schmuck from the NSF who just said absolutely nothing, had been to a million panels before, and was simply angling to be invited to yet more.
  4. Next there was a guy who had started training data-ready graduates in some masters degree program. All he ever talked about is how programs like his should be funded, especially his, and how he was talking directly with employers in his area to figure out what to train his students to know.
  5. It was especially interesting to see how this second guy reacted when the single somewhat thoughtful and informed Congressman, whose name I didn’t catch because he came in and left quickly and his name tag was miniscule, asked him about whether or not he taught his students to be skeptical. The guy was like, I teach my students to be ready to deal with big data just like their employers want. The congressman was like, no that’s not what I asked, I asked whether they can be skeptical of perceived signals versus noise, whether they can avoid making huge costly mistakes with big data. The guy was like, I teach my students to deal with big data.
  6. Finally there was the head of IBM Research who kept coming up with juicy and misleading pro-data tidbits which made him sound like some kind of saint for doing his job. For example, he brought up the “premature infants are being saved” example I talked about in this post.
  7. The IBM guy was also the only person who ever mentioned privacy issues at all, and he summarized his, and presumably everyone else’s position on this subject, by saying “people are happy to give away their private information for the services they get in return.” Thanks, IBM guy!
  8. One more priceless moment was when one of the Congressmen asked the panel if industry has enough interaction with policy makers. The head of IBM Research said, “Why yes, we do!” Thanks, IBM guy!

I was reminded of this weird vibe and power dynamic, where an unchallenged mysterious power of big data rules over reason, when I read this New York Times column entitled Some Cracks in the Cult of Technocrats (hat tip Suresh Naidu). Here’s the leading paragraph:

We are living in the age of the technocrats. In business, Big Data, and the Big Brains who can parse it, rule. In government, the technocrats are on top, too. From Washington to Frankfurt to Rome, technocrats have stepped in where politicians feared to tread, rescuing economies, or at least propping them up, in the process.

The column was written by Chrystia Freeland and it discusses a recent paper entitled Economics versus Politics: Pitfalls of Policy Advice by Daron Acemoglu from M.I.T. and James Robinson from Harvard. A description of the paper from Freeland’s column:

Their critique is not the standard technocrat’s lament that wise policy is, alas, politically impossible to implement. Instead, their concern is that policy which is eminently sensible in theory can fail in practice because of its unintended political consequences.

In particular, they believe we need to be cautious about “good” economic policies that have the side effect of either reinforcing already dominant groups or weakening already frail ones.

“You should apply double caution when it comes to policies which will strengthen already powerful groups,” Dr. Acemoglu told me. “The central starting point is a certain suspicion of elites. You really cannot trust the elites when they are totally in charge of policy.”

Three examples they discuss in the paper: trade unions, financial deregulation in the U.S., privatization in Russia. Examples where something economists suggested would make the system better also acted to reinforce power of already powerful people.

If there’s one thing I might infer from my trip to Washington, it’s that the technocrats in charge nowadays, whose advice is being followed, may have subtly shifted away from deregulation economists and towards big data folks. Not that I’m holding my breath for Bob Rubin to be losing his grip any time soon.

Categories: data science, finance, news

The rise of big data, big brother

I recently read an article off the newsstand called The Rise of Big Data.

It was written by Kenneth Neil Cukier and Viktor Mayer-Schoenberger and it was published in the May/June 2013 edition of Foreign Affairs, which is published by the Council on Foreign Relations (CFR). I mention this because CFR is an influential think tank, filled with powerful insiders, including people like Robert Rubin himself, and for that reason I want to take this view on big data very seriously: it might reflect the policy view before long.

And if I think about it, compared to the uber naive view I came across last week when I went to the congressional hearing about big data and analytics, that would be good news. I’ll write more about it soon, but let’s just say it wasn’t everything I was hoping for.

At least Cukier and Mayer-Schoenberger discuss their reservations regarding “big data” in this article. To contrast this with last week, it seemed like the only background material for the hearing, at least for the congressmen, was the McKinsey report talking about how sexy data science is and how we’ll need to train an army of them to stay competitive.

So I’m glad it’s not all rainbows and sunshine when it comes to big data in this article. Unfortunately, whether because they’re tied to successful business interests, or because they just haven’t thought too deeply about the dark side, their concerns seem almost token, and their examples bizarre.

The article is unfortunately behind the pay wall, but I’ll do my best to explain what they’ve said.


First they discuss the concept of datafication, and their example is how we quantify friendships with “likes”: it’s the way everything we do, online or otherwise, ends up recorded for later examination in someone’s data storage units. Or maybe multiple storage units, and maybe for sale.

They formally define later in the article as a process:

… taking all aspect of life and turning them into data. Google’s augmented-reality glasses datafy the gaze. Twitter datafies stray thoughts. LinkedIn datafies professional networks.

Datafication is an interesting concept, although as far as I can tell they did not coin the word, and it has led me to consider its importance with respect to intentionality of the individual.

Here’s what I mean. We are being datafied, or rather our actions are, and when we “like” someone or something online, we are intending to be datafied, or at least we should expect to be. But when we merely browse the web, we are unintentionally, or at least passively, being datafied through cookies that we might or might not be aware of. And when we walk around in a store, or even on the street, we are being datafied in an completely unintentional way, via sensors or Google glasses.

This spectrum of intentionality ranges from us gleefully taking part in a social media experiment we are proud of to all-out surveillance and stalking. But it’s all datafication. Our intentions may run the gambit but the results don’t.

They follow up their definition in the article, once they get to it, with a line that speaks volumes about their perspective:

Once we datafy things, we can transform their purpose and turn the information into new forms of value

But who is “we” when they write it? What kinds of value do they refer to? As you will see from the examples below, mostly that translates into increased efficiency through automation.

So if at first you assumed they mean we, the American people, you might be forgiven for re-thinking the “we” in that sentence to be the owners of the companies which become more efficient once big data has been introduced, especially if you’ve recently read this article from Jacobin by Gavin Mueller, entitled “The Rise of the Machines” and subtitled “Automation isn’t freeing us from work — it’s keeping us under capitalist control.” From the article (which you should read in its entirety):

In the short term, the new machines benefit capitalists, who can lay off their expensive, unnecessary workers to fend for themselves in the labor market. But, in the longer view, automation also raises the specter of a world without work, or one with a lot less of it, where there isn’t much for human workers to do. If we didn’t have capitalists sucking up surplus value as profit, we could use that surplus on social welfare to meet people’s needs.

The big data revolution and the assumption that N=ALL

According to Cukier and Mayer-Schoenberger, the Big Data revolution consists of three things:

  1. Collecting and using a lot of data rather than small samples.
  2. Accepting messiness in your data.
  3. Giving up on knowing the causes.

They describe these steps in rather grand fashion, by claiming that big data doesn’t need to understand cause because the data is so enormous. It doesn’t need to worry about sampling error because it is literally keeping track of the truth. The way the article frames this is by claiming that the new approach of big data is letting “N = ALL”.

But here’s the thing, it’s never all. And we are almost always missing the very things we should care about most.

So for example, as this InfoWorld post explains, internet surveillance will never really work, because the very clever and tech-savvy criminals that we most want to catch are the very ones we will never be able to catch, since they’re always a step ahead.

Even the example from their own article, election night polls, is itself a great non-example: even if we poll absolutely everyone who leaves the polling stations, we still don’t count people who decided not to vote in the first place. And those might be the very people we’d need to talk to to understand our country’s problems.

Indeed, I’d argue that the assumption we make that N=ALL is one of the biggest problems we face in the age of Big Data. It is, above all, a way of excluding the voices of people who don’t have the time or don’t have the energy or don’t have the access to cast their vote in all sorts of informal, possibly unannounced, elections.

Those people, busy working two jobs and spending time waiting for buses, become invisible when we tally up the votes without them. To you this might just mean that the recommendations you receive on Netflix don’t seem very good because most of the people who bother to rate things are Netflix are young and have different tastes than you, which skews the recommendation engine towards them. But there are plenty of much more insidious consequences stemming from this basic idea.

Another way in which the assumption that N=ALL can matter is that it often gets translated into the idea that data is objective. Indeed the article warns us against not assuming that:

… we need to be particularly on guard to prevent our cognitive biases from deluding us; sometimes, we just need to let the data speak.

And later in the article,

In a world where data shape decisions more and more, what purpose will remain for people, or for intuition, or for going against the facts?

This is a bitch of a problem for people like me who work with models, know exactly how they work, and know exactly how wrong it is to believe that “data speaks”.

I wrote about this misunderstanding here, in the context of Bill Gates, but I was recently reminded of it in a terrifying way by this New York Times article on big data and recruiter hiring practices. From the article:

“Let’s put everything in and let the data speak for itself,” Dr. Ming said of the algorithms she is now building for Gild.

If you read the whole article, you’ll learn that this algorithm tries to find “diamond in the rough” types to hire. A worthy effort, but one that you have to think through.

Why? If you, say, decided to compare women and men with the exact same qualifications that have been hired in the past, but then, looking into what happened next you learn that those women have tended to leave more often, get promoted less often, and give more negative feedback on their environments, compared to the men, your model might be tempted to hire the man over the woman next time the two showed up, rather than looking into the possibility that the company doesn’t treat female employees well.

In other words, ignoring causation can be a flaw, rather than a feature. Models that ignore causation can add to historical problems instead of addressing them. And data doesn’t speak for itself, data is just a quantitative, pale echo of the events of our society.

Some cherry-picked examples

One of the most puzzling things about the Cukier and Mayer-Schoenberger article is how they chose their “big data” examples.

One of them, the ability for big data to spot infection in premature babies, I recognized from the congressional hearing last week. Who doesn’t want to save premature babies? Heartwarming! Big data is da bomb!

But if you’re going to talk about medicalized big data, let’s go there for reals. Specifically, take a look at this New York Times article from last week where a woman traces the big data footprints, such as they are, back in time after receiving a pamphlet on living with Multiple Sclerosis. From the article:

Now she wondered whether one of those companies had erroneously profiled her as an M.S. patient and shared that profile with drug-company marketers. She worried about the potential ramifications: Could she, for instance, someday be denied life insurance on the basis of that profile? She wanted to track down the source of the data, correct her profile and, if possible, prevent further dissemination of the information. But she didn’t know which company had collected and shared the data in the first place, so she didn’t know how to have her entry removed from the original marketing list.

Two things about this. First, it happens all the time, to everyone, but especially to people who don’t know better than to search online for diseases they actually have. Second, the article seems particularly spooked by the idea that a woman who does not have a disease might be targeted as being sick and have crazy consequences down the road. But what about a woman is actually is sick? Does that person somehow deserve to have their life insurance denied?

The real worries about the intersection of big data and medical records, at least the ones I have, are completely missing from the article. Although they did mention that “improving and lowering the cost of health care for the world’s poor” inevitable  will lead to “necessary to automate some tasks that currently require human judgment.” Increased efficiency once again.

To be fair, they also talked about how Google tried to predict the flu in February 2009 but got it wrong. I’m not sure what they were trying to say except that it’s cool what we can try to do with big data.

Also, they discussed a Tokyo research team that collects data on 360 pressure points with sensors in a car seat, “each on a scale of 0 to 256.” I think that last part about the scale was added just so they’d have more numbers in the sentence – so mathematical!

And what do we get in exchange for all these sensor readings? The ability to distinguish drivers, so I guess you’ll never have to share your car, and the ability to sense if a driver slumps, to either “send an alert or atomatically apply brakes.” I’d call that a questionable return for my investment of total body surveillance.

Big data, business, and the government

Make no mistake: this article is about how to use big data for your business. It goes ahead and suggests that whoever has the biggest big data has the biggest edge in business.

Of course, if you’re interested in treating your government office like a business, that’s gonna give you an edge too. The example of Bloomberg’s big data initiative led to efficiency gain (read: we can do more with less, i.e. we can start firing government workers, or at least never hire more).

As for regulation, it is pseudo-dealt with via the discussion of market dominance. We are meant to understand that the only role government can or should have with respect to data is how to make sure the market is working efficiently. The darkest projected future is that of market domination by Google or Facebook:

But how should governments apply antitrust rules to big data, a market that is hard to define and is constantly changing form?

In particular, no discussion of how we might want to protect privacy.

Big data, big brother

I want to be fair to Cukier and Mayer-Schoenberger, because they do at least bring up the idea of big data as big brother. Their topic is serious. But their examples, once again, are incredibly weak.

Should we find likely-to-drop-out boys or likely-to-get-pregnant girls using big data? Should we intervene? Note the intention of this model would be the welfare of poor children. But how many models currently in production are targeting that demographic with that goal? Is this in any way at all a reasonable example?

Here’s another weird one: they talked about the bad metric used by US Secretary of Defense Robert McNamara in the Viet Nam War, namely the number of casualties. By defining this with the current language of statistics, though, it gives us the impression that we could just be super careful about our metrics in the future and: problem solved. As we experts in data know, however, it’s a political decision, not a statistical one, to choose a metric of success. And it’s the guy in charge who makes that decision, not some quant.


If you end up reading the Cukier and Mayer-Schoenberger article, please also read Julie Cohen’s draft of a soon-to-be published Harvard Law Review article called “What Privacy is For” where she takes on big data in a much more convincing and skeptical light than Cukier and Mayer-Schoenberger were capable of summoning up for their big data business audience.

I’m actually planning a post soon on Cohen’s article, which contains many nuggets of thoughtfulness, but for now I’ll simply juxtapose two ideas surrounding big data and innovation, giving Cohen the last word. First from the Cukier and Mayer-Schoenberger article:

Big data enables us to experiment faster and explore more leads. These advantages should produce more innovation

Second from Cohen, where she uses the term “modulation” to describe, more or less, the effect of datafication on society:

When the predicate conditions for innovation are described in this way, the problem with characterizing privacy as anti-innovation becomes clear: it is modulation, not privacy, that poses the greater threat to innovative practice. Regimes of pervasively distributed surveillance and modulation seek to mold individual preferences and behavior in ways that reduce the serendipity and the freedom to tinker on which innovation thrives. The suggestion that innovative activity will persist unchilled under conditions of pervasively distributed surveillance is simply silly; it derives rhetorical force from the cultural construct of the liberal subject, who can separate the act of creation from the fact of surveillance. As we have seen, though, that is an unsustainable fiction. The real, socially-constructed subject responds to surveillance quite differently—which is, of course, exactly why government and commercial entities engage in it. Clearing the way for innovation requires clearing the way for innovative practice by real people, by preserving spaces within which critical self-determination and self-differentiation can occur and by opening physical spaces within which the everyday practice of tinkering can thrive.

Big data and surveillance

You know how, every now and then, you hear someone throw out a statistic that implies almost all of the web is devoted to porn?

Well, that turns out to be a false myth, which you can read more about here – although once upon a time it was kind of true, before women started using the web in large numbers and before there was Netflix streaming.

Here’s another myth along the same lines which I think might actually be true: almost all of big data is devoted to surveillance.

Of course, data is data, and you could define “surveillance” broadly (say as “close observation”), to make the above statement a tautology. To what extent is Google’s data, collected about you, a surveillance database, if they only use it to tailor searches and ads?

On the other hand, something that seems unthreatening now can become creepy soon: recall the NSA whistleblower who last year described how the government stores an enormous amount of the “electronic communications” in this country to keep close tabs on us.

The past

Back in 2011, published an article entitled “Big data to drive a surveillance society” and makes the case that there is a natural competition among corporations with large databases to collect more data, have it more interconnected (knowing now only a person’s shopping habits but also their location and age, say) and have the analytics work faster, even real-time, so they can peddle their products faster and better than the next guy.

Of course, not everyone agrees to talk about this “natural competition”. From the article:

Todd Papaioannou, vice president of cloud architecture at Yahoo, said instead of thinking about big data analytics as a weapon that empowers corporate Big Brothers, consumers should regard it as a tool that enables a more personalized Web experience.

“If someone can deliver a more compelling, relevant experience for me as a consumer, then I don’t mind it so much,” he said.

Thanks for telling us consumers how great this is, Todd. Later in the same article Todd says, “Our approach is not to throw any data away.”

The present

Fast forward to 2013, when defence contractor Raytheon is reported to have a new piece of software, called Riot, which is cutting-edge in the surveillance department.

The name Riot refers to “Rapid Information Overlay Technology” and it can locate individuals with longitude and latitudes, using cell phone data, and make predictions as well, using data scraped from Facebook, Twitter, and Foursquare. A video explains how they do it. From the op-ed:

The possibilities for RIOT are hideous at consumer level. This really is the stalker’s dream technology. There’s also behavioural analysis to predict movements in the software. That’s what Big Data can do, and if it’s not foolproof, there are plenty of fools around the world to try it out on.

US employers, who have been creating virtual Gulags of surveillance for employees with much less effective technology, will love this. “We know what you do” has always been a working option for coercion. The fantastic levels of paranoia involved in the previous generations of surveillance technology will be truly gratified by RIOT.

The future

Lest we think that our children are not as affected by such stalking software, since they don’t spend as much time on social media and often don’t have cellphones, you should also be aware that educational data is now being collected about individual learners in the U.S. at an enormous scale and with very little oversight.

This report from (hat tip Matthew Cunningham-Cook) explains recent changes in privacy laws for children, which happen to coincide with how much data is being collected (tons) and how much money is in the analysis of that data (tons):

Schools are a rich source of personal information about children that can be legally and illegally accessed by third parties.With incidences of identity theft, database hacking, and sale of personal information rampant, there is an urgent need to protect students’ rights under FERPA and raise awareness of aspects of the law that may compromise the privacy of students and their families.

In 2008 and 2011, amendments to FERPA gave third parties, including private companies,increased access to student data. It is significant that in 2008, the amendments to FERPA expanded the definitions of “school  officials” who have access to student data to include “contractors, consultants, volunteers, and other parties to whom an educational agency or institution has outsourced institutional services or functions it would otherwise use employees to perform.” This change has the effect of increasing the market for student data.

There are lots of contractors and consultants, for example inBloom, and they are slightly less concerned about data privacy issues than you might be:

inBloom has stated that it “cannot guarantee the security of the information stored … or that the information will not be intercepted when it is being transmitted.”

The article ends with this:

The question is: Should we compromise and endanger student privacy to support a centralized and profit-driven education reform initiative? Given this new landscape of an information and data free-for-all, and the proliferation of data-driven education reform initiatives like CommonCore and huge databases of student information, we’ve arrived at a time when once a child enters a public school,their parents will never again know who knows what about their children and about their families. It is now up to individual states to find ways to grant students additional privacy protections.

No doubt about it: our children are well on their way to being the most stalked generation.

Privacy policy

One of the reasons I’m writing this post today is that I’m on a train to D.C. to sit in a Congressional hearing where Congressmen will ask “big data experts” questions about big data and analytics. The announcement is here, and I’m hoping to get into it.

The experts present are from IBM, the NSF, and North Carolina State University. I’m wondering how they got picked and what their incentives are. If I get in I will write a follow-up post on what happened.

Here’s what I hope happens. First, I hope it’s made clear that anonymization doesn’t really work with large databases. Second, I hope it’s clear that there’s no longer a very clear dividing line between sensitive data and nonsensitive data – you’d be surprised how much can be inferred about your sensitive data using only nonsensitive data.

Next, I hope it’s clear that the very people who should be worried the most about their data being exposed and freely available are the ones who don’t understand the threat. This means that merely saying that people should protect their data more is utterly insufficient.

Next, we should understand what policies already in place look like in Europe:

Screen Shot 2013-04-24 at 6.55.44 AM

Finally, we should focus not only the collection of data, but on the usage of data. Just because you have a good idea of my age, race, education level, income, and HIV status doesn’t mean you should be able to use that information against me whenever you want.

In particular, it should not be legal for companies that provide loans or insurance to use whatever information they can buy from Acxiom about you. It should be a highly regulated set of data that allows for such decisions.

Categories: data science, modeling

The smell test for big data

The other day I was chatting with a data scientist (who didn’t know me), and I asked him what he does. He said that he used social media graphs to see how we might influence people to lose weight.

Whaaaa? That doesn’t pass the smell test.

If I can imagine it happening in real life, between people, then I can imagine it happening in a social medium. If it doesn’t happen in real life, it doesn’t magically appear on the internet.

So if I have a huge crush on LeBron James (true), and if he tweets that I should go out and watch “Life of Pi” because it’s a great movie (true), then I’d do it, because I’d imagine he is here with me in my living room suggesting that I see that movie, and I’d do anything that man says if he’s in my living room, especially if he’s jamming with me.

Not actually my living room.

Not actually my living room.

But if LeBron James tells me to lose weight while we’re hanging, then I just feel bad and weird. Because nobody can influence someone else to lose weight in person*.

Bottomline: there’s a smell test, and it states that real influence happening inside a social graph isn’t magical just because it’s mathematically formulated. It is at best an echo of the actual influence exerted in real life. I have yet to see a counter-example to that. If you have one, please challenge me on this.

Any data scientist going around claiming they’re going to surpass this smell test should stop right now, because it adds to the hype and adds to the noise around big data without adding to the conversation.

* I’ll make an exception if they’re a doctor wielding a surgical knife about to remove my stomach or something, which doesn’t translate well into social media, and might not always work long-term. And to be fair, you (or LeBron) can influence me to not eat a given thing on a given day, or even to go on a diet, but by now we should know that doesn’t have long term effects. There’s a reason Weight Watchers either doesn’t publish their results or relies on survivorship bias for fake results.

Categories: data science, modeling, rant

Updating your big data model

When you are modeling for the sake of real-time decision-making you have to keep updating your model with new data, ideally in an automated fashion. Things change quickly in the stock market or the internet, and you don’t want to be making decisions based on last month’s trends.

One of the technical hurdles you need to overcome is the sheer size of the dataset you are using to first train and then update your model. Even after aggregating your model with MapReduce or what have you, you can end up with hundreds of millions of lines of data just from the past day or so, and you’d like to use it all if you can.

The problem is, of course, that over time the accumulation of all that data is just too unwieldy, and your python or Matlab or R script, combined with your machine, can’t handle it all, even with a 64 bit setup.

Luckily with exponential downweighting, you can update iteratively; this means you can take your new aggregated data (say a day’s worth), update the model, and then throw it away altogether. You don’t need to save the data anywhere, and you shouldn’t.

As an example, say you are running a multivariate linear regression. I will ignore bayesian priors (or, what is an example of the same thing in a different language, regularization terms) for now. Then in order to have an updated coefficient vector \beta, you need to update your “covariance matrix” X^{\tau} X and the other term (which must have a good name but I don’t know it) X^{\tau} y and simply compute

\beta = (X^{\tau} X)^{-1} X^{\tau} y.

So the problem simplifies to, how can we update X^{\tau} X and X^{\tau} y?

As I described before in this post for example, you can use exponential downweighting. Whereas before I was expounding on how useful this method is for helping you care about new data more than old data, today my emphasis is on the other convenience, which is that you can throw away old data after updating your objects of interest.

So in particular, we will follow the general rule in updating an object $T$ that it’s just some part old, some part new:

T(t+1) = \lambda T(t) + (1-\lambda) T(t, t+1),

where by T(t) I mean the estimate of the thing T at time t, and by T(t, t+a) I mean the estimate of the thing T given just the data between time t and time t+a.

The speed at which I forget data is determined by my choice of \lambda, and should be determined by the market this model is being used in. For example, currency trading is fast-paced, and long-term bonds not as much. How long does it take the market to forget news or to acclimate to new news? The same kind of consideration should be used in modeling the internet. How quickly do users change their behaviors? This could depend on the season as well- things change quickly right after Christmas shopping season is done compared to the lazy summer months.

Specifically, I want to give an example of this update rule for the covariance matrix X^{\tau}X, which really isn’t a true covariance matrix because I’m not scaling it correctly, but I’ll ignore that because it doesn’t matter for this discussion.

Namely, I claim that after updating X^{\tau}X with the above exponential downweighting rule, I have the covariance matrix of data that was itself exponentially downweighted. This is totally trivial but also kind of important- it means that we are not creating some kind of new animal when we add up covariance matrices this way.

Just to be really dumb, start with a univariate regression example, so where we have a single signal x and a single response y. Say we get our first signal x_1 and our first reponse y_1. Our first estimate for the covariance matrix is x_1^2.

Now we get a new piece of data (x_2, y_2), and we want to downweight the old stuff, so we multiply x_1 and y_1 by some number \mu. Then our signal vector looks like [\mu x_1 x_2] and the new estimate for the covariance matrix is

M(2) = \mu^2 x_1^2 + x_2^2 = \mu^2 M(1) + M(1, 2),

where by M(t) I mean the estimate of the covariance matrix at time t as above. Up to scaling this is the exact form from above, where \lambda = \frac{\mu^2}{1+\mu^2}.

Things to convince yourself of:

  1. This works when we move from n pieces of data to n+1 pieces of data.
  2. This works when we move from a univariate regression to a multivariate regression and we’re actually talking about square matrices.
  3. Same goes for the X^{\tau} y term in the same exact way (except it ends up being a column matrix rather than a square matrix).
  4. We don’t really have to worry about scaling; this uses the fact that everything in sight is quadratic in \mu, the downweighting scalar, and the final product we care about is \beta =(X^{\tau}X)^{-1} X^{\tau}y, where, if we did decide to care about scalars, we would mutliply X^{\tau} y by the appropriate scalar but then end up dividing by that same scalar when we find the inverse of X^{\tau} X.
  5. We don’t have to update one data point at a time. We can instead compute the `new part’ of the covariance matrix and the other thingy for a whole day’s worth of data, downweight our old estimate of the covariance matrix and other thingy, and then get a new version for both.
  6. We can also incorporate bayesian priors into the updating mechanism, although you have decide whether the prior itself needs to be downweighted or not; this depends on whether the prior is coming from a fading prior belief (like, oh I think the answer is something like this because all the studies that have been done say something kind of like that, but I’d be convinced otherwise if the new model tells me otherwise) or if it’s a belief that won’t be swayed (like, I think newer data is more important, so if I use lagged values of the quarterly earnings of these companies then the more recent earnings are more important and I will penalize the largeness of their coefficients less).

End result: we can cut our data up into bite-size chunks our computer can handle, compute our updates, and chuck the data. If we want to maintain some history we can just store the `new parts’ of the matrix and column vector per day. Then if we later decide our downweighting was too aggressive or not sufficiently aggressive, we can replay the summation. This is much more efficient as storage than holding on to the whole data set, because it depends only on the number of signals in the model (typically under 200) rather than the number of data points going into the model. So for each day you store a 200-by-200 matrix and a 200-by-1 column vector.

Is Big Data Evil?

Back when I was growing up, your S.A.T. score was a big deal, but I feel like I lived in a relatively unfettered world of anonymity compared to what we are creating now. Imagine if your SAT score decided your entire future.

Two days ago I wrote about Emanuel Derman’s excellent new book “Models. Behaving. Badly.” and mentioned his Modeler’s Hippocratic Oath, which I may have to restate on every post from now on:

  • I will remember that I didn’t make the world, and it doesn’t satisfy my equations.
  • Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.
  • I will never sacrifice reality for elegance without explaining why I have done so.
  • Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.
  • I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension.

I mentioned that every data scientist should sign at the bottom of this page. Since then I’ve read three disturbing articles about big data. First, this article in the New York Times, which basically says that big data is a bubble:

This is a common characteristic of technology that its champions do not like to talk about, but it is why we have so many bubbles in this industry. Technologists build or discover something great, like railroads or radio or the Internet. The change is so important, often world-changing, that it is hard to value, so people overshoot toward the infinite. When it turns out to be merely huge, there is a crash, in railroad bonds, or RCA stock, or Perhaps Big Data is next, on its way to changing the world.

In a way I agree, but let’s emphasize the “changing the world” part, and ignore the hype. The truth is that, beyond the hype, the depth of big data’s reach is not really understood yet by most people, especially people inside big data. I’m not talking about the technological reach, but rather the moral and philosophical reach.

Let me illustrate my point by explaining the gist of the other two articles, both from the Wall Street Journal. The second article describes a model which uses the information on peoples’ credit card purchases to direct online advertising at them:

MasterCard earlier this year proposed an idea to ad executives to link Internet users to information about actual purchase behaviors for ad targeting, according to a MasterCard document and executives at some of the world’s largest ad companies who were involved in the talks. “You are what you buy,” the MasterCard document says.

MasterCard doesn’t collect people’s names or addresses when processing credit-card transactions. That makes it tricky to directly link people’s card activity to their online profiles, ad executives said. The company’s document describes its “extensive experience” linking “anonymized purchased attributes to consumer names and addresses” with the help of third-party companies.

MasterCard has since backtracked on this plan:

The MasterCard spokeswoman also said the idea described in MasterCard’s April document has “evolved significantly” and has “changed considerably” since August. After the company’s conversations with ad agencies, MasterCard said, it found there was “no feasible way” to connect Internet users with its analysis of their purchase history. “We cannot link individual transaction data,” MasterCard said.

How loudly can you hear me say “bullshit”? Even if they decide not to do this because of bad public relations, there are always smaller third-party companies who don’t even have a PR department:

Credit-card issuers including Discover Financial Services’ Discover Card, Bank of America Corp., Capital One Financial Corp. and J.P. Morgan Chase & Co. disclose in their privacy policies that they can share personal information about people with outside companies for marketing. They said they don’t make transaction data or purchase-history information available to outside companies for digital ad targeting.

The third article talks about using credit scores, among other “scoring” systems, to track and forecast peoples’ behavior. They model all sorts of things, like the likelihood you will take your pills:

Experian PLC, the credit-report giant, recently introduced an Income Insight score, designed to estimate the income of a credit-card applicant based on the applicant’s credit history. Another Experian score attempts to gauge the odds that a consumer will file for bankruptcy.

Rival credit reporter Equifax Inc. offers an Ability to Pay Index and a Discretionary Spending Index that purports to indicate whether people have extra money burning a hole in their pocket.

Understood, this is all about money. This is, in fact, all about companies ranking you in terms of your potential profitability to them. Just to make sure we’re all clear on the goal then:

The system “has been incredibly powerful for consumers,” said Mr. Wagner.

Ummm… well, at least it’s nice to see that it’s understood there is some error in the modeling:

Eric Rosenberg, director of state-government relations for credit bureau TransUnion LLC, told Oregon state lawmakers last year that his company can’t show “any statistical correlation” between the contents of a credit report and job performance.

But wait, let’s see what the CEO of Fair Isaac Co, one of the companies creating the scores, says about his new system:

“We know what you’re going to do tomorrow”

This is not well aligned with the fourth part of the Modeler’s Hippocratic Oath (MHO). The article goes on to expose some of the questionable morality that stems from such models:

Use of credit histories also raises concerns about racial discrimination, because studies show blacks and Hispanics, on average, have lower credit scores than non-Hispanic whites. The U.S. Equal Employment Opportunity Commission filed suit last December against the Kaplan Higher Education unit of Washington Post Co., claiming it discriminated against black employees and applicants by using credit-based screens that were “not job-related.”

Let me make the argument for these models before I explain why I think they’re flawed.

First, in terms of the credit card information, you should all be glad that the ads coming to us online are so beautifully tailored to your needs and desires- it’s so convenient, almost like someone read your mind and anticipated you’d be needing more vacuum cleaner bags at just the right time! And in terms of the scoring, it’s also very convenient that people and businesses somehow know to trust you, know that you’ve been raised with good (firm) middle-class values and ethics. You don’t have to argue my way into a new credit card or a car purchase, because the model knows you’re good for it. Okay, I’m done.

The flip side of this is that, if you don’t happen to look good to the models, you are funneled into a shitty situation, where you will continue to look bad. It’s a game of chutes and ladders, played on an enormous scale.

[If there’s one thing about big data that we all need to understand, it’s the enormous scale of these models.]

Moreover, this kind of cyclical effect will actually decrease the apparent error of the models: this is because if we forecast you as being uncredit-worthy, and your life sucks from now on and you have trouble getting a job or a credit card and when you do you have to pay high fees, then you are way more likely to be a credit risk in the future.

One last word about errors: it’s always scary to see someone on the one hand admit that the forecasting abilities of a model may be weak, but on the other hand say things like “we know what you’re going to do tomorrow”. It’s a human nature thing to want something to work better than it does, and that’s why we need the IMO (especially the fifth part).

This all makes me think of the movie Blade Runner, with its oppressive sense of corporate control, where the seedy underground economy of artificial eyeballs was the last place on earth you didn’t need to show ID. There aren’t any robots to kill (yet) but I’m getting the feeling more and more that we are sorting people at birth, or soon after, to be winners or losers in this culture.

Of course, collecting information about people isn’t new. Why am I all upset about it? Here are a few reasons, which I will expand on in another post:

  1. There’s way more information about people nowadays than their Social Security Number; the field of consumer information gathering is huge and growing exponentially
  2. All of those quants who left Wall Street are now working in data science and have real skills (myself included)
  3. They also typically don’t have any qualms; they justify models like this by saying, hey we’re just using correlations, we’re not forcing people to behave well or badly, and anyway if I don’t make this model someone else will
  4. The real bubble is this: thinking these things work, and advocating their bulletproof convenience and profitability (in the name of mathematics)
  5. Who suffers when these models fail? Answer: not the corporations that use them, but rather the invisible people who are designated as failures.

Bigger Data Isn’t Always Better Data

My newest piece on Bloomberg:

Bigger Data Isn’t Always Better Data

Categories: Uncategorized

Links about big bad data

There have been a lot of great articles recently on my beat, the dark side of big data. I wanted to share some of them with you today:

  1. An interview with Cynthia Dwork by Clair Cain Miller (h/t Marc Sobel). Describes how fairness is not automatic in algorithms, and the somewhat surprising fact that, in order to make sure an algorithm isn’t racist, for example, you must actually take race into consideration when testing it.
  2. How Google Could Rig the 2016 Election by Robert Epstein (h/t Ernie Davis). This describes the unreasonable power of search rank in terms political trust. Namely, when a given candidate was artificially lifted in terms of rank, people started to trust them more. Google’s meaningless response: “Providing relevant answers has been the cornerstone of Google’s approach to search from the very beginning. It would undermine the people’s trust in our results and company if we were to change course.”
  3. Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency by Hannah Wallach (h/t Arnaud Sahuguet). She addresses the need for social scientists to work alongside computer scientists when working with human behavior data, as well as a prioritization on the question rather than data availability. She also promotes the idea of including a concept of uncertainty when possible.
  4. How Big Data Is Unfair by Moritz Hardt. This isn’t new but it is a fantastic overview of fairness issues in big data, specifically how data mining techniques deal with minority groups.
  5. How Social Bias Creeps Into Web Technology by Elizabeth Dwoskin (h/t Ernie Davis). Unfortunately behind the pay wall, this article talks about negative unintended consequences of data mining.
  6. A somewhat different topic but great article, The MOOC revolution that wasn’t, by Audrey Watters (h/t Ernie Davis). This article traces the fall of the mighty MOOC ideals. Best quote in the article: “High failure rates and dropouts are features, not bugs,” Caulfield suggests, “because they represent a way to thin pools of applicants for potential employers.”
Categories: Uncategorized

How Big Pharma Cooks Data: The Case of Vioxx and Heart Disease

This is cross posted from Naked Capitalism.

Yesterday I caught a lecture at Columbia given by statistics professor David Madigan, who explained to us the story of Vioxx and Merck. It’s fascinating and I was lucky to get permission to retell it here.


Madigan has been a paid consultant to work on litigation against Merck. He doesn’t consider Merck to be an evil company by any means, and says it does lots of good by producing medicines for people. According to him, the following Vioxx story is “a line of work where they went astray”.

Yet Madigan’s own data strongly suggests that Merck was well aware of the fatalities resulting from Vioxx, a blockbuster drug that earned them $2.4b in 2003, the year before it “voluntarily” pulled it from the market in September 2004. What you will read below shows that the company set up standard data protection and analysis plans which they later either revoked or didn’t follow through with, they gave the FDA misleading statistics to trick them into thinking the drug was safe, and set up a biased filter on an Alzheimer’s patient study to make the results look better. They hoodwinked the FDA and the New England Journal of Medicine and took advantage of the public trust which ultimately caused the deaths of thousands of people.

The data for this talk came from published papers, internal Merck documents that he saw through the litigation process, FDA documents, and SAS files with primary data coming from Merck’s clinical trials. So not all of the numbers I will state below can be corroborated, unfortunately, due to the fact that this data is not all publicly available. This is particularly outrageous considering the repercussions that this data represents to the public.


The process for getting a drug approved is lengthy, requires three phases of clinical trials before getting FDA approval, and often takes well over a decade. Before the FDA approved Vioxx, less than 20,000 people tried the drug, versus 20,000,000 people after it was approved. Therefore it’s natural that rare side effects are harder to see beforehand. Also, it should be kept in mind that for the sake of clinical trials, they choose only people who are healthy outside of the one disease which is under treatment by the drug, and moreover they only take that one drug, in carefully monitored doses. Compare this to after the drug is on the market, where people could be unhealthy in various ways and could be taking other drugs or too much of this drug.

Vioxx was supposed to be a new “NSAID” drug without the bad side effects. NSAID drugs are pain killers like Aleve and ibuprofen and aspirin, but those had the unfortunate side effects of gastro-intestinal problems (but those are only among a subset of long term users, such as people who take painkillers daily to treat chronic pain, such as people with advanced arthritis). The goal was to find a pain-killer without the GI side effects. The underlying scientific goal was to find a COX-2 inhibitor without the COX-1 inhibition, since scientists had realized in 1991 that COX-2 suppression corresponded to pain relief whereas COX-1 suppression corresponded to GI problems.

Vioxx introduced and withdrawn from the market

The timeline for Vioxx’s introduction to the market was accelerated: they started work in 1991 and got approval in 1999. They pulled Vioxx from the market in 2004 in the “best interest of the patient”. It turned out that it caused heart attacks and strokes. The stock price of Merck plummeted and $30 billion of its market cap was lost. There was also an avalanche of lawsuits, one of the largest resulting in a $5 billion settlement which was essentially a victory for Merck, considering they made a profit of $10 billion on the drug while it was being sold.

The story Merck will tell you is that they “voluntarily withdrew” the drug on September 30, 2004. In a placebo-controlled study of colon polyps in 2004, it was revealed that over a time period of 1200 days, 4% of the Vioxx users suffered a “cardiac, vascular, or thoracic event” (CVT event), which basically means something like a heart attack or stroke, whereas only 2% of the placebo group suffered such an event. In a group of about 2400 people, this was statistically significant, and Merck had no choice but to pull their drug from the market.

It should be noted that, on the one hand Merck should be applauded for checking for CVT events on a colon polyps study, but on the other hand that in 1997, at the International Consensus Meeting on COX-2 Inhibition, a group of leading scientists issued a warning in their Executive Summary that it was “… important to monitor cardiac side effects with selective COX-2 inhibitors”. Moreover, in an internal Merck email as early as 1996, it was stated there was a “… substantial chance that CVT will be observed.” In other words, Merck knew to look out for such things. Importantly, however, there was no subsequent insert in the medicine’s packaging that warned of possible CVT side-effects.

What the CEO of Merck said

What did Merck say to the world at that point in 2004? You can look for yourself at the four and half hour Congressional hearing (seen on C-SPAN) which took place on November 18, 2004. Starting at 3:27:10, the then-CEO of Merck, Raymond Gilmartin, testifies that Merck “puts patients first” and “acted quickly” when there was reason to believe that Vioxx was causing CVT events. Gilmartin also went on the Charlie Rose show and repeated these claims, even go so far as stating that the 2004 study was the first time they had a study which showed evidence of such side effects.

How quickly did they really act though? Were there warning signs before September 30, 2004?

Arthritis studies

Let’s go back to the time in 1999 when Vioxx was FDA approved. In spite of the fact that it was approved for a rather narrow use, mainly for arthritis sufferers who needed chronic pain management and were having GI problems on other meds (keeping in mind that Vioxx was way more expensive than ibuprofen or aspirin, so why would you use it unless you needed to), Merck nevertheless launched an ad campaign with Dorothy Hamill and spent $160m (compare that with Budweiser which spent $146m or Pepsi which spent $125m in the same time period).

As I mentioned, Vioxx was approved faster than usual. At the time of its approval, the completed clinical studies had only been 6- or 12-week studies; no longer term studies had been completed. However, there was one underway at the time of approval, namely a study which compared Aleve with Vioxx for people suffering from osteoarthritis and rheumatoid arthritis.

What did the arthritis studies show? These results, which were available in late 2003, showed that the CVT events were more than twice as likely with Vioxx as with Aleve (CVT event rates of 32/1304 = 0.0245 with Vioxx, 6/692 = 0.0086 with Aleve, with a p-value of 0.01). As we see this is a direct refutation of the fact that CEO Gilmartin stated that they didn’t have evidence until 2004 and acted quickly when they did.

In fact they had evidence even before this, if they bothered to put it together (in fact they stated a plan to do such statistical analyses but it’s not clear if they did them- or in any case there’s so far no evidence that they actually did these promised analyses).

In a previous study (“Table 13”), available in February of 2002, the could have seen that, comparing Vioxx to placebo, we saw a CVT event rate of 27/1087 = 0.0248 with Vioxx versus 5/633 = 0.0079 with placebo, with a p-value of 0.01. So, three times as likely.

In fact, there was an even earlier study (“1999 plan”), results of which were available in July of 2000, where the Vioxx CVT event rate was 10/427 = 0.0234 versus a placebo event rate of 1/252 = 0.0040, with a p-value of 0.05 (so more than 5 times as likely). This p-value can be taken to be the definition of statistically significant. So actually they knew to be very worried as early as 2000, but maybe they… forgot to do the analysis?

The FDA and pooled data

Where was the FDA in all of this?

They showed the FDA some of these numbers. But they did something really tricky. Namely, they kept the “osteoarthritis study” results separate from the “rheumatoid arthritis study” results. Each alone were not quite statistically significant, but together were amply statistically significant. Moreover, they introduced a third category of study, namely the “Alzheimer’s study” results, which looked pretty insignificant (more on that below though). When you pooled all three of these study types together, the overall significance was just barely not there.

It should be mentioned that there was no apparent reason to separate the different arthritic studies, and there is evidence that they did pool such study data in other places as a standard method. That they didn’t pool those studies for the sake of their FDA report is incredibly suspicious. That the FDA didn’t pick up on this is probably due to the fact that they are overworked lawyers, and too trusting on top of that. That’s unfortunately not the only mistake the FDA made (more below).

Alzheimer’s Study

So the Alzheimer’s study kind of “saved the day” here. But let’s look into this more. First, note that the average age of the 3,000 patients in the Alzheimer’s study was 75, it was a 48-month study, and that the total number of deaths for those on Vioxx was 41 versus 24 on placebo. So actually on the face of it it sounds pretty bad for Vioxx.

There were a few contributing reasons why the numbers got so mild by the time the study’s result was pooled with the two arthritis studies. First, when really old people die, there isn’t always an autopsy. Second, although there was supposed to be a DSMB as part of the study, and one was part of the original proposal submitted to the FDA, this was dropped surreptitiously in a later FDA update. This meant there was no third party keeping an eye on the data, which is not standard operating procedure for a massive drug study and was a major mistake, possibly the biggest one, by the FDA.

Third, and perhaps most importantly, Merck researchers created an added “filter” to the reported CVT events, which meant they needed the doctors who reported the CVT event to send their info to the Merck-paid people (“investigators”), who looked over the documents to decide whether it was a bonafide CVT event or not. The default was to assume it wasn’t, even though standard operating procedure would have the default assuming that there was such an event. In all, this filter removed about half the initially reported CVT events, and about twice as often the Vioxx patients had their CVT event status revoked as for the placebo patients. Note that the “investigator” in charge of checking the documents from the reporting doctors is paid $10,000 per patient. So presumably they wanted to continue to work for Merck in the future.

The effect of this “filter” was that, instead of it seeming 1.5 times as likely to have a CVT event if you were taking Voixx, it seemed like it was only 1.03 as likely, with a high p-score.

If you remove the ridiculous filter from the Alzheimer’s study, then you see that as of November 2000 there was statistically significant evidence that Vioxx caused CVT events in Alzheimer patients.

By the way, one extra note. Many of the 41 deaths in the Vioxx group were dismissed as “bizarre” and therefore unrelated to Vioxx. Namely, car accidents, falling of ladders, accidentally eating bromide pills. But at this point there’s evidence that Vioxx actually accelerates Alzheimer’s disease itself, which could explain those so-called bizarre deaths. This is not to say that Merck knew that, but rather that one should not immediately dismiss the concept of statistically significant just because it doesn’t make intuitive sense.

VIGOR and the New England Journal of Medicine 

One last chapter in this sad story. There was a large-scale study, called the VIGOR study, with 8,000 patients. It was published in the New England Journal of Medicine on November 23, 2000. See also this NPR timeline for details. They didn’t show the graphs which would have emphasized this point, but they admitted, in a deceptively round-about way, that Vioxx has 4 times the number of CVT events than Aleve. They hinted that this is either because Aleve is protective against CVT events or that Vioxx is bad for it, but left it open.

But Bayer, which owns Aleve, issued a press release saying something like, “if Aleve is protective for CVT events then it’s news to us.” Bayer, it should be noted, has every reason to want people to think that Aleve is protective against CVT events. This problem, and the dubious reasoning explaining it away, was completely missed by the peer review system; if it had been spotted, Vioxx would have been forced off the market then and there. Instead, Merck purchased 900,000 preprints of this article from the NE Journal of Medicine, which is more than the number of practicing doctors in the U.S.. In other words, the Journal was used as a PR vehicle for Merck.

The paper emphasized that Aleve has twice the rate of ulcers and bleeding, at 4%, whereas Vioxx had a rate of only 2% among chronic users. When you compare that to the elevated rate of heart attack and death (0.4% to 1.2%) of Vioxx over Aleve, though, the reduced ulcer rate doesn’t seem all that impressive.

A bit more color on this paper. It was written internally by Merck, after which non-Merck authors were found. One of them is Loren Laine. Loren helped Merck develop a sound-bite interview which was 30 seconds long and was sent to the news media and run like a press interview, even though it actually happened in Merck’s New Jersey office (with a backdrop to look like a library) with a Merck employee posing as a neutral interviewer. Some smart lawyer got the outtakes of this video made available as part of the litigation against Merck. Check out this youtube video, where Laine and the fake interviewer scheme about spin and Laine admits they were being “cagey” about the renal failure issues that were poorly addressed in the article.

The damage done

Also on the Congress testimony I mentioned above is Dr. David Graham, who speaks passionately from minute 41:11 to minute 53:37 about Vioxx and how it is a symptom of a broken regulatory system. Please take 10 minutes to listen if you can.

He claims a conservative estimate is that 100,000 people have had heart attacks as a result of using Vioxx, leading to between 30,000 and 40,000 deaths (again conservatively estimated). He points out that this 100,000 is 5% of Iowa, and in terms people may understand better, this is like 4 aircraft falling out of the sky every week for 5 years.

According to this blog, the noticeable downwards blip in overall death count nationwide in 2004 is probably due to the fact that Vioxx was taken off the market that year.


Let’s face it, nobody comes out looking good in this story. The peer review system failed, the FDA failed, Merck scientists failed, and the CEO of Merck misled Congress and the people who had lost their husbands and wives to this damaging drug. The truth is, we’ve come to expect this kind of behavior from traders and bankers, but here we’re talking about issues of death and quality of life on a massive scale, and we have people playing games with statistics, with academic journals, and with the regulators.

Just as the financial system has to be changed to serve the needs of the people before the needs of the bankers, the drug trial system has to be changed to lower the incentives for cheating (and massive death tolls) just for a quick buck. As I mentioned before, it’s still not clear that they would have made less money, even including the penalties, if they had come clean in 2000. They made a bet that the fines they’d need to eventually pay would be smaller than the profits they’d make in the meantime. That sounds familiar to anyone who has been following the fallout from the credit crisis.

One thing that should be changed immediately: the clinical trials for drugs should not be run or reported on by the drug companies themselves. There has to be a third party which is in charge of testing the drugs and has the power to take the drugs off the market immediately if adverse effects (like CVT events) are found. Hopefully they will be given more power than risk firms are currently given in finance (which is none)- in other words, it needs to be more than reporting, it needs to be an active regulatory power, with smart people who understand statistics and do their own state-of-the-art analyses – although as we’ve seen above even just Stats 101 would sometimes do the trick.

Categories: data science, news

Using Data Science to do Good: A Conversation

This is a guest post by Roger Stanev and Chris French. Roger Stanev is a data scientist and lecturer at the University of Washington. His work focuses on ethical and epistemic issues concerning the nature and application of statistical modeling and inference, and relationship between science and democracy. Chris French is a data science enthusiast, and an advocate for social justice. He’s worked on the history of statistics and probability, and writes science fiction in his spare time.

Calling Data Scientists, Data Science Enthusiasts, and Advocates for Civic Liberties and Social Justice. Please join us for an information and preliminary discussion about how Data Science can be used to do Good!

Throughout Seattle/Tacoma, the state of Washington and the other forty-nine states in America, many non-profit organizations promote causes that are vital to the health, safety and humanity of our friends, families and communities. For the next several years, these social and civic groups will need all the help they can get to resist the increase of fear and hatred – of racism, sexism, xenophobia and bigotry – in our country.

Data Scientists have a unique skill set. They are trained to transform vague and difficult questions – typically questions about human behavior – into empirical, solvable problems.

So here is the question we want to have a conversation about: How can Data Scientists & IT Professionals use their expertise to help answer the current human questions which social and policy-based organizations are currently struggling to address?

What problems will minority and other vulnerable communities face in the coming years? What resources, tools and activities are currently being employed to address these questions? What can data science do, if anything, to help address these questions? Do data scientists or computer professionals have an obligation to assist in promoting social justice? What can we, as data scientists, do to help add and expand the digital tool-belt for these non-profit organizations?

If you’d like to join the conversation, RSVP to

Saturday, January 14
11am to 1pm @ King County Library (Lake Forest)
17171 Bothell Way NE, Lake Forest Park, WA 98155

Saturday, January 21
11am to 1pm @ Tacoma Public Library
1102 Tacoma Ave S, Tacoma, WA 98402

Saturday, January 28
1 to 3pm @ Seattle Public Library (Capitol Hill)
425 Harvard Ave E, Seattle, WA 98102

Categories: Uncategorized