Archive

Archive for the ‘data science’ Category

Critical Questions for Big Data by danah boyd & Kate Crawford

I’m teaching a class this summer in the Lede Program, starting in mid-July, which is called The Platform. Here’s the course description:

This course begins with the idea that computing tools are the products of human ingenuity and effort. They are never neutral and carry with them the biases of their designers and their design process. “Platform studies” is a new term used to describe investigations into these relationships between computing technologies and the creative or research products that they help to generate. How you understand how data, code, and algorithms affect creative practices can be an effective first step toward critical thinking about technology. This will not be purely theoretical, however, and specific case studies, technologies, and project work will make the ideas concrete.

Since my first class is coming soon, I’m actively thinking about what to talk about and which readings to assign. I’ve got wonderful guest lecturers coming, and for the most part the class will focus on those guest lecturers and their topics, but for the first class I want to give them an overview of a very large subject.

I’ve decided that danah boyd and Kate Crawford’s recent article, Critical Questions for Big Data, is pretty much perfect for this goal. I’ve read and written a lot about big data but even so I’m impressed by how clearly and comprehensively they have laid out their provocations. And although I’ve heard many of the ideas and examples before, some of them are new to me, and are directly related to the theme of the class, for example:

Twitter and Facebook are examples of Big Data sources that offer very poor archiving and search functions. Consequently, researchers are much more likely to focus on something in the present or immediate past – tracking reactions to an election, TV finale, or natural disaster – because of the sheer difficulty or impossibility of accessing older data.

Of course the students in the Lede are journalists, not academic researchers, which the article mostly addresses, and moreover they are not necessarily working with big data per se, but even so they are increasingly working with social media data, and moreover they are probably covering big data even if they don’t directly analyze it. So I think it’s still relevant to them. Or another way to express this is that one thing we will attempt to do in class is examine the extent to which their provocations are relevant.

Here’s another gem, directly related to the Facebook experiment I discussed yesterday:

As computational scientists have started engaging in acts of social science, there is a tendency to claim their work as the business of facts and not interpretation. A model may be mathematically sound, an experiment may seem valid, but as soon as a researcher seeks to understand what it means, the process of interpretation has begun. This is not to say that all interpretations are created equal, but rather that not all numbers are neutral.

In fact, what with this article and that case study, I’m pretty much set for my first day, after combining them with a discussion of the students’ projects and some related statistical experiments.

I also hope to invite at least one of the authors to come talk to the class, although I know they are both incredibly busy. Danah boyd, who recently came out with a book called It’s Complicated: the social lives of networked teensalso runs the Data & Society Research Institute, a NYC-based think/do tank focused on social, cultural, and ethical issues arising from data-centric technological development. I’m hoping she comes and talks about the work she’s starting up there.

Thanks for a great case study, Facebook!

I’m super excited about the recent “mood study” that was done on Facebook. It constitutes a great case study on data experimentation that I’ll use for my Lede Program class when it starts mid-July. It was first brought to my attention by one of my Lede Program students, Timothy Sandoval.

My friend Ernest Davis at NYU has a page of handy links to big data articles, and at the bottom (for now) there are a bunch of links about this experiment. For example, this one by Zeynep Tufekci does a great job outlining the issues, and this one by John Grohol burrows into the research methods. Oh, and here’s the original research article that’s upset everyone.

It’s got everything a case study should have: ethical dilemmas, questionable methodology, sociological implications, and questionable claims, not to mention a whole bunch of media attention and dissection.

By the way, if I sound gleeful, it’s partly because I know this kind of experiment happens on a daily basis at a place like Facebook or Google. What’s special about this experiment isn’t that it happened, but that we get to see the data. And the response to the critiques might be, sadly, that we never get another chance like this, so we have to grab the opportunity while we can.

The dark matter of big data

A tiny article in The Cap Times was recently published (hat tip Jordan Ellenberg) which describes the existence of a big data model which claims to help filter and rank school teachers based on their ability to raise student test scores. I guess it’s a kind of pre-VAM filtering system, and if it was hard to imagine a more vile model than the VAM, here you go. The article mentioned that the Madison School Board was deliberating on whether to spend $273K on this model.

One of the teachers in the district wrote her concerns about this model in her blog and then there was a debate at the school board meeting, and a journalist covered the meeting, so we know about it. But it was a close call, and this one could have easily slipped under the radar, or at least my radar.

Even so, now I know about it, and once I looked at the website of the company promoting this model, I found links to an article where they name a customer, for example in the Charlotte-Mecklenburg School District of North Carolina. They claim they only filter applications using their tool, they don’t make hiring decisions. Cold comfort for people who got removed by some random black box algorithm.

I wonder how many of the teachers applying to that district knew their application was being filtered through such a model? I’m going to guess none. For that matter, there are all sorts of application screening algorithms being regularly used of which applicants are generally unaware.

It’s just one example of the dark matter of big data. And by that I mean the enormous and growing clusters of big data models that are only inadvertently detectable by random small-town or small-city budget meeting journalism, or word-of-mouth reports coming out of conferences or late-night drinking parties with VC’s.

The vast majority of big data dark matter is still there in the shadows. You can only guess at its existence and its usage. Since the models themselves are proprietary, and are generally deployed secretly, there’s no reason for the public to be informed.

Let me give you another example, this time speculative, but not at all unlikely.

Namely, big data health models arising from the quantified self movement data. This recent Wall Street Journal article entitled Can Data From Your Fitbit Transform Medicine? articulated the issue nicely:

A recent review of 43 health- and fitness-tracking apps by the advocacy group Privacy Rights Clearinghouse found that roughly one-third of apps tested sent data to a third party not disclosed by the developer. One-third of the apps had no privacy policy. “For us, this is a big trust issue,” said Kaiser’s Dr. Young.

Consumer wearables fall into a regulatory gray area. Health-privacy laws that prevent the commercial use of patient data without consent don’t apply to the makers of consumer devices. “There are no specific rules about how those vendors can use and share data,” said Deven McGraw, a partner in the health-care practice at Manatt, Phelps, and Phillips LLP.

The key is that phrase “regulatory gray area”; it should make you think “big data dark matter lives here”.

When you have unprotected data that can be used as a proxy of HIPAA-protected medical data, there’s no reason it won’t be. So anyone who wants stands to benefit from knowing health-related information about you – think future employers who might help pay for future insurance claims – will be interested in using big data dark matter models gleaned from this kind of unregulated data.

To be sure, most people nowadays who wear fitbits are athletic, trying to improve their 5K run times. But the article explained that the medical profession is on the verge of suggesting a much larger population of patients use such devices. So it could get ugly real fast.

Secret big data models aren’t new, of course. I remember a friend of mine working for a credit card company a few decades ago. Her job was to model which customers to offer subprime credit cards to, and she was specifically told to target those customers who would end up paying the most in fees. But it’s become much much easier to do this kind of thing with the proliferation of so much personal data, including social media data.

I’m interested in the dark matter, partly as research for my book, and I’d appreciate help from my readers in trying to spot it when it pops up. For example, I remember begin told that a certain kind of online credit score is used to keep people on hold for customer service longer, but now I can’t find a reference to it anywhere. We should really compile a list at the boundaries of this dark matter. Please help! And if you don’t feel comfortable commenting, my email address is on the About page.

The business of big data audits: monetizing fairness

I gave a talk to the invitation-only NYC CTO Club a couple of weeks ago about my fears about big data modeling, namely:

  • that big data modeling is discriminatory,
  • that big data modeling increases inequality, and
  • that big data modeling threatens democracy.

I had three things on my “to do” list for the audience of senior technologists, namely:

  • test internal, proprietary models for discrimination,
  • help regulators like the CFPB develop reasonable audits, and
  • get behind certain models being transparent and publicly accessible, including credit scoring, teacher evaluations, and political messaging models.

Given the provocative nature of my talk, I was pleasantly surprised by the positive reception I was given. Those guys were great – interactive, talkative, and very thoughtful. I think it helped that I wasn’t trying to sell them something.

Even so, I shouldn’t have been surprised when one of them followed up with me to talk about a possible business model for “fairness audits.” The idea is that, what with the recent bad press about discrimination in big data modeling (some of the audience had actually worked with the Podesta team), there will likely be a business advantage to being able to claim that your models are fair. So someone should develop those tests that companies can take. Quick, someone, monetize fairness!

One reason I think this might actually work – and more importantly, be useful – is that I focused on “effects-based” discrimination, which is to say testing a model by treating it like a black box and seeing how it works on different inputs and gives different outputs. In other words, I want to give a resume-sorting algorithm different resumes with similar qualifications but different races. An algorithmically induced randomized experiment, if you will.

From the business perspective, a test that allows a model to remain a black box feels safe, because it does not require true transparency, and allows the “secret sauce” to remain secret.

One thing, though. I don’t think it makes too much sense to have a proprietary model for fairness auditing. In fact the way I was imagining this was to develop an open-source audit model that the CFPB could use. What I don’t want, and which would be worse than nothing, would be if some private company developed a proprietary “fairness audit” model that we cannot trust and would claim to solve the very real problems listed above.

Update: something like this is already happening for privacy compliance in the big data world (hat tip David Austin).

Inside the Podesta Report: Civil Rights Principles of Big Data

I finished reading Podesta’s Big Data Report to Obama yesterday, and I have to say I was pretty impressed. I credit some special people that got involved with the research of the report like Danah Boyd, Kate Crawford, and Frank Pasquale for supplying thoughtful examples and research that the authors were unable to ignore. I also want to thank whoever got the authors together with the civil rights groups that created the Civil Rights Principles for the Era of Big Data:

  1. Stop High-Tech Profiling. New surveillance tools and data gathering techniques that can assemble detailed information about any person or group create a heightened risk of profiling and discrimination. Clear limitations and robust audit mechanisms are necessary to make sure that if these tools are used it is in a responsible and equitable way.
  2. Ensure Fairness in Automated Decisions. Computerized decisionmaking in areas such as employment, health, education, and lending must be judged by its impact on real people, must operate fairly for all communities, and in particular must protect the interests of those that are disadvantaged or that have historically been the subject of discrimination. Systems that are blind to the preexisting disparities faced by such communities can easily reach decisions that reinforce existing inequities. Independent review and other remedies may be necessary to assure that a system works fairly.
  3. Preserve Constitutional Principles. Search warrants and other independent oversight of law enforcement are particularly important for communities of color and for religious and ethnic minorities, who often face disproportionate scrutiny. Government databases must not be allowed to undermine core legal protections, including those of privacy and freedom of association.
  4. Enhance Individual Control of Personal Information. Personal information that is known to a corporation — such as the moment-to-moment record of a person’s movements or communications — can easily be used by companies and the government against vulnerable populations, including women, the formerly incarcerated, immigrants, religious minorities, the LGBT community, and young people. Individuals should have meaningful, flexible control over how a corporation gathers data from them, and how it uses and shares that data. Non-public information should not be disclosed to the government without judicial process.
  5. Protect People from Inaccurate Data. Government and corporate databases must allow everyone — including the urban and rural poor, people with disabilities, seniors, and people who lack access to the Internet — to appropriately ensure the accuracy of personal information that is used to make important decisions about them. This requires disclosure of the underlying data, and the right to correct it when inaccurate.

This was signed off on by multiple civil rights groups listed here, and it’s a great start.

One thing I was not impressed by: the only time the report mentioned finance was to say that, in finance, they are using big data to combat fraud. In other words, finance was kind of seen as an industry standing apart from big data, and using big data frugally. This is not my interpretation.

In fact, I see finance as having given birth to big data. Many of the mistakes we are making as modelers in the big data era, which require the Civil Rights Principles as above, were made first in finance. Those modeling errors – and when not errors, politically intentional odious models – were created first in finance, and were a huge reason we first had the mortgage-backed-securities rated with AAA ratings and then the ensuing financial crisis.

In fact finance should have been in the report standing as a worst case scenario.

One last thing. The recommendations coming out of the Podesta report are lukewarm and are even contradicted by the contents of the report, as I complained about here. That’s interesting, and it shows that politics played a large part of what the authors could include as acceptable recommendations to the Obama administration.

Categories: data science, modeling

No, Sandy Pentland, let’s not optimize the status quo

It was bound to happen. Someone was inevitably going to have to write this book, entitled Social Physics, and now someone has just up and done it. Namely, Alex “Sandy” Pentland, data scientist evangelist, director of MIT’s Human Dynamics Laboratory, and co-founder of the MIT Media Lab.

A review by Nicholas Carr

This article entitled The Limits of Social Engineering, published in MIT’s Technology Review and written by Nicholas Carr (hat tip Billy Kaos) is more or less a review of the book. From the article:

Pentland argues that our greatly expanded ability to gather behavioral data will allow scientists to develop “a causal theory of social structure” and ultimately establish “a mathematical explanation for why society reacts as it does” in all manner of circumstances. As the book’s title makes clear, Pentland thinks that the social world, no less than the material world, operates according to rules. There are “statistical regularities within human movement and communication,” he writes, and once we fully understand those regularities, we’ll discover “the basic mechanisms of social interactions.”

By collecting all the data – credit card, sensor, cell phones that can pick up your moods, etc. – Pentland seems to think we can put the science into social sciences. He thinks we can predict a person like we now predict planetary motion.

OK, let’s just take a pause here to say: eeeew. How invasive does that sound? And how insulting is its premise? But wait, it gets way worse.

The next think Pentland wants to do is use micro-nudges to affect people’s actions. Like paying them to act a certain way, and exerting social and peer pressure. It’s like Nudge in overdrive.

Vomit. But also not the worst part.

Here’s the worst part about Pentland’s book, from the article:

Ultimately, Pentland argues, looking at people’s interactions through a mathematical lens will free us of time-worn notions about class and class struggle. Political and economic classes, he contends, are “oversimplified stereotypes of a fluid and overlapping matrix of peer groups.” Peer groups, unlike classes, are defined by “shared norms” rather than just “standard features such as income” or “their relationship to the means of production.” Armed with exhaustive information about individuals’ habits and associations, civic planners will be able to trace the full flow of influences that shape personal behavior. Abandoning general categories like “rich” and “poor” or “haves” and “have-nots,” we’ll be able to understand people as individuals—even if those individuals are no more than the sums of all the peer pressures and other social influences that affect them.

Kill. Me. Now.

The good news is that the author of the article, Nicholas Carr, doesn’t buy it, and makes all sorts of reasonable complaints about this theory, like privacy concerns, and structural sources of society’s ills. In fact Carr absolutely nails it (emphasis mine):

Pentland may be right that our behavior is determined largely by social norms and the influences of our peers, but what he fails to see is that those norms and influences are themselves shaped by history, politics, and economics, not to mention power and prejudice. People don’t have complete freedom in choosing their peer groups. Their choices are constrained by where they live, where they come from, how much money they have, and what they look like. A statistical model of society that ignores issues of class, that takes patterns of influence as givens rather than as historical contingencies, will tend to perpetuate existing social structures and dynamics. It will encourage us to optimize the status quo rather than challenge it.

How to see how dumb this is in two examples

This brings to mind examples of models that do or do not combat sexism.

First, the orchestra audition example: in order to avoid nepotism, they started making auditioners sit behind a sheet. The result has been way more women in orchestras.

This is a model, even if it’s not a big data model. It is the “orchestra audition” model, and the most important thing about this example is that they defined success very carefully and made it all about one thing: sound. They decided to define the requirements for the job to be “makes good sounding music” and they decided that other information, like how they look, would be by definition not used. It is explicitly non-discriminatory.

By contrast, let’s think about how most big data models work. They take historical information about successes and failures and automate them – rather than challenging their past definition of success, and making it deliberately fair, they are if anything codifying their discriminatory practices in code.

My standard made-up example of this is close to the kind of thing actually happening and being evangelized in big data. Namely, a resume sorting model that helps out HR. But, using historical training data, this model notices that women don’t fare so well historically at a the made-up company as computer programmers – they often leave after only 6 months and they never get promoted. A model will interpret that to mean they are bad employees and never look into structural causes. And moreover, as a result of this historical data, it will discard women’s resumes. Yay, big data!

Thanks, Pentland

I’m kind of glad Pentland has written such an awful book, because it gives me an enemy to rail against in this big data hype world. I don’t think most people are as far on the “big data will solve all our problems” spectrum as he is, but he and his book present a convenient target. And it honestly cannot surprise anyone that he is a successful white dude as well when he talks about how big data is going to optimize the status quo if we’d just all wear sensors to work and to bed.

Categories: data science, modeling, rant

Great news: InBloom is shutting down

I’m trying my hardest to resist talking about Piketty’s Capital because I haven’t read it yet, even though I’ve read a million reviews and discussions about it, and I saw him a couple of weeks ago on a panel with my buddy Suresh Naidu. Suresh, who was great on the panel, wrote up his notes here.

So I’ll hold back from talking directly about Piketty, but let me talk about one of Suresh’s big points that was inspired in part by Piketty. Namely, the fact that it’s a great time to be rich. It’s even greater now to be rich than it was in the past, even when there were similar rates of inequality. Why? Because so many things have become commodified. Here’s how Suresh puts it:

We live in a world where much more of everyday life occurs on markets, large swaths of extended family and government services have disintegrated, and we are procuring much more of everything on markets. And this is particularly bad in the US. From health care to schooling to philanthropy to politicians, we have put up everything for sale. Inequality in this world is potentially much more menacing than inequality in a less commodified world, simply because money buys so much more. This nasty complementarity of market society and income inequality maybe means that the social power of rich people is higher today than in the 1920s, and one response to increasing inequality of market income is to take more things off the market and allocate them by other means.

I think about this sometimes in the field of education in particular, and to that point I’ve got a tiny bit of good news today.

Namely, InBloom is shutting down (hat tip Linda Brown). You might not remember what InBloom is, but I blogged about this company a while back in my post Big Data and Surveillance, as well as the ongoing fight against InBloom in New York state by parents here.

The basic idea is that InBloom, which was started in cooperation with the Bill and Melinda Gates Foundation and Rupert Murdoch’s Amplify, would collect huge piles of data on students and their learning and allow third party companies to mine that data to improve learning. From this New York Times article:

InBloom aimed to streamline personalized learning — analyzing information about individual students to customize lessons to them — in public schools. It planned to collect and integrate student attendance, assessment, disciplinary and other records from disparate school-district databases, put the information in cloud storage and release it to authorized web services and apps that could help teachers track each student’s progress.

It’s not unlike the idea that Uber has, of connecting drivers with people needing rides, or that AirBNB has, of connecting people needing a room with people with rooms: they are platforms, not cab companies or hoteliers, and they can use that matchmaking status as a way to duck regulations.

The problem here is that the relevant child data protection regulation, called FERPA, is actually pretty strong, and InBloom and companies like it were largely bypassing that law, as was discovered by a Fordham Law study led by Joel Reidenberg. In particular, the study found that InBloom and other companies were offering what seemed like “free” educational services, but of course the deal really was in exchange for the children’s data, and the school officials who were agreeing to the deals had no clue as to what they were signing. The parents were bypassed completely. Much of the time the contracts were in direct violation of FERPA, but often the school officials didn’t even have copies of the contracts and hadn’t heard of FERPA.

Because of that report and other bad publicity, we saw growing resistance in New York State by parents, school board members and privacy lawyers. And thanks to that resistance, New York State Legislature recently passed a budget that prohibited state education officials from releasing student data to amalgamators like inBloom. InBloom has subsequently decided to close down.

I’m not saying that the urge to privatize education – and profit off of it – isn’t going to continue after a short pause. For that matter look at the college system. Even so, let’s take a moment to appreciate the death of one of the more egregious ideas out there.

Follow

Get every new post delivered to your Inbox.

Join 1,300 other followers