Search Results

Keyword: ‘big data’

The Era of Plausible Deniability in Big Data Continues

Today I published a new Bloomberg Opinion piece on how Amazon’s sexist recruiting algorithm is not a surprise to anyone, but is framed as one because the tech bros are trying to maintain plausible deniability:

Amazon’s Gender-Biased Algorithm Is Not Alone


For my other Bloomberg pieces, go here.

Categories: Uncategorized

Big Data Is Coming to Take Your Health Insurance

Hey people! I’m back to business, with a new Bloomberg View column about healthcare:

Big Data Is Coming to Take Your Health Insurance

Categories: Uncategorized

President Bannon and Big Data Juries #Resist

April 6, 2017 Comments off

I’m super happy to say that, according to the New York Times, Bannon was demoted yesterday in part because the President Bannon/ #PostcardToBannon campaign – which I wrote about back in February – really got under Trump’s skin. From the article:

Screen Shot 2017-04-06 at 8.52.52 AM

Obviously Bannon hasn’t been kicked out of the White House, but I think we can all agree this is a step in the right direction.

Also, my newest Bloomberg View column is out, where I describe the idea behind Big Data jury selection and decide against it:

Big Data Won’t Make Juries Better

Categories: Uncategorized

Insurance and Big Data Are Incompatible

My newest Bloomberg View piece about how that FitBit could be bad for your health:

That Free Health Tracker Could Cost You

Categories: Uncategorized

A good use of big data: to help struggling students

There’s an article that’s been forwarded to me by a bunch of people (I think first by Becky Jaffe) by Anya Kamanetz entitled How One University Used Big Data To Boost Graduation Rates.

The article centers on an algorithm being used by Georgia State University to identify students in danger of dropping out of school. Once identified, the school pairs those wobbly students with advisers to try to help them succeed. From the article:

A GPS alert doesn’t put a student on academic probation or trigger any automatic consequence. Instead, it’s the catalyst for a conversation.

The system prompted 51,000 in-person meetings between students and advisers in the past 12 months. That’s three or four times more than was happening before, when meetings were largely up to the students.

The real work was in those face-to-face encounters, as students made plans with their advisers to get extra tutoring help, take a summer class or maybe switch majors.

I wrote a recent book about powerful, secret, destructive algorithms that I called WMD’s, short for Weapons of Math Destruction. And naturally, a bunch of people have written to me asking if I thought the algorithm from this article would qualify as a WMD.

In a word, no.

Here’s the thing. One of the hallmark characteristics of a WMD is that it punishes the poor, the unlucky, the sick, or the marginalized. This algorithm does the opposite – it offers them help.

Now, I’m not saying it’s perfect. There could easily be flaws in this model, and some people are not being offered help who really need it. That can be seen as a kind of injustice, if others are receiving that help. But that’s the worst case scenario, and it’s not exactly tragic, and it’s a mistake that might well be caught if the algorithm is trained over time and modified to new data.

According to the article, the new algorithmic advising system has resulted in quite a few pieces of really good news:

  • Graduation rates are up 6 percentage points since 2013.
  • Graduates are getting that degree an average half a semester sooner than before, saving an estimated $12 million in tuition.
  • Low-income, first-generation and minority students have closed the graduation rate gap.
  • And those same students are succeeding at higher rates in tough STEM majors.


But to be clear, the real “secret sauce” in this system is the extraordinary amount of advising that’s been given to the students. The algorithm just directed that work.

A final word. This algorithm, which identifies struggling students and helps them, is an example I often use in explaining that an algorithm is not inherently good or evil.

In other words, this same algorithm could be used for evil, to punish the badly off, and a similar one nearly was in the case of Mount St. Mary’s College in Virginia. I wrote about that case as well, in a post entitled The Mount St. Mary’s Story is just so terrible.

Categories: Uncategorized

3 Terrible Big Data Ideas

Yesterday was, for some reason, a big day for terrible ideas in the big data space.

First, there’s this article (via Matt Stoller) which explains how Cable One is data mining their customers, and in particular are rating potential customers by their FICO scores. If you don’t have a high enough FICO score, they won’t bother selling you pay-TV.

No wait, that’s not completely fair. Here’s how they put it:

“We don’t turn people away,” Might said, but the cable company’s technicians aren’t going to “spend 15 minutes setting up an iPhone app” for a customer who has a low FICO score.

Second, the Chicago Police Department uses data mining techniques of social media to determine who is in gangs. Then they arrest scores of people on their lists, and finally they tout the accuracy of their list in part because of the percentage of people who were arrested who were also on their list. I’d like to see a slightly more scientific audit of this system. ProPublica?

Finally, and this is absolutely amazing, there’s a extremely terrible new start-up in town called Faception (h/t Ernie Davis). Describing itself as a “Facial Personality Profiling company”, Faception promises to “use science” to figure out who is a terrorist based on photographs. Or, as my friend Eugene Stern snarkily summarized, “personality is influenced by genes, facial features are influenced by genes, therefore facial features can be used to predict personality.”

Here’s a screenshot from their website, I promise I didn’t make this up:

Screen Shot 2016-05-24 at 5.22.55 PM

Also, here’s a 2-minute advertisement from their founder:


I think my previous claim that Big Data is the New Phrenology was about a year too early.

Categories: Uncategorized

White House report on big data and civil rights

Last week the White House issued a report entitled Big Risks, Big Opportunities: the Intersection of Big Data and Civil Rights. Specifically, the authors were United States C.T.O. Megan Smith, Chief Data Scientist DJ Patil, and Cecilia Munoz, who is Assistant to the President and Director of the Domestic Policy Council.

It is a remarkable report, and covered a lot in 24 readable pages. I was especially excited to see the following paragraph in the summary of the report:

Using case studies on credit lending, employment, higher education, and criminal justice, the report we are releasing today illustrates how big data techniques can be used to detect bias and prevent discrimination. It also demonstrates the risks involved, particularly how technologies can deliberately or inadvertently perpetuate, exacerbate, or mask discrimination.

The report itself is broken up into an abstract discussion of algorithms, which for example debunks the widely held assumption that algorithms are objective, discusses problems of biased training data, and discusses problems of opacity, unfairness, and disparate impact. It’s similar in its chosen themes, if not in length, to my upcoming book.

The case studies are well-chosen. Each case study is split into three sections: the problem, the opportunity, and the challenge. They do a pretty good job of warning of things that could theoretically go wrong, if they are spare on examples of how such things are already happening (for such examples, please read my book).

Probably the most exciting part of the report, from my perspective, is in the conclusion, which discusses Things That Should Happen:

Promote academic research and industry development of algorithmic auditing and external testing of big data systems to ensure that people are being treated fairly. One way these issues can be tackled is through the emerging field of algorithmic systems accountability, where stakeholders and designers of technology “investigate normatively significant instances of discrimination involving computer algorithms” and use nascent tools and approaches to proactively avoid discrimination through the use of new technologies employing research-based behavior science. These efforts should also include an analysis identifying the constituent elements of transparency and accountability to better inform the ethical and policy considerations of big data technologies. There are other promising avenues for research and development that could address fairness and discrimination in algorithmic systems, such as those that would enable the design of machine learning systems that constrain disparate impact or construction of algorithms that incorporate fairness properties into their design and execution.

Readers, this is what I want to do with my life, after my book tour. So it’s nice to see people calling for it like this.

Categories: Uncategorized

Big data, technology, and the elderly

The last time I visited my in-laws in Holland, I noticed my father-in-law, who is hard of hearing, was having serious trouble communicating with his wife. My husband was able to communicate with his father by patiently writing things on a piece of paper and waiting for him to write back, but his mother doesn’t have that patience or the motor control to do that.

But here’s the thing. We have technology that could help them communicate with each other. Why not have the mother-in-law speak to an iPad, or Siri, or some other voice-recognition software, and then that transcription could be sent to her husband? And he could, in turn, use a touch screen or his voice to communicate back to her.

This is just one simple example, but it made me wonder what the world of technology and big data is doing for elderly, and more generally for people with specific limited faculties.

There was a recent New York Times article that investigated so-called “Silver Tech,” and it painted a pretty dire picture: most of the tools being developed are essentially surveillance devices, monitors to allow caregivers more freedom. They had ways of monitoring urine in diapers, open refrigerators, blood sugar, or falls. They often failed or had too much set-up time. And more generally, the wearables industry is ignoring people who might actually benefit from their use.

I’m more interested in tools for older people to use that would make their lives more interactive, not merely so that they can be safely left alone for longer periods of time. And there have been tools made specifically for older people to use, but they are often too difficult to use or to charge or even to turn off and on. They don’t seem to be designed with the end-user in mind very often.

Of course, I should be the first person to point out that there’s a corner of the big data industry that’s already hard at work thinking about the elderly, but it’s in the realm of predatory consumer offers; specifically tailoring ads and services that prey on confused older people, with the help of data warehousing companies like Acxiom selling lists of names, addresses, and email addresses with names like “Suffering Seniors” and “Aching and Ailing” for 15 cents per person.

I know we talk about the Boomers too much, but let me just say: the Boomers are retiring, and they won’t want their quality of life to be diminished to the daytime soap opera watching that my grandmother put up with. They’re going to want iPads that help them stay in touch with their kids and each other. We should make that work.

And as the world lives longer, we’ll have more and more people who are perfectly healthy in all sorts of ways except one or two, and who could really benefit from some thoughtful and non-predatory technology solution. I’d love to hear your thoughts.

Categories: Uncategorized

Big Data community, please don’t leave underrepresented students behind

This is a guest post by Nii Attoh-Okine, Professor of Civil and Environmental Engineering and Director of Big Data Center at the University of Delaware. Nii, originally from Ghana, does research in Resilience Engineering and Data Science. His new book, Resilience Engineering: Models and Analysis will be out in December 2016 with Cambridge Press. Nii is also working on a second book, Big Data and Differential Privacy: Analysis Strategies for Railway Track Engineering, which will be out Fall 2016 with John Wiley & Sons. 

Big data has been a major revolutionary area of research in the last few years—although one may argue that the name change has created at least part of the hype. Only time will tell how much. In any case, with all the opportunities, hype, and advancement, it is very clear that underrepresented minority students are virtually missing in the big data revolution.

What do I mean? The big data revolution is addressing and tackling issues within the banking, engineering and technology, health and medical sciences, social sciences, humanities, music, and fashion industries, among others. But visit conferences, seminars, and other activities related to big data: underrepresented minority students are missing.

At a recent Strata and Hadoop conference in New York, one of the premier big data events, it was very disappointing and even alarming that underrepresented minority students (participants and presenters) were virtually nonexistent. The logical question that comes to mind is whether the big data community is not reaching out to underrepresented minority students or if underrepresented minority students are not reaching out to the big data community.

To address the importance of addressing and tackling the issues, there are a two critical facts to know, the first on the supply side, the other on the demand side:

  1. The demographics of the US population are undergoing a dramatic shift. Minority groups underrepresented in STEM fields will soon make up the majority of school-age children in the states (Frey, 2012). This means that currently underrepresented minorities are a rich pool of STEM talent, if we figure out how to tap into it.
  2. “‘Human resource inputs are a critical component to our scientific enterprise. We look to scientists for creative sparks to expand our knowledge base and deepen our understanding of natural and social phenomena. Their contributions provide the basis for technological advances that improve our productivity and the quality of lives. It is not surprising, therefore, that concern about the adequacy of the talent pool, both in number and quality, is a hardy perennial that appears regularly as an important policy issue.’ This statement, borrowed from Pearson and Fechter’s book, Who will Do Science?: Educating the Next Generation, remains a topic of serious debate” (A. James Hicks, Ph.D., NSF/LSAMP Program Director).

The issue at large is how the big data community can involve the underrepresented minority students. On that front I have some suggestions. The big data community can:

  • Develop ‘invested mentors’ from the big data community who show a genuine interest in advising underrepresented minority students about big data.
  • Forge partnerships with colleges and universities, especially minority-serving institutions.
  • Identify professors who have genuine interest in working with underrepresented students in big data related research.
  • Invite some students and researchers from underrepresented minorities to big data conferences and workshops.
  • Attend and organize information sessions during conferences oriented toward underrepresented minority students.

The major advice to the big data community is this: please do make the effort to engage and include underrepresented minority students because there is so much talent within this group.

Categories: Uncategorized

Big data, disparate impact, and the neoliberal mindset

When you’re writing a book for the general public’s consumption, you have to keep things pretty simple. You can’t spend a lot of time theorizing about why some stuff is going on, you have to focus on what’s happening, and how bad it is, and who’s getting screwed. Anything beyond that and you’ll be called a conspiracy theorist by some level of your editing team.

But the good thing about writing a blog is that you can actually say anything you like. That’s one reason I cling so strongly to mathbabe; I need to be able to write stuff that’s mildly conspiracy-theoretical. After all, just because you’re paranoid doesn’t mean nobody’s out to get you, right?

Anyhoo, I’m going to throw out a theory about big data, disparate impact, and the neoliberal mindset. First I need to set it up a bit.

Did you hear about this recent story whereby Facebook just got a patent to measure someone’s creditworthiness by looking at who their friends are and what their credit scores are? They idea is, you are more likely to be able to pay back your loans if the people you’re friends with pay back their loans.

On the one hand, it sounds possibly true: richer people tend to have richer friends, and so if there’s not very much information about someone, but that person is nevertheless inferred to be “friends with rich people,” then they might be a better bet for paying back loans.

On the other hand, it also sounds like an unfair way to distribute loans: most of us are friends with a bunch of people from high school, and if I happened to go to a high school filled with poor kids, then loans for me would be ruled out by this method.

This leads to the concept of disparate impact, which was beautifully explained in this recent article called When Big Data Becomes Bad Data (hat tip Marc Sobel). The idea is, when your process (or algorithm) favors one group of people over another, intentionally or not, it might be considered unfair and thus illegal. There’s lots of precedent for this in the courts, and recently the Supreme Court upheld it as a legitimate argument in Fair Housing Act cases.

It’s still not clear whether a “disparate impact” argument can be used in the case of algorithms, though. And there are plenty of people who work in the field of big data who dismiss this possibility altogether, and who even claim that things like the Facebook idea above are entirely legitimate. I had an argument on my Slate Money podcast last Friday about this very question.

Here’s my theory as to why it’s so hard for people to understand. They have been taken over in these matters by a neoliberal thought process, whereby every person is told to behave rationally, as an individual, and to seek maximum profit. It’s like an invisible hand on a miniature scale, acting everywhere and at all times.

Since this ideology has us acting as individuals, and ignoring group dynamics, the disparate impact argument is difficult if not impossible to understand. Why would anyone want to loan money to a poor person? That wouldn’t make economic sense. Or, more relevantly, why would anyone not distinguish between a poor person and a rich person before making a loan? That’s the absolute heart of how the big data movement operates. Changing that would be like throwing away money.

Since every interaction boils down to game theory and strategies for winning, “fairness” doesn’t come into the equation (note, the more equations the better!) of an individual’s striving for more opportunity and more money. Fairness isn’t even definable unless you give context, and context is exactly what this mindset ignores.

Here’s how I talk to someone when this subject comes up. I right away distinguish between the goal of the loaner – namely, accuracy and profit – and the goal of the public at large, namely that we have a reasonable financial system that doesn’t exacerbate the current inequalities or send people into debt spirals. This second goal has a lot to do with fairness and definitely pertains broadly to groups of people. Then, after setting that up, we can go ahead and discuss the newest big data idea, as long as we remember to look at it through both lenses.

Categories: Uncategorized

China announces it is scoring its citizens using big data

Please go read the article in the Dutch newspaper de Volkskrant entitled China rates its own citizens – including online behavior (hat tip Ernie Davis).

In the article, it describes China’s plan to use big data techniques to score all of its citizens – with the help of China internet giants Alibaba, Baidu, and Tencent – in a kind of expanded credit score that includes behavior and reputation. So what you buy, who you’re friends with, and whether you seem sufficiently “socialist” are factors that affect your overall score.

Here’s a quote from a person working on the model, from Chinese Academy of Social Science, that is incredibly creepy:

When people’s behavior isn’t bound by their morality, a system must be used to restrict their actions

And here’s another quote from Rogier Creemers, an academic at Oxford who specializes in China:

Government and big internet companies in China can exploit ‘Big Data’ together in a way that is unimaginable in the West

I guess I’m wondering whether that’s really true. Given my research over the past couple of years, I see this kind of “social credit scoring” being widely implemented here in the United States.

Looking for big data reading suggestions

I have been told by my editor to take a look at the books already out there on big data to make sure my book hasn’t already been written. For example, today I’m set to read Robert Scheer’s They Know Everything About You: how data-collecting corporations and snooping government agencies are destroying democracy.

This book, like others I’ve already read and written about (Bruce Schneier’s Data and Goliath, Frank Pasquale’s Black Box Society, and Julia Angwin’s Dragnet Nation) are all primarily concerned with individual freedom and privacy, whereas my book is primarily concerned with social justice issues, and each chapter gives an example of how big data is being used a tool against the poor, against minorities, against the mentally ill, or against public school teachers.

Not that my book is entirely different from the above books, but the relationship is something like what I spelled out last week when I discussed the four political camps in the big data world. So far the books I’ve found are focused on the corporate angle or the privacy angle. There may also be books focused on the open data angle, but I’m guessing they have even less in common with my book, which focuses on the ways big data increase inequality and further alienate already alienated populations.

If any of you know of a book I should be looking at, please tell me!

Four political camps in the big data world

Last Friday I was honored to be part of a super interesting and provocative conference at UC Berkeley’s Law School called Open Data: Addressing Privacy, Security, and Civil Rights Challenges.

What I loved about this conference is that it explicitly set out to talk across boundaries of the data world. That’s unusual.

Broadly speaking, there are four camps in the “big data” world:

  1. The corporate big data camp. This involves the perspective that we use data to know our customers, make our products tailored to their wants and needs, generally speaking keep our data secret so as to maximize profits. The other side of this camp is the public, seen as consumers.
  2. The security crowd. These are people like Bruce Schneier, whose book I recently read. They worry about individual freedom and liberty, and how mass surveillance and dragnets are degrading our existence. I have a lot of sympathy for their view, although their focus is not mine. The other side of this camp is the NSA, on the one hand, and hackers, on the other, who exploit weak data and privacy protections.
  3. The open data crowd. The people involved with this movement are split into two groups. The first consists of activists like Aaron Swartz and Carl Malamud, whose basic goal is to make publicly available things that theoretically, and often by law, should be publicly available, like court proceedings and scientific research, and the Sunlight Foundation, which focuses on data about politics. The second group of “open data” folks come from government itself, and are constantly espousing the win-win-win aspects of opening up data: win for companies, who make more profit, win for citizens, who have access to more and better information, and win for government, which benefits from more informed citizenry and civic apps. The other side of this camp is often security folks, who point out how much personal information often leaks through the cracks of open data.
  4. Finally, the camp I’m in, which is either the “big data and civil rights” crowd, or more broadly the people who worry about how this avalanche of big data is affecting the daily lives of citizens, not only when we are targeted by the NSA or by someone stealing our credit cards, but when we are born poor versus rich, and so on. The other side of this camp is represented by the big data brokers who sell information and profiles about everyone in the country, and sometimes the open data folks who give out data about citizens that can be used against them.

The thing is, all of these camps have their various interests, and can make good arguments for them. Even more importantly, they each have their own definition of the risks, as well as the probability of those risks.

For example, I care about hackers and people unreasonably tracked and targeted by the NSA, but I don’t think about that nearly as much as I think about how easy it is for poor people to be targeted by scam operations when they google for “how do I get food stamps”. As another example, when I saw Carl Malamud talk the other day, he obviously puts some attention into having social security numbers of individuals protected when he opens up court records, but it’s not obvious that he cares as much about that issue as someone who is a real privacy advocate would.

Anyway, we didn’t come to many conclusions in one day, but it was great for us all to be in one room and start the difficult conversation. To be fair, the “corporate big data camp” was not represented in that room as far as I know, but that’s because they’re too busy lobbying for a continuation of little to no regulation in Washington.

And given that we all have different worries, we also have different suggestions for how to address those worries; there is no one ideal regulation that will fix everything, and for that matter some people involved don’t believe that government regulations can ever work, and that we need citizen involvement above all, especially when it comes to big data in politics. A mishmash, in other words, but still an important conversation to begin.

I’d like it to continue! I’d like to see some public debates between different representatives of these groups.

Categories: Uncategorized

Big Data Is The New Phrenology

Have you ever heard of phrenology? It was, once upon a time, the “science” of measuring someone’s skull to understand their intellectual capabilities.

This sounds totally idiotic but was a huge fucking deal in the mid-1800’s, and really didn’t stop getting some credit until much later. I know that because I happen to own the 1911 edition of the Encyclopedia Britannica, which was written by the top scholars of the time but is now horribly and fascinatingly outdated.

For example, the entry for “Negro” is famously racist. Wikipedia has an excerpt: “Mentally the negro is inferior to the white… the arrest or even deterioration of mental development [after adolescence] is no doubt very largely due to the fact that after puberty sexual matters take the first place in the negro’s life and thoughts.”

But really that one line doesn’t tell the whole story. Here’s the whole thing, it’s long:

Pages 1 and 2

Pages 1 and 2

Pages 3 and 4

Pages 3 and 4

Pages 5 and 6

Pages 5 and 6

As you can see, they really go into it, with all sorts of data and speculative theories. But near the beginning there’s straight up racist phrenology:

From page 1

From page 1

To be clear: this was produced by a culture that was using pseudo-scientific nonsense to validate an underlying toxic and racist mindset. There was nothing more to it, but because people become awed and confused around scientific facts and figures, it seemed to work as a validating argument in 1911.

Anyhoo, I thought this was an interesting back drop to the NPR story I wanted to share with you (hat tip Yves Smith) entitled Recruiting Better Talent With Brain Games And Big Data. You can read the transcript as well, you don’t have to listen. Basically the idea is you play video games and the machine takes note of how you play and the choices you make and comes back to you with a personality profile. That profile will help you get a job or will exclude you from a job if the company believes in the results. There’s been no scientific tests to see if or how this stuff works, we’re supposed to just believe in it because, you know, data is objective and everything.

Here’s the thing. What we’ve got is a new kind of awful pseudo-science, which replaces measurements of skulls with big data. There’s no reason to think this stuff is any less biased or discriminatory either: given that there’s no actual science behind it, we might simply be replicating a selection method to get people who we like and who remind us of ourselves. To be sure, it might not be as deliberate as what we saw above, but that doesn’t mean it’s not happening.

The NPR reporter who introduced this story did so by saying, “let’s start this hour with a look at an innovation in something that’s gone unchanged, it seems, forever.” That one sentence already gets it wrong, though. This is, unfortunately, not innovative. This is just the big data version of phrenology.

Categories: Uncategorized

AAPOR Big Data Report

February 18, 2015 1 comment

I was recently part of a task force for understanding the practices of “big data” from the perspective of the American Association for Public Opinion Research (AAPOR), which is an organization that promotes good standards for studying public opinion.

So for example, AAPOR has a code of ethics for how to track public opinion, and a set of understood methodologies for correctly using surveys. They involved themselves last year when they criticized the New York Times and CBS for releasing the results of a nationwide poll on Senate races where the opt-in survey method had “little grounding in theory” and for a lack of transparency.

But here’s the thing, the biggest problem facing the world of public opinion research isn’t that online opt-in polls, but rather the temptation to troll twitter to “see what people are thinking.” And that’s exactly what’s happening, in large part because it’s cheaper. Thus the AAPOR Big Data Report that I helped with.

I think we did a decent job of describing some of the intrinsic difficulties with using big data, specifically around quality control issues, and for that reason I recommend this report to anyone entering the field, or even people already in the field who haven’t thought through this stuff. If you don’t have time to read the full report, here are our recommendations:

1. Surveys and Big Data are complementary data sources not competing data sources.

There are differences between the approaches, but this should be seen as an advantage rather than a disadvantage. Research is about answering questions, and one way to answer questions is to start utilizing all information available. The availability of Big Data to support research provides a new way to approach old questions as well as an ability to address some new questions that in the past were out of reach. However, the findings that are generated based on Big Data inevitably generate more questions, and some of those questions tend to be best addressed by traditional survey research methods.

2. AAPOR should develop standards for the use of Big Data in survey research when more knowledge has been accumulated.

Using Big Data in statistically valid ways is a challenge. One common misconception is the belief that volume of data can compensate for any other deficiency in the data. AAPOR should develop standards of disclosure and transparency when using Big Data in survey research. AAPOR’s transparency initiative is a good role model that should be extended to other data sources besides surveys.

3. AAPOR should start working with the private sector and other professional organizations to educate its members on Big Data.

The current pace of the Big Data development in itself is a challenge. It is very difficult to keep up with the research and development in the Big Data area. Research on new technology tends to become outdated very fast. There is currently insufficient capacity in the AAPOR community. AAPOR should tap other professional associations, such as the American Statistical Association and the Association for Computing Machinery, to help understand these issues and provide training for other AAPOR members and non-members.

4. AAPOR should inform the public of the risks and benefits of Big Data.

Most users of digital services are unaware of the fact that data formed out of their digital behavior may be reused for other purposes, for both public and private good. AAPOR should be active in public debates and provide training for journalists to improve data-driven journalism. AAPOR should also update its Code of Professional Ethics and Practice to include the collection of digital data outside of surveys. It should work with Institutional Review Boards to facilitate the research use of such data in an ethical fashion.

5. AAPOR should help remove the barrier associated with different uses of terminology.

Effective use of Big Data usually requires a multidisciplinary team consisting of e.g., a domain expert, a researcher, a computer scientist, and a system administrator. Because of the interdisciplinary nature of Big Data, there are many concepts and terms that are defined differently by people with different backgrounds. AAPOR should help remove this barrier by informing its community about the different uses of terminology. Short courses and webinars are successful instruments that AAPOR can use to accomplish this task.

6. AAPOR should take a leading role in working with federal agencies in developing a necessary infrastructure for the use of Big Data in survey research.

Data ownership is not well defined and there is no clear legal framework for the collection and subsequent use of Big Data. There is a need for public-private partnerships to ensure data access and reproducibility. The Office of Management and Budget (OMB) is very much involved in federal surveys since they develop guidelines for those and research funded by government should follow these guidelines. It is important that AAPOR work together with federal statistical agencies on Big Data issues and build capacity in this field. AAPOR’s involvement could include the creation or propagation of shared cloud computing resources

Categories: Uncategorized

Creepy big data health models

There’s an excellent Wall Street Journal article by Joseph Walker, entitled Can a Smartphone Tell if You’re Depressed?that describes a lot of creepy new big data projects going on now in healthcare, in partnership with hospitals and insurance companies.

Some of the models come in the form of apps, created and managed by private, third-party companies that try to predict depression in, for example, postpartum women. They don’t disclose what they are doing to many of the women, or the extent of what they’re doing, according to the article. They own the data they’ve collected at the end of the day and, presumably, can sell it to anyone interested in whether a woman is depressed. For example, future employers. To be clear, this data is generally not covered by HIPAA.

Perhaps the creepiest example is a voice analysis model:

Nurses employed by Aetna have used voice-analysis software since 2012 to detect signs of depression during calls with customers who receive short-term disability benefits because of injury or illness. The software looks for patterns in the pace and tone of voices that can predict “whether the person is engaged with activities like physical therapy or taking the right kinds of medications,” Michael Palmer, Aetna’s chief innovation and digital officer, says.

Patients aren’t informed that their voices are being analyzed, Tammy Arnold, an Aetna spokeswoman, says. The company tells patients the calls are being “recorded for quality,” she says.

“There is concern that with more detailed notification, a member may alter his or her responses or tone (intentionally or unintentionally) in an effort to influence the tool or just in anticipation of the tool,” Ms. Arnold said in an email.

In other words, in the name of “fear of gaming the model,” we are not disclosing the creepy methods we are using. Also, considering that the targets of this model are receiving disability benefits, I’m wondering if the real goal is to catch someone off their meds and disqualify them for further benefits or something along those lines. Since they don’t know they are being modeled, they will never know.

Conclusion: we need more regulation around big data in healthcare.

Categories: data journalism, modeling, rant

Big data and class

About a month ago there was an interesting article in the New York Times entitled Blowing Off Class? We Know. It discusses the “big data” movement in colleges around the country. For example, at Ball State, they track which students go to parties at the student center. Presumably to help them study for tests, or maybe to figure out which ones to hit up for alumni gifts later on.

There’s a lot to discuss in this article, but I want to focus today on one piece:

Big data has a lot of influential and moneyed advocates behind it, and I’ve asked some of them whether their enthusiasm might also be tinged with a little paternalism. After all, you don’t see elite institutions regularly tracking their students’ comings and goings this way. Big data advocates don’t dispute that, but they also note that elite institutions can ensure that their students succeed simply by being very selective in the first place.

The rest “get the students they get,” said William F. L. Moses, the managing director of education programs at the Kresge Foundation, which has given grants to the innovation alliance and to bolster data-analytics efforts at other colleges. “They have a moral obligation to help them succeed.”

This is a sentiment I’ve noticed a lot, although it’s not usually this obvious. Namely, the elite don’t need to be monitored, but the rabble does. The rich and powerful get to be quirky philosophers but the rest of the population need to be ranked and filed. And, by the way, we are spying on them for their own good.

In other words, never mind how big data creates and expands classism; classism already helps decide who is put into the realm of big data in the first place.

It feeds into the larger question of who is entitled to privacy. If you want to be strict about your definition of pricacy, you might say “nobody.” But if you recognize that privacy is a spectrum, where we have a variable amount of information being collected on people, and also a variable amount of control over people whose information we have collected, then upon study, you will conclude that privacy, or at least relative privacy, is for the rich and powerful. And it starts early.

Wage Gaps Don’t Magically Get Smaller Because Big Data

Today, just a rant. Sorry. I mean, I’m not a perfect person either, and of course that’s glaringly obvious, but this fluff piece from Wired, written by Pam Wikham of Raytheon, is just aggravating.

The title is Big Data, Smaller Wage Gap? and, you know, it almost gives us the impression that she has a plan to close the wage gap using big data, or alternatively an argument that the wage gap will automatically close with the advent of big data techniques. It turns out to be the former, but not really.

After complaining about the wage gap for women in general, and after we get to know how much she loves her young niece, here’s the heart of the plan (emphasis mine, on the actual plan parts of the plan):

Analytics and microtargeting aren’t just for retailers and politicians — they can help us grow the ranks of executive women and close the gender wage gap. Employers analyze who clicked on internal job postings, and we can pursue qualified women who looked but never applied. We can go beyond analyzing the salary and rank histories of women who have left our companies. We can use big data analytics to tell us what exit interviews don’t.

Facebook posts, Twitter feeds and LinkedIn groups provide a trove of valuable intel from ex-employees. What they write is blunt, candid and useful. All the data is there for the taking — we just have to collect it and figure out what it means. We can delve deep into whether we’re promoting the best people, whether we’re doing enough to keep our ranks diverse, whether potential female leaders are being left behind and, importantly, why.

That’s about it, after that she goes back to her niece.

Here’s the thing, I’m not saying it’s not an important topic, but that plan doesn’t seem worthy of the title of the piece. It’s super vague and fluffy and meaningless. I guess, if I had to give it meaning, it would be that she’s proposing to understand internal corporate sexism using data, rather than assuming “data is objective” and that all models will make things better. And that’s one tiny step, but it’s not much. It’s really not enough.

Here’s an idea, and it kind of uses big data, or at least small data, so we might be able to sell it. Ask people in your corporate structure what the actual characteristics are of people they promote, and how they are measured, or if they are measured, and look at the data to see if what they say is consistent with what they do, and whether those characteristics are inherently sexist. It’s a very specific plan and no fancy mathematical techniques are necessary, but we don’t have to tell anyone that.

What combats sexism is a clarification and transparent description of job requirements and a willingness to follow through. Look at blind orchestra auditions for a success story there. By contrast, my experience with the corporate world is that, when hiring or promoting, they often list a long series of unmeasurable but critical properties like “good cultural fit” and “leadership qualities” that, for whatever reason, more men are rated high on than women.

Categories: data science, rant

Fairness, accountability, and transparency in big data models

As I wrote about already, last Friday I attended a one day workshop in Montreal called FATML: Fairness, Accountability, and Transparency in Machine Learning. It was part of the NIPS conference for computer science, and there were tons of nerds there, and I mean tons. I wanted to give a report on the day, as well as some observations.

First of all, I am super excited that this workshop happened at all. When I left my job at Intent Media in 2011 with the intention of studying these questions and eventually writing a book about them, they were, as far as I know, on nobody’s else’s radar. Now, thanks to the organizers Solon and Moritz, there are communities of people, coming from law, computer science, and policy circles, coming together to exchange ideas and strategies to tackle the problems. This is what progress feels like!

OK, so on to what the day contained and my copious comments.

Hannah Wallach

Sadly, I missed the first two talks, and an introduction to the day, because of two airplane cancellations (boo American Airlines!). I arrived in the middle of Hannah Wallach’s talk, the abstract of which is located here. Her talk was interesting, and I liked her idea of having social scientists partnered with data scientists and machine learning specialists, but I do want to mention that, although there’s a remarkable history of social scientists working within tech companies – say at Bell Labs and Microsoft and such – we don’t see that in finance at all, nor does it seem poised to happen. So in other words, we certainly can’t count on social scientists to be on hand when important mathematical models are getting ready for production.

Also, I liked Hannah’s three categories of models: predictive, explanatory, and exploratory. Even though I don’t necessarily think that a given model will fall neatly into one category or the other, they still give you a way to think about what we do when we make models. As an example, we think of recommendation models as ultimately predictive, but they are (often) predicated on the ability to understand people’s desires as made up of distinct and consistent dimensions of personality (like when we use PCA or something equivalent). In this sense we are also exploring how to model human desire and consistency. For that matter I guess you could say any model is at its heart an exploration into whether the underlying toy model makes any sense, but that question is dramatically less interesting when you’re using linear regression.

Anupam Datta and Michael Tschantz

Next up Michael Tschantz reported on work with Anupam Datta that they’ve done on Google profiles and Google ads. The started with google’s privacy policy, which I can’t find but which claims you won’t receive ads based on things like your health problems. Starting with a bunch of browsers with no cookies, and thinking of each of them as fake users, they did experiments to see what actually happened both to the ads for those fake users and to the google ad profiles for each of those fake users. They found that, at least sometimes, they did get the “wrong” kind of ad, although whether Google can be blamed or whether the advertiser had broken Google’s rules isn’t clear. Also, they found that fake “women” and “men” (who did not differ by any other variable, including their searches) were offered drastically different ads related to job searches, with men being offered way more ads to get $200K+ jobs, although these were basically coaching sessions for getting good jobs, so again the advertisers could have decided that men are more willing to pay for such coaching.

An issue I enjoyed talking about was brought up in this talk, namely the question of whether such a finding is entirely evanescent or whether we can call it “real.” Since google constantly updates its algorithm, and since ad budgets are coming and going, even the same experiment performed an hour later might have different results. In what sense can we then call any such experiment statistically significant or even persuasive? Also, IRL we don’t have clean browsers, so what happens when we have dirty browsers and we’re logged into gmail and Facebook? By then there are so many variables it’s hard to say what leads to what, but should that make us stop trying?

From my perspective, I’d like to see more research into questions like, of the top 100 advertisers on Google, who saw the majority of the ads? What was the economic, racial, and educational makeup of those users? A similar but different (because of the auction) question would be to reverse-engineer the advertisers’ Google ad targeting methodologies.

Finally, the speakers mentioned a failure on Google’s part of transparency. In your advertising profile, for example, you cannot see (and therefore cannot change) your marriage status, but advertisers can target you based on that variable.

Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian

Next up we had Sorelle talk to us about her work with two guys with enormous names. They think about how to make stuff fair, the heart of the question of this workshop.

First, if we included race in, a resume sorting model, we’d probably see negative impact because of historical racism. Even if we removed race but included other attributes correlated with race (say zip code) this effect would remain. And it’s hard to know exactly when we’ve removed the relevant attributes, but one thing these guys did was define that precisely.

Second, say now you have some idea of the categories that are given unfair treatment, what can you do? One thing suggested by Sorelle et al is to first rank people in each category – to assign each person a percentile in their given category – and then to use the “forgetful function” and only consider that percentile. So, if we decided at a math department that we want 40% women graduate students, to achieve this goal with this method we’d independently rank the men and women, and we’d offer enough spots to top women to get our quota and separately we’d offer enough spots to top men to get our quota. Note that, although it comes from a pretty fancy setting, this is essentially affirmative action. That’s not, in my opinion, an argument against it. It’s in fact yet another argument for it: if we know women are systemically undervalued, we have to fight against it somehow, and this seems like the best and simplest approach.

Ed Felten and Josh Kroll

After lunch Ed Felton and Josh Kroll jointly described their work on making algorithms accountable. Basically they suggested a trustworthy and encrypted system of paper trails that would support a given algorithm (doesn’t really matter which) and create verifiable proofs that the algorithm was used faithfully and fairly in a given situation. Of course, we’d really only consider an algorithm to be used “fairly” if the algorithm itself is fair, but putting that aside, this addressed the question of whether the same algorithm was used for everyone, and things like that. In lawyer speak, this is called “procedural fairness.”

So for example, if we thought we could, we might want to turn the algorithm for punishment for drug use through this system, and we might find that the rules are applied differently to different people. This algorithm would catch that kind of problem, at least ideally.

David Robinson and Harlan Yu

Next up we talked to David Robinson and Harlan Yu about their work in Washington D.C. with policy makers and civil rights groups around machine learning and fairness. These two have been active with civil rights group and were an important part of both the Podesta Report, which I blogged about here, and also in drafting the Civil Rights Principles of Big Data.

The question of what policy makers understand and how to communicate with them came up several times in this discussion. We decided that, to combat cherry-picked examples we see in Congressional Subcommittee meetings, we need to have cherry-picked examples of our own to illustrate what can go wrong. That sounds bad, but put it another way: people respond to stories, especially to stories with innocent victims that have been wronged. So we are on the look-out.

Closing panel with Rayid Ghani and Foster Provost

I was on the closing panel with Rayid Ghani and Foster Provost, and we each had a few minutes to speak and then there were lots of questions and fun arguments. To be honest, since I was so in the moment during this panel, and also because I was jonesing for a beer, I can’t remember everything that happened.

As I remember, Foster talked about an algorithm he had created that does its best to “explain” the decisions of a complicated black box algorithm. So in real life our algorithms are really huge and messy and uninterpretable, but this algorithm does its part to add interpretability to the outcomes of that huge black box. The example he gave was to understand why a given person’s Facebook “likes” made a black box algorithm predict they were gay: by displaying, in order of importance, which likes added the most predictive power to the algorithm.

[Aside, can anyone explain to me what happens when such an algorithm comes across a person with very few likes? I’ve never understood this very well. I don’t know about you, but I have never “liked” anything on Facebook except my friends’ posts.]

Rayid talked about his work trying to develop a system for teachers to understand which students were at risk of dropping out, and for that system to be fair, and he discussed the extent to which that system could or should be transparent.

Oh yeah, and that reminds me that, after describing my book, we had a pretty great argument about whether credit scoring models should be open source, and what that would mean, and what feedback loops that would engender, and who would benefit.

Altogether a great day, and a fantastic discussion. Thanks again to Solon and Moritz for their work in organizing it.

Big Data’s Disparate Impact

Take a look at this paper by Solon Barocas and Andrew D. Selbst entitled Big Data’s Disparate Impact.

It deals with the question of whether current anti-discrimination law is equipped to handle the kind of unintentional discrimination and digital redlining we see emerging in some “big data” models (and that we suspect are hidden in a bunch more). See for example this post for more on this concept.

The short answer is no, our laws are not equipped.

Here’s the abstract:

This article addresses the potential for disparate impact in the data mining processes that are taking over modern-day business. Scholars and policymakers had, until recently, focused almost exclusively on data mining’s capacity to hide intentional discrimination, hoping to convince regulators to develop the tools to unmask such discrimination. Recently there has been a noted shift in the policy discussions, where some have begun to recognize that unintentional discrimination is a hidden danger that might be even more worrisome. So far, the recognition of the possibility of unintentional discrimination lacks technical and theoretical foundation, making policy recommendations difficult, where they are not simply misdirected. This article provides the necessary foundation about how data mining can give rise to discrimination and how data mining interacts with anti-discrimination law.

The article carefully steps through the technical process of data mining and points to different places within the process where a disproportionately adverse impact on protected classes may result from innocent choices on the part of the data miner. From there, the article analyzes these disproportionate impacts under Title VII. The Article concludes both that Title VII is largely ill equipped to address the discrimination that results from data mining. Worse, due to problems in the internal logic of data mining as well as political and constitutional constraints, there appears to be no easy way to reform Title VII to fix these inadequacies. The article focuses on Title VII because it is the most well developed anti-discrimination doctrine, but the conclusions apply more broadly because they are based on the general approach to anti-discrimination within American law.

I really appreciate this paper, because it’s an area I know almost nothing about: discrimination law and what are the standards for evidence of discrimination.

Sadly, what this paper explains to me is how very far we are away from anything resembling what we need to actually address the problems. For example, even in this paper, where the writers are well aware that training on historical data can unintentionally codify discriminatory treatment, they still seem to assume that the people who build and deploy models will “notice” this treatment. From my experience working in advertising, that’s not actually what happens. We don’t measure the effects of our models on our users. We only see whether we have gained an edge in terms of profit, which is very different.

Essentially, as modelers, we don’t humanize the people on the other side of the transaction, which prevents us from worrying about discrimination or even being aware of it as an issue. It’s so far from “intentional” that it’s almost a ridiculous accusation to make. Even so, it may well be a real problem and I don’t know how we as a society can deal with it unless we update our laws.