Home > Uncategorized > Links about big bad data

Links about big bad data

August 26, 2015

There have been a lot of great articles recently on my beat, the dark side of big data. I wanted to share some of them with you today:

  1. An interview with Cynthia Dwork by Clair Cain Miller (h/t Marc Sobel). Describes how fairness is not automatic in algorithms, and the somewhat surprising fact that, in order to make sure an algorithm isn’t racist, for example, you must actually take race into consideration when testing it.
  2. How Google Could Rig the 2016 Election by Robert Epstein (h/t Ernie Davis). This describes the unreasonable power of search rank in terms political trust. Namely, when a given candidate was artificially lifted in terms of rank, people started to trust them more. Google’s meaningless response: “Providing relevant answers has been the cornerstone of Google’s approach to search from the very beginning. It would undermine the people’s trust in our results and company if we were to change course.”
  3. Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency by Hannah Wallach (h/t Arnaud Sahuguet). She addresses the need for social scientists to work alongside computer scientists when working with human behavior data, as well as a prioritization on the question rather than data availability. She also promotes the idea of including a concept of uncertainty when possible.
  4. How Big Data Is Unfair by Moritz Hardt. This isn’t new but it is a fantastic overview of fairness issues in big data, specifically how data mining techniques deal with minority groups.
  5. How Social Bias Creeps Into Web Technology by Elizabeth Dwoskin (h/t Ernie Davis). Unfortunately behind the pay wall, this article talks about negative unintended consequences of data mining.
  6. A somewhat different topic but great article, The MOOC revolution that wasn’t, by Audrey Watters (h/t Ernie Davis). This article traces the fall of the mighty MOOC ideals. Best quote in the article: “High failure rates and dropouts are features, not bugs,” Caulfield suggests, “because they represent a way to thin pools of applicants for potential employers.”
Categories: Uncategorized
  1. August 26, 2015 at 10:55 am

    I work for Google, but I am not speaking for Google.

    The response is saying, yes, maybe Google could influence with things, including something as important[1] as the presidential election. But the incentives are aligned for Google, subpar search results will cost Google in the long run. In this case you don’t even need to trust them to be good, it’s in their own interest.

    [1] Did I say important? I meant irrelevant. Let’s face it, those candidates are lying through their teeth, telling you whatever the fuck they think they want you to hear to get elected, and then will proceed to do whatever the fuck they wanted to do once they are elected, almost independent of what they said they would do.


    • August 26, 2015 at 12:12 pm

      But what does “subpar search results” even mean for politically disputed subjects? The statement was meaningless largely because that’s undefined.

      By way of comparison, I grew up in a farming community and got weaned on these “don’t need to trust them, it’s in their own interest” arguments (in terms of land, animals, etc.) But there are still always cases of pesticide runoffs into rivers, animal abuse, etc. The “it’s in their interest” argument has been shown to be pretty much empty.


  2. RTG
    August 26, 2015 at 11:19 am

    Cathy, thanks for these links. Haven’t read them yet, but the descriptions sound compelling. It seems like there is a growing backlash against blind, “data-driven” search. I know in my own field, industrial data science where data-derived results must stand up to results from physics-based models that account for over a century of human knowledge, there is increasing awareness that seeking to resolve questions through all means possible, including mining large data sets, should trump a blind faith in the data-driven results without any thought to completeness and bias.

    At least in my field, a lot of the problem seems to be driven by an inability to bridge the chasm between the knowledge-base of the engineers who design and operate industrial equipment and the data scientists/machine learners 😉 who analyze the sensor data. I don’t know the solution to this (I happen to have degrees in the former but worked on large data sets derived from sensor measurements long before the term data science emerged), but I do think it’s critical to view data science as one of many skills to be brought to bear on a problem instead of its own microcosm.


  3. August 29, 2015 at 8:22 am

    Regarding distance learning, I don’t think the comparison is fair because the cost of dropping out is much cheaper. I suspect many like me started things they would have never started otherwise because of work and family time constraints. I’ve dropped out of one but completed another. For me, at the individual level, this translates in in success, not in some 50% dropout rate.


  1. August 30, 2015 at 5:13 am
Comments are closed.
%d bloggers like this: