Guest post: Rage against the algorithms

Home > data science, guest post, modeling > Guest post: Rage against the algorithms

Guest post: Rage against the algorithms

October 8, 2013 Cathy O'Neil, mathbabe

This is a guest post by Nicholas Diakopoulos, a Tow Fellow at the Columbia University Graduate School of Journalism where he is researching the use of data and algorithms in the news. You can find out more about his research and other projects on his website or by following him on Twitter. Crossposted from engenhonetwork with permission from the author.

How can we know the biases of a piece of software? By reverse engineering it, of course.

When was the last time you read an online review about a local business or service on a platform like Yelp? Of course you want to make sure the local plumber you hire is honest, or that even if the date is dud, at least the restaurant isn’t lousy. A recent survey found that 76 percent of consumers check online reviews before buying, so a lot can hinge on a good or bad review. Such sites have become so important to local businesses that it’s not uncommon for scheming owners to hire shills to boost themselves or put down their rivals.

To protect users from getting duped by fake reviews Yelp employs an algorithmic review reviewer which constantly scans reviews and relegates suspicious ones to a “filtered reviews” page, effectively de-emphasizing them without deleting them entirely. But of course that algorithm is not perfect, and it sometimes de-emphasizes legitimate reviews and leaves actual fakes intact—oops. Some businesses have complained, alleging that the filter can incorrectly remove all of their most positive reviews, leaving them with a lowly one- or two-stars average.

This is just one example of how algorithms are becoming ever more important in society, for everything from search engine personalization, discrimination, defamation, and censorship online, to how teachers are evaluated, how markets work, how political campaigns are run, and even how something like immigration is policed. Algorithms, driven by vast troves of data, are the new power brokers in society, both in the corporate world as well as in government.

They have biases like the rest of us. And they make mistakes. But they’re opaque, hiding their secrets behind layers of complexity. How can we deal with the power that algorithms may exert on us? How can we better understand where they might be wronging us?

Transparency is the vogue response to this problem right now. The big “open data” transparency-in-government push that started in 2009 was largely the result of an executive memo from President Obama. And of course corporations are on board too; Google publishes a biannual transparency report showing how often they remove or disclose information to governments. Transparency is an effective tool for inculcating public trust and is even the way journalists are now trained to deal with the hole where mighty Objectivity once stood.

But transparency knows some bounds. For example, though the Freedom of Information Act facilitates the public’s right to relevant government data, it has no legal teeth for compelling the government to disclose how that data was algorithmically generated or used in publicly relevant decisions (extensions worth considering).

Moreover, corporations have self-imposed limits on how transparent they want to be, since exposing too many details of their proprietary systems may undermine a competitive advantage (trade secrets), or leave the system open to gaming and manipulation. Furthermore, whereas transparency of data can be achieved simply by publishing a spreadsheet or database, transparency of an algorithm can be much more complex, resulting in additional labor costs both in creation as well as consumption of that information—a cognitive overload that keeps all but the most determined at bay. Methods for usable transparency need to be developed so that the relevant aspects of an algorithm can be presented in an understandable way.

Given the challenges to employing transparency as a check on algorithmic power, a new and complementary alternative is emerging. I call it algorithmic accountability reporting. At its core it’s really about reverse engineering—articulating the specifications of a system through a rigorous examination drawing on domain knowledge, observation, and deduction to unearth a model of how that system works.

As interest grows in understanding the broader impacts of algorithms, this kind of accountability reporting is already happening in some newsrooms, as well as in academic circles. At the Wall Street Journal a team of reporters probed e-commerce platforms to identify instances of potential price discrimination in dynamic and personalized online pricing. By polling different websites they were able to spot several, such as Staples.com, that were adjusting prices dynamically based on the location of the person visiting the site. At the Daily Beast, reporter Michael Keller dove into the iPhone spelling correction feature to help surface patterns of censorship and see which words, like “abortion,” the phone wouldn’t correct if they were misspelled. In my own investigation for Slate, I traced the contours of the editorial criteria embedded in search engine autocomplete algorithms. By collecting hundreds of autocompletions for queries relating to sex and violence I was able to ascertain which terms Google and Bing were blocking or censoring, uncovering mistakes in how these algorithms apply their editorial criteria.

All of these stories share a more or less common method. Algorithms are essentially black boxes, exposing an input and output without betraying any of their inner organs. You can’t see what’s going on inside directly, but if you vary the inputs in enough different ways and pay close attention to the outputs, you can start piecing together some likeness for how the algorithm transforms each input into an output. The black box starts to divulge some secrets.

Algorithmic accountability is also gaining traction in academia. At Harvard, Latanya Sweeney has looked at how online advertisements can be biased by the racial association of names used as queries. When you search for “black names” as opposed to “white names” ads using the word “arrest” appeared more often for online background check service Instant Checkmate. She thinks the disparity in the use of “arrest” suggests a discriminatory connection between race and crime. Her method, as with all of the other examples above, does point to a weakness though: Is the discrimination caused by Google, by Instant Checkmate, or simply by pre-existing societal biases? We don’t know, and correlation does not equal intention. As much as algorithmic accountability can help us diagnose the existence of a problem, we have to go deeper and do more journalistic-style reporting to understand the motivations or intentions behind an algorithm. We still need to answer the question of why.

And this is why it’s absolutely essential to have computational journalists not just engaging in the reverse engineering of algorithms, but also reporting and digging deeper into the motives and design intentions behind algorithms. Sure, it can be hard to convince companies running such algorithms to open up in detail about how their algorithms work, but interviews can still uncover details about larger goals and objectives built into an algorithm, better contextualizing a reverse-engineering analysis. Transparency is still important here too, as it adds to the information that can be used to characterize the technical system.

Despite the fact that forward thinkers like Larry Lessig have been writing for some time about how code is a lever on behavior, we’re still in the early days of developing methods for holding that code and its influence accountable. “There’s no conventional or obvious approach to it. It’s a lot of testing or trial and error, and it’s hard to teach in any uniform way,” noted Jeremy Singer-Vine, a reporter and programmer who worked on the WSJ price discrimination story. It will always be a messy business with lots of room for creativity, but given the growing power that algorithms wield in society it’s vital to continue to develop, codify, and teach more formalized methods of algorithmic accountability. In the absence of new legal measures, it may just provide a novel way to shed light on such systems, particularly in cases where transparency doesn’t or can’t offer much clarity.

Categories: data science, guest post, modeling

Comments (6)

Dave

October 8, 2013 at 7:51 am

1262 words that imply Yelp’s algorithms are both deep and honest? Yelp has to be loving this, after the other attention they’re getting. Don’t follow the bits, follow the money. A mobster has a better chance of understanding what’s going on here than an MIT computer scientist. I believe the following, because I’ve heard it myself from so many small business owners: “The second reason I believe Yelp is doomed? More and more stories are popping up every single day about how completely biased they actually are. Want positive reviews? Pay them. This isn’t me talking. This is thousands of businesses who have gotten a few good reviews, then been contacted by Yelp to advertise. When they don’t, those reviews mysteriously disappear, only to be replaced by negatives.” – See more at: http://shankman.com/why-i-believe-yelp-is-doomed-to-fail/#sthash.0LLDGqqR.dpuf

LikeLike
Higby

October 8, 2013 at 8:37 am

Nick said: ” Algorithms are essentially black boxes, exposing an input and output without betraying any of their inner organs. You can’t see what’s going on inside directly, but if you vary the inputs in enough different ways and pay close attention to the outputs, you can start piecing together some likeness for how the algorithm transforms each input into an output. The black box starts to divulge some secrets.”

And this is exactly the problem — the “trade secrets” defense against reverse engineering — which is why corporations resist and circumvent disclosure to begin with.

LikeLike
- medicalquackblog
  
  October 8, 2013 at 11:57 am
  
  Exactly, the “trade secrets” and the average consumer really does not get this at all. Trade secrets can be an algorithm that denies some type of service or care with insurers. That’ been my focus for a long time as I saw a lot of that working with not only writing my own software years back but integrating billing software from someone else. I did this back in the early days before it all moved to the web but same principles are there, but more complex today by a long shot.
  
  I ended up writing a series of blog posts called “The Attack of the Killer Algorithms” to draw attention to this, thus the catchy name. Later I wrote a few other posts about what I called “Algo Duping” with addressing how the Killer Algorithms gain power, again to draw some consumer attention and wake up calls if you will. I’m still doing it:) I created a page called “Algo Duping” and put some videos on there, Cathy’s included, to put it out there with some additional links to some of my other posts on the topic. I need to add some more videos but I preface the page with the fact that the videos are all created by people smarter than me and I’m just the one connecting educational video dots. My audience is the average consumer so I try and keep the collection at a level where most can get “something” out of watching the videos. Maybe it’s a crazy idea but I keep it going as again the average consumer has no clue on the power of models and algorithms running on servers 24/7 behind closed doors impacts all of us in almost every portion of our lives.
  
  http://www.ducknet.net/attack-of-the-killer-algorithms/
  
  LikeLike
Anton Jeffery Rasmussen

October 8, 2013 at 10:01 am

What’s to stop paid reviewers or ghost-writers from biasing Yelp reviews? At a certain point, *some* black-hat stuff is gonna get through.

To me this a knowledge problem. Forget trying to determine what’s in the box and focus more on incentives. If a business really wants to have positive reviews that aren’t being achieved organically they’ll just pay for them . . . basic economics.

That said, the niche of computational journalism is intriguing.

LikeLike
n8chz

October 8, 2013 at 1:43 pm

You say that “transparency of data can be achieved simply by publishing a spreadsheet or database,” to which I would add (after data) “by the curator of said data.” Reverse engineering from publicly available data points to tabular (spreadsheet or database) data is itself a largely unaccomplished feat. As for the cognitive overload in reverse engineering algorithms, doesn’t disassembly or decompilation violate the EULA?

LikeLike
Richard P. F.

October 8, 2013 at 5:36 pm

Transparency-in-government push that started in 2009?

Two Liberal Media Giants Call Obama Administration ‘Most Closed, Control-Freak, Manipulative, And Secretive’ Ever Covered
Posted by Andrea Ryan on Saturday, October 5, 2013, 2:59 PM

The most Liberal operators in the media are calling out Obama for the creepy, scheming manipulator that he is.

A New York Times reporter called our transparent president the “most closed, control-freak administration I’ve ever covered.” Add paranoid. Government employees are forced to take lie detector tests to ensure they’re not talking to reporters.

Big Lib, Bob Schieffer, of CBS News agreed. But he used the words the “most manipulative and secretive administration“ he’s ever covered.

Hot Air reports,

Remember when the media rushed to talk about transparency in the Barack Obama “Hope and Change” era? Good times, good times. Leonard Downie, who once worked as the executive editor of the Washington Post and wrote a novel about Washington corruption and the Iraq War, finds a bigger and non-fictional problem in the successor to George W. Bush. Downie gives the Post a preview of his report from the Committee to Protect Journalists which outlines the Obama war on reporters and their sources:

“A memo went out from the chief of staff a year ago to White House employees and the intelligence agencies that told people to freeze and retain any e-mail, and presumably phone logs, of communications with me,” Sanger said. As a result, longtime sources no longer talk to him. “They tell me: ‘David, I love you, but don’t e-mail me. Let’s don’t chat until this blows over.’ ”

Sanger, who has worked for the Times in Washington for two decades, said, “This is most closed, control-freak administration I’ve ever covered.”

Many leak investigations include lie-detector tests for government officials with access to the information at issue. “Reporters are interviewing sources through intermediaries now,” Barr told me, “so the sources can truthfully answer on polygraphs that they didn’t talk to reporters.”

The investigations have been “a kind of slap in the face” for reporters and their sources, said Smith of the Center for Public Integrity. “It means you have to use extraordinary measures for contacts with officials speaking without authorization.”

Amusingly, Downie posits this question at the end of the essay:

Will Obama recognize that all this threatens his often-stated but unfulfilled goal of making government more transparent and accountable? None of the Washington news media veterans I talked to were optimistic.

“Whenever I’m asked what is the most manipulative and secretive administration I’ve covered, I always say it’s the one in office now,” Bob Schieffer, CBS News anchor and chief Washington correspondent, told me.

There you have it.

LikeLike