This is a guest post by Nicholas Diakopoulos, a Tow Fellow at the Columbia University Graduate School of Journalism where he is researching the use of data and algorithms in the news. You can find out more about his research and other projects on his website or by following him on Twitter. Crossposted from engenhonetwork with permission from the author.
How can we know the biases of a piece of software? By reverse engineering it, of course.
When was the last time you read an online review about a local business or service on a platform like Yelp? Of course you want to make sure the local plumber you hire is honest, or that even if the date is dud, at least the restaurant isn’t lousy. A recent survey found that 76 percent of consumers check online reviews before buying, so a lot can hinge on a good or bad review. Such sites have become so important to local businesses that it’s not uncommon for scheming owners to hire shills to boost themselves or put down their rivals.
To protect users from getting duped by fake reviews Yelp employs an algorithmic review reviewer which constantly scans reviews and relegates suspicious ones to a “filtered reviews” page, effectively de-emphasizing them without deleting them entirely. But of course that algorithm is not perfect, and it sometimes de-emphasizes legitimate reviews and leaves actual fakes intact—oops. Some businesses have complained, alleging that the filter can incorrectly remove all of their most positive reviews, leaving them with a lowly one- or two-stars average.
This is just one example of how algorithms are becoming ever more important in society, for everything from search engine personalization, discrimination, defamation, and censorship online, to how teachers are evaluated, how markets work, how political campaigns are run, and even how something like immigration is policed. Algorithms, driven by vast troves of data, are the new power brokers in society, both in the corporate world as well as in government.
They have biases like the rest of us. And they make mistakes. But they’re opaque, hiding their secrets behind layers of complexity. How can we deal with the power that algorithms may exert on us? How can we better understand where they might be wronging us?
Transparency is the vogue response to this problem right now. The big “open data” transparency-in-government push that started in 2009 was largely the result of an executive memo from President Obama. And of course corporations are on board too; Google publishes a biannual transparency report showing how often they remove or disclose information to governments. Transparency is an effective tool for inculcating public trust and is even the way journalists are now trained to deal with the hole where mighty Objectivity once stood.
But transparency knows some bounds. For example, though the Freedom of Information Act facilitates the public’s right to relevant government data, it has no legal teeth for compelling the government to disclose how that data was algorithmically generated or used in publicly relevant decisions (extensions worth considering).
Moreover, corporations have self-imposed limits on how transparent they want to be, since exposing too many details of their proprietary systems may undermine a competitive advantage (trade secrets), or leave the system open to gaming and manipulation. Furthermore, whereas transparency of data can be achieved simply by publishing a spreadsheet or database, transparency of an algorithm can be much more complex, resulting in additional labor costs both in creation as well as consumption of that information—a cognitive overload that keeps all but the most determined at bay. Methods for usable transparency need to be developed so that the relevant aspects of an algorithm can be presented in an understandable way.
Given the challenges to employing transparency as a check on algorithmic power, a new and complementary alternative is emerging. I call it algorithmic accountability reporting. At its core it’s really about reverse engineering—articulating the specifications of a system through a rigorous examination drawing on domain knowledge, observation, and deduction to unearth a model of how that system works.
As interest grows in understanding the broader impacts of algorithms, this kind of accountability reporting is already happening in some newsrooms, as well as in academic circles. At the Wall Street Journal a team of reporters probed e-commerce platforms to identify instances of potential price discrimination in dynamic and personalized online pricing. By polling different websites they were able to spot several, such as Staples.com, that were adjusting prices dynamically based on the location of the person visiting the site. At the Daily Beast, reporter Michael Keller dove into the iPhone spelling correction feature to help surface patterns of censorship and see which words, like “abortion,” the phone wouldn’t correct if they were misspelled. In my own investigation for Slate, I traced the contours of the editorial criteria embedded in search engine autocomplete algorithms. By collecting hundreds of autocompletions for queries relating to sex and violence I was able to ascertain which terms Google and Bing were blocking or censoring, uncovering mistakes in how these algorithms apply their editorial criteria.
All of these stories share a more or less common method. Algorithms are essentially black boxes, exposing an input and output without betraying any of their inner organs. You can’t see what’s going on inside directly, but if you vary the inputs in enough different ways and pay close attention to the outputs, you can start piecing together some likeness for how the algorithm transforms each input into an output. The black box starts to divulge some secrets.
Algorithmic accountability is also gaining traction in academia. At Harvard, Latanya Sweeney has looked at how online advertisements can be biased by the racial association of names used as queries. When you search for “black names” as opposed to “white names” ads using the word “arrest” appeared more often for online background check service Instant Checkmate. She thinks the disparity in the use of “arrest” suggests a discriminatory connection between race and crime. Her method, as with all of the other examples above, does point to a weakness though: Is the discrimination caused by Google, by Instant Checkmate, or simply by pre-existing societal biases? We don’t know, and correlation does not equal intention. As much as algorithmic accountability can help us diagnose the existence of a problem, we have to go deeper and do more journalistic-style reporting to understand the motivations or intentions behind an algorithm. We still need to answer the question of why.
And this is why it’s absolutely essential to have computational journalists not just engaging in the reverse engineering of algorithms, but also reporting and digging deeper into the motives and design intentions behind algorithms. Sure, it can be hard to convince companies running such algorithms to open up in detail about how their algorithms work, but interviews can still uncover details about larger goals and objectives built into an algorithm, better contextualizing a reverse-engineering analysis. Transparency is still important here too, as it adds to the information that can be used to characterize the technical system.
Despite the fact that forward thinkers like Larry Lessig have been writing for some time about how code is a lever on behavior, we’re still in the early days of developing methods for holding that code and its influence accountable. “There’s no conventional or obvious approach to it. It’s a lot of testing or trial and error, and it’s hard to teach in any uniform way,” noted Jeremy Singer-Vine, a reporter and programmer who worked on the WSJ price discrimination story. It will always be a messy business with lots of room for creativity, but given the growing power that algorithms wield in society it’s vital to continue to develop, codify, and teach more formalized methods of algorithmic accountability. In the absence of new legal measures, it may just provide a novel way to shed light on such systems, particularly in cases where transparency doesn’t or can’t offer much clarity.
Crossposted from /var/null, a blog written by Aditya Mukerjee. Aditya graduated from Columbia with a degree in CS and statistics, was a hackNY Fellow, worked in data at OkCupid, and on the server team at foursquare. He currently serves as the Hacker-in-Residence at Quotidian Ventures.
A couple of weeks ago, I was scheduled to take a trip from New York (JFK) to Los Angeles on JetBlue. Every year, my family goes on a one-week pilgrimage, where we put our work on hold and spend time visiting temples, praying, and spending time with family and friends. To my Jewish friends, I often explain this trip as vaguely similar to the Sabbath, except we take one week of rest per year, rather than one day per week.
Our family is not Muslim, but by coincidence, this year, our trip happened to be during the last week of Ramadan.
By further coincidence, this was also the same week that I was moving out of my employer-provided temporary housing (at NYU) and moving into my new apartment. The night before my trip, I enlisted the help of two friends and we took most of my belongings, in a couple of suitcases, to my new apartment. The apartment was almost completely unfurnished – I planned on getting new furniture upon my return – so I dropped my few bags (one containing an air mattress) in the corner. Even though I hadn’t decorated the apartment yet, in accordance with Hindu custom, I taped a single photograph to the wall in my bedroom — a long-haired saint with his hands outstretched in pronam (a sign of reverence and respect).
The next morning, I packed the rest of my clothes into a suitcase and took a cab to the airport. I didn’t bother to eat breakfast, figuring I would grab some yogurt in the terminal while waiting to board.
I got in line for security at the airport and handed the agent my ID. Another agent came over and handed me a paper slip, which he said was being used to track the length of the security lines. He said, “just hand this to someone when your stuff goes through the x-ray machines, and we’ll know how long you were in line.’ I looked at the timestamp on the paper: 10:40.
When going through the security line, I opted out (as I always used to) of the millimeter wave detectors. I fly often enough, and have opted out often enough, that I was prepared for what comes next: a firm pat-down by a TSA employee wearing non-latex gloves, who uses the back of his hand when patting down the inside of the thighs.
After the pat-down, the TSA agent swabbed his hands with some cotton-like material and put the swab in the machine that supposedly checks for explosive residue. The machine beeped. “We’re going to need to pat you down again, this time in private,” the agent said.
Having been selected before for so-called “random” checks, I assumed that this was another such check.
“What do you mean, ‘in private’? Can’t we just do this out here?”
“No, this is a different kind of pat-down, and we can’t do that in public.” When I asked him why this pat-down was different, he wouldn’t tell me. When I asked him specifically why he couldn’t do it in public, he said “Because it would be obscene.”
Naturally, I balked at the thought of going somewhere behind closed doors where a person I just met was going to touch me in “obscene” ways. I didn’t know at the time (and the agent never bothered to tell me) that the TSA has a policy that requires two agents to be present during every private pat-down. I’m not sure if that would make me feel more or less comfortable.
Noticing my hesitation, the agent offered to have his supervisor explain the procedure in more detail. He brought over his supervisor, a rather harried man who, instead of explaining the pat-down to me, rather rudely explained to me that I could either submit immediately to a pat-down behind closed-doors, or he could call the police.
At this point, I didn’t mind having to leave the secure area and go back through security again (this time not opting out of the machines), but I didn’t particularly want to get the cops involved. I told him, “Okay, fine, I’ll leave”.
“You can’t leave here.”
“Are you detaining me, then?” I’ve been through enough “know your rights” training to know how to handle police searches; however, TSA agents are not law enforcement officials. Technically, they don’t even have the right to detain you against your will.
“We’re not detaining you. You just can’t leave.” My jaw dropped.
“Either you’re detaining me, or I’m free to go. Which one is it?” I asked.
He glanced for a moment at my backpack, then snatched it out of the conveyor belt. “Okay,” he said. “You can leave, but I’m keeping your bag.”
I was speechless. My bag had both my work computer and my personal computer in it. The only way for me to get it back from him would be to snatch it back, at which point he could simply claim that I had assaulted him. I was trapped.
While we waited for the police to arrive, I took my phone and quickly tried to call my parents to let them know what was happening. Unfortunately, my mom’s voicemail was full, and my dad had never even set his up.
“Hey, what’s he doing?” One of the TSA agents had noticed I was touching my phone. “It’s probably fine; he’s leaving anyway,” another said.
The cops arrived a few minutes later, spoke with the TSA agents for a moment, and then came over and gave me one last chance to submit to the private examination. “Otherwise, we have to escort you out of the building.” I asked him if he could be present while the TSA agent was patting me down.
“No,” he explained, “because when we pat people down, it’s to lock them up.”
I only realized the significance of that explanation later. At this point, I didn’t particularly want to miss my flight. Foolishly, I said, “Fine, I’ll do it.”
The TSA agents and police escorted me to a holding room, where they patted me down again – this time using the front of their hands as they passed down the front of my pants. While they patted me down, they asked me some basic questions.
“What’s the purpose of your travel?”
“Personal,” I responded, (as opposed to business).
“Are you traveling with anybody?”
“My parents are on their way to LA right now; I’m meeting them there.”
“How long is your trip?”
“What will you be doing?”
Mentally, I sighed. There wasn’t any other way I could answer this next question.
“We’ll be visiting some temples.” He raised his eyebrow, and I explained that the next week was a religious holiday, and that I was traveling to LA to observe it with my family.
After patting me down, they swabbed not only their hands, but also my backpack, shoes, wallet, and belongings, and then walked out of the room to put it through the machine again. After more than five minutes, I started to wonder why they hadn’t said anything, so I asked the police officer who was guarding the door. He called over the TSA agent, who told me,
“You’re still setting off the alarm. We need to call the explosives specialist”.
I waited for about ten minutes before the specialist showed up. He walked in without a word, grabbed the bins with my possessions, and started to leave. Unlike the other agents I’d seen, he wasn’t wearing a uniform, so I was a bit taken aback.
“What’s happening?” I asked.
“I’m running it through the x-ray again,” he snapped. “Because I can. And I’m going to do it again, and again, until I decide I’m done”. He then asked the TSA agents whether they had patted me down. They said they had, and he just said, “Well, try again”, and left the room. Again I was told to stand with my legs apart and my hands extended horizontally while they patted me down all over before stepping outside.
The explosives specialist walked back into the room and asked me why my clothes were testing positive for explosives. I told him, quite truthfully, “I don’t know.” He asked me what I had done earlier in the day.
“Well, I had to pack my suitcase, and also clean my apartment.”
“I moved my stuff from my old apartment to my new one”.
“What did you eat this morning?”
“Nothing,” I said. Only later did I realize that this made it sound like I was fasting, when in reality, I just hadn’t had breakfast yet.
“Are you taking any medications?”
The other TSA agents stood and listened while the explosives specialist and asked every medication I had taken “recently”, both prescription and over-the-counter, and asked me to explain any medical conditions for which any prescription medicine had been prescribed. Even though I wasn’t carrying any medication on me, he still asked for my complete “recent” medical history.
“What have you touched that would cause you to test positive for certain explosives?”
“I can’t think of anything. What does it say is triggering the alarm?” I asked.
“I’m not going to tell you! It’s right here on my sheet, but I don’t have to tell you what it is!” he exclaimed, pointing at his clipboard.
I was at a loss for words. The first thing that came to my mind was, “Well, I haven’t touched any explosives, but if I don’t even know what chemical we’re talking about, I don’t know how to figure out why the tests are picking it up.”
He didn’t like this answer, so he told them to run my belongings through the x-ray machine and pat me down again, then left the room.
I glanced at my watch. Boarding would start in fifteen minutes, and I hadn’t even had anything to eat. A TSA officer in the room noticed me craning my neck to look at my watch on the table, and he said, “Don’t worry, they’ll hold the flight.”
As they patted me down for the fourth time, a female TSA agent asked me for my baggage claim ticket. I handed it to her, and she told me that a woman from JetBlue corporate security needed to ask me some questions as well. I was a bit surprised, but agreed. After the pat-down, the JetBlue representative walked in and cooly introduced herself by name.
She explained, “We have some questions for you to determine whether or not you’re permitted to fly today. Have you flown on JetBlue before?”
“Maybe about ten times,” I guessed.
“Ten what? Per month?”
“No, ten times total.”
She paused, then asked,
“Will you have any trouble following the instructions of the crew and flight attendants on board the flight?”
“No.” I had no idea why this would even be in doubt.
“We have some female flight attendants. Would you be able to follow their instructions?”
I was almost insulted by the question, but I answered calmly, “Yes, I can do that.”
“Okay,” she continued, “and will you need any special treatment during your flight? Do you need a special place to pray on board the aircraft?”
Only here did it hit me.
“No,” I said with a light-hearted chuckle, trying to conceal any sign of how offensive her questions were. “Thank you for asking, but I don’t need any special treatment.”
She left the room, again, leaving me alone for another ten minutes or so. When she finally returned, she told me that I had passed the TSA’s inspection. “However, based on the responses you’ve given to questions, we’re not going to permit you to fly today.”
I was shocked. “What do you mean?” were the only words I could get out.
“If you’d like, we’ll rebook you for the flight tomorrow, but you can’t take the flight this afternoon, and we’re not permitting you to rebook for any flight today.”
I barely noticed the irony of the situation – that the TSA and NYPD were clearing me for takeoff, but JetBlue had decided to ground me. At this point, I could think of nothing else but how to inform my family, who were expecting me to be on the other side of the country, that I wouldn’t be meeting them for dinner after all. In the meantime, an officer entered the room and told me to continue waiting there. “We just have one more person who needs to speak with you before you go.” By then, I had already been “cleared” by the TSA and NYPD, so I couldn’t figure out why I still needed to be questioned. I asked them if I could use my phone and call my family.
“No, this will just take a couple of minutes and you’ll be on your way.” The time was 12.35.
He stepped out of the room – for the first time since I had been brought into the cell, there was no NYPD officer guarding the door. Recognizing my short window of opportunity, I grabbed my phone from the table and quickly texted three of my local friends – two who live in Brooklyn, and one who lives in Nassau County – telling them that I had been detained by the TSA and that I couldn’t board my flight. I wasn’t sure what was going to happen next, but since nobody had any intention of reading me my Miranda rights, I wanted to make sure people knew where I was.
After fifteen minutes, one of the police officers marched into the room and scolded, “You didn’t tell us you have a checked bag!” I explained that I had already handed my baggage claim ticket to a TSA agent, so I had in fact informed someone that I had a checked bag. Looking frustrated, he turned and walked out of the room, without saying anything more.
After about twenty minutes, another man walked in and introduced himself as representing the FBI. He asked me many of the same questions I had already answered multiple times – my name, my address, what I had done so far that day. etc.
He then asked, “What is your religion?”
“How religious are you? Would you describe yourself as ‘somewhat religious’ or ‘very religious’?”
I was speechless from the idea of being forced to talk about my the extent of religious beliefs to a complete stranger. “Somewhat religious”, I responded.
“How many times a day do you pray?” he asked. This time, my surprise must have registered on my face, because he quickly added, “I’m not trying to offend you; I just don’t know anything about Hinduism. For example, I know that people are fasting for Ramadan right now, but I don’t have any idea what Hindus actually do on a daily basis.”
I nearly laughed at the idea of being questioned by a man who was able to admit his own ignorance on the subject matter, but I knew enough to restrain myself. The questioning continued for another few minutes. At one point, he asked me what cleaning supplies I had used that morning.
“Well, some window cleaner, disinfectant -” I started, before he cut me off.
“This is important,” he said, sternly. “Be specific.” I listed the specific brands that I had used.
Suddenly I remembered something: the very last thing I had done before leaving was to take the bed sheets off of my bed, as I was moving out. Since this was a dorm room, to guard against bedbugs, my dad (a physician) had given me an over-the-counter spray to spray on the mattress when I moved in, over two months previously. Was it possible that that was still active and triggering their machines?
“I also have a bedbug spray,” I said. “I don’t know the name of it, but I knew it was over-the-counter, so I figured it probably contained permethrin.” Permethrin is an insecticide, sold over-the-counter to kill bed bugs and lice.
“Perm-what?” He asked me to spell it.
After he wrote it down, I asked him if I could have something to drink. “I’ve been here talking for three hours at this point,” I explained. “My mouth is like sandpaper”. He refused, saying
“We’ll just be a few minutes, and then you’ll be able to go.”
“Do you have any identification?” I showed him my drivers license, which still listed my old address. “You have nothing that shows your new address?” he exclaimed.
“Well, no, I only moved there on Thursday.”
“What about the address before that?”
“I was only there for two months – it was temporary housing for work”. I pulled my NYU ID out of my wallet. He looked at it, then a police officer in the room took it from him and walked out.
“What about any business cards that show your work address?” I mentally replayed my steps from the morning, and remembered that I had left behind my business card holder, thinking I wouldn’t need it on my trip.
“No, I left those at home.”
“You have none?”
“Well, no, I’m going on vacation, so I didn’t refill them last night.” He scoffed. “I always carry my cards on me, even when I’m on vacation.” I had no response to that – what could I say?
“What about a direct line at work? Is there a phone number I can call where it’ll patch me straight through to your voicemail?”
“No,” I tried in vain to explain. “We’re a tech company; everyone just uses their cell phones”. To this day, I don’t think my company has a working landline phone in the entire office – our “main line” is a virtual assistant that just forwards calls to our cell phones. I offered to give him the name and phone number of one of our venture partners instead, which he reluctantly accepted.
Around this point, the officer who had taken my NYU ID stormed into the room.
“They put an expiration sticker on your ID, right?” I nodded. “Well then why did this ID expire in 2010?!” he accused.
I took a look at the ID and calmly pointed out that it said “August 2013” in big letters on the ID, and that the numbers “8/10” meant “August 10th, 2013”, not “August, 2010”. I added, “See, even the expiration sticker says 2013 on it above the date”. He studied the ID again for a moment, then walked out of the room again, looking a little embarrassed.
The FBI agent resumed speaking with me. “Do you have any credit cards with your name on them?” I was hesitant to hand them a credit card, but I didn’t have much of a choice. Reluctantly, I pulled out a credit card and handed it to him. “What’s the limit on it?” he said, and then, noticing that I didn’t laugh, quickly added, “That was a joke.”
He left the room, and then a series of other NYPD and TSA agents came in and started questioning me, one after the other, with the same questions that I’d already answered previously. In between, I was left alone, except for the officer guarding the door.
At one point, when I went to the door and asked the officer when I could finally get something to drink, he told me, “Just a couple more minutes. You’ll be out of here soon.”
“That’s what they said an hour ago,” I complained.
“You also said a lot of things, kid,” he said with a wink. “Now sit back down”.
I sat back down and waited some more. Another time, I looked up and noticed that a different officer was guarding the door. By this time, I hadn’t had any food or water in almost eighteen hours. I could feel the energy draining from me, both physically and mentally, and my head was starting to spin. I went to the door and explained the situation the officer. “At the very least, I really need something to drink.”
“Is this a medical emergency? Are you going to pass out? Do we need to call an ambulance?” he asked, skeptically. His tone was almost mocking, conveying more scorn than actual concern or interest.
“No,” I responded. I’m not sure why I said that. I was lightheaded enough that I certainly felt like I was going to pass out.
“Are you diabetic?”
“No,” I responded.
Again he repeated the familiar refrain. “We’ll get you out of here in a few minutes.” I sat back down. I was starting to feel cold, even though I was sweating – the same way I often feel when a fever is coming on. But when I put my hand to my forehead, I felt fine.
One of the police officers who questioned me about my job was less-than-familiar with the technology field.
“What type of work do you do?”
“I work in venture capital.”
“Venture Capital – is that the thing I see ads for on TV all the time?” For a moment, I was dumbfounded – what venture capital firm advertises on TV? Suddenly, it hit me.
“Oh! You’re probably thinking of Capital One Venture credit cards.” I said this politely and with a straight face, but unfortunately, the other cop standing in the room burst out laughing immediately. Silently, I was shocked – somehow, this was the interrogation procedure for confirming that I actually had the job I claimed to have.
Another pair of NYPD officers walked in, and one asked me to identify some landmarks around my new apartment. One was, “When you’re facing the apartment, is the parking on the left or on the right?” I thought this was an odd question, but I answered it correctly. He whispered something in the ear of the other officer, and they both walked out.
The onslaught of NYPD agents was broken when a South Asian man with a Homeland Security badge walked in and said something that sounded unintelligible. After a second, I realized he was speaking Hindi.
“Sorry, I don’t speak Hindi.”
“Oh!” he said, noticeably surprised at how “Americanized” this suspect was. We chatted for a few moments, during which time I learned that his family was Pakistani, and that he was Muslim, though he was not fasting for Ramadan. He asked me the standard repertoire of questions that I had been answering for other agents all day.
Finally, the FBI agent returned.
“How are you feeling right now?” he asked. I wasn’t sure if he was expressing genuine concern or interrogating me further, but by this point, I had very little energy left.
“A bit nauseous, and very thirsty.”
“You’ll have to understand, when a person of your… background walks into here, travelling alone, and sets off our alarms, people start to get a bit nervous. I’m sure you’ve been following what’s been going on in the news recently. You’ve got people from five different branches of government all in here – we don’t do this just for fun.”
He asked me to repeat some answers to questions that he’d asked me previously, looking down at his notes the whole time, then he left. Finally, two TSA agents entered the room and told me that my checked bag was outside, and that I would be escorted out to the ticketing desks, where I could see if JetBlue would refund my flight.
It was 2:20PM by the time I was finally released from custody. My entire body was shaking uncontrollably, as if I were extremely cold, even though I wasn’t. I couldn’t identify the emotion I was feeling. Surprisingly, as far as I could tell, I was shaking out of neither fear nor anger – I felt neither of those emotions at the time. The shaking motion was entirely involuntary, and I couldn’t force my limbs to be still, no matter how hard I concentrated.
In the end, JetBlue did refund my flight, but they cancelled my entire round-trip ticket. Because I had to rebook on another airline that same day, it ended up costing me about $700 more for the entire trip. Ironically, when I went to the other terminal, I was able to get through security (by walking through the millimeter wave machines) with no problem.
I spent the week in LA, where I was able to tell my family and friends about the entire ordeal. They were appalled by the treatment I had received, but happy to see me safely with them, even if several hours later.
I wish I could say that the story ended there. It almost did. I had no trouble flying back to NYC on a red-eye the next week, in the wee hours of August 12th. But when I returned home the next week, opened the door to my new apartment, and looked around the room, I couldn’t help but notice that one of the suitcases sat several inches away from the wall. I could have sworn I pushed everything to the side of the room when I left, but I told myself that I may have just forgotten, since I was in a hurry when I dropped my bags off.
When I entered my bedroom, a chill went down my spine: the photograph on my wall had vanished. I looked around the room, but in vain. My apartment was almost completely empty; there was no wardrobe it could have slipped under, even on the off-chance it had fallen.
To this day, that photograph has not turned up. I can’t think of any “rational” explanation for it. Maybe there is one. Maybe a burglar broke into my apartment by picking the front door lock and, finding nothing of monetary value, took only my picture. In order to preserve my peace-of-mind, I’ve tried to convince myself that that’s what happened, so I can sleep comfortably at night.
But no matter how I’ve tried to rationalize this in the last week and a half, nothing can block out the memory of the chilling sensation I felt that first morning, lying on my air mattress, trying to forget the image of large, uniformed men invading the sanctuary of my home in my absence, wondering when they had done it, wondering why they had done it.
In all my life, I have only felt that same chilling terror once before – on one cold night in September twelve years ago, when I huddled in bed and tried to forget the terrible events in the news that day, wondering why they they had happened, wondering whether everything would be okay ever again.
This is a guest post from Jordan Ellenberg, a professor of mathematics at the University of Wisconsin. Jordan’s book, How Not To Be Wrong, comes out in May 2014. It is crossposted from his blog, Quomodocumque, and tweeted about at @JSEllenberg.
Cathy posted some cool data yesterday coming from the new visualization features of the magnificent Stacks Project. Summary: you can make a directed graph whose vertices are the 10,445 tagged assertions in the Stacks Project, and whose edges are logical dependency. So this graph (hopefully!) doesn’t have any directed cycles. (Actually, Cathy tells me that the Stacks Project autovomits out any contribution that would create a logical cycle! I wish LaTeX could do that.)
Given any assertion v, you can construct the subgraph G_v of vertices which are the terminus of a directed path starting at v. And Cathy finds that if you plot the number of vertices and number of edges of each of these graphs, you get something that looks really, really close to a line.
Why is this so? Does it suggest some underlying structure? I tend to say no, or at least not much — my guess is that in some sense it is “expected” for graphs like this to have this sort of property.
Because I am trying to get strong at sage I coded some of this up this morning. One way to make a random directed graph with no cycles is as follows: start with N edges, and a function f on natural numbers k that decays with k, and then connect vertex N to vertex N-k (if there is such a vertex) with probability f(k). The decaying function f is supposed to mimic the fact that an assertion is presumably more likely to refer to something just before it than something “far away” (though of course the stack project is not a strictly linear thing like a book.)
Here’s how Cathy’s plot looks for a graph generated by N= 1000 and f(k) = (2/3)^k, which makes the mean out-degree 2 as suggested in Cathy’s post.
Pretty linear — though if you look closely you can see that there are really (at least) a couple of close-to-linear “strands” superimposed! At first I thought this was because I forgot to clear the plot before running the program, but no, this is the kind of thing that happens.
Is this because the distribution decays so fast, so that there are very few long-range edges? Here’s how the plot looks with f(k) = 1/k^2, a nice fat tail yielding many more long edges:
My guess: a random graph aficionado could prove that the plot stays very close to a line with high probability under a broad range of random graph models. But I don’t really know!
Update: Although you know what must be happening here? It’s not hard to check that in the models I’ve presented here, there’s a huge amount of overlap between the descendant graphs; in fact, a vertex is very likely to be connected all but c of the vertices below it for a suitable constant c.
I would guess the Stacks Project graph doesn’t have this property (though it would be interesting to hear from Cathy to what extent this is the case) and that in her scatterplot we are not measuring the same graph again and again.
It might be fun to consider a model where vertices are pairs of natural numbers and (m,n) is connected to (m-k,n-l) with probability f(k,l) for some suitable decay. Under those circumstances, you’d have substantially less overlap between the descendant trees; do you still get the approximately linear relationship between edges and nodes?
This is a guest post by my friend Laura Strausfeld.
As an unlicensed psychotherapist, here’s my take on why Huma Abedin is supporting her husband Anthony Weiner’s campaign for mayor:
It’s all about the kid.
Jordan Weiner is 19 months old. When he’s 8 or 9—or 5, and wearing google glasses—maybe he’ll google his name and read about his father’s penis. Either that, or one of his buddies at school may ask him about his father’s penis. Jordan might then ask his mommy and daddy about his father’s penis and they’ll tell him either 1) your daddy was a great politician, but had to resign from Congress because he admitted to showing people his penis, which we recommend you don’t do, especially when you’re a grownup and on twitter; or 2) your daddy was a great politician and ran a very close race for mayor—that’s right, your daddy was almost mayor of New York City!—but he lost because people said he showed people his penis and that’s none of anybody’s business.
Let’s look at this from Huma’s perspective. She’s got a child for a husband, with a weird sexual addiction that on the positive side, doesn’t appear to carry the threat of STDs. But her dilemma is not about her marriage. The marriage is over. What she cares about is Jordan. And this is where she’s really fucked. Whatever happens, Anthony will always be her child’s father.
That bears repeating. You’ve got a child you love more than anything in the world, will sacrifice anything for, and will always now be stigmatized as the son of a celebrity-sized asshole. What are your choices?
The best scenario for Huma is if Anthony becomes mayor. Then she can divorce his ass, get primary custody and protect her child from growing up listening to penis jokes about his loser father. There will be jokes, but at least they’ll be about the mayor’s penis. And with a whole lot of luck, they might even be about how his father’s penis was a lot smaller in the mind of the public than his policies.
Weiner won’t get my vote, however. And for that, I apologize to you, Jordan. You have my sympathy, Huma.
This is a guest post by Peter Darche, an engineer at DataKind and recent graduate of NYU’s ITP program. At ITP he focused primarily on using personal data to improve personal social and environmental impact. Prior to graduate school he taught in NYC public schools with Teach for America and Uncommon Schools.
We all ‘know’ that money influences the way congressmen and women legislate; at least we certainly believe it does. According to poll conducted by law professor Larry Lessig for his book Republic Lost, 75% of respondents (Republican and Democrat) said that ‘money buys results in Congress.’
But what does that explanation really tell us? Yes, a congresswoman’s receiving millions dollars from an industry then voting with that industry’s interests reeks of corruption. But, when that industry is responsible for 80% of her constituents’ jobs the causation becomes much less clear and the explanation much less informative.
The real devil is in the details. It is in the ways that money has shaped her legislative worldview over time and in the small, particular actions that tilt her policy one way rather than another.
In the past finding these many and subtle ways would have taken a herculean effort: untold hours collecting campaign contributions, voting records, speeches, and so on. Today however, due to the efforts of organizations like the Sunlight Foundation and Center for Responsive Politics, this information is online and programmatically accessible; you can write a few lines of code and have a computer gather it all for you.
The last few months Cathy O’Neil, Lee Drutman (a Senior Fellow at the Sunlight Foundation), myself and others have been working on a project that leverages these data sources to attempt to unearth some of these particular facts. By connecting all the avenues by which influence is exerted on the legislative process to the actions taken by legislators, we’re hoping to find some of the detailed ways money changes behavior over time.
The ideas is this: first, find and aggregate what data exists related to the ways influence can be exerted on the legislative process (data on campaign contributions, lobbying contributions, etc), then find data that might track influence manifesting itself in the legislative process (bill sponsorships, co-sponsorships, speeches, votes, committee memberships, etc). Finally, connect the interest group or industry behind the influence to the policies and see how they change over time.
One immediate and attainable goal for this project, for example, is to create an affinity score between legislators and industries, or in other words a metric that would indicate the extent to which a given legislator is influenced by and acts in the interest of a given industry.
So far most of our efforts have focused on finding, collecting, and connecting the records of influence and legislative behavior. We’ve pulled in lobbying and campaign contribution data, as well as sponsored legislation, co-sponsored legislation, speeches and votes. We’ve connected the instances of influence to legislative actions for a given legislator and visualized it on a timeline showing the entirety of a legislator’s career.
Here’s an example of how one might use the timeline. The example below is of Nancy Pelosi’s career. Each green circle represents a campaign contribution she received, and is grouped within a larger circle by the month it was recorded by the FEC. Above are colored rectangles representing legislative actions she took during the time-period in focus (indigo are votes, orange speeches, red co-sponsored bills, blue sponsored bills). Some of the green circles are highlighted because the events have been filtered for connection to health professionals.
Changing the filter to Health Services/HMOs, we see different contributions coming from that industry as well as a co-sponsored bill related to that industry.
Mousing over the bill indicates its a proposal to amend the Social Security act to provide Medicaid coverage to low-income individuals with HIV. Further, looking around at speeches, one can see a relevant speech about the children’s health insurance. Clicking on the speech reveals the text.
By combining data about various events, and allowing users to filter and dive into them, we’re hoping to leverage our natural pattern-seeking capabilities to find specific hypotheses to test. Once an interesting pattern has been found, the tool would allow one to download the data and conduct analyses.
Again, It’s just start, and the timeline and other project related code are internal prototypes created to start seeing some of the connections. We wanted to open it up to you all though to see what you all think and get some feedback. So, with it’s pre-alphaness in mind, what do you think about the project generally and the timeline specifically? What works well – helps you gain insights or generate hypotheses about the connection between money and politics – and what other functionality would you like to see?
The demo version be found here with data for the following legislators:
- Nancy Pelosi
- John Boehner
- Cathy McMorris Rodgers
- John Boehner
- Eric Cantor
- James Lankford
- John Cornyn
- Nancy Pelosi
- James Clyburn
- Kevin McCarthy
- Steny Hoyer
Note: when the timeline is revealed, click and drag over content at the bottom of the timeline to reveal the focus events.
This is a guest post by Eugene Stern.
Now that I have kids in school, I’ve become a lot more familiar with high-stakes testing, which is the practice of administering standardized tests with major consequences for students who take them (you have to pass to graduate), their teachers (who are often evaluated based on standarized test results), and their school districts (state funding depends on test results). To my great chagrin, New Jersey, where I live, is in the process of putting such a teacher evaluation system in place (for a lot more detail and criticism, see here).
The excellent John Ewing pointed me to a pretty comprehensive survey of standardized testing called “Measuring Up,” by Harvard Ed School prof Daniel Koretz, who teaches a course there about this stuff. If you have any interest in the subject, the book is very much worth your time. But in case you don’t get to it, or just to whet your appetite, here are my top 10 takeaways:
Believe it or not, most of the people who write standardized tests aren’t idiots. Building effective tests is a difficult measurement problem! Koretz makes an analogy to political polling, which is a good reminder that a test result is really a sample from a distribution (if you take multiple versions of a test designed to measure the same thing, you won’t do exactly the same each time), and not an absolute measure of what someone knows. It’s also a good reminder that the way questions are phrased can matter a great deal.
The reliability of a test is inversely related to the standard deviation of this distribution: a test is reliable if your score on it wouldn’t vary very much from one instance to the next. That’s a function of both the test itself and the circumstances under which people take it. More reliability is better, but the big trade-off is that increasing the sophistication of the test tends to decrease reliability. For example, tests with free form answers can test for a broader range of skills than multiple choice, but they introduce variability across graders, and even the same person may grade the same test differently before and after lunch. More sophisticated tasks also take longer to do (imagine a lab experiment as part of a test), which means fewer questions on the test and a smaller cross-section of topics being sampled, again meaning more noise and less reliability.
A complementary issue is bias, which is roughly about people doing better or worse on a test for systematic reasons outside the domain being tested. Again, there are trade-offs: the more sophisticated the test, the more extraneous skills beyond those being tested it may be bringing in. One common way to weed out such questions is to look at how people who score the same on the overall test do on each particular question: if you get variability you didn’t expect, that may be a sign of bias. It’s harder to do this for more sophisticated tests, where each question is a bigger chunk of the overall test. It’s also harder if the bias is systematic across the test.
Beyond the (theoretical) distribution from which a single student’s score is a sample, there’s also the (likely more familiar) distribution of scores across students. This depends both on the test and on the population taking it. For example, for many years, students on the eastern side of the US were more likely to take the SAT than those in the west, where only students applying to very selective eastern colleges took the test. Consequently, the score distributions were very different in the east and the west (and average scores tended to be higher in the west), but this didn’t mean that there was bias or that schools in the west were better.
The shape of the score distribution across students carries important information about the test. If a test is relatively easy for the students taking it, scores will be clustered to the right of the distribution, while if it’s hard, scores will be clustered to the left. This matters when you’re interpreting results: the first test is worse at discriminating among stronger students and better at discriminating among weaker ones, while the second is the reverse.
The score distribution across students is an important tool in communicating results (you may not know right away what a score of 600 on a particular test means, but if you hear it’s one standard deviation above a mean of 500, that’s a decent start). It’s also important for calibrating tests so that the results are comparable from year to year. In general, you want a test to have similar means and variances from one year to the next, but this raises the question of how to handle year-to-year improvement. This is particularly significant when educational goals are expressed in terms of raising standardized test scores.
If you think in terms of the statistics of test score distributions, you realize that many of those goals of raising scores quickly are deluded. Koretz has a good phrase for this: the myth of the vanishing variance. The key observation is that test score distributions are very wide, on all tests, everywhere, including countries that we think have much better education systems than we do. The goals we set for student score improvement (typically, a high fraction of all students taking a test several years from now are supposed to score above some threshold) imply a great deal of compression at the lower end of this distribution – compression that has never been seen in any country, anywhere. It sounds good to say that every kid who takes a certain test in four years will score as proficient, but that corresponds to a score distribution with much less variance than you’ll ever see. Maybe we should stop lying to ourselves?
Koretz is highly critical of the recent trend to report test results in terms of standards (e.g., how many students score as “proficient”) instead of comparisons (e.g., your score is in the top 20% of all students who took the test). Standards and standard-based reporting are popular because it’s believed that American students’ performance as a group is inadequate. The idea is that being near the top doesn’t mean much if the comparison group is weak, so instead we should focus on making sure every student meets an absolute standard needed for success in life. There are three (at least) problems with this. First, how do you set a standard – i.e., what does proficient mean, anyway? Koretz gives enough detail here to make it clear how arbitrary the standards are. Second, you lose information: in the US, standards are typically expressed in terms of just four bins (advanced, proficient, partially proficient, basic), and variation inside the bins is ignored. Third, even standards-based reporting tends to slide back into comparisons: since we don’t know exactly what proficient means, we’re happiest when our school, or district, or state places ahead of others in the fraction of students classified as proficient.
Koretz’s other big theme is score inflation for high-stakes tests: if everyone is evaluated based on test scores, everyone has an incentive to get those scores up, whether or not that actually has much correlation with learning. If you remember anything from the book or from this post, remember this phrase: sawtooth pattern. The idea is that when a new high-stakes standardized test appears, average scores start at some base level, go up quickly as people figure out how to game the test, then plateau. If the test is replaced with another, the same thing happens: base, rapid growth, plateau. Repeat ad infinitum. Koretz and his collaborators did a nice experiment in which they went back to a school district in which one high-stakes test had been replaced with another and administered the first test several years later. Now that teachers weren’t teaching to the first test, scores on it reverted back to the original base level. Moral: score inflation is real, pervasive, and unavoidable, unless we bite the bullet and do away with high-stakes tests.
While Koretz is sympathetic toward test designers, who live the complexity of standardized testing every day, he is harsh on those who (a) interpret and report on test results and (b) set testing and education policy, without taking that complexity into account. Which, as he makes clear, is pretty much everyone who reports on results and sets policy.
If you think it’s a good idea to make high-stakes decisions about schools and teachers based on standardized test results, Koretz’s book offers several clear warnings.
First, we should expect any high-stakes test to be gamed. Worse yet, the more reliable tests, being more predictable, are probably easier to game (look at the SAT prep industry).
Second, the more (statistically) reliable tests, by their controlled nature, cover only a limited sample of the domain we want students to learn. Tests trying to cover more ground in more depth (“tests worth teaching to,” in the parlance of the last decade) will necessarily have noisier results. This noise is a huge deal when you realize that high-stakes decisions about teachers are made based on just two or three years of test scores.
Third, a test that aims to distinguish “proficiency” will do a worse job of distinguishing students elsewhere in the skills range, and may be largely irrelevant for teachers whose students are far away from the proficiency cut-off. (For a truly distressing example of this, see here.)
With so many obstacles to rating schools and teachers reliably based on standardized test scores, is it any surprise that we see results like this?
This is a guest post by Eugene Stern.
Sometimes you learn just as much from a bad analogy as from a good one. At least you learn what people are thinking.
The other day I read this response to this NYT article. The original article asked whether the Common Core-based school reforms now being put in place in most states are really a good idea. The blog post criticized the article for failing to break out four separate elements of the reforms: standards (the Core), curriculum (what’s actually taught), assessment (testing), and accountability (evaluating how kids and educators did). If you have an issue with the reforms, you’re supposed to say exactly which aspect you have an issue with.
But then, at the end of the blog post, we get this:
A track and field metaphor might help: The standard is the bar that students must jump over to be competitive. The curriculum is the training program coaches use to help students get over the bar. The assessment is the track meet where we find out how high everyone can jump. And the accountability system is what follows after its all over and we want to figure out what went right, what went wrong, and what it will take to help kids jump higher.
In track, jumping over the bar is the entire point. You’re successful if you clear the bar, you’ve failed if you don’t. There are no other goals in play. So the standard, the curriculum, and the assessment might be nominally different, but they’re completely interdependent. The standard is defined in terms of the assessment, and the only curriculum that makes sense is training for the assessment.
Education has a lot more to it. The Common Core is a standard covering two academic dimensions: math and English/language arts/literacy. But we also want our kids learning science, and history, and music, and foreign languages, and technology, as well as developing along non-academic dimensions: physically, socially, morally, etc. (If a school graduated a bunch of high academic achievers that couldn’t function in society, or all ended up in jail for insider trading, we probably wouldn’t call that school successful.)
In Cathy’s terminology from this blog post, the Common Core is a proxy for the sum total of what we care about, or even just for the academic component of what we care about.
Then there’s a second level of proxying when we go from the standard to the assessment. The Common Core requirements are written to require general understanding (for example: kindergarteners should understand the relationship between numbers and quantities and connect counting to cardinality). A test that tries to measure that understanding can only proxy it imperfectly, in terms of a few specific questions.
Think that’s obvious? Great! But hang on just a minute.
The real trouble with the sports analogy comes when we get to the accountability step and forget all the proxying we did. “After it’s all over and we want to figure out what went right (and) what went wrong,” we measure right and wrong in terms of the assessment (the test). In sports, where the whole point is to do well on the assessment, it may make sense to change coaches if the team isn’t winning. But when we deny tenure to or fire teachers whose students didn’t do well enough on standardized tests (already in place in New York, now proposed for New Jersey as well), we’re treating the test as the whole point, rather than a proxy of a proxy. That incentivizes schools to narrow the curriculum to what’s included in the standard, and to teach to the test.
We may think it’s obvious that sports and education are different, but the decisions we’re making as a society don’t actually distinguish them.