The Consumer Fraud and Abuse Act is badly in need of reform. It currently criminalizes violations of terms of services for websites, even when those terms of service are written in a narrow way and the violation is being done for the public good.
Specifically, the CFAA keeps researchers from understanding how algorithms work. As an example, Julia Angwin’s recent work on recidivism modeling, which I blogged about here, was likely a violation of the CFAA:
A more general case has been made for CFAA reform in this 2014 paper, Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms, written by Christian Sandvig, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort.
They make the case that discrimination audits – wherein you send a bunch of black people and then white people to, say, try to rent an apartment from Donald Trump’s real estate company in 1972 – have clearly violated standard ethical guidelines (by wasting people’s time and not letting them in on the fact that they’re involved in a study), but since they represent a clear public good, such guidelines should have have been set aside.
Similarly, we are technically treating employers unethically when we have fake (but similar) resumes from whites and blacks sent to them to see who gets an interview, but the point we’ve set out to prove is important enough to warrant such behavior.
Their argument for CFAA reform is a direct expansion of the aforementioned examples:
Indeed, the movement of unjust face-to-face discrimination into computer algorithms appears to have the net effect of protecting the wicked. As we have pointed out, algorithmic discrimination may be much more opaque and hard to detect than earlier forms of discrimination, while at the same time one important mode of monitoring—the audit study—has been circumscribed. Employing the “traditional” design of an audit study but doing so via computer would now waste far fewer resources in order to find discrimination. In fact, it is difficult to imagine that a major internet platform would even register a large amount of auditing by researchers. Although the impact of this auditing might now be undetectable, the CFAA treats computer processor time as and a provider’s “authorization” as far more precious than the minutes researchers have stolen from honest landlords and employers over the last few decades. This appears to be fundamentally misguided.
As a consequence, we advocate for a reconceptualization of accountability on Internet platforms. Rather than regulating for transparency or misbehavior, we find this situation argues for “regulation toward auditability.” In our terms, this means both minor, practical suggestions as well as larger shifts in thinking. For example, it implies the reform of the CFAA to allow for auditing exceptions that are in the public interest. It implies revised scholarly association guidelines that subject corporate rules like the terms of service to the same cost-benefit analysis that the Belmont Report requires for the conduct of ethical research—this would acknowledge that there may be many instances where ethical researchers should disobey a platform provider’s stated wishes.
When it comes to alternative credit scoring systems, look for the phrase “we give consumers more access to credit!”
That’s code for a longer phrase: “we’re doing anything at all we want, with personal information, possibly discriminatory and destructive, but there are a few people who will benefit from this new system versus the old, so we’re ignoring costs and only counting the benefits for those people, in an attempt to distract any critics.”
Unfortunately, the propaganda works a lot of the time, especially because tech reporters aren’t sufficiently skeptical (and haven’t read my upcoming book).
The alt credit scoring field has recently been joined by another player, and it’s the stuff of my nightmares. Specifically, ZestFinance is joining forces with Baidu in China to assign credit scores to Chinese citizens based on the history of their browsing results, as reported in the LA Times.
- ZestFinance is the American company, led by ex-Googler Douglas Merrill who likes to say “all data is credit data” and claims he cannot figure out why people who spell, capitalize, and punctuate correctly are somehow better credit risks. Between you and me, I think he’s lying. I think he just doesn’t like to say he happily discriminates against poor people who have gone to bad schools.
- Baidu is the Google of China. So they have a shit ton of browsing history on people. Things like, “symptoms for Hepatitis” or “how do I get a job.” In other words, the company collects information on a person’s most vulnerable hopes and fears.
Now put these two together, which they already did thankyouverymuch, and you’ve got a toxic cocktail of personal information, on the one hand, and absolutely no hesitation in using information against people, on the other.
In the U.S. we have some pretty good anti-discrimination laws governing credit scores – albeit incomplete, especially in the age of big data. In China, as far as I know, there are no such rules. Anything goes.
So, for example, someone who recently googled for how to treat an illness might not get that loan, even if they were simply trying to help their friend or family member. Moreover, they will never know why they didn’t get the loan, nor will they be able to appeal the decision. Just as an example.
Am I being too suspicious? Maybe: at the end of the article announcing this new collaboration, after all, Douglas Merrill from ZestFinance is quoted touting the benefits:
“Today, three out of four Chinese citizens can’t get fair and transparent credit,” he said. “For a small amount of very carefully handled loss of privacy, to get more easily available credit, I think that’s going to be an easy choice.”
I’ve started a company called ORCAA, which stands for O’Neil Risk Consulting and Algorithmic Auditing and is pronounced “orcaaaaaa”. ORCAA will audit algorithms and conduct risk assessments for algorithms, first as a consulting entity and eventually, if all goes well, as a more formal auditing firm, with open methodologies and toolkits.
No worries! I’m busy learning everything I can about the field, small though it is. Today, for example, my friend Suresh Naidu suggested I read this fascinating study, referred to by those in the know as “Oaxaca’s decomposition,” which separates differences of health outcomes for two groups – referred to as “the poor” and the “nonpoor” in the paper – into two parts: first, the effect of “worse attributes” for the poor, and second, the effect of “worse coefficients.” There’s also a worked-out example of children’s health in Viet Nam which is interesting.
The specific formulas they use depends crucially on the fact that the underlying model is a linear regression, but the idea doesn’t: in practice, we care about both issues. For example, with credit scores, it’s obvious we’d care about the coefficients – the coefficients are the ingredients in the recipe that takes the input and gives the output, so if they fundamentally discriminate against blacks, for example, that would be bad (but it has to be carefully defined!). At the same time, though, we also care about which inputs we choose in the first place, which is why there are laws about not being able to use race or gender in credit scoring.
And, importantly, this analysis won’t necessarily tell us what to do about the differences we pick up. Indeed many of the tests I’ve been learning about and studying have that same limitation: we can detect problems but we don’t learn how to address them.
If you have any suggestions for me on methods for either auditing algorithms or for how to modify problematic algorithms, I’d be very grateful if you’d share them with me.
Also, if there are any artists out there, I’m on the market for a logo.
This is a guest post by Brian D’Alessandro, who daylights as the Head of Data Science at Zocdoc and as an Adjunct Professor with NYU’s Center for Data Science. When not thinking probabilistically, he’s drumming with the indie surf rock quarter Coastgaard.
I’d like to address the recent study by Roland Fryer Jr from Harvard University, and associated NY Times coverage, that claims to show zero racial bias in police shootings. While this paper certainly makes an honest attempt to study this very important and timely problem, it ultimately suffers from issues of data sampling and subjective data preparation. Given the media attention it is receiving, and the potential policy and public perceptual implications of this attention, we as a community of data people need to comb through this work and make sure the headlines are consistent with the underlying statistics.
First thing’s first: is there really zero bias in police shootings? The evidence for this claim is, notably, derived from data drawn from a single precinct. This is a statistical red flag and might well represent selection bias. Put simply, a police department with a culture that successfully avoids systematic racial discrimination may be more willing than others to share their data than one that doesn’t. That’s not proof of cherry-picking, but as a rule we should demand that any journalist or author citing this work should preface any statistic with “In Houston, using self-reported data,…”.
For that matter, if the underlying analytic techniques hold up under scrutiny, we should ask other cities to run the same tests on their data and see what the results are more widely. If we’re right, and Houston is rather special, we should investigate what they’re doing right.
On to the next question: do those analytic techniques hold up? The short answer is: probably not.
How The Sampling Was Done
As discussed here by economist Rajiv Sethi and here by Justin Feldman, the means by which the data instances were sampled to measure racial bias in Houston police shootings is in itself potentially very biased.
Essentially, Fryer and his team sampled “all shootings” as their set of positively labeled instances, and then randomly sampled “arrests in which use of force may have been justified” (attempted murder of an officer, resisting/impeding arrest, etc.) as the negative instances. The analysis the measured racial biases using the union of these two sets.
Here is a simple Venn diagram representing the sampling scheme:
In other words, the positive population (those with shooting) is not drawn from the same distribution as the negative population (those arrests where use of force is justified). The article implies that there is no racial bias conditional on there being an arrest where use of force was justified. However, the fact that they used shootings that were outside of this set of arrests means that this is not what they actually tested.
Instead, they only show that there was no racial bias in the set that was sampled. That’s different. And, it turns out, a biased sampling mechanism can in fact undo the bias that exists in the original data population (see below for a light mathematical explanation). This is why we take great pains in social science research to carefully design our sampling schemes. In this case, if the sampling is correlated with race (which it very likely is), all bets are off on analyzing the real racial biases in police shootings.
What Is Actually Happening
Let’s accept for now the two main claims of the paper: 1) black and hispanic people are more likely to endure some force from police, but 2) this bias doesn’t exist in an escalated situation.
Well, how could one make any claim without chaining these two events together? The idea of an escalation, or an arrest reason where force is justified, is unfortunately an often subjective concept reported after the fact. Could it be possible that a an officer is more likely to find his/her life in danger when a black, as opposed to a white, suspect reaches for his wallet? Further, while unquestioned compliance is certainly the best life-preserving policy when dealing with an officer, I can imagine that an individual being roughed up by a cop is liable to push back with an adrenalized, self-preserving an instinctual use of force. I’ll say that this is likely for black and white persons, but if the black person is more likely to be in that situation in the first place, the black person is more likely to get shot from a pre-stop position.
To sum up, the issue at hand is not whether cops are more likely to shoot at black suspects who are pointing guns straight back at the cop (which is effectively what is being reported about the study). The more important questions, which is not addressed, is why are black men more likely to pushed up against the wall by a cop in the first place, or does race matter when a cop decides his/her life is in danger and believes lethal force is necessary?
What Should Have Happened
While I empathize with the data prep challenges Fryer and team faced (the Times article mentions that put a collective 3000 person hours here), the language of the article and its ensuing coverage unfortunately does not fit the data distribution induced by the method of sampling.
I don’t want to suggest in any way that the data may have been manipulated to engineer a certain result, or that the analysis team mistakenly committed some fundamental sampling error. The paper does indeed caveat the challenge here, but given that admission, I wonder why the authors were so quick to release an un-peer-reviewed working version and push it out via the NY Times.
Peer review would likely have pointed out these issues and at least push the authors to temper their conclusions. For instance, the paper uses multiple sources to show that non-lethal violence is much more likely if you are black or hispanic, controlling for other factors. I see the causal chain being unreasonably bisected here, and this is a pretty significant conceptual error.
Overall, Fryer is fairly honest in the paper about the given data limitations. I’d love for him to take his responsibility to the next level and make his data, both in raw and encoded forms, public. Given the dependency on both subjective, manual encodings of police reports and a single, biased choice of sampling method, more sensitivity analysis should be done here. Also, anyone reporting on this (Fryer himself), should make a better effort to connect the causal chain here.
Headlines are sticky, and first impressions are hard to undo.This study needs more scrutiny at all levels, with special attention to the data preparation that has been done. We need a better impression than the one already made.
The coverage of the results comes down to the following:
P(Shooting | Black, Escalation) = P(Shooting | White, Escalation)
(here I am using ‘Escalation’ as the set of arrests where use of force is considered justified. And for notational simplicity I have omitted the control variables from the conditional above).
However, the analysis actually shows that:
P(Shooting | Black, Sampled) = P(Shooting | White, Sampled),
Where (Sampled = True) if the person was either shot or the situation escalated and the person was not shot. This makes a huge difference, because with the right bias in the sampling, we could have a situation in which there is in fact bias in police shooting but not in the sampled data. We can show this with a little application of Bayes rule:
P(Shot|B, Samp) / P(Shot|W, Samp) = [P(Shot|B) / P(Shot|W)] * [P(Samp|W) / P(Samp|B)]
The above should be read as: the bias in the study depends on both the racial bias in the population (P(S|B) / P(S|W)) and the bias in the sampling. Any bias in the population can therefore effectively be undone by a sampling scheme that is also racially biased. Unfortunately, the data summarized in the study doesn’t allow us to back into the 4 terms on the right hand side of the above equality.
I just finished a neat little book called The Wellness Syndrome by Carl Cederström and André Spicer. They are (business school) professors in Stockholm and London, respectively, so the book has a welcome non-U.S. perspective.
The book defines the wellness syndrome to be an extension and a perversion of the concept of individual well-being. According to Cederström and Spicer, it’s not just that you are expected to care for yourself, it’s that you are blamed if you don’t, and conversely, if there’s anything at all wrong with your life, then it’s because you’ve failed to sufficiently take care of yourself. The result is that people have become utterly unaware of why things happen to them and how much power they actually have to change anything.
The wellness syndrome manifests itself in various ways:
- We are asked to “think positively” to make positive things happen to us. The funniest (read: saddest) section of the book relates to the fact that David Cameron was a big believer in this kind of positive thinking; he focused on good outcomes and ignored the bad ones, believing that somehow his personal willpower would make good things happen.
- We are asked to take care of ourselves in order to stay competitive in the workforce, to productize and commoditize ourselves. This could mean staying slim – because if you’re overweight you’re falling down on the self-optimization regiment – or it could mean engaging in the quantified self movement, keeping track of sleep, exercise, and even pooping schedules, and at the very least it requires us to monitor our attitudes.
- We are asked to enjoy ourselves while we take full personal responsibility for our own wellness, which in the age of the gig economy means we always appear happy to stay lively and infinitely employable under increasingly precarious economic conditions.
- If things don’t go well for us, if we cannot find that job or we cannot seem to lose the extra weight, we are expected to feel guilty and – this is crucial – not to blame the system for an inadequate supply of job, nor the racist, sexist, or otherwise discriminatory environment, but rather our own mindset. God forbid we ever accept any actual limit to our powers of reinvention, because that is equivalent to giving up.
- The authors point to Margaret Thatcher and Tony Blair in the UK and to Reagan and Bill Clinton in the US as creators of this notion of individual responsibility as a shield of governmental responsibility, and they frequently point out that the “positive mindset” self-help gurus thus represent a perfect pairing: a pairing, moreover, which manages to depoliticize itself as its power grows.
- The consequence: we don’t think of ourselves as political victims when we fall prey to a narcissistic worldview in which we are never fit enough, never eating enough organic kale, and never productive enough. Instead we engage in self-criticism, guilt, and renewed promises to try better next time. We internalize the shame and the definition of ourselves as “improperly optimized.”
- In the end, we all walk around with tiny little versions of Reagan’s welfare queens in our heads – or at the very least, the fear of becoming anything like her. In the UK it’s a slightly varied version called the Chav.
There are two rich topics that aren’t addressed in this book which I’d love to hear about, even if it’s just in an informal conversation with the authors. First, what about the online dating scene? How does that play into this and amplify it? From my perspective, online dating has a strong effect on how people create and wield data about themselves, and the extent to which they self-criticize, stemming from (I assume) the question of how they are being seen by potential lovers.
Second, to what extent does this concept of self-perfecting and quantifying encourage the subculture of futurism? Do people like Ray Kurzweil and others who believe they will live forever represent the most extreme version of the wellness syndrome, or do they suffer from some other disease?
I liked the book a lot. There are lots of topics in common with my upcoming book, in fact, including wellness programs and personal data collection, and other ways that employers have increasing control over our bodies and lives. And although we largely agree, it was interesting to read their more historical take on things. Also, it was a super fast read, at only 135 pages. I recommend it.
A decades-long focus on policing minor crimes and activities – a practice called Broken Windows policing – has led to the criminalization and over-policing of communities of color and excessive force in otherwise harmless situations. In 2014, police killed at least 287people who were involved in minor offenses and harmless activities like sleeping in parks, possessing drugs, looking “suspicious” or having a mental health crisis. These activities are often symptoms of underlying issues of drug addiction, homelessness, and mental illness which should be treated by healthcare professionals and social workers rather than the police.
Having studied the effects of uneven policing myself, especially how it pertains to the data byproduct of “police events,” I could not agree more.
There was a recent New York Times article that got people’s attention. It claimed that there was no bias in police shootings of blacks over whites. What it didn’t talk about – crucially – was the chance that a given person would end up in an interaction with the police in the first place.
It’s much more likely for blacks, especially young black men, to end up in an interaction with cops. And that’s due in large part to the broken theory of Broken Windows policing.
New York City’s version of Broken Windows policing – Stop, Question, and Frisk – was particularly vile, and was eventually declared unconstitutional due to its disparate impact on minorities. The ACLU put some facts together when Stop and Frisk was at its height, including the following unbelievable statistics from 2011:
- The number of stops of young black men exceeded the entire city population of young black men (168,126 as compared to 158,406).
- In 70 out of 76 precincts, blacks and Latinos accounted for more than 50 percent of stops, and in 33 precincts they accounted for more than 90 percent of stops. In the 10 precincts with black and Latino populations of 14 percent or less (such as the 6th Precinct in Greenwich Village), black and Latino New Yorkers accounted for more than 70 percent of stops in six of those precincts.
What happens when this kind of uneven policing goes on? Lots of stupid arrests for petty crimes, for “resisting arrest,” and generally for being poor or having untreated mental health problems. About 1 in 1000 such stops are directly linked to a violent crime.
And again, since those stopped are overwhelmingly minority, it means that when City Hall decides to use predictive policing based on this data, they end up over policing the same neighborhoods, creating even more uneven and biased data. That continuing stream of data even ends up in sentencing and paroling algorithms, making it more likely for those same over-policed populations to stay in jail longer.
It’s high time we get rid of the root cause, the theory of Broken Windows, which was never proven in the first place, which optimizes on the wrong definition of success, and which further undermines community trust in the police.
I was invited last week to an event co-sponsored by the White House,Microsoft, and NYU called AI Now: The social and economic implications of artificial intelligence technologies in the near term. Many of the discussions were under “Chatham House Rule,” which means I get to talk about the ideas without attributing any given idea to any person.
Before I talk about some of the ideas that came up, I want to mention that the definition of “AI” was never discussed. After a while I took it to mean anything that was technological that had an embedded flow chart inside it. So, anything vaguely computerized that made decisions. Even a microwave that automatically detected whether your food was sufficiently hot – and kept heating if it wasn’t – would qualify as AI under these rules.
In particular, all of the algorithms I studied for my book certainly qualified. And some of them, like predictive policing and recidivism risk models, google search and resume filtering algorithms, were absolutely talked about and referred to as AI.
One of the questions we posed was, when is AI appropriate? Is there a class of questions that AI should not be used for, and why? More interestingly, is there AI working and making decisions right now, in some context, that should be outlawed? Or at least put on temporary suspension?
[Aside: I’m so glad we’re actually finally discussing this. Up until now it seems like wherever I go it’s taken as a given that algorithms would be an improvement over human decision-making. People still automatically assume algorithms are more fair and objective than humans, and sometimes they are, but they are by no means perfect.]
We didn’t actually have time to thoroughly discuss this question, but I’m going to throw down the gauntlet anyway.
Take recidivism risk models. Julia Angwin and her team at ProPublica recently demonstrated that the COMPAS model, which was being used in Broward County Florida (as well as many other places around the country), is racist. In particular, it has very different errors for blacks and for whites, with high “false positive” rates for blacks and high “false negative” rates for whites. This ends up meaning that blacks go to jail for longer, since that’s how recidivism rates are being used.
So, do we throw out recidivism modeling altogether? After all, judges by themselves are also racist; a models such as the COMPAS model might actually be improving the situation. Then again, it might be making it worse. We simply don’t know without a monitor in place. (So, let’s get some monitors in place, people! Let’s see some academic work in this area!)
I’ve heard people call for removing recidivism models altogether, but honestly I think that’s too simple. I think we should instead have a discussion on what they show, why they’re used the way they are, and how they can be improved to help people.
So, if we’re seeing way more black (men) with high recidivism risk scores, we need to ask ourselves: why are black men deemed so much more likely to return to jail? Is it because they’re generally poorer and don’t have the start-up funds necessary to start a new life? Or don’t have job opportunities when they get out of prison? Or because their families and friends don’t have a place for them to stay? Or because the cops are more likely to re-arrest them because they live in poor neighborhoods or are homeless? Or because the model’s design itself is flawed? In short, what are we measuring when we build recidivism scores?
Second, why are recidivism risk models used to further punish people who are already so disadvantaged? What is it about our punitive and vengeful justice system that makes us punish people in advance for crimes they have not yet committed? It keeps them away from society even longer and further casting them into a cycle of crime and poverty. If our goal were to permanently brand and isolate a criminal class, we couldn’t look for a better tool. We need to do better.
Next, how can we retool recidivism models to help people rather than harm them? We could use the scores to figure out who needs resources the most in order to stay out of trouble after release, to build evidence that we need to help people who leave jail rebuild their lives. How do investments in education inside help people once they get out land a job? Do states that make it hard for employers to discriminate based on prior convictions – or for that matter on race – see better results for recently released prisoners? To what extent does “broken windows policing” in a neighborhood affect the recidivism rates for its inhabitants? These are all questions we need to answer, but we cannot answer without data. So let’s collect the data.
Back to the question: when is AI appropriate? I’d argue that building AI is almost never inappropriate in itself, but interpreting results of AI decision-making is incredibly complicated and can be destructive or constructive, depending on how well it is carried out.
And, as was discussed at the meeting, most data scientists/ engineers have little or no training in thinking about this stuff beyond optimization techniques and when to use linear versus logistic regression. That’s a huge problem, because part of AI – a big part – is the assumption that AI can solve every problem in essentially the same way. AI teams are, generally speaking, homogenous in gender, class, and often race, and that monoculture gives rise to massive misunderstandings and narrow ways of thinking.
The short version of my answer is, AI can be made appropriate if it’s thoughtfully done, but most AI shops are not set up to be at all thoughtful about how it’s done. So maybe, at the end of the day, AI really is inappropriate, at least for now, and until we figure out how to involve more people and have a more principled discussion about what it is we’re really measuring with AI.