I gave a talk to the invitation-only NYC CTO Club a couple of weeks ago about my fears about big data modeling, namely:
- that big data modeling is discriminatory,
- that big data modeling increases inequality, and
- that big data modeling threatens democracy.
I had three things on my “to do” list for the audience of senior technologists, namely:
- test internal, proprietary models for discrimination,
- help regulators like the CFPB develop reasonable audits, and
- get behind certain models being transparent and publicly accessible, including credit scoring, teacher evaluations, and political messaging models.
Given the provocative nature of my talk, I was pleasantly surprised by the positive reception I was given. Those guys were great – interactive, talkative, and very thoughtful. I think it helped that I wasn’t trying to sell them something.
Even so, I shouldn’t have been surprised when one of them followed up with me to talk about a possible business model for “fairness audits.” The idea is that, what with the recent bad press about discrimination in big data modeling (some of the audience had actually worked with the Podesta team), there will likely be a business advantage to being able to claim that your models are fair. So someone should develop those tests that companies can take. Quick, someone, monetize fairness!
One reason I think this might actually work – and more importantly, be useful – is that I focused on “effects-based” discrimination, which is to say testing a model by treating it like a black box and seeing how it works on different inputs and gives different outputs. In other words, I want to give a resume-sorting algorithm different resumes with similar qualifications but different races. An algorithmically induced randomized experiment, if you will.
From the business perspective, a test that allows a model to remain a black box feels safe, because it does not require true transparency, and allows the “secret sauce” to remain secret.
One thing, though. I don’t think it makes too much sense to have a proprietary model for fairness auditing. In fact the way I was imagining this was to develop an open-source audit model that the CFPB could use. What I don’t want, and which would be worse than nothing, would be if some private company developed a proprietary “fairness audit” model that we cannot trust and would claim to solve the very real problems listed above.
Update: something like this is already happening for privacy compliance in the big data world (hat tip David Austin).
I finished reading Podesta’s Big Data Report to Obama yesterday, and I have to say I was pretty impressed. I credit some special people that got involved with the research of the report like Danah Boyd, Kate Crawford, and Frank Pasquale for supplying thoughtful examples and research that the authors were unable to ignore. I also want to thank whoever got the authors together with the civil rights groups that created the Civil Rights Principles for the Era of Big Data:
- Stop High-Tech Profiling. New surveillance tools and data gathering techniques that can assemble detailed information about any person or group create a heightened risk of profiling and discrimination. Clear limitations and robust audit mechanisms are necessary to make sure that if these tools are used it is in a responsible and equitable way.
- Ensure Fairness in Automated Decisions. Computerized decisionmaking in areas such as employment, health, education, and lending must be judged by its impact on real people, must operate fairly for all communities, and in particular must protect the interests of those that are disadvantaged or that have historically been the subject of discrimination. Systems that are blind to the preexisting disparities faced by such communities can easily reach decisions that reinforce existing inequities. Independent review and other remedies may be necessary to assure that a system works fairly.
- Preserve Constitutional Principles. Search warrants and other independent oversight of law enforcement are particularly important for communities of color and for religious and ethnic minorities, who often face disproportionate scrutiny. Government databases must not be allowed to undermine core legal protections, including those of privacy and freedom of association.
- Enhance Individual Control of Personal Information. Personal information that is known to a corporation — such as the moment-to-moment record of a person’s movements or communications — can easily be used by companies and the government against vulnerable populations, including women, the formerly incarcerated, immigrants, religious minorities, the LGBT community, and young people. Individuals should have meaningful, flexible control over how a corporation gathers data from them, and how it uses and shares that data. Non-public information should not be disclosed to the government without judicial process.
- Protect People from Inaccurate Data. Government and corporate databases must allow everyone — including the urban and rural poor, people with disabilities, seniors, and people who lack access to the Internet — to appropriately ensure the accuracy of personal information that is used to make important decisions about them. This requires disclosure of the underlying data, and the right to correct it when inaccurate.
This was signed off on by multiple civil rights groups listed here, and it’s a great start.
One thing I was not impressed by: the only time the report mentioned finance was to say that, in finance, they are using big data to combat fraud. In other words, finance was kind of seen as an industry standing apart from big data, and using big data frugally. This is not my interpretation.
In fact, I see finance as having given birth to big data. Many of the mistakes we are making as modelers in the big data era, which require the Civil Rights Principles as above, were made first in finance. Those modeling errors – and when not errors, politically intentional odious models – were created first in finance, and were a huge reason we first had the mortgage-backed-securities rated with AAA ratings and then the ensuing financial crisis.
In fact finance should have been in the report standing as a worst case scenario.
One last thing. The recommendations coming out of the Podesta report are lukewarm and are even contradicted by the contents of the report, as I complained about here. That’s interesting, and it shows that politics played a large part of what the authors could include as acceptable recommendations to the Obama administration.
It was bound to happen. Someone was inevitably going to have to write this book, entitled Social Physics, and now someone has just up and done it. Namely, Alex “Sandy” Pentland, data scientist evangelist, director of MIT’s Human Dynamics Laboratory, and co-founder of the MIT Media Lab.
A review by Nicholas Carr
Pentland argues that our greatly expanded ability to gather behavioral data will allow scientists to develop “a causal theory of social structure” and ultimately establish “a mathematical explanation for why society reacts as it does” in all manner of circumstances. As the book’s title makes clear, Pentland thinks that the social world, no less than the material world, operates according to rules. There are “statistical regularities within human movement and communication,” he writes, and once we fully understand those regularities, we’ll discover “the basic mechanisms of social interactions.”
By collecting all the data – credit card, sensor, cell phones that can pick up your moods, etc. – Pentland seems to think we can put the science into social sciences. He thinks we can predict a person like we now predict planetary motion.
OK, let’s just take a pause here to say: eeeew. How invasive does that sound? And how insulting is its premise? But wait, it gets way worse.
Vomit. But also not the worst part.
Here’s the worst part about Pentland’s book, from the article:
Ultimately, Pentland argues, looking at people’s interactions through a mathematical lens will free us of time-worn notions about class and class struggle. Political and economic classes, he contends, are “oversimplified stereotypes of a fluid and overlapping matrix of peer groups.” Peer groups, unlike classes, are defined by “shared norms” rather than just “standard features such as income” or “their relationship to the means of production.” Armed with exhaustive information about individuals’ habits and associations, civic planners will be able to trace the full flow of influences that shape personal behavior. Abandoning general categories like “rich” and “poor” or “haves” and “have-nots,” we’ll be able to understand people as individuals—even if those individuals are no more than the sums of all the peer pressures and other social influences that affect them.
Kill. Me. Now.
The good news is that the author of the article, Nicholas Carr, doesn’t buy it, and makes all sorts of reasonable complaints about this theory, like privacy concerns, and structural sources of society’s ills. In fact Carr absolutely nails it (emphasis mine):
Pentland may be right that our behavior is determined largely by social norms and the influences of our peers, but what he fails to see is that those norms and influences are themselves shaped by history, politics, and economics, not to mention power and prejudice. People don’t have complete freedom in choosing their peer groups. Their choices are constrained by where they live, where they come from, how much money they have, and what they look like. A statistical model of society that ignores issues of class, that takes patterns of influence as givens rather than as historical contingencies, will tend to perpetuate existing social structures and dynamics. It will encourage us to optimize the status quo rather than challenge it.
How to see how dumb this is in two examples
This brings to mind examples of models that do or do not combat sexism.
First, the orchestra audition example: in order to avoid nepotism, they started making auditioners sit behind a sheet. The result has been way more women in orchestras.
This is a model, even if it’s not a big data model. It is the “orchestra audition” model, and the most important thing about this example is that they defined success very carefully and made it all about one thing: sound. They decided to define the requirements for the job to be “makes good sounding music” and they decided that other information, like how they look, would be by definition not used. It is explicitly non-discriminatory.
By contrast, let’s think about how most big data models work. They take historical information about successes and failures and automate them – rather than challenging their past definition of success, and making it deliberately fair, they are if anything codifying their discriminatory practices in code.
My standard made-up example of this is close to the kind of thing actually happening and being evangelized in big data. Namely, a resume sorting model that helps out HR. But, using historical training data, this model notices that women don’t fare so well historically at a the made-up company as computer programmers – they often leave after only 6 months and they never get promoted. A model will interpret that to mean they are bad employees and never look into structural causes. And moreover, as a result of this historical data, it will discard women’s resumes. Yay, big data!
I’m kind of glad Pentland has written such an awful book, because it gives me an enemy to rail against in this big data hype world. I don’t think most people are as far on the “big data will solve all our problems” spectrum as he is, but he and his book present a convenient target. And it honestly cannot surprise anyone that he is a successful white dude as well when he talks about how big data is going to optimize the status quo if we’d just all wear sensors to work and to bed.
I’m trying my hardest to resist talking about Piketty’s Capital because I haven’t read it yet, even though I’ve read a million reviews and discussions about it, and I saw him a couple of weeks ago on a panel with my buddy Suresh Naidu. Suresh, who was great on the panel, wrote up his notes here.
So I’ll hold back from talking directly about Piketty, but let me talk about one of Suresh’s big points that was inspired in part by Piketty. Namely, the fact that it’s a great time to be rich. It’s even greater now to be rich than it was in the past, even when there were similar rates of inequality. Why? Because so many things have become commodified. Here’s how Suresh puts it:
We live in a world where much more of everyday life occurs on markets, large swaths of extended family and government services have disintegrated, and we are procuring much more of everything on markets. And this is particularly bad in the US. From health care to schooling to philanthropy to politicians, we have put up everything for sale. Inequality in this world is potentially much more menacing than inequality in a less commodified world, simply because money buys so much more. This nasty complementarity of market society and income inequality maybe means that the social power of rich people is higher today than in the 1920s, and one response to increasing inequality of market income is to take more things off the market and allocate them by other means.
I think about this sometimes in the field of education in particular, and to that point I’ve got a tiny bit of good news today.
Namely, InBloom is shutting down (hat tip Linda Brown). You might not remember what InBloom is, but I blogged about this company a while back in my post Big Data and Surveillance, as well as the ongoing fight against InBloom in New York state by parents here.
The basic idea is that InBloom, which was started in cooperation with the Bill and Melinda Gates Foundation and Rupert Murdoch’s Amplify, would collect huge piles of data on students and their learning and allow third party companies to mine that data to improve learning. From this New York Times article:
InBloom aimed to streamline personalized learning — analyzing information about individual students to customize lessons to them — in public schools. It planned to collect and integrate student attendance, assessment, disciplinary and other records from disparate school-district databases, put the information in cloud storage and release it to authorized web services and apps that could help teachers track each student’s progress.
It’s not unlike the idea that Uber has, of connecting drivers with people needing rides, or that AirBNB has, of connecting people needing a room with people with rooms: they are platforms, not cab companies or hoteliers, and they can use that matchmaking status as a way to duck regulations.
The problem here is that the relevant child data protection regulation, called FERPA, is actually pretty strong, and InBloom and companies like it were largely bypassing that law, as was discovered by a Fordham Law study led by Joel Reidenberg. In particular, the study found that InBloom and other companies were offering what seemed like “free” educational services, but of course the deal really was in exchange for the children’s data, and the school officials who were agreeing to the deals had no clue as to what they were signing. The parents were bypassed completely. Much of the time the contracts were in direct violation of FERPA, but often the school officials didn’t even have copies of the contracts and hadn’t heard of FERPA.
Because of that report and other bad publicity, we saw growing resistance in New York State by parents, school board members and privacy lawyers. And thanks to that resistance, New York State Legislature recently passed a budget that prohibited state education officials from releasing student data to amalgamators like inBloom. InBloom has subsequently decided to close down.
I’m not saying that the urge to privatize education – and profit off of it – isn’t going to continue after a short pause. For that matter look at the college system. Even so, let’s take a moment to appreciate the death of one of the more egregious ideas out there.
Interview on Junk Charts
Yesterday I was featured on Kaiser Fung’s Junk Charts blog in an interview where he kindly refers to me as a “Numbersense Pro”. Previous to this week, my strongest connection with Kaiser Fung was through Andrew Gelman’s meta-review of my review and Kaiser’s review of Nate Silver’s book The Signal And The Noise.
iPython Notebook in Data Journalism
Speaking of Nate Silver, Brian Keegan, a quantitative social scientist from Northeastern University, recently built a very cool iPython notebook (hat tip Ben Zaitlen), replete with a blog post in markdown on the need for openness in journalism (also available here), which revisited a fivethirtyeight article originally written by Walt Hickey on the subject of women in film. Keegan’s notebook is truly a model of open data journalism, and the underlying analysis is also interesting, so I hope you have time to read it.
The big news about the SAT is that the College Boards, which makes the SAT, has admitted there is a problem, which is widespread test-prep and gaming. As I talked about in this post, the SAT mainly serves to sort people by income.
It shouldn’t be a surprise to anyone when a weak proxy gets gamed. Yesterday I discussed this very thing in the context of Google’s PageRank algorithm, and today it’s student learning aptitude. The question is, what do we do next?
Rick Bookstaber wrote an interesting post yesterday (hat tip Marcos Carreira) with an idea to address the SAT problem with the same approach that I’m guessing Google is addressing the PageRank problem, namely by abandoning the poor proxy and getting a deeper, more involved one. Here’s Bookstaber’s suggestion:
You would think that in the emerging world of big data, where Amazon has gone from recommending books to predicting what your next purchase will be, we should be able to find ways to predict how well a student will do in college, and more than that, predict the colleges where he will thrive and reach his potential. Colleges have a rich database at their disposal: high school transcripts, socio-economic data such as household income and family educational background, recommendations and the extra-curricular activities of every applicant, and data on performance ex post for those who have attended. For many universities, this is a database that encompasses hundreds of thousands of students.
There are differences from one high school to the next, and the sample a college has from any one high school might be sparse, but high schools and school districts can augment the data with further detail, so that the database can extend beyond those who have applied. And the data available to the colleges can be expanded by orders of magnitude if students agree to share their admission data and their college performance on an anonymized basis. There already are common applications forms used by many schools, so as far as admission data goes, this requires little more than adding an agreement in the college applications to share data; the sort of agreement we already make with Facebook or Google.
The end result, achievable in a few years, is a vast database of high school performance, drilling down to the specific high school, coupled with the colleges where each student applied, was accepted and attended, along with subsequent college performance. Of course, the nature of big data is that it is data, so students are still converted into numerical representations. But these will cover many dimensions, and those dimensions will better reflect what the students actually do. Each college can approach and analyze the data differently to focus on what they care about. It is the end of the SAT version of standardization. Colleges can still follow up with interviews, campus tours, and reviews of musical performances, articles, videos of sports, and the like. But they will have a much better filter in place as they do so.
Two things about this. First, I believe this is largely already happening. I’m not an expert on the usage of student data at colleges and universities, but the peek I’ve had into this industry tells me that the analytics are highly advanced (please add related comments and links if you have them!). And they have more to do with admissions and college aid – and possibly future alumni giving – than any definition of academic success. So I think Bookstaber is being a bit naive and idealistic if he thinks colleges will use this information for good. They already have it and they’re not.
Secondly, I want to think a little bit harder about when the “big, deeper data” approach makes sense. I think it does for teachers to some extent, as I talked about yesterday, because after all it’s part of a job to get evaluated. For that matter I expect this kind of thing to be part of most jobs soon (but it will be interesting to see when and where it stops – I’m pretty sure Bloomberg will never evaluate himself quantitatively).
I don’t think it makes sense to evaluate children in the same way, though. After all, we’re basically talking about pre-consensual surveillance, not to mention the collection and mining of information far beyond the control of the individual child. And we’re proposing to mine demographic and behavioral data to predict future success. This is potentially much more invasive than just one crappy SAT test. Childhood is a time which we should try to do our best to protect, not quantify.
Also, the suggestion that this is less threatening because “the data is anonymized” is misleading. Stripping out names in historical data doesn’t change or obscure the difference between coming from a rich high school or a poor one. In the end you will be judged by how “others like you” performed, and in this regime the system gets off the hook but individuals are held accountable. If you think about it, it’s exactly the opposite of the American dream.
I don’t want to be naive. I know colleges will do what they can to learn about their students and to choose students to make themselves look good, at least as long as the US News & World Reports exists. I’d like to make it a bit harder for them to do so.
First there was Google Search, and then pretty quickly SEOs came into existence.
SEOs are marketing people hired by businesses to bump up the organic rankings for that business in Google Search results. That means they pay people to make their website more attractive and central to Google Search so they don’t have to pay for ads but will get visitors anyway. And since lots of customers come from search results, this is a big deal for those businesses.
Since Google Search was based on a pretty well-known, pretty open algorithm called PageRank which relies on ranking the interestingness of pages by their links, SEOs’ main jobs were to add links and otherwise fiddle with links to and from the websites of their clients. This worked pretty well at the beginning and the businesses got higher rank and they didn’t have to pay for it, except they did have to pay for the SEOs.
But after a while Google caught on to the gaming and adjusted its search algorithm, and SEOs responded by working harder at gaming the system (see more history here). It got more expensive but still kind of worked, and nowadays SEOs are a big business. And the algorithm war is at full throttle, with some claiming that Google Search results are nowadays all a bunch of crappy, low-quality ads.
This is to be expected, of course, when you use a proxy like “link” to indicate something much deeper and more complex like “quality of website”. Since it’s so high stakes, the gaming acts to decouple the proxy entirely from its original meaning. You end up with something that is in fact the complete opposite of what you’d intended. It’s hard to address except by giving up the proxy altogether and going for something much closer to what you care about.
Recently my friend Jordan Ellenberg sent me an article entitled The Future of PageRank: 13 Experts on the Dwindling Value of the Link. It’s an insider article, interviewing 13 SEO experts on how they expect Google to respond to the ongoing gaming of the Google Search algorithm.
The experts don’t all agree on the speed at which this will happen, but there seems to be some kind of consensus that Google will stop relying on links as such and will go to user behavior, online and offline, to rank websites.
If correct, this means that we can expect Google to pump all of our email, browsing, and even GPS data to understand our behaviors in a minute fashion in order to get at a deeper understanding of how we perceive “quality” and how to monetize that. Because, let’s face it, it’s all about money. Google wants good organic searches so that people won’t abandon its search engine altogether so it can sell ads.
So we’re talking GPS on your android, or sensor data, and everything else it can get its hands on through linking up various data sources (which as I read somewhere is why Google+ still exists at all, but I can’t seem to find that article on Google).
It’s kind of creepy all told, and yet I do see something good coming out of it. Namely, it’s what I’ve been saying we should be doing to evaluate teachers, instead of using crappy and gameable standardized tests. We should go deeper and try to define what we actually think makes a good teacher, which will require sensors in the classroom to see if kids are paying attention and are participating and such.
Maybe Google and other creepy tech companies can show us the way on this one, although I don’t expect them to explain their techniques in detail, since they want to stay a step ahead of SEO’s.