Google search is already open source
I’ve been talking a lot recently, with various people and on this blog, about data and model privacy. It seems like individuals, who should have the right to protect their data, don’t seem to, but huge private companies, with enormous powers over the public, do.
Another example: models working on behalf of the public, like Fed stress tests and other regulatory models, seem essentially publicly known, which is useful indeed to the financial insiders, the very people who are experts on gaming systems.
Google search has a deeply felt power over the public, and arguably needs to be understood for the consistent threat it poses to people’s online environment. It’s a scary thought experiment to imagine what could be done with it, and after all why should we blindly trust a corporation to have our best intentions in mind? Maybe it’s time to call for the Google search model to be open source.
But what would that look like? At first blush we might imagine forcing them to actually opening up their source code. But at this point that code must be absolutely enormous, unreadable, and written specifically for their uniquely massive machine set-up. In other words, totally overwhelming and useless (as my friend Suresh might say, the singularity has already happened and this is what it looks like (update: Suresh credits Cosma)).
Considering how many people would actually be able to make sense of the underlying code base, then you quickly realize that opening it up would be meaningless for the task of protecting the public. Instead, we’d want to make the code accessible in some way.
But I claim that’s exactly what Google does, by allowing everyone to search using the model from anywhere. In other words, it’s on us, the public, to run experiments to undertand what the underlying model actually does. We have the tools, let’s get going.
If we think there’s inherent racism in google searches, then we should run experiments like Nathan Newman recently did, examining the different ads that pop up when someone writes an email about buying a car, for example, with different names and in different zip codes. We should organize to change our zip codes, our personas (which would mean deliberately creating personas and gmail logins, etc.), and our search terms, and see how the Google search results change as our inputs change.
After all, I don’t know what’s in the code base but I’m pretty sure there’s no sub-routine that’s called “add_racism_to_search”; instead, it’s a complicated Rube-Goldberg machine that should be judged by its outputs, in a statistical way, rather than expected to prescriptively describe how it treats things on a case-by-case basis.
Another thing: I don’t think there are bad intentions on the part of the modelers, but that doesn’t mean there aren’t bad consequences – the model is too complicated for anyone to anticipate exactly how it acts unless they perform experiments to test them. In the meantime, until people undertand that, we need to distinguish between the intentions and the results. So, for example, in the update to Nathan Newman’s experiments with Google mail, Google responded with this:
This post relies on flawed methodology to draw a wildly inaccurate conclusion. If Mr. Newman had contacted us before publishing it, we would have told him the facts: we do not select ads based on sensitive information, including ethnic inferences from names.
And then Newman added this:
Now, I’m happy to hear Google doesn’t “select ads” on this basis, but Google’s words seem chosen to allow a lot of wiggle room (as such Google statements usually seem to). Do they mean that Google algorithms do not use the ethnicity of names in ad selection or are they making the broader claim that they bar advertisers from serving up different ads to people with different names?
My point is that it doesn’t matter what Google says it does or doesn’t do, if statistically speaking the ads change depending on ethnicity. It’s a moot argument what they claim to do if what actually happens, the actual output of their Rube-Goldberg machine, is racist.
And I’m not saying Google’s models are definitively racist, by the way, since Newman’s efforts were small, the efforts of one man, and there were not thousands and thousands of tests but only a few. But his approach to understanding the model was certainly correct, and it’s a cause that technologists and activists should take up and expand on.
Mathematically speaking, it’s already as open-source as we need it to be to understand it, although in a dual way than people are used to thinking about. Actually, it defines the gold standard of open-source: instead of getting a bunch of gobbly-gook that we can’t process and that depends on enormously large data that changes over time, we get real-time access to the newest version that even a child can use.
I only wish that other public-facing models had such access. Let’s create a large-scale project like SETI to understand the Google search model.