Google search is already open source

Home > data science, modeling, open source tools > Google search is already open source

Google search is already open source

January 21, 2013 Cathy O'Neil, mathbabe

I’ve been talking a lot recently, with various people and on this blog, about data and model privacy. It seems like individuals, who should have the right to protect their data, don’t seem to, but huge private companies, with enormous powers over the public, do.

Another example: models working on behalf of the public, like Fed stress tests and other regulatory models, seem essentially publicly known, which is useful indeed to the financial insiders, the very people who are experts on gaming systems.

Google search has a deeply felt power over the public, and arguably needs to be understood for the consistent threat it poses to people’s online environment. It’s a scary thought experiment to imagine what could be done with it, and after all why should we blindly trust a corporation to have our best intentions in mind? Maybe it’s time to call for the Google search model to be open source.

But what would that look like? At first blush we might imagine forcing them to actually opening up their source code. But at this point that code must be absolutely enormous, unreadable, and written specifically for their uniquely massive machine set-up. In other words, totally overwhelming and useless (as my friend Suresh might say, the singularity has already happened and this is what it looks like (update: Suresh credits Cosma)).

Considering how many people would actually be able to make sense of the underlying code base, then you quickly realize that opening it up would be meaningless for the task of protecting the public. Instead, we’d want to make the code accessible in some way.

But I claim that’s exactly what Google does, by allowing everyone to search using the model from anywhere. In other words, it’s on us, the public, to run experiments to undertand what the underlying model actually does. We have the tools, let’s get going.

If we think there’s inherent racism in google searches, then we should run experiments like Nathan Newman recently did, examining the different ads that pop up when someone writes an email about buying a car, for example, with different names and in different zip codes. We should organize to change our zip codes, our personas (which would mean deliberately creating personas and gmail logins, etc.), and our search terms, and see how the Google search results change as our inputs change.

After all, I don’t know what’s in the code base but I’m pretty sure there’s no sub-routine that’s called “add_racism_to_search”; instead, it’s a complicated Rube-Goldberg machine that should be judged by its outputs, in a statistical way, rather than expected to prescriptively describe how it treats things on a case-by-case basis.

Another thing: I don’t think there are bad intentions on the part of the modelers, but that doesn’t mean there aren’t bad consequences – the model is too complicated for anyone to anticipate exactly how it acts unless they perform experiments to test them. In the meantime, until people undertand that, we need to distinguish between the intentions and the results. So, for example, in the update to Nathan Newman’s experiments with Google mail, Google responded with this:

This post relies on flawed methodology to draw a wildly inaccurate conclusion. If Mr. Newman had contacted us before publishing it, we would have told him the facts: we do not select ads based on sensitive information, including ethnic inferences from names.

And then Newman added this:

Now, I’m happy to hear Google doesn’t “select ads” on this basis, but Google’s words seem chosen to allow a lot of wiggle room (as such Google statements usually seem to). Do they mean that Google algorithms do not use the ethnicity of names in ad selection or are they making the broader claim that they bar advertisers from serving up different ads to people with different names?

My point is that it doesn’t matter what Google says it does or doesn’t do, if statistically speaking the ads change depending on ethnicity. It’s a moot argument what they claim to do if what actually happens, the actual output of their Rube-Goldberg machine, is racist.

And I’m not saying Google’s models are definitively racist, by the way, since Newman’s efforts were small, the efforts of one man, and there were not thousands and thousands of tests but only a few. But his approach to understanding the model was certainly correct, and it’s a cause that technologists and activists should take up and expand on.

Mathematically speaking, it’s already as open-source as we need it to be to understand it, although in a dual way than people are used to thinking about. Actually, it defines the gold standard of open-source: instead of getting a bunch of gobbly-gook that we can’t process and that depends on enormously large data that changes over time, we get real-time access to the newest version that even a child can use.

I only wish that other public-facing models had such access. Let’s create a large-scale project like SETI to understand the Google search model.

Categories: data science, modeling, open source tools

Comments (21)

grwww

January 21, 2013 at 9:31 am

Googles engine is based on statistics and your behaviors, plain and simple, from what I can see. So, if you act like a person from a particular ethnic background, then you will statistically see yourself treated that way. Your location controls the types of local goods and services that you will see advertised. What you buy on line, will control the types of businesses that advertise to you, as well as the “slew” of similar products and services that will be shown to you.

You are what you are, and Google search is going to treat you, well, like yourself!

LikeLike
- Cathy O'Neil, mathbabe
  
  January 21, 2013 at 9:35 am
  
  I’m afraid that’s just too naive. Google’s model, like any other, is a coarse approximation to reality. There is a lot to understand about how it estimates reality and, even more importantly, how it effects reality.
  
  LikeLike
  - grwww
    
    January 21, 2013 at 9:49 am
    
    It keeps a lot of statistics and has ranking algorithms and people can, of course buy the rights to improve their match results, and perhaps lots of other specific controls. But, there is no, non-statistical way, that I can imagine, that they can go any farther than what the statistics show them. It makes no sense to waste time and precious ad space, on non-productive ads.
    
    They need to provide click through performance that meets Google’s needs and those of the advertisers. Divining ethnicity, as a particular statistical detail, seems completely odd. If I was of Asian ethnicity, and I just bought a new HD TV, what would make more sense to advertise to me? A local Asian food marketplace, an Asian clothing marketplace or an add for a blueray player, HDMI cabling, TV services, Dish network, Apple TV, a DVR box, etc.?
    
    The statistics of my recent shopping would be the most realistic to look at, or some statistics which said about once a quarter, I buy close on line (at an Asian Marketplace), or statistics which show that every Monday, I open CNN.com, so why not throw up a Wallstreet Journal ad?
    
    Surely, there are lots of things that it considers, but it’s all numbers and statistics. It’s not just going to be the fact that my sirname is “Kim” or “Wey” or “Sungli” or something else. Those might factor in for “locale” analysis, but that’s not going to a large factor. And by locale, I mean they have my zip code as downtown Chicago, and so they might use some advertising from China town as opposed to Michigan avenue.
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      January 21, 2013 at 9:55 am
      
      You are clearly under the impression that this is all being done in the name of convenience for the user. May I suggest that you think more like a predatory asshole and consider how many obscene and fraudulent mortgages were pushed on minorities throughout the past 10 years, and how much easier that gets through this kind of personalization? In that light you might want to think twice about how certain people are given “offers” that others don’t see.
      
      Another way to say this is that it’s always easier to be the person benefitting from a situation than the person suffering. In fact it seems downright wonderful that we get offers of money back on our purchases if we have excellent credit. But keep in mind that no credit card company is in it to lose money, so to balance out your money-back offers there are plenty of other people who are directly victimized by the system.
      
      LikeLike
    - grwww
      
      January 21, 2013 at 3:17 pm
      
      Part of the reason behind predatory lending is described here: http://www.thisamericanlife.org/radio-archives/episode/355/the-giant-pool-of-money.
      
      There is of course lots of “individuals” involved who each took their own “path” in that debacle. As said there, some people understood what was going on, and jumped in, sure that the benefits before bankruptcy would far exceed what would come after.
      
      But, yes, it’s possible that “predatory assholes” were present in all of this. The question is, whether there is a clear and obvious path pointing from one end to the other, with Google Search standing in the middle, with an outstretched hand in both directions.
      
      Google search enables “everything” that you can do with it. It doesn’t mean that there is a “conscience decision” to make “everything” possible. Clearly there is a lot of speculation possible.
      
      LikeLike
    - Harry
      
      January 22, 2013 at 3:53 pm
      
      Quite so. I dont want to play poker with people who can see my cards, and I dont want to do business with people who have access to my private data. Its a way of appropriating the “consumer surplus” for the company and that is directly harmful to my interests.
      
      LikeLike
xiousgeonz

January 21, 2013 at 1:35 pm

I think you both may be absolutely right. If I wanted to prey, it wouldn’t be too hard to explore what kinds of key words would help me connect with the people with my target personalizations applied to them. The possibilities are endless… That doesn’t impose motivations on data or statistics… and I just don’t see how or why Google would use last names when there are so many easier and more accurate ways to glean information to categorize and personalize a person’s profile.

LikeLike
JSE

January 21, 2013 at 3:03 pm

“I’m pretty sure there’s no sub-routine that’s called “add_racism_to_search””

Have you tried “from racistpy import *”?

LikeLike
- Not Larry Page
  
  January 21, 2013 at 5:48 pm
  
  If Google only shows you “what interests you” and nothing different then you will never know anything better.
  
  I don’t work at Google but I’m pretty sure that add-racism-to-search IS part of the subrutine. For example, take a look at this.
  
  Anyway, since I’m not a moralist, where is the limit when you must let it go and let people roll over their own shit? If someone doesn’t want to consider any alternatives to their own thinking and you know it will bring negative consequences, should you do something about it?
  
  In the past I used to take the role of the Devil’s Laywer, partially because I liked it, but lately I’m growing quite tired of it.
  
  LikeLike
chris

January 21, 2013 at 3:34 pm

This is true to a limited extent. But as Google’s response suggests. You can only do small studies. Google has traditionally limited the number of queries any one person can shoot at it in one day. This has been a severe limitation to using it as an experimental tool. Linguists have long wanted to use Google as a very large corpus, but their policy has been a choke point. Have they changed their policy?

LikeLike
rrenaud

January 21, 2013 at 9:56 pm

Open source has well defined, clean and clear definition. You need to be able to actually see, share, and allow modifications to the source code for something to be open source.

The Google search engine is very far from open source. Even compared it against say, the windows operating system, it’s very closed. At least with windows, you can study the compiled code, break down it’s small functions, run them on inputs you control, etc. You have no such freedom with the Google search engine.

As for studying Google (or language itself) from it’s outputs, I’ve chatted with statisticians about the #results estimation, and they say it’s very hacky and full of very crude approximations. So this means things that academia has produced, like Normalized Google Distance (http://en.wikipedia.org/wiki/Normalized_Google_distance) don’t actually make that much sense in practice.

When Google itself was trying to figure out if Bing was directly copying it’s search results, it still required a very clever and concerted effort by Googlers to get definitive proof. See http://articles.cnn.com/2011-02-02/tech/google.bing.sting_1_experian-hitwise-google-engineers-search-engine?_s=PM:TECH . Trying to infer the guts of the search engine from the inputs and outputs is a very hard task, even for experts in the field.

I am sure there would be enormous difficulty for people to understand the search engine code. But there would also be enormous value to those who committed the effort. So many people use it everyday. If you could make it a tiny bit better, multiplied by the billions of queries it gets a day, you’d have done very much good in aggregate.

No one says that people shouldn’t publish the solutions to Clay mathematics million dollar open problems just because those solutions are certainly enormously complex, and only capable of being understood by a very small number of very skilled people.

Pondering whether a mainstream search engine is racist is somewhat absurd. It’s just algorithms optimized toward making results look good for human raters. There is really no space for racism. But things like how does google rank youtube videos vs other videos on the web are interesting from a legal perspective.

To be honest, the degree to which Google is intensively secretive about the ranking function and subsystems actually bothers me a bit. Even though I have access to some of the Google code (engineer on local search), I learn a lot about the field of information retrieval from papers written by researchers and engineers at Yahoo, Microsoft, and academia. It’s rare that I find papers written by Googlers about the nitty gritty of web search, even though such papers written by people at Yahoo and Microsoft are plentiful. Google is hoarding it’s admittedly hard earned knowledge for it’s own benefit.

The best (maybe the only) argument that I can make for why you should have open source personal credit scoring, but not open source web search, is that the guys doing web search are actually good and benevolent.

LikeLike
- Cathy O'Neil, mathbabe
  
  January 22, 2013 at 9:00 am
  
  Your blind faith leaves me speechless.
  
  LikeLike
  - rrenaud
    
    January 22, 2013 at 12:56 pm
    
    Then it’s decided. If society as a whole can’t trust Google, then it’s pretty much indistinguishable except in importance from credit scoring systems.
    
    We need a “Too important to hoard” movement for decision systems that have enormous impact on society.
    
    Maybe it’s something that RMS would have come up with if he grew his hacking skills in the ’00s instead of the ’70s. Basically GNU for the internet age.
    
    LikeLike
NotRelevant

January 22, 2013 at 9:21 am

Do you worry that if you google “Jheri Curl” then you will never see another ad for Las Vegas hotel deals? Certainly don’t use it in an email. You know the bot reads every word of that.

Placing text ads on search results generated by specific strings is what Google is all about, but once search history can be used then new opportunities emerge. For example, the minus sign. Maybe a retailer wants to serve up text ads to those who google the words “NASCAR” but exclude those who have ever googled “Obama” or “Romney.” Tailoring search needn’t be limited to the words in the search box, theoretically. It could someday include words that have ever appeared in the drop down list of the search box. After all, the Holy Grail of search is to fix things so that if you google for a car dealer and then immediately google for a bus schedule, the machine will put two and two together and try to sell you a new bus. Use of the minus sign is a big opportunity for explicit racism that goes beyond simply red-lining geographic areas. It allows the user to red-line anyone that’s ever searched on a term that is strongly associated with any group, including people grouped on race, nationality, sex, or religion. It would be an opportunity that existed for users though, not a policy that would be advanced by Google. If Google disallowed excluding on the basis of some search terms, it would quickly find itself employing an army of cyber-chimps to play whack-a-mole with the universe. That does not conjure up images we positively associate with the information age.

Opportunities for racism occur anywhere humans become involved, and humans are very much involved in Google search results. The search results we see are frequently not “organic” search generated by code alone. Google employees tweak the search results on the most frequently searched strings. So on the big searches that matter, results can be very unique, with the human providing the finishing touches that the machine can’t quite get. Additionally, although the search results may report something like 87,000,000 pages, the last time I tried (about a decade ago), I couldn’t get past the first thousand pages listed in the results. That is to say, Google would consistently disallow me through web page failure from accessing the search results listings past about page 100. Therefore, drilling down past the edited search into organic search may not be a reliable method of testing for consistent organic results regardless of a user’s geography, search history, or email content.

When discussing discrimination and the Internet, don’t forget to include the intelligent sites like Amazon that use their own methods to determine what you see offered and at what price it is offered to you. Lord only knows the mischief that could occur if Amazon was able to recognize an IP address that is exclusive to the New York Times or FBI Headquarters. I always check the aggregator retail sites (Google Shopping and Addall are two) to assure I’m paying the rate that’s offered to every other nobody, but that only works on goods, not services.

LikeLike
William

January 22, 2013 at 10:12 am

It seems to me that the concept “racism,” like “fascism,” is being used rather fast and free these days. Associating the name “Gonzalez” with an advertiser whose product is of interest primarily to people of a Spanish heritage, is not rascist! It is what non-Europeans in this country have always wanted: cultural sensitivity to their wants, needs, and desires.

LikeLike
- Cathy O'Neil, mathbabe
  
  January 22, 2013 at 10:13 am
  
  I agree. That’s why more data would be better, and a more nuanced description of what is actually happening to which people during their online experiences.
  
  LikeLike
thebrasstack

February 4, 2013 at 8:51 am

Let’s say a company price-discriminates by race.

Suppose Google wants to be a helpful ad service to them, and help them segment their customers automatically.

All they have to do is get a dataset of names, zipcodes, and prices. Classifying “expensive” and “cheap” names and zipcodes will turn out to be the same as classifying “black” and “white” names. Nobody at Google even needs to be *aware* that the company’s practice is to discriminate by race.

Now, if I understand you correctly, you’re saying that in a scenario like this, it would be appropriate to call Google “racist.” Your line is “If what actually happens, the output of their Rube Goldberg machine, is racist.” But you should understand that you’re making an extraordinarily strong claim by doing that. You’re saying that it would have been their responsibility to proactively make sure that their market segmentation processes don’t discriminate by race.

You could try to implement this concretely by, for instance, making sure that market segments don’t correlate unduly with races. It’s a little more complicated than that, because race correlates with legitimate reasons to product-differentiate, and most of the time we only say it’s racist if you price-discriminate on race *alone* — it’s not racist to show hip-hop ads to hip-hop fans, even if liking hip-hop correlates with being black. Testing for racism in all ad placements sounds like a computationally expensive, and possibly ill-defined, problem. I don’t know enough to know if it’s feasible; but it’s at least plausible that it’s simply enough of a burden that it makes it impossible to be an internet advertising company at all.

What’s going on is that Google says “We’ll help you advertise and price your service so you make the most profit,” and sometimes the way companies would make the most profit is by overcharging black people. What you’re saying is that it’s Google’s responsibility to find out when that’s happening, and then say “Sorry, pal, we won’t help you discriminate.” Be aware that, by holding Google to this standard, you’re essentially asking Google to be the watchdog for racism in the entire economy, and bear the full cost of doing so.

Now let’s imagine how this would play out. What happens if Google’s algorithms for ad placement for Acme Car Company test as “racist”? Does Google refuse to take on Acme as a customer at all? No, that would be wasteful. Google would probably choose the ad placement that scores the best on profit-maximizing *without* being racist. Now, there’s two problems with that.

Problem 1: perhaps Google will discover that racial price discrimination is basically endemic in the economy, to the point that certain sectors can’t be profitable at all unless they’re racist. Imagine if mortgage lending, for example, was an industry that didn’t just have a *little* racial price discrimination, but was thoroughly *based* on it. In this scenario, all the mortgage lenders quit using AdWords, because Google’s anti-racism policy erodes all the advantage for them. They go back to being their old racist selves, slightly less efficiently than they would with Google’s help; but Google loses their entire business. In other words, the cost falls disproportionately on Google rather than the racist industries.

Problem 2: Google will end up gaming the metric used to test for racism. The “next-best solution without racism” will, in the case that racial price-discrimination is a core part of the company’s business model, be simply a racial price-discrimination algorithm that your racism test fails to detect.

This is all leaving aside the question of how you would get Google to do such a thing. In my thought experiment, Google is trying wholeheartedly to be a good citizen. In real life, whether you tried to persuade it to take on this responsibility through public opinion or through law, Google would only “test for racism” as much as it took to avoid looking bad in the press or getting sued. It’s probably much, much easier to hide racism than to rigorously avoid it. Unless the public or government is a lot more capable and passionate (and willing to spend computational resources) than I would expect, there would never be a *thorough* public-interest racism-testing project; there would only be a few statistical experiments here and there, like the one in your link. A SETI-like project would work, perhaps. But at this point I’m questioning the value of using all those resources and time and talent just to screw Google over, when you’ll have a negligible impact on *actual* price discrimination. (Employers, merchants, and lenders can racially discriminate just fine by themselves; they’ve been doing it without Google for centuries. If you eliminated racial price-discrimination from Google, the financial impact on the average black person would be non-zero but I doubt it would be noticeable.)

LikeLike
- Cathy O'Neil, mathbabe
  
  February 4, 2013 at 9:18 am
  
  Well I agree with most of what you are saying. But I don’t see “problem #1” as a problem!
  
  LikeLike
thebrasstack

February 4, 2013 at 10:03 am

What’s also conceivable would be a tool that could show you, the user, what ads and prices you would see if you were white, or in a different neighborhood, or whatever. Fight information with information. Show people when they’re getting ripped off.

Of course, this becomes an arms race. I’m not entirely sure how you could get an anti-ripoff service to grow to the scale where it can actually compete with Google; you might be able to get a modestly successful startup that charges people for the service (finding deals, avoiding scams), but it’s kind of got to be a paid service because you obviously can’t monetize it in the traditional way. (Ha. Imagine if companies paid to be at the top of your “non-scam” results. Second-order product differentiation. That would be a nightmare.) So…I doubt this really has legs, *unless* by some windfall it just happens to be really cheap to test for price discrimination and hard for Google to effectively curb testing. Some sufficiently clever distributed setup might be doable — it’s conceivable you could even do it for free and distribute it for free.

In the happy case where this actually works, price discrimination just becomes *less effective*, because everybody knows that they could have gotten a better deal. They’ll be fine with the kind of price discrimination that works on them (I do not mind paying a premium for swanky lattes and would not be tempted by the Dunkin’ Donuts ads my neighbor gets), but they’ll push back on outright predation or racism (I’d kick up a stink if I knew I were being overcharged because of my zipcode.) Price-discrimination will trend towards the directions that people don’t mind even if they know about it. Google wouldn’t have to have a giant department of racism-testing; racism will just work less well, so they won’t wind up optimizing for it as much. Of course, it’s an imperfect solution (what about people who don’t choose to use the service? there will always be a cost to being uninformed, and as the pool of ignorant people shrinks, they’ll be preyed on more aggressively.) But, in the end, there really *isn’t* a universal solution short of an implausibly expensive and politically unlikely enforcement campaign.

LikeLike
- Cathy O'Neil, mathbabe
  
  February 4, 2013 at 10:05 am
  
  YES!!
  
  But it already *is* an arms race, so don’t tell me we’d be starting it.
  
  Do you wanna help make that tool??
  
  LikeLike
  - thebrasstack
    
    February 4, 2013 at 10:09 am
    
    Right this minute I’m overbooked with projects beyond belief. I can try to recommend good people, though. I
    
    LikeLike