Here’s what I’ve spent the last couple of days doing: alternatively reading Christian Rudder’s new book Dataclysm and proofreading a report by AAPOR which discusses the benefits, dangers, and ethics of using big data, which is mostly “found” data originally meant for some other purpose, as a replacement for public surveys, with their carefully constructed data collection processes and informed consent. The AAPOR folk have asked me to provide tangible examples of the dangers of using big data to infer things about public opinion, and I am tempted to simply ask them all to read Dataclysm as exhibit A.
Rudder is a co-founder of OKCupid, an online dating site. His book mainly pertains to how people search for love and sex online, and how they represent themselves in their profiles.
Here’s something that I will mention for context into his data explorations: Rudder likes to crudely provoke, as he displayed when he wrote this recent post explaining how OKCupid experiments on users. He enjoys playing the part of the somewhat creepy detective, peering into what OKCupid users thought was a somewhat private place to prepare themselves for the dating world. It’s the online equivalent of a video camera in a changing booth at a department store, which he defended not-so-subtly on a recent NPR show called On The Media, and which was written up here.
I won’t dwell on that aspect of the story because I think it’s a good and timely conversation, and I’m glad the public is finally waking up to what I’ve known for years is going on. I’m actually happy Rudder is so nonchalant about it because there’s no pretense.
Even so, I’m less happy with his actual data work. Let me tell you why I say that with a few examples.
Who are OKCupid users?
I spent a lot of time with my students this summer saying that a standalone number wouldn’t be interesting, that you have to compare that number to some baseline that people can understand. So if I told you how many black kids have been stopped and frisked this year in NYC, I’d also need to tell you how many black kids live in NYC for you to get an idea of the scope of the issue. It’s a basic fact about data analysis and reporting.
When you’re dealing with populations on dating sites and you want to conclude things about the larger culture, the relevant “baseline comparison” is how well the members of the dating site represent the population as a whole. Rudder doesn’t do this. Instead he just says there are lots of people OKCupid for the first few chapters, and then later on after he’s made a few spectacularly broad statements, on page 104 he compares the users of OKCupid to the wider internet users, but not to the general population.
It’s an inappropriate baseline, made too late. Because I’m not sure about you but I don’t have a keen sense of the population of internet users. I’m pretty sure very young kids and old people are not well represented, but that’s about it. My students would have know to compare a population to the census. It needs to happen.
How do you collect your data?
Let me back up to the very beginning of the book, where Rudder startles us by showing us that the men that women rate “most attractive” are about their age whereas the women that men rate “most attractive” are consistently 20 years old, no matter how old the men are.
Actually, I am projecting. Rudder never actually specifically tells us what the rating is, how it’s exactly worded, and how the profiles are presented to the different groups. And that’s a problem, which he ignores completely until much later in the book when he mentions that how survey questions are worded can have a profound effect on how people respond, but his target is someone else’s survey, not his OKCupid environment.
Words matter, and they matter differently for men and women. So for example, if there were a button for “eye candy,” we might expect women to choose more young men. If my guess is correct, and the term in use is “most attractive”, then for men it might well trigger a sexual concept whereas for women it might trigger a different social construct; indeed I would assume it does.
Since this isn’t a porn site, it’s a dating site, we are not filtering for purely visual appeal; we are looking for relationships. We are thinking beyond what turns us on physically and asking ourselves, who would we want to spend time with? Who would our family like us to be with? Who would make us be attractive to ourselves? Those are different questions and provoke different answers. And they are culturally interesting questions, which Rudder never explores. A lost opportunity.
Next, how does the recommendation engine work? I can well imagine that, once you’ve rated Profile A high, there is an algorithm that finds Profile B such that “people who liked Profile A also liked Profile B”. If so, then there’s yet another reason to worry that such results as Rudder described are produced in part as a result of the feedback loop engendered by the recommendation engine. But he doesn’t explain how his data is collected, how it is prompted, or the exact words that are used.
Here’s a clue that Rudder is confused by his own facile interpretations: men and women both state that they are looking for relationships with people around their own age or slightly younger, and that they end up messaging people slightly younger than they are but not many many years younger. So forty year old men do not message twenty year old women.
Is this sad sexual frustration? Is this, in Rudder’s words, the difference between what they claim they want and what they really want behind closed doors? Not at all. This is more likely the difference between how we live our fantasies and how we actually realistically see our future.
Need to control for population
Here’s another frustrating bit from the book: Rudder talks about how hard it is for older people to get a date but he doesn’t correct for population. And since he never tells us how many OKCupid users are older, nor does he compare his users to the census, I cannot infer this.
Here’s a graph from Rudder’s book showing the age of men who respond to women’s profiles of various ages:
We’re meant to be impressed with Rudder’s line, “for every 100 men interested in that twenty year old, there are only 9 looking for someone thirty years older.” But here’s the thing, maybe there are 20 times as many 20-year-olds as there are 50-year-olds on the site? In which case, yay for the 50-year-old chicks? After all, those histograms look pretty healthy in shape, and they might be differently sized because the population size itself is drastically different for different ages.
One of the worst examples of statistical mistakes is his experiment in turning off pictures. Rudder ignores the concept of confounders altogether, which he again miraculously is aware of in the next chapter on race.
To be more precise, Rudder talks about the experiment when OKCupid turned off pictures. Most people went away when this happened but certain people did not:
Some of the people who stayed on went on a “blind date.” Those people, which Rudder called the “intrepid few,” had a good time with people no matter how unattractive they were deemed to be based on OKCupid’s system of attractiveness. His conclusion: people are preselecting for attractiveness, which is actually unimportant to them.
But here’s the thing, that’s only true for people who were willing to go on blind dates. What he’s done is select for people who are not superficial about looks, and then collect data that suggests they are not superficial about looks. That doesn’t mean that OKCupid users as a whole are not superficial about looks. The ones that are just got the hell out when the pictures went dark.
This brings me to the most interesting part of the book, where Rudder explores race. Again, it ends up being too blunt by far.
Here’s the thing. Race is a big deal in this country, and racism is a heavy criticism to be firing at people, so you need to be careful, and that’s a good thing, because it’s important. The way Rudder throws it around is careless, and he risks rendering the term meaningless by not having a careful discussion. The frustrating part is that I think he actually has the data to have a very good discussion, but he just doesn’t make the case the way it’s written.
Rudder pulls together stats on how men of all races rate women of all races on an attractiveness scale of 1-5. It shows that non-black men find their own race attractive and non-black men find black women, in general, less attractive. Interesting, especially when you immediately follow that up with similar stats from other U.S. dating sites and – most importantly – with the face that outside the U.S., we do not see this pattern. Unfortunately that crucial fact is buried at the end of the chapter, and instead we get this embarrassing quote right after the opening stats:
And an unintentionally hilarious 84 percent of users answered this match question:
Would you consider dating someone who has vocalized a strong negative bias toward a certain race of people?
in the absolute negative (choosing “No” over “Yes” and “It depends”). In light of the previous data, that means 84 percent of people on OKCupid would not consider dating someone on OKCupid.
Here Rudder just completely loses me. Am I “vocalizing” a strong negative bias towards black women if I am a white man who finds white women and asian women hot?
Especially if you consider that, as consumers of social platforms and sites like OKCupid, we are trained to rank all the products we come across to ultimately get better offerings, it is a step too far for the detective on the other side of the camera to turn around and point fingers at us for doing what we’re told. Indeed, this sentence plunges Rudder’s narrative deeply into the creepy and provocative territory, and he never fully returns, nor does he seem to want to. Rudder seems to confuse provocation for thoughtfulness.
This is, again, a shame. A careful conversation about the issues of what we attracted to, what we can imagine doing, and how we might imagine that will look to our wider audience, and how our culture informs those imaginings, are all in play here, and could have been drawn out in a non-accusatory and much more useful way.
The Columbia J-School program that I have been directing, The Lede Program in Data Journalism, has wound down this past week and in four days my 6-month contract with Columbia will end. I’ve had a fantastic time and I am super proud of what we accomplished this past summer. The students from the program are awesome and many of them are now my friends. About half of them are still engaged in classes and will continue to work this semester with Jonathan Soma, who absolutely rocks, and of course my fabulous colleague Theresa Bradley, who will step in as Director now that I’m leaving.
So, what’s next? I am happy to say that as of today (or at least as of next Monday when my kids are really in school full-time) I’m writing my book Weapons of Math Destruction on a full-time basis. This comes as a huge relief, since the internal pressure I have to finish this book is reminiscent of how I felt when I needed to write my thesis: enormous, but maybe even worse than then since the timeliness of the book could not be overstated, and I want to get this book out before the moment passes.
In the meantime I have some cool talks I’m planning to go to (like this one I went to already!) and some I’m planning to give. So for example, I’m giving a keynote at The Yale Day of Data later this month, which is going to be fun and interesting.
My Yale talk is basically a meditation on what can be achieved by academic data science institutions, what presents cultural and technical obstacles to collaboration, and why we need to do it anyway. It’s no less than a plea for Yale to create a data science institute with a broad definition of data science – so including scholars from law and from journalism as well as the fields you think of already when you think of data science – and a broad mandate to have urgent conversations across disciplines about the “big data revolution.” That conversation has already begun at the Information Society Project at Yale Law School, which makes me optimistic.
I also plan to continue my weekly Slate Money podcasts with Felix Salmon and Jordan Weissmann. Today we’re discussing the economic implications of Scottish independence, Felix’s lifetime earnings calculator, and the Fed’s new liquidity rules and how they affect municipalities, which my friend Marc Joffe guest blogged about yesterday.
For a while now I’ve been thinking I should build a decision tree for deciding which algorithm to use on a given data project. And yes, I think it’s kind of cool that “decision tree” would be an outcome on my decision tree. Kind of like a nerd pun.
I’m happy to say that I finally started work on my algorithm decision tree, thanks to this website called gliffy.com which allows me to build flowcharts with an easy online tool. It was one of those moments when I said to myself, this morning at 6am, “there should be a start-up that allows me to build a flowchart online! Let me google for that” and it totally worked. I almost feel like I willed gliffy.com into existence.
So here’s how far I’ve gotten this morning:
I looked around the web to see if I’m doing something that’s already been done and I came up with this:
I appreciate the effort but this is way more focused on the size of the data than I intend to be, at least for now. And here’s another one that’s even less like the one I want to build but is still impressive.
Because here’s what I want to focus on: what kind of question are you answering with which algorithm? For example, with clustering algorithms you are, you know, grouping similar things together. That one’s easy, kind of, although plenty of projects have ended up being clustering or classifying algorithms whose motivating questions did not originally take on the form “how would we group these things together?”.
In other words, the process of getting at algorithms from questions is somewhat orthogonal to the normal way algorithms are introduced, and for that reason taking me some time to decide what the questions are that I need to ask in my decision tree. Right about now I’m wishing I had taken notes when my Lede Program students asked me to help them with their projects, because embedded in those questions were some great examples of data questions in search of an algorithm.
Please give me advice!
I’ve was away over the weekend (apologies to Aunt Pythia fans!) and super busy yesterday but this morning I finally had a chance to read Ethan Zuckerman’s Atlantic piece entitled The Internet’s Original Sin, which was sent to me by my friend Ernest Davis.
Here’s the thing, Zuckerman gets lots of things right in the article. Most importantly, the inherent conflict between privacy and the advertisement-based economy of the internet:
Demonstrating that you’re going to target more and better than Facebook requires moving deeper into the world of surveillance—tracking users’ mobile devices as they move through the physical world, assembling more complex user profiles by trading information between data brokers.
Once we’ve assumed that advertising is the default model to support the Internet, the next step is obvious: We need more data so we can make our targeted ads appear to be more effective.
This is well said, and important to understand.
Here’s where Zuckerman goes a little too far in my opinion:
Outrage over experimental manipulation of these profiles by social networks and dating companies has led to heated debates amongst the technologically savvy, but hasn’t shrunk the user bases of these services, as users now accept that this sort of manipulation is an integral part of the online experience.
It is a mistake to assume that “users accept this sort of manipulation” because not everyone has stopped using Facebook. Facebook is, after all, an hours-long daily habit for an enormous number of people, and it’s therefore sticky. People don’t give up addictive habits overnight. But it doesn’t mean they are feeling the same way about Facebook that they did 4 years ago. People are adjusting their opinion of the user experience as that user experience is increasingly manipulated and creepy.
An analogy should be drawn to something like smoking, where the rates have gone way down since we all found out it is bad for you. People stopped smoking even though it is really hard for most people (and impossible for some).
We should instead be thinking longer term about what people will be willing to leave Facebook for. What is the social networking model of the future? What kind of minimum privacy protections will convince people they are safe (enough)?
And, most importantly, will we even have reasonable minimum protections, or will privacy be entirely commoditized, whereby only premium pay members will be protected, while the rest of us will be thrown to the dogs?
“Data Science” is one of my least favorite tech buzzwords, second to probably “Big Data”, which in my opinion should be always printed followed by a winky face (after all, my data is bigger than yours). It’s mostly a marketing ploy used by companies to attract talented scientists, statisticians, and mathematicians, who, at the end of the day, will probably be working on some sort of advertising problem or the other.
Still, you have to admit, it does have a nice ring to it. Thus the title Democratizing Data Science, a vision paper which I co-authored with two cool Ph.D students at MIT CSAIL, William Li and Ramesh Sridharan.
The paper focuses on the latter part of the situation mentioned above. Namely, how can we direct these data scientists, aka scientists who interact with the data pipeline throughout the problem-solving process (whether they be computer scientists or programmers or statisticians or mathematicians in practice) towards problems focused on societal issues?
In the paper, we briefly define Data Science (asking ourselves what the heck it even means), then question what it means to democratize the field, and to what end that may be achieved. In other words, the current applications of Data Science, a new but growing field, in both research and industry, has the potential for great social impact, but in reality, resources are rarely distributed in a way to optimize the social good.
We’ll be presenting the paper at the KDD Conference next Sunday, August 24th at 11am as a highlight talk in the Bloomberg Building, 731 Lexington Avenue, NY, NY. It will be more like an open conversation than a lecture and audience participation and opinion is very welcome.
The conference on Sunday at Bloomberg is free, although you do need to register. There are three “tracks” going on that morning, “Data Science & Policy”, “Urban Computing”, and “Data Frameworks”. Ours is in the 3rd track. Sign up here!
If you don’t have time to make it, give the paper a skim anyway, because if you’re on Mathbabe’s blog you probably care about some of these things we talk about.
Everyone I know who codes uses stackoverflow.com for absolutely everything.
Just yesterday I met a cool coding chick who was learning python and pandas (of course!) with the assistance of stackoverflow. It is exactly what you need to get stuff working, and it’s better than having a friend to ask, even a highly knowledgable friend, because your friend might be busy or might not know the answer, or even if your friend knew the answer her answer isn’t cut-and-paste-able.
If you are someone who has never used stackoverflow for help, then let me explain how it works. Say you want to know how to load a JSON file into python but you don’t want to write a script for that because you’re pretty sure someone already has. You just search for “import json into python” and you get results with vote counts:
Also, every math nerd I know uses and contributes to mathoverflow.net. It’s not just for math facts and questions, either, there are interesting discussions going on there all the time. Here’s an example of a comment in response to understanding the philosophy behind the claimed proof of the ABC Conjecture:
OK well hold on tight because now there’s a new online forum, but not about coding and not about math. It’s about all the other STEM subjects, which since we’ve removed math might need to be called STE subjects, which is not catchy.
So far only statistics is open, but other stuff is coming very soon. Specifically it covers, or soon will cover, the following fields:
- Cognitive Sciences
- Computer Sciences
- Earth and Planetary Sciences
- Science & Math Education
- History of Science and Mathematics
- Applied Mathematics, and
I’m super excited for this site, it has serious potential to make peoples’ lives better. I wish it had a category for Data Sciences, and for Data Journalism, because I’d probably be more involved in those categories than most of the above, but then again most data science-y questions could be inserted into one of the above. I’ll try to be patient on this one.
Here’s a screen shot of an existing Stats question on the site:
There’s an interesting and horrible New York Time story by Jessica Silver-Greenberg about a PayDay loan syndicate being run out of New York State. The syndicate consists of twelve companies owned by a single dude, Carey Vaughn Brown, with help from a corrupt lawyer and another corrupt COO. Manhattan District Attorneys are charging him and his helpers with usury under New York law.
The complexity of the operation was deliberate and intended to obscure the chain of events that would start with a New Yorker online looking for quick cash online and end with a predatory loan. They’d interface with a company called MyCashNow.com, which would immediately pass their application on to a bunch of other companies in different states or overseas.
Important context: in New York, the usury law caps interest rates at 25 percent annually, and these PayDay operations were charging between 350 and 650 percent annually. Also key, the usury laws apply to where the borrower is, not where the lender is, so even though some of the companies were located (at least on paper) in the West Indies, they were still breaking the law.
They don’t know exactly how big the operation was in New York, but one clue is that in 2012, one of the twelve companies had $50 million in proceeds from New York.
Here’s my question: how did MyCashNow.com advertise? Did it use Google ads, or Facebook ads, or something else, and if so, what were the attributes of the desperate New Yorkers that it looked for to do its predatory work?
One side of this is that vulnerable people were somehow targeted. The other side is that well-off people were not, which meant they didn’t see ads like this, which makes it harder for people like the Manhattan District Attorney to even know about shady operations like this.