The title is Big Data, Smaller Wage Gap? and, you know, it almost gives us the impression that she has a plan to close the wage gap using big data, or alternatively an argument that the wage gap will automatically close with the advent of big data techniques. It turns out to be the former, but not really.
After complaining about the wage gap for women in general, and after we get to know how much she loves her young niece, here’s the heart of the plan (emphasis mine, on the actual plan parts of the plan):
Analytics and microtargeting aren’t just for retailers and politicians — they can help us grow the ranks of executive women and close the gender wage gap. Employers analyze who clicked on internal job postings, and we can pursue qualified women who looked but never applied. We can go beyond analyzing the salary and rank histories of women who have left our companies. We can use big data analytics to tell us what exit interviews don’t.
Facebook posts, Twitter feeds and LinkedIn groups provide a trove of valuable intel from ex-employees. What they write is blunt, candid and useful. All the data is there for the taking — we just have to collect it and figure out what it means. We can delve deep into whether we’re promoting the best people, whether we’re doing enough to keep our ranks diverse, whether potential female leaders are being left behind and, importantly, why.
That’s about it, after that she goes back to her niece.
Here’s the thing, I’m not saying it’s not an important topic, but that plan doesn’t seem worthy of the title of the piece. It’s super vague and fluffy and meaningless. I guess, if I had to give it meaning, it would be that she’s proposing to understand internal corporate sexism using data, rather than assuming “data is objective” and that all models will make things better. And that’s one tiny step, but it’s not much. It’s really not enough.
Here’s an idea, and it kind of uses big data, or at least small data, so we might be able to sell it. Ask people in your corporate structure what the actual characteristics are of people they promote, and how they are measured, or if they are measured, and look at the data to see if what they say is consistent with what they do, and whether those characteristics are inherently sexist. It’s a very specific plan and no fancy mathematical techniques are necessary, but we don’t have to tell anyone that.
What combats sexism is a clarification and transparent description of job requirements and a willingness to follow through. Look at blind orchestra auditions for a success story there. By contrast, my experience with the corporate world is that, when hiring or promoting, they often list a long series of unmeasurable but critical properties like “good cultural fit” and “leadership qualities” that, for whatever reason, more men are rated high on than women.
Recently I’ve seen two very different versions of what a more data-driven Congress would look like, both emerging from the recent cruddy Cromnibus bill mess.
First, there’s this Bloomberg article, written by the editors, about using data to produce evidence on whether a given policy is working or not. Given what I know about how data is produced, and how definitions of success are politically manipulated, I don’t have much hope for this idea.
Second, there was a reader’s comments on this New York Times article, also about the Cromnibus bill. Namely, the reader was calling on the New York Times to not only explore a few facts about what was contained in the bill, but lay it out with more numbers and more consistency. I think this is a great idea. What if, when Congress gave us a shitty bill, we could see stuff like:
- how much money is allocated to each thing, both raw dollars and as a percentage of the whole bill,
- who put it in the omnibus bill,
- the history of that proposed spending, and the history of voting,
- which lobbyists were pushing it, and who gets paid by them, and ideally
- all of this would be in an easy-to-use interactive.
That’s the kind of data that I’d love to see. Data journalism is an emerging field, and we might not be there yet, but it’s something to strive for.
As I wrote about already, last Friday I attended a one day workshop in Montreal called FATML: Fairness, Accountability, and Transparency in Machine Learning. It was part of the NIPS conference for computer science, and there were tons of nerds there, and I mean tons. I wanted to give a report on the day, as well as some observations.
First of all, I am super excited that this workshop happened at all. When I left my job at Intent Media in 2011 with the intention of studying these questions and eventually writing a book about them, they were, as far as I know, on nobody’s else’s radar. Now, thanks to the organizers Solon and Moritz, there are communities of people, coming from law, computer science, and policy circles, coming together to exchange ideas and strategies to tackle the problems. This is what progress feels like!
OK, so on to what the day contained and my copious comments.
Sadly, I missed the first two talks, and an introduction to the day, because of two airplane cancellations (boo American Airlines!). I arrived in the middle of Hannah Wallach’s talk, the abstract of which is located here. Her talk was interesting, and I liked her idea of having social scientists partnered with data scientists and machine learning specialists, but I do want to mention that, although there’s a remarkable history of social scientists working within tech companies – say at Bell Labs and Microsoft and such – we don’t see that in finance at all, nor does it seem poised to happen. So in other words, we certainly can’t count on social scientists to be on hand when important mathematical models are getting ready for production.
Also, I liked Hannah’s three categories of models: predictive, explanatory, and exploratory. Even though I don’t necessarily think that a given model will fall neatly into one category or the other, they still give you a way to think about what we do when we make models. As an example, we think of recommendation models as ultimately predictive, but they are (often) predicated on the ability to understand people’s desires as made up of distinct and consistent dimensions of personality (like when we use PCA or something equivalent). In this sense we are also exploring how to model human desire and consistency. For that matter I guess you could say any model is at its heart an exploration into whether the underlying toy model makes any sense, but that question is dramatically less interesting when you’re using linear regression.
Anupam Datta and Michael Tschantz
An issue I enjoyed talking about was brought up in this talk, namely the question of whether such a finding is entirely evanescent or whether we can call it “real.” Since google constantly updates its algorithm, and since ad budgets are coming and going, even the same experiment performed an hour later might have different results. In what sense can we then call any such experiment statistically significant or even persuasive? Also, IRL we don’t have clean browsers, so what happens when we have dirty browsers and we’re logged into gmail and Facebook? By then there are so many variables it’s hard to say what leads to what, but should that make us stop trying?
From my perspective, I’d like to see more research into questions like, of the top 100 advertisers on Google, who saw the majority of the ads? What was the economic, racial, and educational makeup of those users? A similar but different (because of the auction) question would be to reverse-engineer the advertisers’ Google ad targeting methodologies.
Finally, the speakers mentioned a failure on Google’s part of transparency. In your advertising profile, for example, you cannot see (and therefore cannot change) your marriage status, but advertisers can target you based on that variable.
Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian
Next up we had Sorelle talk to us about her work with two guys with enormous names. They think about how to make stuff fair, the heart of the question of this workshop.
First, if we included race in, a resume sorting model, we’d probably see negative impact because of historical racism. Even if we removed race but included other attributes correlated with race (say zip code) this effect would remain. And it’s hard to know exactly when we’ve removed the relevant attributes, but one thing these guys did was define that precisely.
Second, say now you have some idea of the categories that are given unfair treatment, what can you do? One thing suggested by Sorelle et al is to first rank people in each category – to assign each person a percentile in their given category – and then to use the “forgetful function” and only consider that percentile. So, if we decided at a math department that we want 40% women graduate students, to achieve this goal with this method we’d independently rank the men and women, and we’d offer enough spots to top women to get our quota and separately we’d offer enough spots to top men to get our quota. Note that, although it comes from a pretty fancy setting, this is essentially affirmative action. That’s not, in my opinion, an argument against it. It’s in fact yet another argument for it: if we know women are systemically undervalued, we have to fight against it somehow, and this seems like the best and simplest approach.
Ed Felten and Josh Kroll
After lunch Ed Felton and Josh Kroll jointly described their work on making algorithms accountable. Basically they suggested a trustworthy and encrypted system of paper trails that would support a given algorithm (doesn’t really matter which) and create verifiable proofs that the algorithm was used faithfully and fairly in a given situation. Of course, we’d really only consider an algorithm to be used “fairly” if the algorithm itself is fair, but putting that aside, this addressed the question of whether the same algorithm was used for everyone, and things like that. In lawyer speak, this is called “procedural fairness.”
So for example, if we thought we could, we might want to turn the algorithm for punishment for drug use through this system, and we might find that the rules are applied differently to different people. This algorithm would catch that kind of problem, at least ideally.
David Robinson and Harlan Yu
Next up we talked to David Robinson and Harlan Yu about their work in Washington D.C. with policy makers and civil rights groups around machine learning and fairness. These two have been active with civil rights group and were an important part of both the Podesta Report, which I blogged about here, and also in drafting the Civil Rights Principles of Big Data.
The question of what policy makers understand and how to communicate with them came up several times in this discussion. We decided that, to combat cherry-picked examples we see in Congressional Subcommittee meetings, we need to have cherry-picked examples of our own to illustrate what can go wrong. That sounds bad, but put it another way: people respond to stories, especially to stories with innocent victims that have been wronged. So we are on the look-out.
Closing panel with Rayid Ghani and Foster Provost
I was on the closing panel with Rayid Ghani and Foster Provost, and we each had a few minutes to speak and then there were lots of questions and fun arguments. To be honest, since I was so in the moment during this panel, and also because I was jonesing for a beer, I can’t remember everything that happened.
As I remember, Foster talked about an algorithm he had created that does its best to “explain” the decisions of a complicated black box algorithm. So in real life our algorithms are really huge and messy and uninterpretable, but this algorithm does its part to add interpretability to the outcomes of that huge black box. The example he gave was to understand why a given person’s Facebook “likes” made a black box algorithm predict they were gay: by displaying, in order of importance, which likes added the most predictive power to the algorithm.
[Aside, can anyone explain to me what happens when such an algorithm comes across a person with very few likes? I’ve never understood this very well. I don’t know about you, but I have never “liked” anything on Facebook except my friends’ posts.]
Rayid talked about his work trying to develop a system for teachers to understand which students were at risk of dropping out, and for that system to be fair, and he discussed the extent to which that system could or should be transparent.
Oh yeah, and that reminds me that, after describing my book, we had a pretty great argument about whether credit scoring models should be open source, and what that would mean, and what feedback loops that would engender, and who would benefit.
Altogether a great day, and a fantastic discussion. Thanks again to Solon and Moritz for their work in organizing it.
As many thoughtful people have pointed out already, Eric Garner’s case proves that video evidence is not a magic bullet to combat and punish undue police brutality. The Grand Jury deemed such evidence insufficient for an indictment, even if the average person watching the video cannot understand that point of view.
Even so, it would be a mistake to dismiss video cameras on police as entirely a bad idea. We shouldn’t assume no progress could be made simply because there’s an example which lets us down. I am no data evangelist, but neither am I someone who dismisses data. It can be powerful and we should use its power when we can.
And before I try to make the general case for video cameras on cops, let me make one other point. The Eric Garner video has already made progress in one arena, namely public opinion. Without the video, we wouldn’t be seeing nationwide marches protesting the outrageous police conduct.
A few of my data nerd thoughts:
- If cops were required to wear cameras, we’d have more data. We should think of that as building evidence, with the potential to use it to sway grand juries, criminal juries, judges, or public opinion.
- One thing I said time after time to my students this summer at the data journalism program I directed is the following: a number by itself is usually meaningless. What we need is to compare that number to a baseline. The baseline could be the average number for a population, or the median, or some range of 5th to 95th percentiles, or how it’s changed over time, or whatnot. But in order to gauge any baseline you need data.
- So in the case of police videotapes, we’d need to see how cops usually handle a situation, or how cops from other precincts handle similar situations, or the extremes of procedures in such situations, or how police have changed their procedures over time. And if we think the entire approach is heavy handed, we can also compare the data to the police manual, or to other countries, or what have you. More data is better for understanding aggregate approaches, and aggregate understanding makes it easier to fit a given situation into context.
- Finally, the cameras might also change their behavior when they are policing, knowing they are being taped. That’s believable but we shouldn’t depend on it.
- And also, we have to be super careful about how we use video evidence, and make sure it isn’t incredibly biased due to careful and unfair selectivity by the police. So, some cops are getting in trouble for turning off their cameras at critical moments, or not turning them on ever.
Let’s take a step back and think about how large-scale data collection and mining works, for example in online advertising. A marketer collects a bunch of data. And knowing a lot about one person doesn’t necessarily help them, but if they know a lot about most people, it statistically speaking does help them sell stuff. A given person might not be in the mood to buy, or might be broke, but if you dangle desirable good in front of a whole slew of people, you make sales. It’s a statistical play which, generally speaking, works.
In this case, we are the marketer, and the police are the customers. We want a lot of information about how they do their job so when the time comes we have some sense of “normal police behavior” and something to compare a given incident to or a given cop to. We want to see how they do or don’t try to negotiate peace, and with whom. We want to see the many examples of good and great policing as well as the few examples of terrible, escalating policing.
Taking another step back, if the above analogy seems weird, there’s a reason for that. In general data is being collected on the powerless, on the consumers, on the citizens, or the job applicants, and we should be pushing for more and better data to be collected instead on the powerful, on the police, on the corporations, and on the politicians. There’s a reason there is a burgeoning privacy industry for rich and powerful people.
For example, we want to know how many people have been killed by the police, but even a statistic that important is incredibly hard to come by (see this and this for more on that issue). However, it’s never been easier for the police to collect data on us and act on suspicions of troublemakers, however that is defined.
Another example – possibly the most extreme example of all – comes this very week from the reports on the CIA and torture. That is data and evidence we should have gotten much earlier, and as the New York Times demands, we should be able to watch videos of waterboarding and decide for ourselves whether it constitutes torture.
So yes, let’s have video cameras on every cop. It is not a panacea, and we should not expect it to solve our problems over night. In fact video evidence, by itself, will not solve any problem. We should think it as a mere evidence collecting device, and use it in the public discussion of how the most powerful among us treat the least powerful. But more evidence is better.
Finally, there’s the very real question of who will have access to the video footage, and whether the public will be allowed to see it at all. It’s a tough question, which will take a while to sort out (FOIL requests!), but until then, everyone should know that it is perfectly legal to videotape police in every place in this country. So go ahead and make a video with your camera when you suspect weird behavior.
I’ve got two girls in middle school. They are lovely and (in my opinion as a proud dad) smart. I wonder, on occasion, what college will they go to and what their higher education experience will be like? No matter how lovely or smart my daughters are, though, it will be hard to fork over all of that tuition money. It sure would be nice if college somehow got cheaper by the time my daughters are ready in 6 or 8 years!
How likely is this? There has been plenty of coverage about how the cost of college has risen so dramatically over the past decades. A number of smart people have argued that the reason tuition has increased so much is because of all of the amenities that schools have built in recent years. Others are unconvinced that’s the reason, pointing out that increased spending by universities grew at a lower than the rate of tuition increases. Perhaps schools have been buoyed by a rising demographic trend – but it’s clear tuition increases have had a great run.
One way colleges have been able to keep increasing tuitions is by competing aggressively for wealthy students who can pay the full price of tuition (which also enables the schools to offer more aid to less than wealthy students). The children of the wealthy overseas are particularly desirable targets, apparently. I heard a great quote yesterday about this by Brad Delong – that his school, Berkeley, and other top universities presumably had become “finishing school[s] for the superrich of Asia.” It’s an odd sort of competition, though, where schools are competing for a particular customer (wealthy students) by raising prices. Presumably, this suggests that colleges have had pricing power to raise tuition due to increased demand (perhaps aided by increase in student loans, but that’s an argument for another day).
Will colleges continue to have this pricing power? For the optimistic future tuition payer, there are some signs that university pricing power may be eroding. Tuition increased at a slower rate this year (a bit more than 3%) but still at a rate that well exceeds inflation. And law schools are already resorting to price cutting after precipitous declines in applications – down 37% in 2014 compared to 2010!
College enrollment trends are a mixed bag and frequently obscured by studies from in-industry sources. Clearly, the 1990s and 2000s were a time a great growth for colleges – college enrollment grew by 48% from 1990 (12 million students) to 2012 (17.7 million). But 2010 appears to be the recent peak and enrollment fell by 2% from 2010 to 2012. In addition, overall college enrollment declined by 2.3% in 2014, although this decline is attributed to the 9.6% decline in two-year colleges while 4-year college enrollment actually increased by 1.2%.
It makes sense that the recent college enrollment trend would be down – the number of high school graduates appears to have peaked in 2010 at 3.3 million or so and is projected to decline to about 3.1 million in 2016 and stay lowish for the next few years. The US Census reports that there was a bulge of kids that are college age now (i.e. there were 22.04 million 14-19 year olds at the 2010 Census), but there are about 1.7 million fewer kids that are my daughters’ age (i.e., 5-9 year olds in the 2010 Census). That’s a pretty steep drop off (about 8%) in this pool of potential college students. These demographic trends have got some people worried. Moody’s, which rates the debt of a lot of colleges, has been downgrading a lot of smaller schools and says that this type of school has already been hit by declining enrollment and revenue. One analyst went so far as to warn of a “death spiral” at some schools due to declining enrollment. Moody’s analysis of declining revenue is an interesting factor, in light of reports of ever-increasing tuition. Last year Moody’s reported that 40% of colleges or universities (that were rated) faced stagnant or declining net tuition revenue.
Speaking strictly, again, as a future payer of my daughters’ college tuition, falling college age population and falling enrollment would seem to point to the possibility that tuition will be lower for my kids when the time comes. Plus there are a lot of other factors that seem to be lining up against the prospects for college tuition – like continued flat or declining wages, the enormous student loan bubble (it can’t keep growing, right?), the rise of online education…
And yet, I’m not feeling that confident. Elite universities (and it certainly would be nice if my girls could get into such a school) seem to have found a way to collect a lot of tuition from foreign students (it’s hard to find a good data source for that though) which protects them from the adverse demographic and economic trends. I’ve wondered if US students could get turned off by the perception that top US schools have too many foreign students and are too much, as Delong says, elite finishing schools. But that’s hard to predict and may take many years to reach a tipping point. Plus if tuition and enrollment drop a lot, that may cripple the schools that have taken out a lot of debt to build all of those nice amenities. A Harvard Business School professor rather bearishly projects that as many as half of the 4,000 US colleges and universities may fail in the next 15 years. Would a sharp decrease in the number of colleges due to falling enrollment have the effect of reducing competition at the remaining schools? If so, what impact would that have on tuition?
Both college tuition and student loans have been described as bubbles thanks to their recent rate of growth. At some point, bubbles burst (in theory). As someone who watched, first hand and with great discomfort, the growth of the subprime and housing bubbles before the crisis, I’ve painfully learned that bubbles can last much longer than you would rationally expect. And despite all sorts of analysis and calculation about what should happen, the thing that triggers the bursting of the bubble is really hard to predict. As is when it will happen. To the extent I’ve learned a lesson from mortgage land, it’s that you shouldn’t do anything stupid in anticipation of the bubble either bursting or continuing. So, as much as I hope and even expect that the trend for increased college tuition will reverse in the coming years, I guess I’ll have to keep on trying to save for when my daughters will be heading off to college.
…the McAuliffe campaign invested heavily in both the data and the creative sides to ensure it could target key voters with specialized messages. Over the course of the campaign, he said, it reached out to 18 to 20 targeted voter groups, with nearly 4,000 Facebook ads, more than 300 banner display ads, and roughly three dozen different pre-roll ads — the ads seen before a video plays — on television and online.
Now I want you to close your eyes and imagine what kind of numbers we will see for the current races, not to mention the upcoming presidential election.
What’s crazy to me about the Times article is that it never questions the implications of this movement. The biggest problem, it seems, is that the analytics have surpassed the creative work of making ads: there are too many segments of populations to tailor the political message to, and not enough marketers to massage those particular messages for each particular segment. I’m guessing that there will be more money and more marketers in the presidential campaign, though.
Translation: politicians can and will send different messages to individuals on Facebook, depending on what they think we want to hear. Not that politicians follow through with all their promises now – they don’t, of course – but imagine what they will say when they can make a different promise to each group. We will all be voting for slightly different versions of a given story. We won’t even know when the politician is being true to their word – which word?
This isn’t the first manifestation of different messages to different groups, of course. Romney’s famous “47%” speech was a famous example of tailored messaging to super rich donors. But on the other hand, it was secretly recorded by a bartender working the event. There will be no such bartenders around when people read their emails and see ads on Facebook.
I’m not the only person worried about this. For example, ProPublica studied this in Obama’s last campaign (see this description). But given the scale of the big data political ad operations now in place, there’s no way they – or anyone, really – can keep track of everything going on.
There are lots of ways that “big data” is threatening democracy. Most of the time, it’s by removing open discussions of how we make decisions and giving them to anonymous and inaccessible quants; think evidence-based sentencing or value-added modeling for teachers. But this political campaign ads is a more direct attack on the concept of a well-informed public choosing their leader.
Greetings fellow Mathbabers! At Cathy’s invitation, I am writing here about NYCTaxi.info, a public service web app my co-founder and I have developed. It overlays on a Google map around you estimated taxi activity, as expected number of passenger pickups and dropoffs this current hour. We modeled these estimates from the recently released 2013 NYC taxi trips dataset comprising 173 million trips, the same dataset that Cathy’s post last week on deanonymization referenced. Our work will not help you stalk your favorite NYC celebrity, but guide your search for a taxi and maybe save some commute time. My writeup below shall take you through the four broad stages our work proceeded through: data extraction and cleaning , clustering, modeling, and visualization.
We extract three columns from the data: the longitude and latitude GPS coordinates of the passenger pickup or dropoff location, and the timestamp. We make no distinction between pickups and dropoffs, since both of these events imply an available taxicab at that location. The data was generally clean, with a very small fraction of a percent of coordinates looking bad, e.g. in the middle of the Hudson River. These coordinate errors get screened out by the clustering step that follows.
We cluster the pickup and dropoff locations into areas of high density, i.e. where many pickups and dropoffs happen, to determine where on the map it is worth making and displaying estimates of taxi activity. We rolled our own algorithm, a variation on heatmap generation, after finding existing clustering algorithms such as K-means unsuitable—we are seeking centroids of areas of high density rather than cluster membership per se. See figure below which shows the cluster centers as identified by our algorithm on a square-mile patch of Manhattan. The axes represent the longitude and latitude of the area; the small blue crosses a random sample of pickups and dropoffs; and the red numbers the identified cluster centers, in descending order of activity.
We then model taxi activity at each cluster. We discretize time into hourly intervals—for each cluster, we sum all pickups and dropoffs that occur each hour in 2013. So our datapoints now are triples of the form [<cluster>, <hour>, <activity>], with <hour> being some hour in 2013 and <activity> being the number of pickups and dropoffs that occurred in hour <hour> in cluster <cluster>. We then regress each <activity> against neighboring clusters’ and neighboring times’ <activity> values. This regression serves to smooth estimates across time and space, smoothing out effects of special events or weather in the prior year that don’t repeat this year. It required some tricky choices on arranging and aligning the various data elements; not technically difficult or maybe even interesting, but nevertheless likely better part of an hour at a whiteboard to explain. In other words, typical data science. We then extrapolate these predictions to 2014, by mapping each hour in 2014 to the most similar hour in 2013. So we now have a prediction at each cluster location, for each hour in 2014, the number of passenger pickups and dropoffs.
We display these predictions by overlaying them on a Google maps at the corresponding cluster locations. We round <activity> to values like 20, 30 to avoid giving users number dyslexia. We color the labels based on these values, using the black body radiation color temperatures for the color scale, as that is one of two color scales where the ordering of change is perceptually intuitive.
If you live in New York, we hope you find NYCTaxi.info useful. Regardless, we look forward to receiving any comments.