Archive

Archive for the ‘data science’ Category

Will Demographics Solve the College Tuition Problem? (A: I Don’t Know)

November 14, 2014 14 comments

I’ve got two girls in middle school. They are lovely and (in my opinion as a proud dad) smart. I wonder, on occasion, what college will they go to and what their higher education experience will be like? No matter how lovely or smart my daughters are, though, it will be hard to fork over all of that tuition money.  It sure would be nice if college somehow got cheaper by the time my daughters are ready in 6 or 8 years!

How likely is this? There has been plenty of coverage about how the cost of college has risen so dramatically over the past decades. A number of smart people have argued that the reason tuition has increased so much is because of all of the amenities that schools have built in recent years. Others are unconvinced that’s the reason, pointing out that increased spending by universities grew at a lower than the rate of tuition increases.  Perhaps schools have been buoyed by a rising demographic trend – but it’s clear tuition increases have had a great run.

One way colleges have been able to keep increasing tuitions is by competing aggressively for wealthy students who can pay the full price of tuition (which also enables the schools to offer more aid to less than wealthy students).  The children of the wealthy overseas are particularly desirable targets, apparently.  I heard a great quote yesterday about this by Brad Delong – that his school, Berkeley, and other top universities presumably had become “finishing school[s] for the superrich of Asia.”  It’s an odd sort of competition, though, where schools are competing for a particular customer (wealthy students) by raising prices.  Presumably, this suggests that colleges have had pricing power to raise tuition due to increased demand (perhaps aided by increase in student loans, but that’s an argument for another day).

Will colleges continue to have this pricing power?  For the optimistic future tuition payer, there are some signs that university pricing power may be eroding.   Tuition increased at a slower rate this year (a bit more than 3%) but still at a rate that well exceeds inflation.   And law schools are already resorting to price cutting after precipitous declines in applications – down 37% in 2014 compared to 2010!

College enrollment trends are a mixed bag and frequently obscured by studies from in-industry sources.  Clearly, the 1990s and 2000s were a time a great growth for colleges – college enrollment grew by 48% from 1990 (12 million students) to 2012 (17.7 million).  But 2010 appears to be the recent peak and enrollment fell by 2% from 2010 to 2012. In addition, overall college enrollment declined by 2.3% in 2014, although this decline is attributed to the 9.6% decline in two-year colleges while 4-year college enrollment actually increased by 1.2%.

It makes sense that the recent college enrollment trend would be down – the number of high school graduates appears to have peaked in 2010 at 3.3 million or so and is projected to decline to about 3.1 million in 2016 and stay lowish for the next few years. The US Census reports that there was a bulge of kids that are college age now (i.e. there were 22.04 million 14-19 year olds at the 2010 Census), but there are about 1.7 million fewer kids that are my daughters’ age (i.e., 5-9 year olds in the 2010 Census).  That’s a pretty steep drop off (about 8%) in this pool of potential college students.  These demographic trends have got some people worried.  Moody’s, which rates the debt of a lot of colleges, has been downgrading a lot of smaller schools and says that this type of school has already been hit by declining enrollment and revenue. One analyst went so far as to warn of a “death spiral” at some schools due to declining enrollment.  Moody’s analysis of declining revenue is an interesting factor, in light of reports of ever-increasing tuition. Last year Moody’s reported that 40% of colleges or universities (that were rated) faced stagnant or declining net tuition revenue.

Speaking strictly, again, as a future payer of my daughters’ college tuition, falling college age population and falling enrollment would seem to point to the possibility that tuition will be lower for my kids when the time comes. Plus there are a lot of other factors that seem to be lining up against the prospects for college tuition –  like continued flat or declining wages, the enormous student loan bubble (it can’t keep growing, right?), the rise of online education…

And yet, I’m not feeling that confident.  Elite universities (and it certainly would be nice if my girls could get into such a school) seem to have found a way to collect a lot of tuition from foreign students (it’s hard to find a good data source for that though) which protects them from the adverse demographic and economic trends.  I’ve wondered if US students could get turned off by the perception that top US schools have too many foreign students and are too much, as Delong says, elite finishing schools.  But that’s hard to predict and may take many years to reach a tipping point.  Plus if tuition and enrollment drop a lot, that may cripple the schools that have taken out a lot of debt to build all of those nice amenities. A Harvard Business School professor rather bearishly projects that as many as half of the 4,000 US colleges and universities may fail in the next 15 years.  Would a sharp decrease in the number of colleges due to falling enrollment have the effect of reducing competition at the remaining schools?  If so, what impact would that have on tuition?

Both college tuition and student loans have been described as bubbles thanks to their recent rate of growth.  At some point, bubbles burst (in theory).  As someone who watched, first hand and with great discomfort, the growth of the subprime and housing bubbles before the crisis, I’ve painfully learned that bubbles can last much longer than you would rationally expect.  And despite all sorts of analysis and calculation about what should happen, the thing that triggers the bursting of the bubble is really hard to predict. As is when it will happen.  To the extent I’ve learned a lesson from mortgage land, it’s that you shouldn’t do anything stupid in anticipation of the bubble either bursting or continuing.  So, as much as I hope and even expect that the trend for increased college tuition will reverse in the coming years, I guess I’ll have to keep on trying to save for when my daughters will be heading off to college.

Categories: data science, education

Tailored political ads threaten democracy

Not sure if you saw this recent New York Times article on the new data-driven political ad machines. Consider for example, the 2013 Virginia Governor campaign won by Terry McAuliffe:

…the McAuliffe campaign invested heavily in both the data and the creative sides to ensure it could target key voters with specialized messages. Over the course of the campaign, he said, it reached out to 18 to 20 targeted voter groups, with nearly 4,000 Facebook ads, more than 300 banner display ads, and roughly three dozen different pre-roll ads — the ads seen before a video plays — on television and online.

Now I want you to close your eyes and imagine what kind of numbers we will see for the current races, not to mention the upcoming presidential election.

What’s crazy to me about the Times article is that it never questions the implications of this movement. The biggest problem, it seems, is that the analytics have surpassed the creative work of making ads: there are too many segments of populations to tailor the political message to, and not enough marketers to massage those particular messages for each particular segment. I’m guessing that there will be more money and more marketers in the presidential campaign, though.

Translation: politicians can and will send different messages to individuals on Facebook, depending on what they think we want to hear. Not that politicians follow through with all their promises now – they don’t, of course – but imagine what they will say when they can make a different promise to each group. We will all be voting for slightly different versions of a given story. We won’t even know when the politician is being true to their word – which word?

This isn’t the first manifestation of different messages to different groups, of course. Romney’s famous “47%” speech was a famous example of tailored messaging to super rich donors. But on the other hand, it was secretly recorded by a bartender working the event. There will be no such bartenders around when people read their emails and see ads on Facebook.

I’m not the only person worried about this. For example, ProPublica studied this in Obama’s last campaign (see this description). But given the scale of the big data political ad operations now in place, there’s no way they – or anyone, really – can keep track of everything going on.

There are lots of ways that “big data” is threatening democracy. Most of the time, it’s by removing open discussions of how we make decisions and giving them to anonymous and inaccessible quants; think evidence-based sentencing or value-added modeling for teachers. But this political campaign ads is a more direct attack on the concept of a well-informed public choosing their leader.

Categories: data science, modeling, rant

Guest post: Clustering and predicting NYC taxi activity

This is a guest post by Deepak Subburam, a data scientist who works at Tessellate.

Screenshot

from NYCTaxi.info

Greetings fellow Mathbabers! At Cathy’s invitation, I am writing here about NYCTaxi.info, a public service web app my co-founder and I have developed. It overlays on a Google map around you estimated taxi activity, as expected number of passenger pickups and dropoffs this current hour. We modeled these estimates from the recently released 2013 NYC taxi trips dataset comprising 173 million trips, the same dataset that Cathy’s post last week on deanonymization referenced. Our work will not help you stalk your favorite NYC celebrity, but guide your search for a taxi and maybe save some commute time. My writeup below shall take you through the four broad stages our work proceeded through: data extraction and cleaning , clustering, modeling, and visualization.

We extract three columns from the data: the longitude and latitude GPS coordinates of the passenger pickup or dropoff location, and the timestamp. We make no distinction between pickups and dropoffs, since both of these events imply an available taxicab at that location. The data was generally clean, with a very small fraction of a percent of coordinates looking bad, e.g. in the middle of the Hudson River. These coordinate errors get screened out by the clustering step that follows.

We cluster the pickup and dropoff locations into areas of high density, i.e. where many pickups and dropoffs happen, to determine where on the map it is worth making and displaying estimates of taxi activity. We rolled our own algorithm, a variation on heatmap generation, after finding existing clustering algorithms such as K-means unsuitable—we are seeking centroids of areas of high density rather than cluster membership per se. See figure below which shows the cluster centers as identified by our algorithm on a square-mile patch of Manhattan. The axes represent the longitude and latitude of the area; the small blue crosses a random sample of pickups and dropoffs; and the red numbers the identified cluster centers, in descending order of activity.

Taxi activity clusters

We then model taxi activity at each cluster. We discretize time into hourly intervals—for each cluster, we sum all pickups and dropoffs that occur each hour in 2013. So our datapoints now are triples of the form [<cluster>, <hour>, <activity>], with <hour> being some hour in 2013 and <activity> being the number of pickups and dropoffs that occurred in hour <hour> in cluster <cluster>. We then regress each <activity> against neighboring clusters’ and neighboring times’ <activity> values. This regression serves to smooth estimates across time and space, smoothing out effects of special events or weather in the prior year that don’t repeat this year. It required some tricky choices on arranging and aligning the various data elements; not technically difficult or maybe even interesting, but nevertheless likely better part of an hour at a whiteboard to explain. In other words, typical data science. We then extrapolate these predictions to 2014, by mapping each hour in 2014 to the most similar hour in 2013. So we now have a prediction at each cluster location, for each hour in 2014, the number of passenger pickups and dropoffs.

We display these predictions by overlaying them on a Google maps at the corresponding cluster locations. We round <activity> to values like 20, 30 to avoid giving users number dyslexia. We color the labels based on these values, using the black body radiation color temperatures for the color scale, as that is one of two color scales where the ordering of change is perceptually intuitive.

If you live in New York, we hope you find NYCTaxi.info useful. Regardless, we look forward to receiving any comments.

Links (with annotation)

I’ve been heads down writing this week but I wanted to share a bunch of great stuff coming out.

  1. Here’s a great interview with machine learning expert Michael Jordan on various things including the big data bubble (hat tip Alan Fekete). I had a similar opinion over a year ago on that topic. Update: here’s Michael Jordan ranting about the title for that interview (hat tip Akshay Mishra). I never read titles.
  2. Have you taken a look at Janet Yellen’s speech on inequality from last week? She was at a conference in Boston about inequality when she gave it. It’s a pretty amazing speech – she acknowledges the increasing inequality, for example, and points at four systems we can focus on as reasons: childhood poverty and public education, college costs, inheritances, and business creation. One thing she didn’t mention: quantitative easing, or anything else the Fed has actual control over. Plus she hid behind the language of economics in terms of how much to care about any of this or what she or anyone else could do. On the other hand, maybe it’s the most we could expect from her. The Fed has, in my opinion, already been overreaching with QE and we can’t expect it to do the job of Congress.
  3. There’s a cool event at the Columbia Journalism School tomorrow night called #Ferguson: Reporting a Viral News Story (hat tip Smitha Corona) which features sociologist and writer Zeynep Tufekci among others (see for example this article she wrote), with Emily Bell moderating. I’m going to try to go.
  4. Just in case you didn’t see this, Why Work Is More And More Debased (hat tip Ernest Davis).
  5. Also: Poor kids who do everything right don’t do better than rich kids who do everything wrong (hat tip Natasha Blakely).
  6. Jesse Eisenger visits the defense lawyers of the big banks and writes about his experience (hat tip Aryt Alasti).

After writing this list, with all the hat tips, I am once again astounded at how many awesome people send me interesting things to read. Thank you so much!!

Guest post: The dangers of evidence-based sentencing

This is a guest post by Luis Daniel, a research fellow at The GovLab at NYU where he works on issues dealing with tech and policy. He tweets @luisdaniel12. Crossposted at the GovLab.

What is Evidence-based Sentencing?

For several decades, parole and probation departments have been using research-backed assessments to determine the best supervision and treatment strategies for offenders to try and reduce the risk of recidivism. In recent years, state and county justice systems have started to apply these risk and needs assessment tools (RNA’s) to other parts of the criminal process.

Of particular concern is the use of automated tools to determine imprisonment terms. This relatively new practice of applying RNA information into the sentencing process is known as evidence-based sentencing (EBS).

What the Models Do

The different parameters used to determine risk vary by state, and most EBS tools use information that has been central to sentencing schemes for many years such as an offender’s criminal history. However, an increasing amount of states have been utilizing static factors such as gender, age, marital status, education level, employment history, and other demographic information to determine risk and inform sentencing. Especially alarming is the fact that the majority of these risk assessment tools do not take an offender’s particular case into account.

This practice has drawn sharp criticism from Attorney General Eric Holder who says “using static factors from a criminal’s background could perpetuate racial bias in a system that already delivers 20% longer sentences for young black men than for other offenders.” In the annual letter to the US Sentencing Commission, the Attorney General’s Office states that “utilizing such tools for determining prison sentences to be served will have a disparate and adverse impact on offenders from poor communities already struggling with social ills.” Other concerns cite the probable unconstitutionality of using group-based characteristics in risk assessments.

Where the Models Are Used

It is difficult to precisely quantify how many states and counties currently implement these instruments, although at least 20 states have implemented some form of EBS. Some of the states or states with counties that have implemented some sort of EBS (any type of sentencing: parole, imprisonment, etc) are: Pennsylvania, Tennessee, Vermont, Kentucky, Virginia, Arizona, Colorado, California, Idaho, Indiana, Missouri, Nebraska, Ohio, Oregon, Texas, and Wisconsin.

The Role of Race, Education, and Friendship

Overwhelmingly states do not include race in the risk assessments since there seems to be a general consensus that doing so would be unconstitutional. However, even though these tools do not take race into consideration directly, many of the variables used such as economic status, education level, and employment correlate with race. African-Americans and Hispanics are already disproportionately incarcerated and determining sentences based on these variables might cause further racial disparities.

The very socioeconomic characteristics such as income and education level used in risk assessments are the characteristics that are already strong predictors of whether someone will go to prison. For example, high school dropouts are 47 times more likely to be incarcerated than people in their similar age group who received a four-year college degree. It is reasonable to suspect that courts that include education level as a risk predictor will further exacerbate these disparities.

Some states, such as Texas, take into account peer relations and considers associating with other offenders as a “salient problem”. Considering that Texas is in 4th place in the rate of people under some sort of correctional control (parole, probation, etc) and that the rate is 1 in 11 for black males in the United States it is likely that this metric would disproportionately affect African-Americans.

Sonja Starr’s paper

Even so, in some cases, socioeconomic and demographic variables receive significant weight. In her forthcoming paper in the Stanford Law Review, Sonja Starr provides a telling example of how these factors are used in presentence reports. From her paper:

For instance, in Missouri, pre-sentence reports include a score for each defendant on a scale from -8 to 7, where “4-7 is rated ‘good,’ 2-3 is ‘above average,’ 0-1 is ‘average’, -1 to -2 is ‘below average,’ and -3 to -8 is ‘poor.’ Unlike most instruments in use, Missouri’s does not include gender. However, an unemployed high school dropout will score three points worse than an employed high school graduate—potentially making the difference between “good” and “average,” or between “average” and “poor.” Likewise, a defendant under age 22 will score three points worse than a defendant over 45. By comparison, having previously served time in prison is worth one point; having four or more prior misdemeanor convictions that resulted in jail time adds one point (three or fewer adds none); having previously had parole or probation revoked is worth one point; and a prison escape is worth one point. Meanwhile, current crime type and severity receive no weight.

Starr argues that such simple point systems may “linearize” a variable’s effect. In the underlying regression models used to calculate risk, some of the variable’s effects do not translate linearly into changes in probability of recidivism, but they are treated as such by the model.

Another criticism Starr makes is that they often make predictions on an individual based on averages of a group. Starr says these predictions can predict with reasonable precision the average recidivism rate for all offenders who share the same characteristics as the defendant, but that does not make it necessarily useful for individual predictions.

The Future of EBS Tools

The Model Penal Code is currently in the process of being revised and is set to include these risk assessment tools in the sentencing process. According to Starr, this is a serious development because it reflects the increased support of these practices and because of the Model Penal Code’s great influence in guiding penal codes in other states. Attorney General Eric Holder has already spoken against the practice, but it will be interesting to see whether his successor will continue this campaign.

Even if EBS can accurately measure risk of recidivism (which is uncertain according to Starr), does that mean that a greater prison sentence will result in less future offenses after the offender is released? EBS does not seek to answer this question. Further, if knowing there is a harsh penalty for a particular crime is a deterrent to commit said crime, wouldn’t adding more uncertainty to sentencing (EBS tools are not always transparent and sometimes proprietary) effectively remove this deterrent?

Even though many questions remain unanswered and while several people have been critical of the practice, it seems like there is great support for the use of these instruments. They are especially easy to support when they are overwhelmingly regarded as progressive and scientific, something Starr refutes. While there is certainly a place for data analytics and actuarial methods in the criminal justice system, it is important that such research be applied with the appropriate caution. Or perhaps not at all. Even if the tools had full statistical support, the risk of further exacerbating an already disparate criminal justice system should be enough to halt this practice.

Both Starr and Holder believe there is a strong case to be made that the risk prediction instruments now in use are unconstitutional. But EBS has strong advocates, so it’s a difficult subject. Ultimately, evidence-based sentencing is used to determine a person’s sentencing not based on what the person has done, but who that person is.

Big Data’s Disparate Impact

Take a look at this paper by Solon Barocas and Andrew D. Selbst entitled Big Data’s Disparate Impact.

It deals with the question of whether current anti-discrimination law is equipped to handle the kind of unintentional discrimination and digital redlining we see emerging in some “big data” models (and that we suspect are hidden in a bunch more). See for example this post for more on this concept.

The short answer is no, our laws are not equipped.

Here’s the abstract:

This article addresses the potential for disparate impact in the data mining processes that are taking over modern-day business. Scholars and policymakers had, until recently, focused almost exclusively on data mining’s capacity to hide intentional discrimination, hoping to convince regulators to develop the tools to unmask such discrimination. Recently there has been a noted shift in the policy discussions, where some have begun to recognize that unintentional discrimination is a hidden danger that might be even more worrisome. So far, the recognition of the possibility of unintentional discrimination lacks technical and theoretical foundation, making policy recommendations difficult, where they are not simply misdirected. This article provides the necessary foundation about how data mining can give rise to discrimination and how data mining interacts with anti-discrimination law.

The article carefully steps through the technical process of data mining and points to different places within the process where a disproportionately adverse impact on protected classes may result from innocent choices on the part of the data miner. From there, the article analyzes these disproportionate impacts under Title VII. The Article concludes both that Title VII is largely ill equipped to address the discrimination that results from data mining. Worse, due to problems in the internal logic of data mining as well as political and constitutional constraints, there appears to be no easy way to reform Title VII to fix these inadequacies. The article focuses on Title VII because it is the most well developed anti-discrimination doctrine, but the conclusions apply more broadly because they are based on the general approach to anti-discrimination within American law.

I really appreciate this paper, because it’s an area I know almost nothing about: discrimination law and what are the standards for evidence of discrimination.

Sadly, what this paper explains to me is how very far we are away from anything resembling what we need to actually address the problems. For example, even in this paper, where the writers are well aware that training on historical data can unintentionally codify discriminatory treatment, they still seem to assume that the people who build and deploy models will “notice” this treatment. From my experience working in advertising, that’s not actually what happens. We don’t measure the effects of our models on our users. We only see whether we have gained an edge in terms of profit, which is very different.

Essentially, as modelers, we don’t humanize the people on the other side of the transaction, which prevents us from worrying about discrimination or even being aware of it as an issue. It’s so far from “intentional” that it’s almost a ridiculous accusation to make. Even so, it may well be a real problem and I don’t know how we as a society can deal with it unless we update our laws.

De-anonymizing what used to be anonymous: NYC taxicabs

Thanks to Artem Kaznatcheev, I learned yesterday about the recent work of Anthony Tockar in exploring the field of anonymization and deanonymization of datasets.

Specifically, he looked at the 2013 cab rides in New York City, which was provided under a FOIL request, and he stalked celebrities Bradley Cooper and Jessica Alba (and discovered that neither of them tipped the cabby). He also stalked a man who went to a slew of NYC titty bars: found out where the guy lived and even got a picture of him.

Previously, some other civic hackers had identified the cabbies themselves, because the original dataset had scrambled the medallions, but not very well.

The point he was trying to make was that we should not assume that “anonymized” datasets actually protect privacy. Instead we should learn how to use more thoughtful approaches to anonymizing stuff, and he proposes a method called “differential privacy,” which he explains here. It involves adding noise to the data, in a certain way, so that at the end any given person doesn’t risk too much of their own privacy by being included in the dataset versus being not included in the dataset.

Bottomline, it’s actually pretty involved mathematically, and although I’m a nerd and it doesn’t intimidate me, it does give me pause. Here are a few concerns:

  1. It means that most people, for example the person in charge of fulfilling FOIL requests, will not actually understand the algorithm.
  2. That means that, if there’s a requirement that such a procedure is used, that person will have to use and trust a third party to implement it. This leads to all sorts of problems in itself.
  3. Just to name one, depending on what kind of data it is, you have to implement differential privacy differently. There’s no doubt that a complicated mapping of datatype to methodology will be screwed up when the person doing it doesn’t understand the nuances.
  4. Here’s another: the third party may not be trustworthy and may have created a backdoor.
  5. Or they just might get it wrong, or do something lazy that doesn’t actually work, and they can get away with it because, again, the user is not an expert and cannot accurately evaluate their work.

Altogether I’m imagining that this is at best an expensive solution for very important datasets, and won’t be used for your everyday FOIL requests like taxicab rides unless the culture around privacy changes dramatically.

Even so, super interesting and important work by Anthony Tockar. Also, if you think that’s cool, take a look at my friend Luis Daniel‘s work on de-anonymizing the Stop & Frisk data.

Follow

Get every new post delivered to your Inbox.

Join 2,067 other followers