The financial crisis has given rise to a series of catastrophes related to mathematical modeling.
Time after time you hear people speaking in baffled terms about mathematical models that somehow didn’t warn us in time, that were too complicated to understand, and so on. If you have somehow missed such public displays of throwing the model (and quants) under the bus, stay tuned below for examples.
A common response to these problems is to call for those models to be revamped, to add features that will cover previously unforeseen issues, and generally speaking, to make them more complex.
For a person like myself, who gets paid to “fix the model,” it’s tempting to do just that, to assume the role of the hero who is going to set everything right with a few brilliant ideas and some excellent training data.
Unfortunately, reality is staring me in the face, and it’s telling me that we don’t need more complicated models.
If I go to the trouble of fixing up a model, say by adding counterparty risk considerations, then I’m implicitly assuming the problem with the existing models is that they’re being used honestly but aren’t mathematically up to the task.
But this is far from the case – most of the really enormous failures of models are explained by people lying. Before I give three examples of “big models failing because someone is lying” phenomenon, let me add one more important thing.
Namely, if we replace okay models with more complicated models, as many people are suggesting we do, without first addressing the lying problem, it will only allow people to lie even more. This is because the complexity of a model itself is an obstacle to understanding its results, and more complex models allow more manipulation.
Example 1: Municipal Debt Models
Many municipalities are in shit tons of problems with their muni debt. This is in part because of the big banks taking advantage of them, but it’s also in part because they often lie with models.
Specifically, they know what their obligations for pensions and school systems will be in the next few years, and in order to pay for all that, they use a model which estimates how well their savings will pay off in the market, or however they’ve invested their money. But they use vastly over-exaggerated numbers in these models, because that way they can minimize the amount of money to put into the pool each year. The result is that pension pools are being systematically and vastly under-funded.
Example 2: Wealth Management
I used to work at Riskmetrics, where I saw first-hand how people lie with risk models. But that’s not the only thing I worked on. I also helped out building an analytical wealth management product. This software was sold to banks, and was used by professional “wealth managers” to help people (usually rich people, but not mega-rich people) plan for retirement.
We had a bunch of bells and whistles in the software to impress the clients – Monte Carlo simulations, fancy optimization tools, and more. But in the end, the banks and their wealth managers put in their own market assumptions when they used it. Specifically, they put in the forecast market growth for stocks, bonds, alternative investing, etc., as well as the assumed volatility of those categories and indeed the entire covariance matrix representing how correlated the market constituents are to each other.
The result is this: no matter how honest I would try to be with my modeling, I had no way of preventing the model from being misused and misleading to the clients. And it was indeed misused: wealth managers put in absolutely ridiculous assumptions of fantastic returns with vanishingly small risk.
Example 3: JP Morgan’s Whale Trade
I saved the best for last. JP Morgan’s actions around their $6.2 billion trading loss, the so-called “Whale Loss” was investigated recently by a Senate Subcommittee. This is an excerpt (page 14) from the resulting report, which is well worth reading in full:
While the bank claimed that the whale trade losses were due, in part, to a failure to have the right risk limits in place, the Subcommittee investigation showed that the five risk limits already in effect were all breached for sustained periods of time during the first quarter of 2012. Bank managers knew about the breaches, but allowed them to continue, lifted the limits, or altered the risk measures after being told that the risk results were “too conservative,” not “sensible,” or “garbage.” Previously undisclosed evidence also showed that CIO personnel deliberately tried to lower the CIO’s risk results and, as a result, lower its capital requirements, not by reducing its risky assets, but by manipulating the mathematical models used to calculate its VaR, CRM, and RWA results. Equally disturbing is evidence that the OCC was regularly informed of the risk limit breaches and was notified in advance of the CIO VaR model change projected to drop the CIO’s VaR results by 44%, yet raised no concerns at the time.
I don’t think there could be a better argument explaining why new risk limits and better VaR models won’t help JPM or any other large bank. The manipulation of existing models is what’s really going on.
Just to be clear on the models and modelers as scapegoats, even in the face of the above report, please take a look at minute 1:35:00 of the C-SPAN coverage of former CIO head Ina Drew’s testimony when she’s being grilled by Senator Carl Levin (hat tip Alan Lawhon, who also wrote about this issue here).
Ina Drew firmly shoves the quants under the bus, pretending to be surprised by the failures of the models even though, considering she’d been at JP Morgan for 30 years, she might know just a thing or two about how VaR can be manipulated. Why hasn’t Sarbanes-Oxley been used to put that woman in jail? She’s not even at JP Morgan anymore.
Stick around for a few minutes in the testimony after Levin’s done with Drew, because he’s on a roll and it’s awesome to watch.
There’ve been a couple of articles in the past few days about teacher Value-Added Testing that have enraged me.
If you haven’t been paying attention, the Value-Added Model (VAM) is now being used in a majority of the states (source: the Economist):
But it gives out nearly random numbers, as gleaned from looking at the same teachers with two scores (see this previous post). There’s a 24% correlation between the two numbers. Note that some people are awesome with respect to one score and complete shit on the other score:
Final thing you need to know about the model: nobody really understands how it works. It relies on error terms of an error-riddled model. It’s opaque, and no teacher can have their score explained to them in Plain English.
Now, with that background, let’s look into these articles.
First, there’s this New York Times article from yesterday, entitled “Curious Grade for Teachers: Nearly All Pass”. In this article, it describes how teachers are nowadays being judged using a (usually) 50/50 combination of classroom observations and VAM scores. This is different from the past, which was only based on classroom observations.
What they’ve found is that the percentage of teachers found “effective or better” has stayed high in spite of the new system – the numbers are all over the place but typically between 90 and 99 percent of teachers. In other words, the number of teachers that are fingered as truly terrible hasn’t gone up too much. What a fucking disaster, at least according to the NYTimes, which seems to go out of its way to make its readers understand how very much high school teachers suck.
A few things to say about this.
- Given that the VAM is nearly a random number generator, this is good news – it means they are not trusting the VAM scores blindly. Of course, it still doesn’t mean that the right teachers are getting fired, since half of the score is random.
- Another point the article mentions is that failing teachers are leaving before the reports come out. We don’t actually know how many teachers are affected by these scores.
- Anyway, what is the right number of teachers to fire each year, New York Times? And how did you choose that number? Oh wait, you quoted someone from the Brookings Institute: “It would be an unusual profession that at least 5 percent are not deemed ineffective.” Way to explain things so scientifically! It’s refreshing to know exactly how the army of McKinsey alums approach education reform.
- The overall article gives us the impression that if we were really going to do our job and “be tough on bad teachers,” then we’d weight the Value-Added Model way more. But instead we’re being pussies. Wonder what would happen if we weren’t pussies?
The second article explained just that. It also came from the New York Times (h/t Suresh Naidu), and it was a the story of a School Chief in Atlanta who took the VAM scores very very seriously.
What happened next? The teachers cheated wildly, changing the answers on their students’ tests. There was a big cover-up, lots of nasty political pressure, and a lot of good people feeling really bad, blah blah blah. But maybe we can take a step back and think about why this might have happened. Can we do that, New York Times? Maybe it had to do with the $500,000 in “performance bonuses” that the School Chief got for such awesome scores?
Let’s face it, this cheating scandal, and others like it (which may never come to light), was not hard to predict (as I explain in this post). In fact, as a predictive modeler, I’d argue that this cheating problem is the easiest thing to predict about the VAM, considering how it’s being used as an opaque mathematical weapon.
Guest Post SuperReview Part III of VI: The Occupy Handbook Part I and a little Part II: Where We Are Now
Moving on from Lewis’ cute Bloomberg column reprint, we come to the next essay in the series:
Indefatigable pair Paul Krugman and Robin Wells (KW hereafter) contribute one of the several original essays in the book, but the content ought to be familiar if you read the New York Times, know something about economics or practice finance. Paul Krugman is prolific, and it isn’t hard to be prolific when you have to rewrite essentially the same column every week; question, are there other columnists who have been so consistently right yet have failed to propose anything that the polity would adopt? Political failure notwithstanding, Krugman leaves gems in every paragraph for the reader new to all this. The title “The Widening Gyre” comes from an apocalyptic William Yeats Butler poem. In this case, Krugman and Wells tackle the problem of why the government responded so poorly to the crisis. In their words:
By 2007, America was about as unequal as it had been on the eve of the Great Depression – and sure enough, just after hitting this milestone, we lunged into the worst slump since the Depression. This probably wasn’t a coincidence, although economists are still working on trying to understand the linkages between inequality and vulnerability to economic crisis.
Here, however, we want to focus on a different question: why has the response to crisis been so inadequate? Before financial crisis struck, we think it’s fair to say that most economists imagined that even if such a crisis were to happen, there would be a quick and effective policy response [editor's note: see Kautsky et al 2016 for a partial explanation]. In 2003 Robert Lucas, the Nobel laureate and then president of the American Economic Association, urged the profession to turn its attention away from recessions to issues of longer-term growth. Why? Because he declared, the “central problem of depression-prevention has been solved, for all practical purposes, and has in fact been solved for many decades.”
Famous last words from Professor Lucas. Nevertheless, the curious failure to apply what was once the conventional wisdom on a useful scale intrigues me for two reasons. First, most political scientists suggest that democracy, versus authoritarian system X, leads to better outcomes for two reasons.
1. Distributional – you get a nicer distribution of wealth (possibly more productivity for complicated macro reasons); economics suggests that since people are mostly envious and poor people have rapidly increasing utility in wealth, democracy’s tendency to share the wealth better maximizes some stupid social welfare criterion (typically, Kaldor-Hicks efficiency).
2. Information – democracy is a better information aggregation system than dictatorship and an expanded polity makes better decisions beyond allocation of produced resources. The polity must be capable of learning and intelligent OR vote randomly if uninformed for this to work. While this is the original rigorous justification for democracy (first formalized in the 1800s by French rationalists), almost no one who studies these issues today believes one-person one-vote democracy better aggregates information than all other systems at a national level. “Well Leon,” some knave comments, “we don’t live in a democracy, we live in a Republic with a president…so shouldn’t a small group of representatives better be able to make social-welfare maximizing decisions?” Short answer: strong no, and US Constitutionalism has some particularly nasty features when it comes to political decision-making.
Second, KW suggest that the presence of extreme wealth inequalities act like a democracy disabling virus at the national level. According to KW extreme wealth inequalities perpetuate themselves in a way that undermines both “nice” features of a democracy when it comes to making regulatory and budget decisions.* Thus, to get better economic decision-making from our elected officials, a good intermediate step would be to make our tax system more progressive or expand Medicare or Social Security or…Well, we have a lot of good options here. Of course, for mathematically minded thinkers, this begs the following question: if we could enact so-called progressive economic policies to cure our political crisis, why haven’t we done so already? What can/must change for us to do so in the future? While I believe that the answer to this question is provided by another essay in the book, let’s take a closer look at KW’s explanation at how wealth inequality throws sand into the gears of our polity. They propose four and the following number scheme is mine:
1. The most likely explanation of the relationship between inequality and polarization is that the increased income and wealth of a small minority has, in effect bought the allegiance of a major political party…Needless to say, this is not an environment conducive to political action.
2. It seems likely that this persistence [of financial deregulation] despite repeated disasters had a lot do with rising inequality, with the causation running in both directions. On the one side the explosive growth of the financial sector was a major source of soaring incomes at the very top of the income distribution. On the other side, the fact that the very rich were the prime beneficiaries of deregulation meant that as this group gained power- simply because of its rising wealth- the push for deregulation intensified. These impacts of inequality on ideology did not in 2008…[they] left us incapacitated in the face of crisis.
3. Conservatives have always seen seen [Keynesian economics] as the thin edge of the wedge: concede that the government can play a useful role in fighting slumps, and the next thing you know we’ll be living under socialism.
4. [Krugman paraphrasing Kalecki] Every widening of state activity is looked upon by business with suspicion, but the creation of employment by government spending has a special aspect which makes the opposition particularly intense. Under a laissez-faire system the level of employment to a great extend on the so-called state of confidence….This gives capitalists a powerful indirect control over government policy: everything which may shake the state of confidence must be avoided because it would cause an economic crisis.
All of these are true to an extent. Two are related to the features of a particular policy position that conservatives don’t like (countercyclical spending) and their cost will dissipate if the economy improves. Isn’t it the case that most proponents and beneficiaries of financial liberalization are Democrats? (Wall Street mostly supported Obama in 08 and barely supported Romney in 12 despite Romney giving the house away). In any case, while KW aren’t big on solutions they certainly have a strong grasp of the problem.
Take a Stand: Sit In by Phillip Dray
As the railroad strike of 1877 had led eventually to expanded workers’ rights, so the Greensboro sit-in of February 1, 1960, helped pave the way for passage of the Civil Rights Act of 1964 and the Voting Rights Act of 1965. Both movements remind us that not all successful protests are explicit in their message and purpose; they rely instead on the participants’ intuitive sense of justice. 
I’m not the only author to have taken note of this passage as particularly important, but I am the only author who found the passage significant and did not start ranting about so-called “natural law.” Chronicling the (hitherto unknown-to-me) history of the Great Upheaval, Dray does a great job relating some important moments in left protest history to the OWS history. This is actually an extremely important essay and I haven’t given it the time it deserves. If you read three essays in this book, include this in your list.
Inequality and Intemperate Policy by Raghuram Rajan (no URL, you’ll have to buy the book)
Rajan’s basic ideas are the following: inequality has gotten out of control:
Deepening income inequality has been brought to the forefront of discussion in the United States. The discussion tends to center on the Croesus-like income of John Paulson, the hedge fund manager who made a killing in 2008 betting on a financial collapse and netted over $3 billion, about seventy-five-thousand times the average household income. Yet a more worrying, everyday phenomenon that confronts most Americans is the disparity in income growth rates between a manager at the local supermarket and the factory worker or office assistant. Since the 1970s, the wages of the former, typically workers at the ninetieth percentile of the wage distribution in the United States, have grown much faster than the wages of the latter, the typical median worker.
But American political ideologies typically rule out the most direct responses to inequality (i.e. redistribution). The result is a series of stop-gap measures that do long-run damage to the economy (as defined by sustainable and rising income levels and full employment), but temporarily boost the consumption level of lower classes:
It is not surprising then, that a policy response to rising inequality in the United States in the 1990s and 200s – whether carefully planned or chosen as the path of least resistance – was to encourage lending to households, especially but not exclusively low-income ones, with the government push given to housing credit just the most egregious example. The benefit – higher consumption – was immediate, whereas paying the inevitable bill could be postponed into the future. Indeed, consumption inequality did not grow nearly as much as income inequality before the crisis. The difference was bridged by debt. Cynical as it may seem, easy credit has been used as a palliative success administrations that been unable to address the deeper anxieties of the middle class directly. As I argue in my book Fault Lines, “Let them eat credit” could well summarize the mantra of the political establishment in the go-go years before the crisis.
Why should you believe Raghuram Rajan? Because he’s one of the few guys who called the first crisis and tried to warn the Fed.
A solid essay providing a more direct link between income inequality and bad policy than KW do.
The 5 percent’s [consisting of the seven million Americans who, in 1934, were sixty-five and older] protests coalesced as the Townsend movement, launched by a sinewy midwestern farmer’s son and farm laborer turned California physician. Francis Townsend was a World War I veteran who had served in the Army Medical Corps. He had an ambitious, and impractical plan for a federal pension program. Although during its heyday in the 1930s the movement failed to win enactment of its [editor's note: insane] program, it did play a critical role in contemporary politics. Before Townsend, America understood the destitution of its older generations only in abstract terms; Townsend’s movement made it tangible. “It is no small achievment to have opened the eyes of even a few million Americans to these facts,” Bruce Bliven, editor of the New Republic observed. “If the Townsend Plan were to die tomorrow and be completely forgotten as miniature golf, mah-jongg, or flinch [editor's note: everything old is new again], it would still have left some sedimented flood marks on the national consciousness.” Indeed, the Townsend movement became the catalyst for the New Deal’s signal achievement, the old-age program of Social Security. The history of its rise offers a lesson for the Occupy movement in how to convert grassroots enthusiasm into a potent political force – and a warning about the limitations of even a nationwide movement.
Does the author live up to the promises of this paragraph? Is the whole essay worth reading? Does FDR give in to the people’s demands and pass Social Security?!
Yes to all. Read it.
This is a great essay. I’m going to outsource the review and analysis to:
because it basically sums up my thoughts. You all, go read it.
If you know nothing about Wall Street, then the essay is worth reading, otherwise skip it. There are two common ways to write a bad article in financial journalism. First, you can try to explain tiny index price movements via news articles from that day/week/month. “Shares in the S&P moved up on good news in Taiwan today,” that kind of nonsense. While the news and price movements might be worth knowing for their own sake, these articles are usually worthless because no journalist really knows who traded and why (theorists might point out even if the journalists did know who traded to generate the movement and why, it’s not clear these articles would add value – theorists are correct).
The other way, the Cassidy! way is to ask some subgroup of American finance what they think about other subgroups in finance. High frequency traders think iBankers are dumb and overpaid, but HFT on the other hand, provides an extremely valuable service – keeping ETFs cheap, providing liquidity and keeping shares the right level. iBankers think prop-traders add no value, but that without iBanking M&A services, American manufacturing/farmers/whatever would cease functioning. Low speed prop-traders think that HFT just extracts cash from dumb money, but prop-traders are reddest blooded American capitalists, taking the right risks and bringing knowledge into the markets. Insurance hates hedge funds, hedge funds hate the bulge bracket, the bulge bracket hates the ratings agencies, who hate insurance and on and on.
You can spit out dozens of articles about these catty and tedious rivalries (invariably claiming that financial sector X, rivals for institutional cash with Y, “adds no value”) and learn nothing about finance. Cassidy writes the article taking the iBankers side and surprises no one (this was originally published as an article in The New Yorker).
Ms. McLean holds immense talent. It was always pretty obvious that the bottom twenty-percent, i.e. the vast majority of subprime loan recipients, who are generally poor at planning, were using mortgages to get quick cash rather than buy houses. Regulators and high finance, after resisting for a good twenty years, gave in for reasons explained in Rajan’s essay.
A legit essay by a future Nobelist in Econ. Read it.
Anthro-hack Appadurai writes:
I first came to this country in 1967. I have been either a crypto-anthropologist or professional anthropologist for most of the intervening years. Still, because I came here with an interest in India and took the path of least resistance in choosing to retain India as my principal ethnographic referent, I have always been reluctant to offer opinions about life in these United States.
His instincts were correct. The essay reads like an old man complaining about how bad the weather is these days. Skip it.
Editor Byrne has amazing powers of persuasion or, a lot of authors have had some essays in the desk-drawer they were waiting for an opportunity to publish. In any case, Rogoff and Reinhart (RR hereafter) have summed up a couple hundred studies and two of their books in a single executive summary and given it to whoever buys The Occupy Handbook. Value. RR are Republicans and the essay appears to be written in good faith (unlike some people *cough* Tyler Cowen and Veronique de Rugy *cough*). RR do a great job discovering and presenting stylized facts about financial crises past and present. What to expect next? A couple national defaults and maybe a hyperinflation or two.
Shiller has always been ahead of the curve. In 1981, he wrote a cornerstone paper in behavioral finance at a time when the field was in its embryonic stages. In the early 1990s, he noticed insufficient attention was paid to real estate values, despite their overwhelming importance to personal wealth levels; this led him to create, along with Karl E. Case, the Case-Shiller index – now the Case-Shiller Home Prices Indices. In March 2000**, Shiller published Irrational Exuberance, arguing that U.S. stocks were substantially overvalued and due for a tumble. [Editor's note: what Brandon Adams fails to mention, but what's surely relevant is that Shiller also called the subprime bubble and re-released Irrational Exuberance in 2005 to sound the alarms a full three years before The Subprime Solution]. In 2008, he published The Subprime Solution, which detailed the origins of the housing crisis and suggested innovative policy responses for dealing with the fallout. These days, one of his primary interests is neuroeconomics, a field that relates economic decision-making to brain function as measured by fMRIs.
Shiller is basically a champ and you should listen to him.
Shiller was disappointed but not surprised when governments bailed out banks in extreme fashion while leaving the contracts between banks and homeowners unchanged. He said, of Hank Paulson, “As Treasury secretary, he presented himself in a very sober and collected way…he did some bailouts that benefited Goldman Sachs, among others. And I can imagine that they were well-meaning, but I don’t know that they were totally well-meaning, because the sense of self-interest is hard to clean out of your mind.”
Shiller understates everything.
Verdict: Read it.
And so, we close our discussion of part I. Moving on to part II:
In Ms. Byrne’s own words:
Part 2, “Where We Are Now,” which covers the present, both in the United States and abroad, opens with a piece by the anthropologist David Graeber. The world of Madison Avenue is far from the beliefs of Graeber, an anarchist, but it’s Graeber who arguably (he says he didn’t do it alone) came up with the phrase “We Are the 99 percent.” As Bloomberg Businessweek pointed out in October 2011, during month two of the Occupy encampments that Graeber helped initiate and three moths after the publication of his Debt: The First 5,000 Years, “David Graeber likes to say that he had three goals for the year: promote his book, learn to drive, and launch a worldwide revolution. The first is going well, the second has proven challenging and the third is looking up.” Graeber’s counterpart in Chile can loosely be said to be Camila Vallejo, the college undergraduate, pictured on page 219, who, at twenty-three, brought the country to a standstill. The novelist and playwright Ariel Dorfman writes about her and about his own self-imposed exile from Chile, and his piece is followed by an entirely different, more quantitative treatment of the subject. This part of the book also covers the indignados in Spain, who before Occupy began, “occupied” the public squares of Madrid and other cities – using, as the basis for their claim on the parks could be legally be slept in, a thirteenth-century right granted to shepherds who moved, and still move, their flocks annually.
In other words, we’re in occupy is the hero we deserve, but not the hero we need territory here.
*Addendum 1: Some have suggested that it’s not the wealth inequality that ought to be reduced, but the democratic elements of our system. California’s terrible decision-making resulting from its experiments with direct democracy notwithstanding, I would like to stay in the realm of the sane.
**Addendum 2: Yes, Shiller managed to get the book published the week before the crash. Talk about market timing.
This is a review of Part I of The Occupy Handbook. Part I consists of twelve pieces ranging in quality from excellent to awful. But enough from me, in Janet Byrne’s own words:
Part 1, “How We Got Here,” takes a look at events that may be considered precursors of OWS: the stories of a brakeman in 1877 who went up against the railroads; of the four men from an all-black college in North Carolina who staged the first lunch counter sit-in of the 1960s; of the out-of-work doctor whose nationwide, bizarrely personal Townsend Club movement led to the passage of Social Security. We go back to the 1930s and the New Deal and, in Carmen M. Reinhart and Kenneth S. Rogoff‘s “nutshell” version of their book This Time Is Different: Eight Centuries of Financial Folly, even further.
Ms. Byrne did a bang-up job getting one Nobel Prize Winner in economics (Paul Krugman), two future Economics Nobel Prize winners (Robert Shiller, Daron Acemoglu) and two maybes (sorry Raghuram Rajan and Kenneth Rogoff) to contribute excellent essays to this section alone. Powerhouse financial journalists Gillian Tett, Michael Hilztik, John Cassidy, Bethany McLean and the prolific Michael Lewis all drop important and poignant pieces into this section. Arrogant yet angry anthropologist Arjun Appadurai writes one of the worst essays I’ve ever had the misfortune of reading and the ubiquitous Brandon Adams make his first of many mediocre appearances interviewing Robert Shiller. Clocking in at 135 pages, this is the shortest section of the book yet varies the most in quality. You can skip Professor Appadurai and Cassidy’s essays, but the rest are worth reading.
Advice from the 1 Percent: Lever Up, Drop Out by Michael Lewis
Framed as a strategy memo circulated among one-percenters, Lewis’ satirical piece written after the clearing of Zucotti Park begins with a bang.
The rabble has been driven from the public parks. Our adversaries, now defined by the freaks and criminals among them, have demonstrated only that they have no idea what they are doing. They have failed to identify a single achievable goal.
Indeed, the absurd fixation on holding Zuccotti Park and refusal to issue demands because doing so “would validate the system” crippled Occupy Wall Street (OWS). So far OWS has had a single, but massive success: it shifted the conversation back to the United States’ out of control wealth inequality managed to do so in time for the election, sealing the deal on Romney. In this manner, OWS functioned as a holding action by the 99% in the interests of the 99%.
We have identified two looming threats: the first is the shifting relationship between ambitious young people and money. There’s a reason the Lower 99 currently lack leadership: anyone with the ability to organize large numbers of unsuccessful people has been diverted into Wall Street jobs, mainly in the analyst programs at Morgan Stanley and Goldman Sachs. Those jobs no longer exist, at least not in the quantities sufficient to distract an entire generation from examining the meaning of their lives. Our Wall Street friends, wounded and weakened, can no longer pick up the tab for sucking the idealism out of America’s youth.We on the committee are resigned to all elite universities becoming breeding grounds for insurrection, with the possible exception of Princeton.
Michael Lewis speaks from experience; he is a Princeton alum and a 1 percenter himself. More than that however, he is also a Wall Street alum from Salomon Brothers during the 1980s snafu and wrote about it in the original guide to Wall Street, Liar’s Poker. Perhaps because of his atypicality (and dash of solipsism), he does not have a strong handle on human(s) nature(s). By the time of his next column in Bloomberg, protests had broken out at Princeton.
Ultimately ineffectual, but still better than…
Lewis was right in the end, but more than anyone sympathetic to the movement might like. OccupyPrinceton now consists of only two bloggers, one of which has graduated and deleted all his work from an already quiet site and another who is a senior this year. OccupyHarvard contains a single poorly written essay on the front page. Although OccupyNewHaven outlasted the original Occupation, Occupy Yale no longer exists. Occupy Dartmouth hasn’t been active for over a year, although it has a rather pathetic Twitter feed here. Occupy Cornell, Brown, Caltech, MIT and Columbia don’t exist, but some have active facebook pages. Occupy Michigan State, Rutgers and NYU appear to have had active branches as recently as eight months ago, but have gone silent since. Functionally, Occupy Berkeley and its equivalents at UCBerkeley predate the Occupy movement and continue but Occupy Stanford hasn’t been active for over a year. Anecdotally, I recall my friends expressing some skepticism that any cells of the Occupy movement still existed.
As for Lewis’ other points, I’m extremely skeptical about “examined lives” being undermined by Wall Street. As someone who started in math and slowly worked his way into finance, I can safely say that I’ve been excited by many of the computing, economic, and theoretical problems quants face in their day-to-day work and I’m typical. I, and everyone who has lived long-enough, knows a handful of geniuses who have thought long and hard about the kinds of lives they want to lead and realized that A. there is no point to life unless you make one and B. making money is as good a point as any. I know one individual, after working as a professional chemist prior to college,who decided to in his words, “fuck it and be an iBanker.” He’s an associate at DB. At elite schools, my friend’s decision is the rule rather than the exception, roughly half of Harvard will take jobs in finance and consulting (for finance) this year. Another friend, an exception, quit a promising career in operations research to travel the world as a pick-up artist. Could one really say that either the operations researcher or the chemist failed to examine their lives or that with further examinations they would have come up with something more “meaningful”?
One of the social hacks to give lie to Lewis-style idealism-emerging-from-an attempt-to-examine-ones-life is to ask freshpeople at Ivy League schools what they’d like to do when they graduate and observe their choices four years later. The optimal solution for a sociopath just admitted to a top school might be to claim they’d like to do something in the peace corp, science or volunteering for the social status. Then go on to work in academia, finance, law or tech or marriage and household formation with someone who works in the former. This path is functionally similar to what many “average” elite college students will do, sociopathic or not. Lewis appears to be sincere in his misunderstanding of human(s) nature(s). In another book he reveals that he was surprised at the reaction to Liar’s Poker – most students who had read the book “treated it as a how-to manual” and cynically asked him for tips on how to land analyst jobs in the bulge bracket. It’s true that there might be some things money can’t buy, but an immensely pleasurable, meaningful life do not seem to be one of them. Today for the vast majority of humans in the Western world, expectations of sufficient levels of cold hard cash are necessary conditions for happiness.
In short and contra Lewis, little has changed. As of this moment, Occupy has proven so harmless to existing institutions that during her opening address Princeton University’s president Shirley Tilghman called on the freshmen in the class of 2016 to “Occupy” Princeton. No freshpeople have taken up her injunction. (Most?) parts of Occupy’s failure to make a lasting impact on college campuses appear to be structural; Occupy might not have succeeded even with better strategy. As the Ivy League became more and more meritocratic and better at discovering talent, many of the brilliant minds that would have fallen into the 99% and become its most effective advocates have been extracted and reached their so-called career potential, typically defined by income or status level. More meritocratic systems undermine instability by making the most talented individuals part of the class-to-be-overthrown, rather than the over throwers of that system. In an even somewhat meritocratic system, minor injustices can be tolerated: Asians and poor rural whites are classes where there is obvious evidence of discrimination relative to “merit and the decision to apply” in elite gatekeeper college admissions (and thus, life outcomes generally) and neither group expresses revolutionary sentiment on a system-threatening scale, even as the latter group’s life expectancy has begun to decline from its already low levels. In the contemporary United States it appears that even as people’s expectations of material security evaporate, the mere possibility of wealth bolsters and helps to secure inequities in existing institutions.
Hence our committee’s conclusion: we must be able to quit American society altogether, and they must know it.The modern Greeks offer the example in the world today that is, the committee has determined, best in class. Ordinary Greeks seldom harass their rich, for the simple reason that they have no idea where to find them. To a member of the Greek Lower 99 a Greek Upper One is as good as invisible.
He pays no taxes, lives no place and bears no relationship to his fellow citizens. As the public expects nothing of him, he always meets, and sometimes even exceeds, their expectations. As a result, the chief concern of the ordinary Greek about the rich Greek is that he will cease to pay the occasional visit.
Michael Lewis is a wise man.
I can recall a conversation with one of my Professors; an expert on Democratic Kampuchea (American: Khmer Rouge), she explained that for a long time the identity of the oligarchy ruling the country was kept secret from its citizens. She identified this obvious subversion of republican principles (how can you have control over your future when you don’t even know who runs your region?) as a weakness of the regime. Au contraire, I suggested, once you realize your masters are not gods, but merely humans with human characteristics, that they: eat, sleep, think, dream, have sex, recreate, poop and die – all their mystique, their claims to superior knowledge divine or earthly are instantly undermined. De facto segregation has made upper classes in the nation more secure by allowing them to hide their day-to-day opulence from people who have lost their homes, job and medical care because of that opulence. Neuroscience will eventually reveal that being mysterious makes you appear more sexy, socially dominant, and powerful, thus making your claims to power and dominance more secure (Kautsky et. al. 2018).*
If the majority of Americans manage to recognize that our two tiered legal system has created a class whose actual claim to the US immense wealth stems from, for the most part, a toxic combination of Congressional pork, regulatory and enforcement agency capture and inheritance rather than merit, there will be hell to pay. Meanwhile, resentment continues to grow. Even on the extreme right one can now regularly read things like:
Now, I think I’d be downright happy to vote for the first politician to run on a policy of sending killer drones after every single banker who has received a post-2007 bonus from a bank that received bailout money. And I’m a freaking libertarian; imagine how those who support bombing Iraqi children because they hate us for our freedoms are going to react once they finally begin to grasp how badly they’ve been screwed over by the bankers. The irony is that a banker-assassination policy would be entirely constitutional according to the current administration; it is very easy to prove that the bankers are much more serious enemies of the state than al Qaeda. They’ve certainly done considerably more damage.
The rest of part I reviewed tomorrow. Hang in there people.
Addendum 1: If your comment amounts to something like “the Nobel Prize in Economics is actually called the The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel” and thus “not a real Nobel Prize” you are correct, yet I will still delete your comment and ban your IP.
*Addendum 2: More on this will come when we talk about the Saez-Delong discussion in part III.
There have been lots of comments and confusion, especially in this post, over what people in finance do or do not assume about how the markets work. I wanted to dispel some myths (at the risk of creating more).
First, there’s a big difference between quantitative trading and quantitative risk. And there may be a bunch of other categories that also exist, but I’ve only worked in those two arenas.
Markets are not efficient
In quantitative trading, nobody really thinks that “markets are efficient.” That’s kind of ridiculous, since then what would be the point of trying to make money through trading? We essentially make money because they aren’t. But of course that’s not to say they are entirely inefficient. Some approaches to removing inefficiency, and some markets, are easier than others. There can be entire markets that are so old and well-combed-over that the inefficiencies (that people have thought of) have been more or less removed and so, to make money, you have to be more thoughtful. A better way to say this is that the inefficiencies that are left are smaller than the transaction costs that would be required to remove them.
It’s not clear where “removing inefficiency” ends and where a different kind of trading begins, by the way. In some sense all algorithmic trades that work for any amount of time can be thought of as removing inefficiency, but then it becomes a useless concept.
Also, you can see from the above that traders have a vested interest to introduce new kinds of markets to the system, because new markets have new inefficiencies that can be picked off.
This kind of trading is very specific to a certain kind of time horizon as well. Traders and their algorithms typically want to make money in the average year. If there’s an inefficiency with a time horizon of 30 years it may still exist but few people are patient enough for it (I should add that we also probably don’t have good enough evidence that they’d work, considering how quickly the markets change). Indeed the average quant shop is going in the opposite direction, of high speed trading, for that very reason, to find the time horizon at which there are still obvious inefficiencies.
A long long time ago, before Black Monday in 1987, people didn’t know how to price options. Then Black-Scholes came out and traders started using the Black-Scholes (BS) formula and it worked pretty well, until Black Monday came along and people suddenly realized the assumptions in BS were ridiculous. Ever since then people have adjusted the BS formula. Everyone.
There are lots of ways to think about how to adjust the formula, but a very common one is through the volatility smile. This allows us to remove the BS assumption of constant volatility (of the underlying stock) and replace it with whatever inferred volatility is actually traded on in the market for that strike price and that maturity. As this commenter mentioned, the BS formula is still used here as a convenient reference to do this calculation. If you extend your consideration to any maturity and any strike price (for the same underlying stock or thingy) then you get a volatility surface by the same reasoning.
Two things to mention. First, you can think of the volatility smile/ surface as adjusting the assumption of constant volatility, but you can also ascribe to it an adjustment of the assumption of a normal distribution of the underlying stock. There’s really no way to extricate those two assumptions, but you can convince yourself of this by a thought experiment: if the volatility stays fixed but the presumed shape of the distribution of the stocks gets fatter-tailed, for example, then option prices (for options that are far from the current price) will change, which will in turn change the implied volatility according to the market (i.e. the smile will deepen). In other words, the smile adjusts for more than one assumption.
The other thing to mention: although we’ve done a relatively good job adjusting to market reality when pricing an option, when we apply our current risk measures like Value-at-Risk (VaR) to options, we still assume a normal distribution of risk factors (one of the risk factors, if we were pricing options, would be the implied volatility). So in other words, we might have a pretty good view of current prices, but it’s not at all clear we know how to make reasonable scenarios of future pricing shifts.
Ultimately, this assumption of normal distributions of risk factors in calculating VaR is actually pretty important in terms of our view of systemic risks. We do it out of computational convenience, by the way. That and because when we use fatter-tailed assumptions, people don’t like the answer.
I wanted to give you the low-down on a data hackathon I participated in this weekend, which was sponsored by the NYU Institute for Public Knowledge on the topic of climate change and social information. We were assigned teams and given a very broad mandate. We had only 24 hours to do the work, so it had to be simple.
Our team consisted of Venky Kannan, Tom Levine, Eric Schles, Aaron Schumacher, Laura Noren, Stephen Fybish, and me.
We decided to think about the effects of super storms on different neighborhoods. In particular, to measure the recovery time of the subway ridership in various neighborhoods using census information. Our project was inspired by this “nofarehikes” map of New York which tries to measure the impact of a fare hike on the different parts of New York. Here’s a copy of our final slides.
Also, it’s not directly related to climate change, but rather rests on the assumption that with climate change comes more frequent extreme weather events, which seems to be an existing myth (please tell me if the evidence is or isn’t there for that myth).
We used three data sets: subway ridership by turnstile, which only exists since May 2010, the census of 2010 (which is kind of out of date but things don’t change that quickly) and daily weather observations from NOAA.
Using the weather map and relying on some formal definitions while making up some others, we came up with a timeline of extreme weather events:
Then we looked at subway daily ridership to see the effect of the storms or the recovery from the storms:
Then we used the census tracts to understand wealth in New York:
And of course we had to know which subway stations were in which census tracts. This isn’t perfect because we didn’t have time to assign “empty” census tracts to some nearby subway station. There are on the order of 2,000 census tracts but only on the order of 800 subway stations. But again, 24 hours isn’t alot of time, even to build clustering algorithms.
Finally, we attempted to put the data together to measure which neighborhoods have longer-than-expected recovery times after extreme weather events. This is our picture:
Interestingly, it looks like the neighborhoods of Manhattan are most impacted by severe weather events, which is not in line with our prior [Update: I don't think we actually computed the impact on a given resident, but rather just the overall change in rate of ridership versus normal. An impact analysis would take into account the relative wealth of the neighborhoods and would probably look very different].
There are tons of caveats, I’ll mention only a few here:
- We didn’t have time to measure the extent to which the recovery time took longer because the subway stopped versus other reasons people might not sure the subway. But our data is good enough to do this.
- Our data might have been overwhelmingly biased by Sandy. We’d really like to do this with much longer-term data, but the granular subway ridership data has not been available for long. But the good news is we can do this from now on.
- We didn’t have bus data at the same level, which is a huge part of whether someone can get to work, especially in the outer boroughs. This would have been great and would have given us a clearer picture.
- When someone can’t get to work, do they take a car service? How much does that cost? We’d love to have gotten our hands on the alternative ways people got to work and how that would impact them.
- In general we’d have like to measure the impact relative to their median salary.
- We would also have loved to have measured the extent to which each neighborhood consisted of salary versus hourly wage earners to further understand how a loss of transportation would translate into an impact on income.
I just read this paper, written by Björn Brembs and Marcus Munafò and entitled “Deep Impact: Unintended consequences of journal rank”. It was recently posted on the Computer Science arXiv (h/t Jordan Ellenberg).
I’ll give you a rundown on what it says, but first I want to applaud the fact that it was written in the first place. We need more studies like this, which examine the feedback loop of modeling at a societal level. Indeed this should be an emerging scientific or statistical field of study in its own right, considering how many models are being set up and deployed on the general public.
Here’s the abstract:
Much has been said about the increasing bureaucracy in science, stifling innovation, hampering the creativity of researchers and incentivizing misconduct, even outright fraud. Many anecdotes have been recounted, observations described and conclusions drawn about the negative impact of impact assessment on scientists and science. However, few of these accounts have drawn their conclusions from data, and those that have typically relied on a few studies. In this review, we present the most recent and pertinent data on the consequences that our current scholarly communication system has had on various measures of scientific quality (such as utility/citations, methodological soundness, expert ratings and retractions). These data confirm previous suspicions: using journal rank as an assessment tool is bad scientific practice. Moreover, the data lead us to argue that any journal rank (not only the currently-favored Impact Factor) would have this negative impact. Therefore, we suggest that abandoning journals altogether, in favor of a library-based scholarly communication system, will ultimately be necessary. This new system will use modern information technology to vastly improve the filter, sort and discovery function of the current journal system.
The key points in the paper are as follows:
- There’s a growing importance of science and trust in science
- There’s also a growing rate (x20 from 2000 to 2010) of retractions, with scientific misconduct cases growing even faster to become the majority of retractions (to an overall rate of 0.02% of published papers)
- There’s a larger and growing “publication bias” problem – in other words, an increasing unreliability of published findings
- One problem: initial “strong effects” get published in high-ranking journal, but subsequent “weak results” (which are probably more reasonable) are published in low-ranking journals
- The formal “Impact Factor” (IF) metric for rank is highly correlated to “journal rank”, defined below.
- There’s a higher incidence of retraction in high-ranking (measured through “high IF”) journals.
- “A meta-analysis of genetic association studies provides evidence that the extent to which a study over-estimates the likely true effect size is positively correlated with the IF of the journal in which it is published”
- Can the higher retraction error in high-rank journal be explained by higher visibility of those journals? They think not. Journal rank is bad predictor for future citations for example. [mathbabe inserts her opinion: this part needs more argument.]
- “…only the most highly selective journals such as Nature and Science come out ahead over unselective preprint repositories such as ArXiv and RePEc”
- Are there other measures of excellence that would correlate with IF? Methodological soundness? Reproducibility? No: “In fact, the level of reproducibility was so low that no relationship between journal rank and reproducibility could be detected.
- More about Impact Factor: The IF is a metric for the number of citations to articles in a journal (the numerator), normalized by the number of articles in that journal (the denominator). Sounds good! But:
- For a given journal, IF is not calculated but is negotiated – the publisher can (and does) exclude certain articles (but not citations). Even retroactively!
- The IF is also not reproducible – errors are found and left unexplained.
- Finally, IF is likely skewed by the fat-tailedness of citations (certain articles get lots, most get few). Wouldn’t a more robust measure be given by the median?
- Journal rank is a weak to moderate predictor of scientific impact
- Journal rank is a moderate to strong predictor of both intentional and unintentional scientific unreliability
- Journal rank is expensive, delays science and frustrates researchers
- Journal rank as established by IF violates even the most basic scientific standards, but predicts subjective judgments of journal quality
- “IF generates an illusion of exclusivity and prestige based on an assumption that it will predict subsequent impact, which is not supported by empirical data.”
- “Systemic pressures on the author, rather than increased scrutiny on the part of the reader, inflate the unreliability of much scientific research. Without reform of our publication system, the incentives associated with increased pressure to publish in high-ranking journals will continue to encourage scientiststo be less cautious in their conclusions (or worse), in an attempt to market their research to the top journals.”
- “It is conceivable that, for the last few decades, research institutions world-wide may have been hiring and promoting scientists who excel at marketing their work to top journals, but who are not necessarily equally good at conducting their research. Conversely, these institutions may have purged excellent scientists from their ranks, whose marketing skills did not meet institutional requirements. If this interpretation of the data is correct, we now have a generation of excellent marketers (possibly, but not necessarily also excellent scientists) as the leading figures of the scientific enterprise, constituting another potentially major contributing factor to the rise in retractions. This generation is now in charge of training the next generation of scientists, with all the foreseeable consequences for the reliability of scientific publications in the future.
The authors suggest that we need a new kind of publishing platform. I wonder what they’d think of the Episciences Project.
A couple of nights I ago I attended this event at Columbia on the topic of ”Rent-Seeking, Instability and Fraud: Challenges for Financial Reform”.
The event was great, albeit depressing – I particularly loved Bill Black‘s concept of control fraud, which I’ll talk more about in a moment, as well as Lynn Turner‘s polite description of the devastation caused by the financial crisis.
To be honest, our conclusion wasn’t a surprise: there is a lack of political will in Congress or elsewhere to fix the problems, even the low-hanging obvious criminal frauds. There aren’t enough actual police to take on the job of dealing with the number of criminals that currently hide in the system (I believe the statistic was that there are about 1,000,000 people in law enforcement in this country, and 2,500 are devoted to white-collar crime), and the people at the top of the regulatory agencies have been carefully chosen to not actually do anything (or let their underlings do anything).
Even so, it was interesting to hear about this stuff through the eyes of a criminologist who has been around the block (Black was the guy who put away a bunch of fraudulent bankers after the S&L crisis) and knows a thing or two about prosecuting crimes. He talked about the concept of control fraud, and how pervasive control fraud is in the current financial system.
Control fraud, as I understood him to describe it, is the process by which a seemingly legitimate institution or process is corrupted by a fraudulent institution to maintain the patina of legitimacy.
Once you say it that way, you recognize it everywhere, and you realize how dirty it is, since outsiders to the system can’t tell what’s going on – hey, didn’t you have overseers? Didn’t they say everything was checking out ok? What the hell happened?
So for example, financial firms like Bank of America used control fraud in the heart of the housing bubble via their ridiculous accounting methods. As one of the speakers mentioned, the accounting firm in charge of vetting BofA’s books issued the same exact accounting description for many years in the row (literally copy and paste) even as BofA was accumulating massive quantities of risky mortgage-backed securities (update: I’ve been told it’s called an “Auditors Report” and it has required language. But surely not all the words are required? Otherwise how could it be called a report?). In other words, the accounting firm had been corrupted in order to aid and abet the fraud.
To get an idea of the repetitive nature and near-inevitability of control fraud, read this essay by Black, which is very much along the lines of his presentation on Tuesday. My favorite passage is this, when he addresses how our regulatory system “forgot about” control fraud during the deregulation boom of the 1990′s:
On January 17, 1996, OTS’ Notice of Proposed Rulemaking proposed to eliminate its rule requiring effective underwriting on the grounds that such rules were peripheral to bank safety.
“The OTS believes that regulations should be reserved for core safety and soundness requirements. Details on prudent operating practices should be relegated to guidance.
Otherwise, regulated entities can find themselves unable to respond to market innovations because they are trapped in a rigid regulatory framework developed in accordance with conditions prevailing at an earlier time.”
This passage is delusional. Underwriting is the core function of a mortgage lender. Not underwriting mortgage loans is not an “innovation” – it is a “marker” of accounting control fraud. The OTS press release dismissed the agency’s most important and useful rule as an archaic relic of a failed philosophy.
Here’s where I bring mathematics into the mix. My experience in finance, first as a quant at D.E. Shaw, and then as a quantitative risk modeler at Riskmetrics, convinced me that mathematics itself is a vehicle for control fraud, albeit in two totally different ways.
In the context of hedge funds and/or hard-core trading algorithms, here’s how it works. New-fangled complex derivatives, starting with credit default swaps and moving on to CDO’s, MBS’s, and CDO+’s, got fronted as “innovation” by a bunch of economists who didn’t really know how markets work but worked at fancy places and claimed to have mathematical models which proved their point. They pushed for deregulation based on the theory that the derivatives represented “a better way to spread risk.”
Then the Ph.D.’s who were clever enough to understand how to actually price these instruments swooped in and made asstons of money. Those are the hedge funds, which I see as kind of amoral scavengers on the financial system.
At the same time, wanting a piece of the action, academics invented associated useless but impressive mathematical theories which culminated in mathematics classes throughout the country that teach “theory of finance”. These classes, which seemed scientific, and the associated economists described above, formed the “legitimacy” of this particular control fraud: it’s math, you wouldn’t understand it. But don’t you trust math? You do? Then allow us to move on with rocking our particular corner of the financial world, thanks.
I also worked in quantitative risk, which as I see it is a major conduit of mathematical control fraud.
First, we have people putting forward “risk estimates” that have larger errorbars then the underlying values. In other words, if we were honest about how much we can actually anticipate price changes in mortgage backed securities in times of panic, then we’d say something like, “search me! I got nothing.” However, as we know, it’s hard to say “I don’t know” and it’s even harder to accept that answer when there’s money on the line. And I don’t apologize for caring about “times of panic” because, after all, that’s why we care about risk in the first place. It’s easy to predict risk in quiet times, I don’t give anyone credit for that.
Never mind errorbars, though- the truth is, I saw worse than ignorance in my time in risk. What I actually saw was a rubberstamping of “third part risk assessment” reports. I saw the risk industry for what it is, namely a poor beggar at the feet of their macho big-boys-of-finance clients. It wasn’t just my firm either. I’ve recently heard of clients bullying their third party risk companies into allowing them to replace whatever their risk numbers were by their own. And that’s even assuming that they care what the risk reports say.
Overall, I’m thinking this time is a bit different, but only in the details, not in the process. We’ve had control fraud for a long long time, but now we have an added tool in the arsenal in the form of mathematics (and complexity). And I realize it’s not a standard example, because I’m claiming that the institution that perpetuated this particular control fraud wasn’t a specific institution like Bank of America, but rather then entire financial system. So far it’s just an idea I’m playing with, what do you think?
At my new job I’ve been spending my time editing my book with Rachel Schutt (who is joining me at JRL next week! Woohoo!). It’s called Doing Data Science and it’s based on these notes I took when she taught a class on data science at Columbia last semester. Right now I’m working on the alternating least squares chapter, where we learned from Matt Gattis how to build and optimize a recommendation system. A very cool algorithm.
However, to be honest I’ve started to feel very sorry for the one parameter we call It’s also sometimes referred to as “the prior”.
Let me tell you, the world is asking too much from this little guy, and moreover most of the big-data world is too indifferent to its plight. Let me explain.
First, he’s supposed to reflect an actual prior belief – namely, his size is supposed to reflect a mathematical vision of how big we think the coefficients in our solution should be.
In an ideal world, we would think deeply about this question of size before looking at our training data, and think only about the scale of our data (i.e. the input), the scale of the preferences (i.e. the recommendation system output) and the quality and amount of training data we have, and using all of that, we’d figure out our prior belief on the size or at least the scale of our hoped-for solution.
I’m not statistician, but that’s how I imagine I’d spend my days if I were: thinking through this reasoning carefully, and even writing it down carefully, before I ever start my training. It’s a discipline like any other to carefully state your beliefs beforehand so you know you’re not just saying what the data wants to hear.
as convergence insurance
But then there’s the next thing we ask of our parameter namely we assign him the responsibility to make sure our algorithm converges.
Because our algorithm isn’t a closed form solution, but rather we are discovering coefficients of two separate matrices and , fixing one while we tweak the other, then switching. The algorithm stops when, after a full cycle of fixing and tweaking, none of the coefficients have moved by more than some pre-ordained
The fact that this algorithm will in fact stop is not obvious, and in fact it isn’t always true.
It is (mostly*) true, however, if our little is large enough, which is due to the fact that our above-mentioned imposed belief of size translates into a penalty term, which we minimize along with the actual error term. This little miracle of translation is explained in this post.
And people say that all the time. When you say, “hey what if that algorithm doesn’t converge?” They say, “oh if is big enough it always does.”
But that’s kind of like worrying about your teenage daughter getting pregnant so you lock her up in her room all the time. You’ve solved the immediate problem by sacrificing an even bigger goal.
Because let’s face it, if the prior is too big, then we are sacrificing our actual solution for the sake of conveniently small coefficients and convergence. In the asymptotic limit, which I love thinking about, our coefficients all go to zero and we get nothing at all. Our teenage daughter has run away from home with her do-nothing boyfriend.
By the way, there’s a discipline here too, and I’d suggest that if the algorithm doesn’t converge you might also want to consider reducing your number of latent variables rather than increasing your since you could be asking too much from your training data. It just might not be able to distinguish that many important latent characteristics.
as tuning parameter
Finally, we have one more job for our little , we’re not done with him yet. Actually for some people this is his only real job, because in practice this is how he’s treated. Namely, we optimize him so that our results look good under whatever metric we decide to care about (but it’s probably the mean squared error of preference prediction on a test set (hopefully on a test set!)).
In other words, in reality most of the above nonsense about is completely ignored.
This is one example among many where having the ability to push a button that makes something hard seem really easy might be doing more harm than good. In this case the button says “optimize with respect to “, but there are other buttons that worry me just as much, and moreover there are lots of buttons being built right now that are even more dangerous and allow the users to be even more big-data-blithe.
I’ve said it before and I’ll say it again: you do need to know about inverting a matrix, and other math too, if you want to be a good data scientist.
* There’s a change-of-basis ambiguity that’s tough to get rid of here, since you only choose the number of latent variables, not their order. This doesn’t change the overall penalty term, so you can minimize that with large enough but if you’re incredibly unlucky I can imagine you might bounce between different solutions that differ by a base change. In this case your steps should get smaller, i.e. the amount you modify your matrix each time you go through the algorithm. This is only a theoretical problem by the way but I’m a nerd.
I recently gave an interview with Russ Roberts at EconTalk, which was fun and which has generated a lot of interesting feedback for me. I had no idea so many people listened to that podcast. Turns out it’ll eventually add up to something like 50,000, with half of those people listening this week. Cool!
One thing Russ and I talked about is still on my mind. Namely, how many problems are the direct result of people pretending to understand something, or exaggerating the certainty of an uncertain quantity. People just don’t acknowledge errorbars when they should!
What up, people?
Part of the problem exists because when we model something, the model typically just comes out with a single answer, usually a number, and it seems so certain to us, so tangible, even when we know that slightly different starting conditions or inputs to our models would have resulted in a different number.
So for example, an SAT score. We know that, on a different day with a different amount of sleep or a different test, we might score significantly differently. And yet the score is the score, and it’s hugely important and we brand ourselves with it as if it’s some kind of final word.
But another part of this problem is that people are seldom incentivized to admit they don’t know something. Indeed the ones we hear from the most are professional opinion-holders, and they are going to lose their audience and their gigs if they go on air saying, “I’m not sure what’s going to happen with [the economy], we’ve honestly never been in this situation before and our data is just not sufficient to make a prediction that’s worth its weight.”
You can replace “the economy” by anything and the problem still holds.
Who’s going to say that?? Someone who doesn’t mind losing their job is who. Which is too bad, because honest people do say that quite a large portion of the time. So professional opinion-holders are kind of trained to be dishonest in this way.
And so are TED talks, but that’s a vent for another day.
I wish there were a macho way to admit you didn’t know something, so people could understand that admitting uncertainty isn’t equivalent to being wishy-washy.
I mean, sometimes I want to bust out and say, “I don’t know that, and neither do you, motherfucker!” but I’m not sure how well that would go over. Some people get touchy about profanity.
But it’s getting there, and it points to something ironic about this uncertainty-as-wishy-washiness: it is sometimes macho to point out that other people are blowing smoke. In other words, I can be a whistle blower on other people’s illusion of certainty even when I can’t make being uncertain sound cool.
I think that explains, to some extent, why so many people end up criticizing other people for false claims rather than making a stance on uncertainty themselves. The other reason of course is that it’s easier to blow holes in other people’s theories, once stated, than it is to come up with a foolproof theory of one’s own.
Any suggestions for macho approaches to errorbars?
In his recent essay in the Wall Street Journal, Bill Gates proposed to “fix the world’s biggest problems” through “good measurement and a commitment to follow the data.” Sounds great!
Unfortunately it’s not so simple.
Gates describes a positive feedback loop when good data is collected and acted on. It’s hard to argue against this: given perfect data-collection procedures with relevant data, specific models do tend to improve, according to their chosen metrics of success. In fact this is almost tautological.
As I’ll explain, however, rather than focusing on how individual models improve with more data, we need to worry more about which models and which data have been chosen in the first place, why that process is successful when it is, and – most importantly – who gets to decide what data is collected and what models are trained.
Take Gates’s example of Ethiopia’s commitment to health care for its people. Let’s face it, it’s not new information that we should ensure “each home has access to a bed net to protect the family from malaria, a pit toilet, first-aid training and other basic health and safety practices.” What’s new is the political decision to do something about it. In other words, where Gates credits the measurement and data-collection for this, I’d suggest we give credit to the political system that allowed both the data collection and the actual resources to make it happen.
Gates also brings up the campaign to eradicate polio and how measurement has helped so much there as well. Here he sidesteps an enormous amount of politics and debate about how that campaign has been fought and, more importantly, how many scarce resources have been put towards it. But he has framed this fight himself, and has collected the data and defined the success metric, so that’s what he’s focused on.
Then he talks about teacher scoring and how great it would be to do that well. Teachers might not agree, and I’d argue they are correct to be wary about scoring systems, especially if they’ve experienced the random number generator called the Value Added Model. Many of the teacher strikes and failed negotiations are being caused by this system where, again, the people who own the model have the power.
Then he talks about college rankings and suggests we replace the flawed US News & World Reports system with his own idea, namely “measures of which colleges were best preparing their graduates for the job market”. Note I’m not arguing for keeping that US News & World Reports model, which is embarrassingly flawed and is consistently gamed. But the question is, who gets to choose the replacement?
This is where we get the closest to seeing him admit what’s really going on: that the person who defines the model defines success, and by obscuring this power behind a data collection process and incrementally improved model results, it seems somehow sanitized and objective when it’s not.
Let’s see some more example of data collection and model design not being objective:
- We see that cars are safer for men than women because the crash-test dummies are men.
- We see that cars are safer for thin people because the crash-test dummies are thin.
- We see drugs are safer and more effective for white people because blacks are underrepresented in clinical trials (which is a whole other story about power and data collection in itself).
- We see that Polaroid film used to only pick up white skin because it was optimized for white people.
- We see that poor people are uninformed by definition of how we take opinion polls (read the fine print).
Bill Gates seems genuinely interested in tackling some big problems in the world, and I wish more people thought long and hard about how they could contribute like that. But the process he describes so lovingly is in fact highly fraught and dangerous.
Don’t be fooled by the mathematical imprimatur: behind every model and every data set is a political process that chose that data and built that model and defined success for that model.
I’m giving a talk at the Joint Mathematics Meeting on Thursday (it’s a 30 minute talk that starts at 11:20am, in Room 2 of the Upper Level of the San Diego Conference Center, I hope you come!).
Thinking about that talk brought something up for me that I think I want to address before the next talk. Namely, at the beginning of the talk I was explaining the title, “How Mathematics is Used Outside of Academia,” and I mentioned that most mathematicians that leave academia end up doing modeling.
I can’t remember the exact exchange, but I referred to myself at some point in this discussion as a mathematician outside of academia, at which point someone in the audience expressed incredulity:
him: Really? Are you still a mathematician? Do you prove theorems?
me: No, I don’t prove theorems any longer, now that I am a modeler… (confused look)
At the moment I didn’t have a good response to this, because he was using a different definition of “mathematician” than I was. For some reason he thought a mathematician must prove theorems.
I don’t think so. I had a conversation about this after my talk with Bob Beals, who was in the audience and who taught many years ago at the math summer program I did last summer. After getting his Ph.D. in math, Bob worked for the spooks, and now he works for RenTech. So he knows a lot about doing math outside academia too, and I liked his perspective on this question.
Namely, he wanted to look at the question through the lens of “grunt work”, which is to say all of the actual work that goes into a “result.”
As a mathematician, of course, you don’t simply sit around all day proving theorems. Actually you spend most of your time working through examples to get a feel for the terrain, and thinking up simple ways to do what seems like hard things, and trying out ideas that fail, and going down paths that are dry. If you’re lucky, then at the end of a long journey like this, you will have a theorem.
The same basic thing happens in modeling. You spend lots of time with the data, getting to know it, and then trying out certain approaches, which sometimes, or often, end up giving you nothing interesting, and half the time you realize you were expecting the wrong thing so you have to change it entirely. In the end you may end up with a model which is useful. If you’re lucky.
There’s a lot of grunt work in both endeavors, and there’s a lot of hard thinking along the way, lots of ways for you to fool yourself that you’ve got something when you haven’t. Perhaps in modeling it’s easier to lie, which is a big difference indeed. But if you’re an honest modeler then I claim the difference in the process of getting an interesting and important result is not that different.
And, I claim, I am still being a mathematician while I’m doing it.
I lied yesterday, as a friend at my Occupy meeting pointed out to me last night.
I made it seem like I look into every model before trusting it, and of course that’s not true. I eat food grown and prepared by other people daily. I go on airplanes and buses all the time, trusting that they will work and that they will be driven safely. I still have my money in a bank, and I also hire an accountant and sign my tax forms without reading them. So I’m a hypocrite, big-time.
There’s another thing I should clear up: I’m not claiming I understand everything about climate research just because I talked to an expert for 2 or 3 hours. I am certainly not an expert, nor am I planning to become one. Even so, I did learn a lot, and the research I undertook was incredibly useful to me.
So, for example, my father is a climate change denier, and I have heard him give a list of scientific facts to argue against climate change. I asked my expert to counter-argue these points, and he did so. I also asked him to explain the underlying model at a high level, which he did.
My conclusion wasn’t that I’ve looked carefully into the model and it’s right, because that’s not possible in such a short time. My conclusion was that this guy is trustworthy and uses logical argument, which he’s happy to share with interested people, and moreover he manages to defend against deniers without being intellectually defensive. In the end, I’m trusting him, an expert.
On the other hand, if I met another person with a totally different conclusion, who also impressed me as intellectually honest and curious, then I’d definitely listen to that guy too, and I’d be willing to change my mind.
So I do imbue models and theories with a limited amount of trust depending on how much sense they makes to me. I think that’s reasonable, and it’s in line with my advocacy of scientific interpreters. Obviously not all scientific interpreters would be telling the same story, but that’s not important – in fact it’s vital that they don’t, because it is a privilege to be allowed to listen to the different sides and be engaged in the debate.
If I sat down with an expert for a whole day, like my friend Jordan suggests, to determine if they were “right” on an issue where there’s argument among experts, then I’d fail, but even understanding what they were arguing about would be worthwhile and educational.
Let me say this another way: experts argue about what they don’t agree on, of course, since it would be silly for them to talk about what they do agree on. But it’s their commonality that we, the laypeople, are missing. And that commonality is often so well understood that we could understand it rather quickly if it was willingly explained to us. That would be a huge step.
So I wasn’t lying after all, if I am allowed to define the “it” that I did get at in the two hours with an expert. When I say I understood it, I didn’t mean everything, I meant a much larger chunk of the approach and method than I’d had before, and enough to evoke (limited) trust.
Something I haven’t addressed, which I need to think about more (please help!), is the question of what subjects require active skepticism. On of my commenters, Paul Stevens, brought this up:
… For me, lay people means John Q Public – public opinion because public opinion can shape policy. In practice, this only matters for a select few issues, such as climate change or science education. There is no impact to a lay person not understanding / believing in the Higgs particle for example.
Stephanie asks three important questions about trusting experts, which I paraphrase here:
- What does it take to look into a model yourself? How deeply must you probe?
- How do you avoid being manipulated when you do so?
- Why should we bother since stuff is so hard and we each have a limited amount of time?
I must confess I find the first two questions really interesting and I want to think about them, but I have a very little patience with the last question.
- I’ve seen too many people (individual modelers) intentionally deflect investigations into models by setting them up as so hard that it’s not worth it (or at least it seems not worth it). They use buzz words and make it seem like there’s a magical layer of their model which makes it too difficult for mere mortals. But my experience (as an arrogant, provocative, and relentless questioner) is that I can always understand a given model if I’m talking to someone who really understands it and actually wants to communicate it.
- It smacks of an excuse rather than a reason. If it’s our responsibility to understand something, then by golly we should do it, even if it’s hard.
- Too many things are left up to people whose intentions are not reasonable using this “too hard” argument, and it gives those people reason to make entire systems seem too difficult to penetrate. For a great example, see the financial system, which is consistently too complicated for regulators to properly regulate.
I’m sure I seem unbelievably cynical here, but that’s where I got by working in finance, where I saw first-hand how manipulative and manipulated mathematical modeling can become. And there’s no reason at all such machinations wouldn’t translate to the world of big data or climate modeling.
Speaking of climate modeling: first, it annoys me that people are using my “distrust the experts” line to be cast doubt on climate modelers.
People: I’m not asking you to simply be skeptical, I’m saying you should look into the models yourself! It’s the difference between sitting on a couch and pointing at a football game on TV and complaining about a missed play and getting on the football field yourself and trying to figure out how to throw the ball. The first is entertainment but not valuable to anyone but yourself. You are only adding to the discussion if you invest actual thoughtful work into the matter.
To that end, I invited an expert climate researcher to my house and asked him to explain the climate models to me and my husband, and although I’m not particularly skeptical of climate change research (more on that below when I compare incentives of the two sides), I asked obnoxious, relentless questions about the model until I was satisfied. And now I am satisfied. I am considering writing it up as a post.
As an aside, if climate researchers are annoyed by the skepticism, I can understand that, since football fans are an obnoxious group, but they should not get annoyed by people who want to actually do the work to understand the underlying models.
Another thing about climate research. People keep talking about incentives, and yes I agree wholeheartedly that we should follow the incentives to understand where manipulation might be taking place. But when I followed the incentives with respect to climate modeling, they bring me straight to climate change deniers, not to researchers.
Do we really think these scientists working with their research grants have more at stake than multi-billion dollar international companies who are trying to ignore the effect of their polluting factories on the environment? People, please. The bulk of the incentives are definitely with the business owners. Which is not to say there are no incentives on the other side, since everyone always wants to feel like their research is meaningful, but let’s get real.
I like this idea Stephanie comes up with:
Some sociologists of science suggest that translational “experts”–that is, “experts” who aren’t necessarily producing new information and research, but instead are “expert” enough to communicate stuff to those not trained in the area–can help bridge this divide without requiring everyone to become “experts” themselves. But that can also raise the question of whether these translational experts have hidden agendas in some way. Moreover, one can also raise questions of whether a partial understanding of the model might in some instances be more misleading than not looking into the model at all–examples of that could be the various challenges to evolution based on fairly minor examples that when fully contextualized seem minor but may pop out to someone who is doing a less systematic inquiry.
First, I attempt to make my blog something like a platform for this, and I also do my best to make my agenda not at all hidden so people don’t have to worry about that.
This raises a few issues for me:
- Right now we depend mostly on press to do our translations, but they aren’t typically trained as scientists. Does that make them more prone to being manipulated? I think it does.
- How do we encourage more translational expertise to emerge from actual experts? Currently, in academia, the translation to the general public of one’s research is not at all encouraged or rewarded, and outside academia even less so.
- Like Stephanie, I worry about hidden agendas and partial understandings, but I honestly think they are secondary to getting a robust system of translation started to begin with, which would hopefully in turn engage the general public with the scientific method and current scientific knowledge. In other words, the good outweighs the bad here.
Crossposted on Naked Capitalism
I just finished reading Nate Silver’s newish book, The Signal and the Noise: Why so many predictions fail – but some don’t.
The good news
First off, let me say this: I’m very happy that people are reading a book on modeling in such huge numbers – it’s currently eighth on the New York Times best seller list and it’s been on the list for nine weeks. This means people are starting to really care about modeling, both how it can help us remove biases to clarify reality and how it can institutionalize those same biases and go bad.
As a modeler myself, I am extremely concerned about how models affect the public, so the book’s success is wonderful news. The first step to get people to think critically about something is to get them to think about it at all.
Moreover, the book serves as a soft introduction to some of the issues surrounding modeling. Silver has a knack for explaining things in plain English. While he only goes so far, this is reasonable considering his audience. And he doesn’t dumb the math down.
In particular, Silver does a nice job of explaining Bayes’ Theorem. (If you don’t know what Bayes’ Theorem is, just focus on how Silver uses it in his version of Bayesian modeling: namely, as a way of adjusting your estimate of the probability of an event as you collect more information. You might think infidelity is rare, for example, but after a quick poll of your friends and a quick Google search you might have collected enough information to reexamine and revise your estimates.)
The bad news
Having said all that, I have major problems with this book and what it claims to explain. In fact, I’m angry.
It would be reasonable for Silver to tell us about his baseball models, which he does. It would be reasonable for him to tell us about political polling and how he uses weights on different polls to combine them to get a better overall poll. He does this as well. He also interviews a bunch of people who model in other fields, like meteorology and earthquake prediction, which is fine, albeit superficial.
What is not reasonable, however, is for Silver to claim to understand how the financial crisis was a result of a few inaccurate models, and how medical research need only switch from being frequentist to being Bayesian to become more accurate.
Let me give you some concrete examples from his book.
Easy first example: credit rating agencies
The ratings agencies, which famously put AAA ratings on terrible loans, and spoke among themselves as being willing to rate things that were structured by cows, did not accidentally have bad underlying models. The bankers packaging and selling these deals, which amongst themselves they called sacks of shit, did not blithely believe in their safety because of those ratings.
Rather, the entire industry crucially depended on the false models. Indeed they changed the data to conform with the models, which is to say it was an intentional combination of using flawed models and using irrelevant historical data (see points 64-69 here for more (Update: that link is now behind the paywall)).
In baseball, a team can’t create bad or misleading data to game the models of other teams in order to get an edge. But in the financial markets, parties to a model can and do.
In fact, every failed model is actually a success
Silver gives four examples what he considers to be failed models at the end of his first chapter, all related to economics and finance. But each example is actually a success (for the insiders) if you look at a slightly larger picture and understand the incentives inside the system. Here are the models:
- The housing bubble.
- The credit rating agencies selling AAA ratings on mortgage securities.
- The financial melt-down caused by high leverage in the banking sector.
- The economists’ predictions after the financial crisis of a fast recovery.
Here’s how each of these models worked out rather well for those inside the system:
- Everyone involved in the mortgage industry made a killing. Who’s going to stop the music and tell people to worry about home values? Homeowners and taxpayers made money (on paper at least) in the short term but lost in the long term, but the bankers took home bonuses that they still have.
- As we discussed, this was a system-wide tool for building a money machine.
- The financial melt-down was incidental, but the leverage was intentional. It bumped up the risk and thus, in good times, the bonuses. This is a great example of the modeling feedback loop: nobody cares about the wider consequences if they’re getting bonuses in the meantime.
- Economists are only putatively trying to predict the recovery. Actually they’re trying to affect the recovery. They get paid the big bucks, and they are granted authority and power in part to give consumers confidence, which they presumably hope will lead to a robust economy.
Cause and effect get confused
Silver confuses cause and effect. We didn’t have a financial crisis because of a bad model or a few bad models. We had bad models because of a corrupt and criminally fraudulent financial system.
That’s an important distinction, because we could fix a few bad models with a few good mathematicians, but we can’t fix the entire system so easily. There’s no math band-aid that will cure these boo-boos.
I can’t emphasize this too strongly: this is not just wrong, it’s maliciously wrong. If people believe in the math band-aid, then we won’t fix the problems in the system that so desperately need fixing.
Why does he make this mistake?
Silver has an unswerving assumption, which he repeats several times, that the only goal of a modeler is to produce an accurate model. (Actually, he made an exception for stock analysts.)
This assumption generally holds in his experience: poker, baseball, and polling are all arenas in which one’s incentive is to be as accurate as possible. But he falls prey to some of the very mistakes he warns about in his book, namely over-confidence and over-generalization. He assumes that, since he’s an expert in those arenas, he can generalize to the field of finance, where he is not an expert.
The logical result of this assumption is his definition of failure as something where the underlying mathematical model is inaccurate. But that’s not how most people would define failure, and it is dangerously naive.
Silver discusses both in the Introduction and in Chapter 8 to John Ioannadis’s work which reveals that most medical research is wrong. Silver explains his point of view in the following way:
I’m glad he mentions incentives here, but again he confuses cause and effect.
As I learned when I attended David Madigan’s lecture on Merck’s representation of Vioxx research to the FDA as well as his recent research on the methods in epidemiology research, the flaws in these medical models will be hard to combat, because they advance the interests of the insiders: competition among academic researchers to publish and get tenure is fierce, and there are enormous financial incentives for pharmaceutical companies.
Everyone in this system benefits from methods that allow one to claim statistically significant results, whether or not that’s valid science, and even though there are lives on the line.
In other words, it’s not that there are bad statistical approaches which lead to vastly over-reported statistically significant results and published papers (which could just as easily happen if the researchers were employing Bayesian techniques, by the way). It’s that there’s massive incentive to claim statistically significant findings, and not much push-back when that’s done erroneously, so the field never self-examines and improves their methodology. The bad models are a consequence of misaligned incentives.
I’m not accusing people in these fields of intentionally putting people’s lives on the line for the sake of their publication records. Most of the people in the field are honestly trying their best. But their intentions are kind of irrelevant.
Silver ignores politics and loves experts
Silver chooses to focus on individuals working in a tight competition and their motives and individual biases, which he understands and explains well. For him, modeling is a man versus wild type thing, working with your wits in a finite universe to win the chess game.
He spends very little time on the question of how people act inside larger systems, where a given modeler might be more interested in keeping their job or getting a big bonus than in making their model as accurate as possible.
In other words, Silver crafts an argument which ignores politics. This is Silver’s blind spot: in the real world politics often trump accuracy, and accurate mathematical models don’t matter as much as he hopes they would.
As an example of politics getting in the way, let’s go back to the culture of the credit rating agency Moody’s. William Harrington, an ex-Moody’s analyst, describes the politics of his work as follows:
In 2004 you could still talk back and stop a deal. That was gone by 2006. It became: work your tail off, and at some point management would say, ‘Time’s up, let’s convene in a committee and we’ll all vote “yes”‘.
To be fair, there have been moments in his past when Silver delves into politics directly, like this post from the beginning of Obama’s first administration, where he starts with this (emphasis mine):
To suggest that Obama or Geithner are tools of Wall Street and are looking out for something other than the country’s best interest is freaking asinine.
and he ends with:
This is neither the time nor the place for mass movements — this is the time for expert opinion. Once the experts (and I’m not one of them) have reached some kind of a consensus about what the best course of action is (and they haven’t yet), then figure out who is impeding that action for political or other disingenuous reasons and tackle them — do whatever you can to remove them from the playing field. But we’re not at that stage yet.
My conclusion: Nate Silver is a man who deeply believes in experts, even when the evidence is not good that they have aligned incentives with the public.
Distrust the experts
Call me “asinine,” but I have less faith in the experts than Nate Silver: I don’t want to trust the very people who got us into this mess, while benefitting from it, to also be in charge of cleaning it up. And, being part of the Occupy movement, I obviously think that this is the time for mass movements.
From my experience working first in finance at the hedge fund D.E. Shaw during the credit crisis and afterwards at the risk firm Riskmetrics, and my subsequent experience working in the internet advertising space (a wild west of unregulated personal information warehousing and sales) my conclusion is simple: Distrust the experts.
Why? Because you don’t know their incentives, and they can make the models (including Bayesian models) say whatever is politically useful to them. This is a manipulation of the public’s trust of mathematics, but it is the norm rather than the exception. And modelers rarely if ever consider the feedback loop and the ramifications of their predatory models on our culture.
Why do people like Nate Silver so much?
To be crystal clear: my big complaint about Silver is naivete, and to a lesser extent, authority-worship.
I’m not criticizing Silver for not understanding the financial system. Indeed one of the most crucial problems with the current system is its complexity, and as I’ve said before, most people inside finance don’t really understand it. But at the very least he should know that he is not an authority and should not act like one.
I’m also not accusing him of knowingly helping cover up the financial industry. But covering for the financial industry is an unfortunate side-effect of his naivete and presumed authority, and a very unwelcome source of noise at this moment when so much needs to be done.
I’m writing a book myself on modeling. When I began reading Silver’s book I was a bit worried that he’d already said everything I’d wanted to say. Instead, I feel like he’s written a book which has the potential to dangerously mislead people – if it hasn’t already – because of its lack of consideration of the surrounding political landscape.
Silver has gone to great lengths to make his message simple, and positive, and to make people feel smart and smug, especially Obama’s supporters.
He gets well-paid for his political consulting work and speaker appearances at hedge funds like D.E. Shaw and Jane Street, and, in order to maintain this income, it’s critical that he perfects a patina of modeling genius combined with an easily digested message for his financial and political clients.
Silver is selling a story we all want to hear, and a story we all want to be true. Unfortunately for us and for the world, it’s not.
How to push back against the celebrity-ization of data science
The truth is somewhat harder to understand, a lot less palatable, and much more important than Silver’s gloss. But when independent people like myself step up to denounce a given statement or theory, it’s not clear to the public who is the expert and who isn’t. From this vantage point, the happier, shorter message will win every time.
This raises a larger question: how can the public possibly sort through all the noise that celebrity-minded data people like Nate Silver hand to them on a silver platter? Whose job is it to push back against rubbish disguised as authoritative scientific theory?
It’s not a new question, since PR men disguising themselves as scientists have been around for decades. But I’d argue it’s a question that is increasingly urgent considering how much of our lives are becoming modeled. It would be great if substantive data scientists had a way of getting together to defend the subject against sensationalist celebrity-fueled noise.
One hope I nurture is that, with the opening of the various data science institutes such as the one at Columbia which was a announced a few months ago, there will be a way to form exactly such a committee. Can we get a little peer review here, people?
There’s an easy test here to determine whether to be worried. If you see someone using a model to make predictions that directly benefit them or lose them money – like a day trader, or a chess player, or someone who literally places a bet on an outcome (unless they place another hidden bet on the opposite outcome) – then you can be sure they are optimizing their model for accuracy as best they can. And in this case Silver’s advice on how to avoid one’s own biases are excellent and useful.
But if you are witnessing someone creating a model which predicts outcomes that are irrelevant to their immediate bottom-line, then you might want to look into the model yourself.
In the final week of Rachel Schutt’s Columbia Data Science course we heard from two groups of students as well as from Rachel herself.
Data Science; class consciousness
The first team of presenters consisted of Yegor, Eurry, and Adam. Many others whose names I didn’t write down contributed to the research, visualization, and writing.
First they showed us the very cool graphic explaining how self-reported skills vary by discipline. The data they used came from the class itself, which did this exercise on the first day:
so the star in the middle is the average for the whole class, and each star along the side corresponds to the average (self-reported) skills of people within a specific discipline. The dotted lines on the outside stars shows the “average” star, so it’s easier to see how things vary per discipline compared to the average.
Surprises: Business people seem to think they’re really great at everything except communication. Journalists are better at data wrangling than engineers.
We will get back to the accuracy of self-reported skills later.
We were asked, do you see your reflection in your star?
Also, take a look at the different stars. How would you use them to build a data science team? Would you want people who are good at different skills? Is it enough to have all the skills covered? Are there complementary skills? Are the skills additive, or do you need overlapping skills among team members?
If all data which had ever been collected were freely available to everyone, would we be better off?
Some ideas were offered:
- all nude photos are included. [Mathbabe interjects: it's possible to not let people take nude pics of you. Just sayin'.]
- so are passwords, credit scores, etc.
- how do we make secure transactions between a person and her bank considering this?
- what does it mean to be “freely available” anyway?
The data of power; the power of data
You see a lot of people posting crap like this on Facebook:
But here’s the thing: the Berner Convention doesn’t exist. People are posting this to their walls because they care about their privacy. People think they can exercise control over their data but they can’t. Stuff like this give one a false sense of security.
In Europe the privacy laws are stricter, and you can request data from Irish Facebook and they’re supposed to do it, but it’s still not easy to successfully do.
And it’s not just data that’s being collected about you – it’s data you’re collecting. As scientists we have to be careful about what we create, and take responsibility for our creations.
As Francois Rabelais said,
Wisdom entereth not into a malicious mind, and science without conscience is but the ruin of the soul.
Or as Emily Bell from Columbia said,
Every algorithm is editorial.
We can’t be evil during the day and take it back at hackathons at night. Just as journalists need to be aware that the way they report stories has consequences, so do data scientists. As a data scientist one has impact on people’s lives and how they think.
Here are some takeaways from the course:
- We’ve gained significant powers in this course.
- In the future we may have the opportunity to do more.
- With data power comes data responsibility.
Who does data science empower?
The second presentation was given by Jed and Mike. Again, they had a bunch of people on their team helping out.
Let’s start with a quote:
“Anything which uses science as part of its name isn’t political science, creation science, computer science.”
- Hal Abelson, MIT CS prof
Keeping this in mind, if you could re-label data science, would you? What would you call it?
Some comments from the audience:
- Let’s call it “modellurgy,” the craft of beating mathematical models into shape instead of metal
- Let’s call it “statistics”
Does it really matter what data science is? What should it end up being?
Chris Wiggins from Columbia contends there are two main views of what data science should end up being. The first stems from John Tukey, inventor of the fast fourier transform and the box plot, and father of exploratory data analysis. Tukey advocated for a style of research he called “data analysis”, emphasizing the primacy of data and therefore computation, which he saw as part of statistics. His descriptions of data analysis, which he saw as part of doing statistics, are very similar to what people call data science today.
The other prespective comes from Jim Gray, Computer Scientist from Microsoft. He saw the scientific ideals of the enlightenment age as expanding and evolving. We’ve gone from the theories of Darwin and Newton to experimental and computational approaches of Turing. Now we have a new science, a data-driven paradigm. It’s actually the fourth paradigm of all the sciences, the first three being experimental, theoretical, and computational. See more about this here.
Wait, can data science be both?
Note it’s difficult to stick Computer Science and Data Science on this line.
Statistics is a tool that everyone uses. Data science also could be seen that way, as a tool rather than a science.
Who does data science?
Here’s a graphic showing the make-up of Kaggle competitors. Teams of students collaborated to collect, wrangle, analyze and visualize this data:
The size of the blocks correspond to how many people in active competitions have an education background in a given field. We see that almost a quarter of competitors are computer scientists. The shading corresponds to how often they compete. So we see the business finance people do more competitions on average than the computer science people.
Consider this: the only people doing math competitions are math people. If you think about it, it’s kind of amazing how many different backgrounds are represented above.
We got some cool graphics created by the students who collaborated to get the data, process it, visualize it and so on.
Which universities offer courses on Data Science?
There will be 26 universities in total by 2013 that offer data science courses. The balls are centered at the center of gravity of a given state, and the balls are bigger if there are more in that state.
Where are data science jobs available?
- We see more professional schools offering data science courses on the west coast.
- It would also would be interesting to see this corrected for population size.
- Only two states had no jobs.
- Massachusetts #1 per capita, then Maryland
McKinsey says there will be hundreds of thousands of data science jobs in the next few years. There’s a massive demand in any case. Some of us will be part of that. It’s up to us to make sure what we’re doing is really data science, rather than validating previously held beliefs.
We need to advance human knowledge if we want to take the word “scientist” seriously.
How did this class empower you?
You are one of the first people to take a data science class. There’s something powerful there.
Thank you Rachel!
Last Day of Columbia Data Science Class, What just happened? from Rachel’s perspective
Recall the stated goals of this class were:
- learn about what it’s like to be a data scientists
- be able to do some of what a data scientist does
Hey we did this! Think of all the guest lectures; they taught you a lot of what it’s like to be a data scientist, which was goal 1. Here’s what I wanted you guys to learn before the class started based on what a data scientist does, and you’ve learned a lot of that, which was goal 2:
Mission accomplished! Mission accomplished?
Thought experiment that I gave to myself last Spring
How would you design a data science class?
Comments I made to myself:
- It’s not a well-defined body of knowledge, subject, no textbook!
- It’s popularized and celebrated in the press and media, but there’s no “authority” to push back
- I’m intellectually disturbed by idea of teaching a course when the body of knowledge is ill-defined
- I didn’t know who would show up, and what their backgrounds and motivations would be
- Could it become redundant with a machine learning class?
I asked questions of myself and from other people. I gathered information, and endured existential angst about data science not being a “real thing.” I needed to give it structure.
Then I started to think about it this way: while I recognize that data science has the potential to be a deep research area, it’s not there yet, and in order to actually design a class, let’s take a pragmatic approach: Recognize that data science exists. After all, there are jobs out there. I want to help students to be qualified for them. So let me teach them what it takes to get those jobs. That’s how I decided to approach it.
In other words, from this perspective, data science is what data scientists do. So it’s back to the list of what data scientists do. I needed to find structure on top of that, so the structure I used as a starting point were the data scientist profiles.
Data scientist profiles
This was a way to think about your strengths and weaknesses, as well as a link between speakers. Note it’s easy to focus on “technical skills,” but it can also be problematic in being too skills-based, as well as being problematic because it has no scale, and no notion of expertise. On the other hand it’s good in that it allows for and captures variability among data scientists.
I assigned weekly guest speakers topics related to their strengths. We held lectures, labs, and (optional) problem sessions. From this you got mad skillz:
- programming in R
- some python
- you learned some best practices about coding
From the perspective of machine learning,
- you know a bunch of algorithms like linear regression, logistic regression, k-nearest neighbors, k-mean, naive Bayes, random forests,
- you know what they are, what they’re used for, and how to implement them
- you learned machine learning concepts like training sets, test sets, over-fitting, bias-variance tradeoff, evaluation metrics, feature selection, supervised vs. unsupervised learning
- you learned about recommendation systems
- you’ve entered a Kaggle competition
Importantly, you now know that if there is an algorithm and model that you don’t know, you can (and will) look it up and figure it out. I’m pretty sure you’ve all improved relative to how you started.
You’ve learned some data viz by taking flowing data tutorials.
You’ve learned statistical inference, because we discussed
- observational studies,
- causal inference, and
- experimental design.
- We also learned some maximum likelihood topics, but I’d urge you to take more stats classes.
In the realm of data engineering,
- we showed you map reduce and hadoop
- we worked with 30 separate shards
- we used an api to get data
- we spent time cleaning data
- we’ve processed different kinds of data
As for communication,
- you wrote thoughts in response to blog posts
- you observed how different data scientists communicate or present themselves, and have different styles
- your final project required communicating among each other
As for domain knowledge,
- lots of examples were shown to you: social networks, advertising, finance, pharma, recommender systems, dallas art museum
I heard people have been asking the following: why didn’t we see more data science coming from non-profits, governments, and universities? Note that data science, the term, was born in for-profits. But the truth is I’d also like to see more of that. It’s up to you guys to go get that done!
How do I measure the impact of this class I’ve created? Is it possible to incubate awesome data science teams in the classroom? I might have taken you from point A to point B but you might have gone there anyway without me. There’s no counterfactual!
Can we set this up as a data science problem? Can we use a causal modeling approach? This would require finding students who were more or less like you but didn’t take this class and use propensity score matching. It’s not a very well-defined experiment.
But the goal is important: in industry they say you can’t learn data science in a university, that it has to be on the job. But maybe that’s wrong, and maybe this class has proved that.
What has been the impact on you or to the outside world? I feel we have been contributing to the broader discourse.
Does it matter if there was impact? and does it matter if it can be measured or not? Let me switch gears.
What is data science again?
Data science could be defined as:
- A set of best practices used in tech companies, which is how I chose to design the course
- A space of problems that could be solved with data
- A science of data where you can think of the data itself as units
The bottom two have the potential to be the basis of a rich and deep research discipline, but in many cases, the way the term is currently used is:
- Pure hype
But it doesn’t matter how we define it, as much as that I want for you:
- to be problem solvers
- to be question askers
- to think about your process
- to use data responsibly and make the world better, not worse.
More on being problem solvers: cultivate certain habits of mind
Here’s a possible list of things to strive for, taken from here:
Here’s the thing. Tons of people can implement k-nearest neighbors, and many do it badly. What matters is that you cultivate the above habits, remain open to continuous learning.
In education in traditional settings, we focus on answers. But what we probably should focus on is how a student behaves when they don’t know the answer. We need to have qualities that help us find the answer.
How would you design a data science class around habits of mind rather than technical skills? How would you quantify it? How would you evaluate? What would students be able to write on their resumes?
Comments from the students:
- You’d need to keep making people doing stuff they don’t know how to do while keeping them excited about it.
- have people do stuff in their own domains so we keep up wonderment and awe.
- You’d use case studies across industries to see how things work in different contexts
More on being question-askers
Some suggestions on asking questions of others:
- start with assumption that you’re smart
- don’t assume the person you’re talking to knows more or less. You’re not trying to prove anything.
- be curious like a child, not worried about appearing stupid
- ask for clarification around notation or terminology
- ask for clarification around process: where did this data come from? how will it be used? why is this the right data to use? who is going to do what? how will we work together?
Some questions to ask yourself
- does it have to be this way?
- what is the problem?
- how can I measure this?
- what is the appropriate algorithm?
- how will I evaluate this?
- do I have the skills to do this?
- how can I learn to do this?
- who can I work with? Who can I ask?
- how will it impact the real world?
Data Science Processes
In addition to being problem-solvers and question-askers, I mentioned that I want you to think about process. Here are a couple processes we discussed in this course:
(1) Real World –> Generates Data –>
–> Collect Data –> Clean, Munge (90% of your time)
–> Exploratory Data Analysis –>
–> Feature Selection –>
–> Build Model, Build Algorithm, Visualize
–> Evaluate –>Iterate–>
–> Impact Real World
(2) Asking questions of yourselves and others –>
Identifying problems that need to be solved –>
Gathering information, Measuring –>
Learning to find structure in unstructured situations–>
Framing Problem –>
Creating Solutions –> Evaluating
Come up with a business that improves the world and makes money and uses data
Comments from the students:
- autonomous self-driving cars you order with a smart phone
- find all the info on people and then show them how to make it private
- social network with no logs and no data retention
10 Important Data Science Ideas
Of all the blog posts I wrote this semester, here’s one I think is important:
Confidence and Uncertainty
Let’s talk about confidence and uncertainty from a couple perspectives.
First, remember that statistical inference is extracting information from data, estimating, modeling, explaining but also quantifying uncertainty. Data Scientists could benefit from understanding this more. Learn more statistics and read Ben’s blog post on the subject.
Second, we have the Dunning-Kruger Effect.
Have you ever wondered why don’t people say “I don’t know” when they don’t know something? This is partly explained through an unconscious bias called the Dunning-Kruger effect.
Basically, people who are bad at something have no idea that they are bad at it and overestimate their confidence. People who are super good at something underestimate their mastery of it. Actual competence may weaken self-confidence.
Design an app to combat the dunning-kruger effect.
Optimizing your life, Career Advice
What are you optimizing for? What do you value?
- money, need some minimum to live at the standard of living you want to, might even want a lot.
- time with loved ones and friends
- doing good in the world
- personal fulfillment, intellectual fulfillment
- goals you want to reach or achieve
- being famous, respected, acknowledged
- some weighted function of all of the above. what are the weights?
What constraints are you under?
- external factors (factors outside of your control)
- your resources: money, time, obligations
- who you are, your education, strengths & weaknesses
- things you can or cannot change about yourself
There are many possible solutions that optimize what you value and take into account the constraints you’re under.
So what should you do with your life?
Remember that whatever you decide to do is not permanent so don’t feel too anxious about it, you can always do something else later –people change jobs all the time
But on the other hand, life is short, so always try to be moving in the right direction (optimizing for what you care about).
If you feel your way of thinking or perspective is somehow different than what those around you are thinking, then embrace and explore that, you might be onto something.
I’m always happy to talk to you about your individual case.
Next Gen Data Scientists
The second blog post I think is important is this “manifesto” that I wrote:
Next-Gen Data Scientists. That’s you! Go out and do awesome things, use data to solve problems, have integrity and humility.
Here’s our class photo!
I just got back from a stimulating trip to Stony Brook to give the math colloquium there. I had a great time thanks to my gracious host Jason Starr (this guy, not this guy), and besides giving my talk (which I will give again in San Diego at the joint meetings next month) I enjoyed two conversations about the field of math which I think could be turned into data science projects. Maybe Ph.D. theses or something.
First, a system for deciding whether a paper on the arXiv is “good.” I will post about that on another day because it’s actually pretty involved and possible important.
Second is the way people hire in math departments. This conversation will generalize to other departments, some more than others.
So first of all, I want to think about how the hiring process actually works. There are people who look at folders of applicants, say for tenure-track jobs. Since math is a pretty disjointed field, a majority of the folders will only be understood well enough for evaluation purposes by a few people in the department.
So in other words, the department naturally splits into clusters more or less along field lines: there are the number theorists and then there are the algebraic geometers and then there are the low-dimensional topologists, say.
Each group of people reads the folders from the field or fields that they have enough expertise in to understand. Then from among those they choose some they want to go to bat for. It becomes a political battle, where each group tries to convince the other groups that their candidates are more qualified. But of course it’s really hard to know who’s telling the honest truth. There are probably lots of biases in play too, so people could be overstating their cases unconsciously.
Some potential problems with this system:
- if you are applying to a department where nobody is in your field, nobody will read your folder, and nobody will go to bat for you, even if you are really great. An exaggeration but kinda true.
- in order to be convincing that “your guy is the best applicant,” people use things like who the advisor is or which grad school this person went to more than the underlying mathematical content.
- if your department grows over time, this tends to mean that you get bigger clusters rather than more clusters. So if you never had a number theorist, you tend to never get one, even if you get more positions. This is a problem for grad students who want to become number theorists, but that probably isn’t enough to affect the politics of hiring.
So here’s my data science plan: test the above hypotheses. I said them because I think they are probably true, but it would be not be impossible to create the dataset to test them thoroughly and measure the effects.
The easiest and most direct one to test is the third: cluster departments by subject by linking the people with their published or arXiv’ed papers. Watch the department change over time and see how the clusters change and grow versus how it might happen randomly. Easy peasy lemon squeazy if you have lots of data. Start collecting it now!
The first two are harder but could be related to the project of ranking papers. In other words, you have to define “is really great” to do this. It won’t mean you can say with confidence that X should have gotten a job at University Y, but it would mean you could say that if X’s subject isn’t represented in University Y’s clusters, then X’s chances of getting a job there, all other things being equal, is diminished by Z% on average. Something like that.
There are of course good things about the clustering. For example, it’s not that much fun to be the only person representing a field in your department. I’m not actually passing judgment on this fact, and I’m also not suggesting a way to avoid it (if it should be avoided).
This week’s guest lecturer in Rachel Schutt’s Columbia Data Science class was Claudia Perlich. Claudia has been the Chief Scientist at m6d for 3 years. Before that she was a data analytics group at the IBM center that developed Watson, the computer that won Jeopardy!, although she didn’t work on that project. Claudia got her Ph.D. in information systems at NYU and now teaches a class to business students in data science, although mostly she addresses how to assess data science work and how to manage data scientists. Claudia also holds a masters in Computer Science.
Claudia is a famously successful data mining competition winner. She won the KDD Cup in 2003, 2007, 2008, and 2009, the ILP Challenge in 2005, the INFORMS Challenge in 2008, and the Kaggle HIV competition in 2010.
She’s also been a data mining competition organizer, first for the INFORMS Challenge in 2009 and then for the Heritage Health Prize in 2011. Claudia claims to be retired from competition.
Claudia’s advice to young people: pick your advisor first, then choose the topic. It’s important to have great chemistry with your advisor, and don’t underestimate the importance.
Here’s what Claudia historically does with her time:
- predictive modeling
- data mining competitions
- publications in conferences like KDD and journals
- digging around data (her favorite part)
Claudia likes to understand something about the world by looking directly at the data.
Here’s Claudia’s skill set:
- plenty of experience doing data stuff (15 years)
- data intuition (for which one needs to get to the bottom of the data generating process)
- dedication to the evaluation (one needs to cultivate a good sense of smell)
- model intuition (we use models to diagnose data)
Claudia also addressed being a woman. She says it works well in the data science field, where her intuition is useful and is used. She claims her nose is so well developed by now that she can smell it when something is wrong. This is not the same thing as being able to prove something algorithmically. Also, people typically remember her because she’s a woman, even when she don’t remember them. It has worked in her favor, she says, and she’s happy to admit this. But then again, she is where she is because she’s good.
Someone in the class asked if papers submitted for journals and/or conferences are blind to gender. Claudia responded that it was, for some time, typically double-blind but now it’s more likely to be one-sided. And anyway there was a cool analysis that showed you can guess who wrote a paper with 80% accuracy just by knowing the citations. So making things blind doesn’t really help. More recently the names are included, and hopefully this doesn’t make things too biased. Claudia admits to being slightly biased towards institutions – certain institutions prepare better work.
Skills and daily life of a Chief Data Scientist
Claudia’s primary skills are as follows:
- Data manipulation: unix (sed, awk, etc), Perl, SQL
- Modeling: various methods (logistic regression, nearest neighbors, k-nearest neighbors, etc)
- Setting things up
She mentions that the methods don’t matter as much as how you’ve set it up, and how you’ve translated it into something where you can solve a question.
More recently, she’s been told that at work she spends:
- 40% of time as “contributor”: doing stuff directly with data
- 40% of time as “ambassador”: writing stuff, giving talks, mostly external communication to represent m6d, and
- 20% of time in “leadership” of her data group
At IBM it was much more focused in the first category. Even so, she has a flexible schedule at m6d and is treated well.
The goals of the audience
She asked the class, why are you here? Do you want to:
- become a data scientist? (good career choice!)
- work with data scientist?
- work for a data scientist?
- manage a data scientist?
Most people were trying their hands at the first, but we had a few in each category.
She mentioned that it matters because the way she’d talk to people wanting to become a data scientist would be different from the way she’d talk to someone who wants to manage them. Her NYU class is more like how to manage one.
So, for example, you need to be able to evaluate their work. It’s one thing to check a bubble sort algorithm or check whether a SQL server is working, but checking a model which purports to give the probability of people converting is different kettle of fish.
For example, try to answer this: how much better can that model get if you spend another week on it? Let’s face it, quality control is hard for yourself as a data miner, so it’s definitely hard for other people. There’s no easy answer.
There’s an old joke that comes to mind: What’s the difference between the scientist and a consultant? The scientists asks, how long does it take to get this right? whereas the consultant asks, how right can I get this in a week?
Insights into data
A student asks, how do you turn a data analysis into insights?
For example, decision trees you interpret, and people like them because they’re easy to interpret, but I’d ask, why does it look like it does? A slightly different data set would give you a different tree and you’d get a different conclusion. This is the illusion of understanding. I tend to be careful with delivering strong insights in that sense.
Data mining competitions
Claudia drew a distinction between different types of data mining competitions.
On the one hand you have the ”sterile” kind, where you’re given a clean, prepared data matrix, a standard error measure, and where the features are often anonymized. This is a pure machine learning problem.
Examples of this first kind are: KDD Cup 2009 and 2011 (Netflix). In such competitions, your approach would emphasize algorithms and computation. The winner would probably have heavy machines and huge modeling ensembles.
On the other hand, you have the ”real world” kind of data mining competition, where you’re handed raw data, which is often in lots of different tables and not easily joined, where you set up the model yourself and come up with task-specific evaluations. This kind of competition simulates real life more.
Examples of this second kind are: KDD cup 2007, 2008, and 2010. If you’re competing in this kind of competition your approach would involve understanding the domain, analyzing the data, and building the model. The winner might be the person who best understands how to tailor the model to the actual question.
Claudia prefers the second kind, because it’s closer to what you do in real life. In particular, the same things go right or go wrong.
How to be a good modeler
Claudia claims that data and domain understanding is the single most important skill you need as a data scientist. At the same time, this can’t really be taught – it can only be cultivated.
A few lessons learned about data mining competitions that Claudia thinks are overlooked in academics:
- Leakage: the contestants best friend and the organizers/practitioners worst nightmare. There’s always something wrong with the data, and Claudia has made an artform of figuring out how the people preparing the competition got lazy or sloppy with the data.
- Adapting learning to real-life performance measures beyond standard measures like MSE, error rate, or AUC (profit?)
- Feature construction/transformation: real data is rarely flat (i.e. given to you in a beautiful matrix) and good, practical solutions for this problem remains a challenge.
Leakage refers to something that helps you predict something that isn’t fair. It’s a huge problem in modeling, and not just for competitions. Oftentimes it’s an artifact of reversing cause and effect.
Example 1: There was a competition where you needed to predict S&P in terms of whether it would go up or go down. The winning entry had a AUC (area under the ROC curve) of 0.999 out of 1. Since stock markets are pretty close to random, either someone’s very rich or there’s something wrong. There’s something wrong.
In the good old days you could win competitions this way, by finding the leakage.
Example 2: Amazon case study: big spenders. The target of this competition was to predict customers who spend a lot of money among customers using past purchases. The data consisted of transaction data in different categories. But a winning model identified that “Free Shipping = True” was an excellent predictor
What happened here? The point is that free shipping is an effect of big spending. But it’s not a good way to model big spending, because in particular it doesn’t work for new customers or for the future. Note: timestamps are weak here. The data that included “Free Shipping = True” was simultaneous with the sale, which is a no-no. We need to only use data from beforehand to predict the future.
Example 3: Again an online retailer, this time the target is predicting customers who buy jewelry. The data consists of transactions for different categories. A very successful model simply noted that if sum(revenue) = 0, then it predicts jewelry customers very well?
What happened here? The people preparing this data removed jewelry purchases, but only included people who bought something in the first place. So people who had sum(revenue) = 0 were people who only bought jewelry. The fact that you only got into the dataset if you bought something is weird: in particular, you wouldn’t be able to use this on customers before they finished their purchase. So the model wasn’t being trained on the right data to make the model useful. This is a sampling problem, and it’s common.
Example 4: This happened at IBM. The target was to predict companies who would be willing to buy “websphere” solutions. The data was transaction data + crawled potential company websites. The winning model showed that if the term ”websphere” appeared on the company’s website, then they were great candidates for the product.
What happened? You can’t crawl the historical web, just today’s web.
You’re trying to study who has breast cancer. The patient ID, which seemed innocent, actually has predictive power. What happened?
In the above image, red means cancerous, green means not. it’s plotted by patient ID. We see three or four distinct buckets of patient identifiers. It’s very predictive depending on the bucket. This is probably a consequence of using multiple databases, some of which correspond to sicker patients are more likely to be sick.
A student suggests: for the purposes of the contest they should have renumbered the patients and randomized.
Claudia: would that solve the problem? There could be other things in common as well.
A student remarks: The important issue could be to see the extent to which we can figure out which dataset a given patient came from based on things besides their ID.
Claudia: Think about this: what do we want these models for in the first place? How well can you predict cancer?
Given a new patient, what would you do? If the new patient is in a fifth bin in terms of patient ID, then obviously don’t use the identifier model. But if it’s still in this scheme, then maybe that really is the best approach.
This discussion brings us back to the fundamental problem that we need to know what the purpose of the model is and how is it going to be used in order to decide how to do it and whether it’s working.
During an INFORMS competition on pneumonia predictions in hospital records, where the goal was to predict whether a patient has pneumonia, a logistic regression which included the number of diagnosis codes as a numeric feature (AUC of 0.80) didn’t do as well as the one which included it as a categorical feature (0.90). What’s going on?
This had to do with how the person prepared the data for the competition:
The diagnosis code for pneumonia was 486. So the preparer removed that (and replaced it by a “-1″) if it showed up in the record (rows are different patients, columns are different diagnoses, there are max 4 diagnoses, “-1″ means there’s nothing for that entry).
Moreover, to avoid telling holes in the data, the preparer moved the other diagnoses to the left if necessary, so that only “-1″‘s were on the right.
There are two problems with this:
- If the column has only “-1″‘s, then you know it started out with only pneumonia, and
- If the column has no “-1″‘s, you know there’s no pneumonia (unless there are actually 5 diagnoses, but that’s less common).
This was enough information to win the competition.
Note: winning competition on leakage is easier than building good models. But even if you don’t explicitly understand and game the leakage, your model will do it for you. Either way, leakage is a huge problem.
How to avoid leakage
Claudia’s advice to avoid this kind of problem:
- You need a strict temporal cutoff: remove all information just prior to the event of interest (patient admission).
- There has to be a timestamp on every entry and you need to keep
- Removing columns asks for trouble
- Removing rows can introduce inconsistencies with other tables, also causing trouble
- The best practice is to start from scratch with clean, raw data after careful consideration
- You need to know how the data was created! I only work with data I pulled and prepared myself (or maybe Ori).
How do I know that my model is any good?
With powerful algorithms searching for patterns of models, there is a serious danger of over fitting. It’s a difficult concept, but the general idea is that “if you look hard enough you’ll find something” even if it does not generalize beyond the particular training data.
To avoid overfitting, we cross-validate and we cut down on the complexity of the model to begin with. Here’s a standard picture (although keep in mind we generally work in high dimensional space and don’t have a pretty picture to look at):
The picture on the left is underfit, in the middle is good, and on the right is overfit.
The model you use matters when it concerns overfitting:
So for the above example, unpruned decision trees are the most over fitting ones. This is a well-known problem with unpruned decision trees, which is why people use pruned decision trees.
Claudia dismisses accuracy as a bad evaluation method. What’s wrong with accuracy? It’s inappropriate for regression obviously, but even for classification, if the vast majority is of binary outcomes are 1, then a stupid model can be accurate but not good (guess it’s always “1″), and a better model might have lower accuracy.
Probabilities matter, not 0′s and 1′s.
Nobody makes decisions on binary outcomes. I want to know the probability I have breast cancer, I don’t want to be told yes or no. It’s much more information. I care about probabilities.
How to evaluate a probability model
We separately evaluate the ranking and the calibration. To evaluate the ranking, we use the ROC curve and calculate the area under it, typically ranges from 0.5-1.0. This is independent of scaling and calibration. Here’s an example of how to draw an ROC curve:
Sometimes to measure rankings, people draw the so-called lift curve:
The key here is that the lift is calculated with respect to a baseline. You draw it at a given point, say 10%, by imagining that 10% of people are shown ads, and seeing how many people click versus if you randomly showed 10% of people ads. A lift of 3 means it’s 3 times better.
How do you measure calibration? Are the probabilities accurate? If the model says probability of 0.57 that I have cancer, how do I know if it’s really 0.57? We can’t measure this directly. We can only bucket those predictions and then aggregately compare those in that prediction bucket (say 0.50-0.55) to the actual results for that bucket.
For example, here’s what you get when your model is an unpruned decision tree, where the blue diamonds are buckets:
A good model would show buckets right along the x=y curve, but here we’re seeing that the predictions were much more extreme than the actual probabilities. Why does this pattern happen for decision trees?
Claudia says that this is because trees optimize purity: it seeks out pockets that have only positives or negatives. Therefore its predictions are more extreme than reality. This is generally true about decision trees: they do not generally perform well with respect to calibration.
Logistic regression looks better when you test calibration, which is typical:
- Accuracy is almost never the right evaluation metric.
- Probabilities, not binary outcomes.
- Separate ranking from calibration.
- Ranking you can measure with nice pictures: ROC, lift
- Calibration is measured indirectly through binning.
- Different models are better than others when it comes to calibration.
- Calibration is sensitive to outliers.
- Measure what you want to be good at.
- Have a good baseline.
Choosing an algorithm
This is not a trivial question and in particular small tests may steer you wrong, because as you increase the sample size the best algorithm might vary: often decision trees perform very well but only if there’s enough data.
In general you need to choose your algorithm depending on the size and nature of your dataset and you need to choose your evaluation method based partly on your data and partly on what you wish to be good at. Sum of squared error is maximum likelihood loss function if your data can be assumed to be normal, but if you want to estimate the median, then use absolute errors. If you want to estimate a quantile, then minimize the weighted absolute error.
We worked on predicting the number of ratings of a movie will get in the next year, and we assumed a poisson distributions. In this case our evaluation method doesn’t involve minimizing the sum of squared errors, but rather something else which we found in the literature specific to the Poisson distribution, which depends on the single parameter :
Charity direct mail campaign
Let’s put some of this together.
Say we want to raise money for a charity. If we send a letter to every person in the mailing list we raise about $9000. We’d like to save money and only send money to people who are likely to give – only about 5% of people generally give. How can we do that?
If we use a (somewhat pruned, as is standard) decision tree, we get $0 profit: it never finds a leaf with majority positives.
If we use a neural network we still make only $7500, even if we only send a letter in the case where we expect the return to be higher than the cost.
This looks unworkable. But if you model is better, it’s not. A person makes two decisions here. First, they decide whether or not to give, then they decide how much to give. Let’s model those two decisions separately, using:
Note we need the first model to be well-calibrated because we really care about the number, not just the ranking. So we will try logistic regression for first half. For the second part, we train with special examples where there are donations.
Altogether this decomposed model makes a profit of $15,000. The decomposition made it easier for the model to pick up the signals. Note that with infinite data, all would have been good, and we wouldn’t have needed to decompose. But you work with what you got.
Moreover, you are multiplying errors above, which could be a problem if you have a reason to believe that those errors are correlated.
We are not meant to understand data. Data are outside of our sensory systems and there are very few people who have a near-sensory connection to numbers. We are instead meant to understand language.
We are not mean to understand uncertainty: we have all kinds of biases that prevent this from happening and are well-documented.
Modeling people in the future is intrinsically harder than figuring out how to label things that have already happened.
Even so we do our best, and this is through careful data generation, careful consideration of what our problem is, making sure we model it with data close to how it will be used, making sure we are optimizing to what we actually desire, and doing our homework in learning which algorithms fit which tasks.
I’m very happy to say I just signed a book contract with my co-author, Rachel Schutt, to publish a book with O’Reilly called Doing Data Science.
For those of you who’ve been reading along for free as I’ve been blogging it, there might not be a huge incentive to buy it, but I can promise you more and better math, more explicit usable formulas, some sample code, and an overall better and more thought-out narrative.
It’s supposed to be published in May with a possible early release coming up at the end of February, in time for the O’Reilly Strata Santa Clara conference, where Rachel will be speaking about it and about other stuff curriculum related. Hopefully people will pick it up in time to teach their data science courses in Fall 2013.
Speaking of Rachel, she’s also been selected to give a TedXWomen talk at Barnard on December 1st, which is super exciting. She’s talking about advocating for the social good using data. Unfortunately the event is invitation-only, otherwise I’d encourage you all to go and hear her words of wisdom. Update: word on the street is that it will be video-taped.
Ori got his Ph.D. in Biostatistics from UC Berkeley after working at a litigation consulting firm. He credits that job with allowing him to understand data through exposure to tons of different data sets; since his job involved creating stories out of data to let experts testify at trials, e.g. for asbestos. In this way Ori developed his data intuition.
Ori worries that people ignore this necessary data intuition when they shove data into various algorithms. He thinks that when their method converges, they are convinced the results are therefore meaningful, but he’s here today to explain that we should be more thoughtful than that.
It’s very important when estimating causal parameters, Ori says, to understand the data-generating distributions and that involves gaining subject matter knowledge that allows you to understand if you necessary assumptions are plausible.
Ori says the first step in a data analysis should always be to take a step back and figure out what you want to know, write that down, and then find and use the tools you’ve learned to answer those directly. Later of course you have to decide how close you came to answering your original questions.
Ori asks, how do you know if your data may be used to answer your question of interest? Sometimes people think that because they have data on a subject matter then you can answer any question.
Students had some ideas:
- You need coverage of your parameter space. For example, if you’re studying the relationship between household income and holidays but your data is from poor households, then you can’t extrapolate to rich people. (Ori: but you could ask a different question)
- Causal inference with no timestamps won’t work.
- You have to keep in mind what happened when the data was collected and how that process affected the data itself
- Make sure you have the base case: compared to what? If you want to know how politicians are affected by lobbyists money you need to see how they behave in the presence of money and in the presence of no money. People often forget the latter.
- Sometimes you’re trying to measure weekly effects but you only have monthly data. You end up using proxies. Ori: but it’s still good practice to ask the precise question that you want, then come back and see if you’ve answered it at the end. Sometimes you can even do a separate evaluation to see if something is a good proxy.
- Signal to noise ratio is something to worry about too: as you have more data, you can more precisely estimate a parameter. You’d think 10 observations about purchase behavior is not enough, but as you get more and more examples you can answer more difficult questions.
Ori explains confounders with a dating example
Frank has an important decision to make. He’s perusing a dating website and comes upon a very desirable woman – he wants her number. What should he write in his email to her? Should he tell her she is beautiful? How do you answer that with data?
You could have him select a bunch of beautiful women and half the time chosen at random, tell them they’re beautiful. Being random allows us to assume that the two groups have similar distributions of various features (not that’s an assumption).
Our real goal is to understand the future under two alternative realities, the treated and the untreated. When we randomize we are making the assumption that the treated and untreated populations are alike.
OK Cupid looked at this and concluded:
- It could say more about the person who says “beautiful” than the word itself. Maybe they are otherwise ridiculous and overly sappy?
- The recipients of emails containing the word “beautiful” might be special: for example, they might get tons of email, which would make it less likely for Frank to get any response at all.
- For that matter, people may be describing themselves as beautiful.
Ori points out that this fact, that she’s beautiful, affects two separate things:
- whether Frank uses the word “beautiful” or not in his email, and
- the outcome (i.e. whether Frank gets the phone number).
For this reason, the fact that she’s beautiful qualifies as a confounder. The treatment is Frank writing “beautiful” in his email.
Denote by the list of all potential confounders. Note it’s an assumption that we’ve got all of them (and recall how unreasonable this seems to be in epidemiology research).
Denote by the treatment (so, Frank using the word “beautiful” in the email). We usually assume this to have a binary (0/1) outcome.
Denote by the binary (0/1) outcome (Frank getting the number).
We are forming the following causal graph:
In a causal graph, each arrow means that the ancestor is a cause of the descendent, where ancestor is the node the arrow is coming out of and the descendent is the node the arrow is going into (see this book for more).
In our example with Frank, the arrow from beauty means that the woman being beautiful is a cause of Frank writing “beautiful” in the message. Both the man writing “beautiful” and and the woman being beautiful are direct causes of her probability to respond to the message.
Setting the problem up formally
The building blocks in understanding the above causal graph are:
- Ask question of interest.
- Make causal assumptions (denote these by ).
- Translate question into a formal quantity (denote this by ).
- Estimate quantity (denote this by ).
We need domain knowledge in general to do this. We also have to take a look at the data before setting this up, for example to make sure we may make the
Positivity Assumption. We need treatment (i.e. data) in all strata of things we adjust for. So if think gender is a confounder, we need to make sure we have data on women and on men. If we also adjust for age, we need data in all of the resulting bins.
Asking causal questions
What is the effect of ___ on ___?
This is the natural form of a causal question. Here are some examples:
- What is the effect of advertising on customer behavior?
- What is the effect of beauty on getting a phone number?
- What is the effect of censoring on outcome? (censoring is when people drop out of a study)
- What is the effect of drug on time until viral failure?, and the general case
- What is the effect of treatment on outcome?
Look, estimating causal parameters is hard. In fact the effectiveness of advertising is almost always ignored because it’s so hard to measure. Typically people choose metrics of success that are easy to estimate but don’t measure what they want! Everyone makes decision based on them anyway because it’s easier. This results in people being rewarded for finding people online who would have converted anyway.
Accounting for the effect of interventions
Thinking about that, we should be concerned with the effect of interventions. What’s a model that can help us understand that effect?
A common approach is the (randomized) A/B test, which involves the assumption that two populations are equivalent. As long as that assumption is pretty good, which it usually is with enough data, then this is kind of the gold standard.
But A/B tests are not always possible (or they are too expensive to be plausible). Often we need to instead estimate the effects in the natural environment, but then the problem is the guys in different groups are actually quite different from each other.
So, for example, you might find you showed ads to more people who are hot for the product anyway; it wouldn’t make sense to test the ad that way without adjustment.
The game is then defined: how do we adjust for this?
The ideal case
Similar to how we did this last week, we pretend for now that we have a “full” data set, which is to say we have god-like powers and we know what happened under treatment as well as what would have happened if we had not treated, as well as vice-versa, for every agent in the test.
Denote this full data set by
- denotes the baseline variables (attributes of the agent) as above,
- denotes the binary treatment as above,
- denotes the binary outcome if treated, and
- denotes the binary outcome if untreated.
As a baseline check: if we observed this full data structure how would we measure the effect of A on Y? In that case we’d be all-powerful and we would just calculate:
Note that, since and are binary, the expected value is just the probability of a positive outcome if untreated. So in the case of advertising, the above is the conversion rate change when you show someone an ad. You could also take the ratio of the two quantities:
This would be calculating how much more likely someone is to convert if they see an ad.
Note these are outcomes you can really do stuff with. If you know people convert at 30% versus 10% in the presence of an ad, that’s real information. Similarly if they convert 3 times more often.
In reality people use silly stuff like log odds ratios, which nobody understands or can interpret meaningfully.
The ideal case with functions
In reality we don’t have god-like powers, and we have to make do. We will make a bunch of assumptions. First off, denote by exogenous variables, i.e. stuff we’re ignoring. Assume there are functions and so that:
- i.e. the attributes are just functions of some exogenous variables,
- i.e. the treatment depends in a nice way on some exogenous variables as well the attributes we know about living in , and
- i.e. the outcome is just a function of the treatment, the attributes, and some exogenous variables.
Note the various ‘s could contain confounders in the above notation. That’s gonna change.
But we want to intervene on this causal graph as though it’s the intervention we actually want to make. i.e. what’s the effect of treatment on outcome ?
Let’s look at this from the point of view of the joint distribution These terms correspond to the following in our example:
- the probability of a woman being beautiful,
- the probability that Frank writes and email to a her saying that she’s beautiful, and
- the probability that Frank gets her phone number.
What we really care about though is the distribution under intervention:
i.e. the probability knowing someone either got treated or not. To answer our question, we manipulate the value of first setting it to 1 and doing the calculation, then setting it to 0 and redoing the calculation.
We are making a “Consistency Assumption / SUTVA” which can be expressed like this:
We have also assumed that we have no unmeasured confounders, which can be expressed thus:
We are also assuming positivity, which we discussed above.
Down to brass tacks
We only have half the information we need. We need to somehow map the stuff we have to the full data set as defined above. We make use of the following identity:
Recall we want to estimate which by the above can be rewritten
We’re going to discuss three methods to estimate this quantity, namely:
- MLE-based substitution estimator (MLE),
- Inverse probability estimators (IPTW),
- Double robust estimating equations (A-IPTW)
For the above models, it’s useful to think of there being two machines, called and , which generate estimates of the probability of the treatment knowing the attributes (that’s machine ) and the probability of the outcome knowing the treatment and the attributes (machine ).
In this method, which is also called importance sampling, we weight individuals that are unlikely to be shown an ad more than those likely. In other words, we up-sample in order to generate the distribution, to get the estimation of the actual effect.
To make sense of this, imagine that you’re doing a survey of people to see how they’ll vote, but you happen to do it at a soccer game where you know there are more young people than elderly people. You might want to up-sample the elderly population to make your estimate.
This method can be unstable if there are really small sub-populations that you’re up-sampling, since you’re essentially multiplying by a reciprocal.
The formula in IPTW looks like this:
Note the formula depends on the machine, i.e. the machine that estimates the treatment probability based on attributes. The problem is that people get the machine wrong all the time, which makes this method fail.
In words, when we are taking the sum of terms whose numerators are zero unless we have a treated, positive outcome, and we’re weighting them in the denominator by the probability of getting treated so each “population” has the same representation. We do the same for and take the difference.
This method is based on the machine, which as you recall estimates the probability of a positive outcome given the attributes and the treatment, so the $latex P(Y|A,W)$ values.
This method is straight-forward: shove everyone in the machine and predict how the outcome would look under both treatment and non-treatment conditions, and take difference.
Note we don’t know anything about the underlying machine $latex Q$. It could be a logistic regression.
Get ready to get worried: A-IPTW
What if our machines are broken? That’s when we bring in the big guns: double robust estimators.
They adjust for confounding through the two machines we have on hand, and and one machine augments the other depending on how well it works. Here’s the functional form written in two ways to illustrate the hedge:
Note: you are still screwed if both machines are broken. In some sense with a double robust estimator you’re hedging your bet.
“I’m glad you’re worried because I’m worried too.” – Ori
Simulate and test
I’ve shown you 3 distinct methods that estimate effects in observational studies. But they often come up with different answers. We set up huge simulation studies with known functions, i.e. where we know the functional relationships between everything, and then tried to infer those using the above three methods as well as a fourth method called TMLE (targeted maximal likelihood estimation).
As a side note, Ori encourages everyone to simulate data.
We wanted to know, which methods fail with respect to the assumptions? How well do the estimates work?
We started to see that IPTW performs very badly when you’re adjusting by very small thing. For example we found that the probability of someone getting sick is 132. That’s not between 0 and 1, which is not good. But people use these methods all the time.
Moreover, as things get more complicated with lots of nodes in our causal graph, calculating stuff over long periods of time, populations get sparser and sparser and it has an increasingly bad effect when you’re using IPTW. In certain situations your data is just not going to give you a sufficiently good answer.
Causal analysis in online display advertising
An overview of the process:
- We observe people taking actions (clicks, visits to websites, purchases, etc.).
- We use this observed data to build list of “prospects” (people with a liking for the brand).
- We subsequently observe same user during over the next few days.
- The user visits a site where a display ad spot exists and bid requests are made.
- An auction is held for display spot.
- If the auction is won, we display the ad.
- We observe the user’s actions after displaying the ad.
But here’s the problem: we’ve instituted confounders – if you find people who convert highly they think you’ve done a good job. In other words, we are looking at the treated without looking at the untreated.
We’d like to ask the question, what’s the effect of display advertising on customer conversion?
As a practical concern, people don’t like to spend money on blank ads. So A/B tests are a hard sell.
We performed some what-if analysis stipulated on the assumption that the group of users that sees ad is different. Our process was as follows:
- Select prospects that we got a bid request for on day 0
- Observe if they were treated on day 1. For those treated set and those not treated set collect attributes
- Create outcome window to be the next five days following treatment; observe if outcome event occurs (visit to the website whose ad was shown).
- Estimate model parameters using the methods previously described (our three methods plus TMLE).
Here are some results:
Note results vary depending on the method. And there’s no way to know which method is working the best. Moreover, this is when we’ve capped the size of the correction in the IPTW methods. If we don’t then we see ridiculous results: