### Archive

Archive for July, 2012

## Statisticians aren’t the problem for data science. The real problem is too many posers

Crossposted on Naked Capitalism

Cosma Shalizi

I recently was hugely flattered by my friend Cosma Shalizi’s articulate argument against my position that data science distinguishes itself from statistics in various ways.

Cosma is a well-read broadly educated guy, and a role model for what a statistician can be, not that every statistician lives up to hist standard. I’ve enjoyed talking to him about data, big data, and working in industry, and I’ve blogged about his blogposts as well.

That’s not to say I agree with absolutely everything Cosma says in his post: in particular, there’s a difference between being a master at visualizations for the statistics audience and being able to put together a power point presentation for a board meeting, which some data scientists in the internet start-up scene definitely need to do (mostly this is a study in how to dumb stuff down without letting it become vapid, and in reading other people’s minds in advance to see what they find sexy).

And communications skills are a funny thing; my experience is communicating with an academic or a quant is a different kettle of fish than communicating with the Head of Product. Each audience has its own dialect.

But I totally believe that any statistician who willingly gets a job entitled “Data Scientist” would be able to do these things, it’s a self-selection process after all.

Statistics and Data Science are on the same team

I think that casting statistics as the enemy of data science is a straw man play. The truth is, an earnest, well-trained and careful statistician in a data scientist role would adapt very quickly to it and flourish as well, if he or she could learn to stomach the business-speak and hype (which changes depending on the role, and for certain data science jobs is really not a big part of it, but for others may be).

It would be a petty argument indeed to try to make this into a real fight. As long as academic statisticians are willing to admit they don’t typically spend just as much time (which isn’t to say they never spend as much time) worrying about how long it will take to train a model as they do wondering about the exact conditions under which a paper will be published, and as long as data scientists admit that they mostly just redo linear regression in weirder and weirder ways, then there’s no need for a heated debate at all.

Let’s once and for all shake hands and agree that we’re here together, and it’s cool, and we each have something to learn from the other.

Posers

What I really want to rant about today though is something else, namely posers. There are far too many posers out there in the land of data scientists, and it’s getting to the point where I’m starting to regret throwing my hat into that ring.

Without naming names, I’d like to characterize problematic pseudo-mathematical behavior that I witness often enough that I’m consistently riled up. I’ll put aside hyped-up, bullshit publicity stunts and generalized political maneuvering because I believe that stuff speaks for itself.

My basic mathematical complaint is that it’s not enough to just know how to run a black box algorithm. You actually need to know how and why it works, so that when it doesn’t work, you can adjust. Let me explain this a bit by analogy with respect to the Rubik’s cube, which I taught my beloved math nerd high school students to solve using group theory just last week.

Rubiks

First we solved the “position problem” for the 3-by-3-by-3 cube using 3-cycles, and proved it worked, by exhibiting the group acting on the cube, understanding it as a subgroup of $S_8 \times S_{12},$ and thinking hard about things like the sign of basic actions to prove we’d thought of and resolved everything that could happen. We solved the “orientation problem” similarly, with 3-cycles.

I did this three times, with the three classes, and each time a student would ask me if the algorithm is efficient. No, it’s not efficient, it takes about 4 minutes, and other people can solve it way faster, I’d explain. But the great thing about this algorithm is that it seamlessly generalizes to other problems. Using similar sign arguments and basic 3-cycle moves, you can solve the 7-by-7-by-7 (or any of them actually) and many other shaped Rubik’s-like puzzles as well, which none of the “efficient” algorithms can do.

Something I could have mentioned but didn’t is that the efficient algorithms are memorized by their users, are basically black-box algorithms. I don’t think people understand to any degree why they work. And when they are confronted with a new puzzle, some of those tricks generalize but not all of them, and they need new tricks to deal with centers that get scrambled with “invisible orientations”. And it’s not at all clear they can solve a tetrahedron puzzle, for example, with any success.

Back to data science. It’s a good thing that data algorithms are getting democratized, and I’m all for there being packages in R or Octave that let people run clustering algorithms or steepest descent.

But, contrary to the message sent by much of Andrew Ng’s class on machine learning, you actually do need to understand how to invert a matrix at some point in your life if you want to be a data scientist. And, I’d add, if you’re not smart enough to understand the underlying math, then you’re not smart enough to be a data scientist.

I’m not being a snob. I’m not saying this because I want people to work hard. It’s not a laziness thing, it’s a matter of knowing your shit and being for real. If your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is. That’s your job.

As I see it, there are three problems with the democratization of algorithms:

1. As described already, it lets people who can load data and press a button describe themselves as data scientists.
2. It tempts companies to never hire anyone who actually knows how these things work, because they don’t see the point. This is a mistake, and could have dire consequences, both for the company and for the world, depending on how widely their crappy models get used.
3. Businesses might think they have awesome data scientists when they don’t. That’s not an easy problem to fix from the business side: posers can be fantastically successful exactly because non-data scientists who hire data scientists in business, i.e. business people, don’t know how to test for real understanding.

How do we purge the posers?

We need to come up with a plan to purge the posers, they are annoying and making a bad name for data science.

One thing that will be helpful in this direction is Rachel Schutt’s Data Science class at Columbia next semester, which is going to be a much-needed bullshit free zone. Note there’s been a time change that hasn’t been reflected on the announcement yet, namely it’s going to be once a week, Wednesdays for three hours starting at 6:15pm. I’m looking forward to blogging on the contents of these lectures.

Categories: data science, rant

## Columbia Data Science Institute: it’s gonna happen

So Bloomberg finally got around to announcing the Columbia Data Science Institute is really going to happen. The details as we know them now:

1. It’ll be at the main campus, not Manhattanville.
2. It’ll hire 75 faculty over the next decade (specifically, 30 new faculty by launch in August 2016 and 75 by 2030, so actually more than a decade but who’s counting?).
3. It will contain a New Media Center, a Smart Cities Center, a Health Analytics Center, a Cybersecurity Center, and a Financial Analytics Center.
4. The city is pitching in $15 million whereas Columbia is ponying up$80 million.
5. Columbia Computer Science professor Kathy McKeown will be the Director and Civil Engineering professor Patricia Culligan will be the Institute’s Deputy Director.

## The douche burger, and putting a ruler to the dick.

I have been pretty hardcore and serious for a few weeks, and today I want to lighten it up for a change.

Douchery

For just $666 you can purchase a foie gras-stuffed Kobe patty covered in Gruyere cheese that’s been melted with champagne steam and topped with lobster, truffles, caviar, and a BBQ sauce made with Kopi Luwak coffee beans that have been pooped out by some sort of animal called the Asian palm civet. The whole thing is then served in a gold-leaf wrapper. Two things I like about this article, first that it’s hilarious and over the top satire, which is always excellent, and second that the world is picking up on my idea of calling people douches when they get really into esoteric stuff. If you don’t believe me, read my previous post My friend the coffee douche. It’s one of my favorites. Putting a ruler to the dick Next, speaking of using language in a funny but pointed way, are you with me that “opening the kimono” is an offensive and sexist phrase? Well, how about we replace it with a better, more offensive, and more sexist phrase that’s even more fun to say, namely “putting a ruler to the dick”?? This was my friend Laura Strausfeld’s idea, and I love it. It’s gonna be the buzzword (buzzphrase) of the year, we just know it. Here’s how it works in context: guy A: “So do you think you’ll invest in those guys? They seemed really excited about that new technique they’ve developed!” guy B: “I don’t know. They talked a big game, but until I can put a ruler to the dick I’m not putting my money there.” Categories: musing ## Does mathematics have a place in higher education? A recent New York Times Opinion piece (hat tip Wei Ho), Is Algebra Necessary?, argues for the abolishment of algebra as a requirement for college. It was written by Andrew Hacker, an emeritus professor of political science at Queens College, City University of New York. His concluding argument: I’ve observed a host of high school and college classes, from Michigan to Mississippi, and have been impressed by conscientious teaching and dutiful students. I’ll grant that with an outpouring of resources, we could reclaim many dropouts and help them get through quadratic equations. But that would misuse teaching talent and student effort. It would be far better to reduce, not expand, the mathematics we ask young people to imbibe. (That said, I do not advocate vocational tracks for students considered, almost always unfairly, as less studious.) Yes, young people should learn to read and write and do long division, whether they want to or not. But there is no reason to force them to grasp vectorial angles and discontinuous functions. Think of math as a huge boulder we make everyone pull, without assessing what all this pain achieves. So why require it, without alternatives or exceptions? Thus far I haven’t found a compelling answer. For an interesting contrast, there’s a recent Bloomberg View Piece, How Recession Will Change University Financing, by Gary Shilling (not to be confused with Robert Shiller). From Shilling’s piece: Most thought that a bachelor’s degree was the ticket to a well-paid job, and that the heavy student loans were worth it and manageable. And many thought that majors such as social science, education, criminal justice or humanities would still get them jobs. They didn’t realize that the jobs that could be obtained with such credentials were the nice-to-have but nonessential positions of the boom years that would disappear when times got tough and businesses slashed costs. Some of those recent graduates probably didn’t want to do, or were intellectually incapable of doing, the hard work required to major in science and engineering. After all, afternoon labs cut into athletic pursuits and social time. Yet that’s where the jobs are now. Many U.S.-based companies are moving their research-and-development operations offshore because of the lack of scientists and engineers in this country, either native or foreign-born. For 34- to 49-year-olds, student debt has leaped 40 percent in the past three years, more than for any other age group. Many of those debtors were unemployed and succumbed to for-profit school ads that promised high-paying jobs for graduates. But those jobs seldom materialized, while the student debt remained. Moreover, many college graduates are ill-prepared for almost any job. A study by the Pew Charitable Trusts examined the abilities of U.S. college graduates in three areas: analyzing news stories, understanding documents and possessing the math proficiency to handle tasks such as balancing a checkbook or tipping in a restaurant. The first article is written by a professor, so it might not be surprising that, as he sees more and more students coming through, he feels their pain and wants their experience to not be excruciating. The easiest way to do that is to remove the stumbling block requirement of math. He also seems to think of higher education as something everyone is entitled to, which I infer based on how he dismisses vocational training. The second article is written by a financial analyst, an economist, so we might not be surprised that he strictly sees college as a purely commoditized investment in future income, and wants it to be a good one. The easiest way to do that is to have way fewer students go through college to begin with, since having dumb or bad students get into debt but not learn anything and then not get a job afterwards doesn’t actually make sense. And where the first author acts like math is only needed for a tiny minority of college students, the second author basically dismisses non-math oriented subjects as frivolous and leading to a life of joblessness and debt. These are vastly different viewpoints. I’m thinking of inviting them both to dinner to discuss. By the way, I think that last line, where Hacker wonders what the pain of math-as-huge-boulder achieves, is more or less answered by Shilling. The goal of having math requirements is to have students be mathematically literate, which is to say know how to do everyday things like balancing checkbooks and reading credit card interest rate agreements. The fact that we aren’t achieving this goal is important, but the goal is pretty clear. In other words, I think my dinner party would be fruitful as well as entertaining. If there’s one thing these two agree on, it’s that students are having an awful lot of trouble doing basic math. This makes me wonder a few things. First, why is algebra such a stumbling block? Is it that the students are really that bad, or is the approach to teaching it bad? I suspect what’s really going on is that the students taking it have mostly not been adequately taught the pre-requisites. That means we need more remedial college math. I honestly feel like this is the perfect place for online learning. Instead of charging students enormous fees while they get taught high-school courses they should already know, and instead of removing basic literacy requirements altogether, ask them to complete some free online math courses at home or in their public library, to get them ready for college. The great thing about computers is that they can figure out the level of the user, and they never get impatient. Next, should algebra be replaced by a Reckoning 101 course? Where, instead of manipulating formulas, we teach students to figure out tips and analyze news stories and understand basic statistical statements? I’m sure this has been tried, and I’m sure it’s easy to do badly or to water down entirely. Please tell me what you know. Specifically, are students better at snarky polling questions if they’ve taken these classes than if they’ve taken algebra? Finally, I’d say this (and I’m stealing this from my friend Kiri, a principal of a high school for girls in math and science): nobody ever brags about not knowing how to read, but people brag all the time about not knowing how to do math. There’s nothing to be proud of in that, and it’s happening to a large degree because of our culture, not intelligence. So no, let’s not remove mathematical literacy as a requirement for college graduates, but let’s think about what we can do to make the path reasonable and relevant while staying rigorous. And yes, there are probably too many students going to college because it’s now a cultural assumption rather than a thought-out decision, and this lands young people in debt up to their eyeballs and jobless, which sucks (here’s something that may help: forcing for-profit institutions to be honest in advertising future jobs promises and high interest debt). Something just occurred to me. Namely, it’s especially ironic that the most mathematically illiterate and vulnerable students are being asked to sign loan contracts that they, almost by construction, don’t understand. How do we address this? Food for thought and for another post. ## Income distributions and misleading poll questions (#OWS) Disingenuous, pseudo-quantitative arguments piss me off. In this recent Bloomberg View article entitled “Making the rich poorer doesn’t enrich the middle class,” Caroline Baum argues that middle class people would rather get more money than take away money from rich people. From the article: Polling by the Pew Research Center shows that people aren’t interested in taking money from the wealthy. They just want a chance to get rich themselves. But that’s a misleading question. It seems like a zero sum game when you put it that way, equivalent to something like, “Would you rather gain$100 or have a rich person somewhere lose \$100?”.

But if you pose the question differently, and more in line with actual numbers, not to mention contextualized to reality in other ways, then you’d probably get the opposite.

Let’s take a look at wealth distribution from 2007, which I got here:

Let’s just say we’re being extreme and we take away all the wealth of the top 1% and give it to everybody equally (say we even give back some of it to those top 1%). That would mean that 34.6% get flattened out to 100 pots instead of one, which means that each of those percentiles gets about 0.35% more than they used to have. The middle 20% would grow from 4% of the overall wealth to (4 + 20*0.35)% = 11%. That’s still a lot less than 20%, but the wealth of the middle 20% is still nearly tripled by just this one percent re-distributing.

Said another way, it’s not tit-for-tat at all.

If we asked someone in the middle class which they want more, a 1% increase in their wealth or a top 1%’er to lose 1% of their wealth, then that might be very different. Consider the political influence that 1% represents, at the very least. Consider the fact that 1% of that person in the middle 20% is 173 times smaller than for the top 1%.

It’s still not fair, though, because the middle class is so squeezed on necessities like food, housing, education, medical expenses, and child care, that they can’t afford even a 1% loss. What if you took those out?

If you go even further and ask someone in the middle class which they want more, a 1% increase in their discretionary income or a top 1%’er to lose 1% of their discretionary income, then that might be very different still. I haven’t been able to find a similar graphic to work with to see the discretionary income distribution, but rest assured it’s even more unbalanced.

Caroline Baum, would you care to cover those questions on your next poll to the middle class?

Categories: #OWS, news, rant

## Why is LIBOR such a big deal? (#OWS)

The manipulation of LIBOR interest rates by the big, mostly-European banks (but not entirely, see a full list here) was an open secret inside finance in 2008. As in so open that I didn’t think of it as a secret at all.

The fact that that manipulation is now consistently creating huge headlines is interesting to me – it brings up a few issues.

1. People seem surprised this out-and-out manipulation was happening. That says to me that they clearly still don’t understand what the culture of finance is really like. The fact that Bob Diamond of Barclays claims to have felt “physically ill” when he saw the emails of the traders manipulating LIBOR is either an out-and-out lie or they guy is simple-minded, as in stupid. And word on the street is he’s not stupid.
2. People still buy the line that most of the problems from the credit crisis arose from legal but wrong-headed efforts to make money, plus corrupt ratings on mortgage-backed securities. This is incredible to me. Let’s get it clear: the culture of finance is to take advantage of every opportunity to juice your bottom line, even if it’s wrong, even if it’s fraudulent, even if it affects the terms of loans on millions of houses and towns in other countries, and even if only your trading desk is benefiting.
3. The LIBOR manipulation in 2008 was about more than that, namely trying not to look as bad as other banks, to avoid being the next Lehman. It was done in the name of not looking weak and requiring a government bailout. Bob Diamond still doesn’t think they did anything wrong by lying there. It was almost like they were doing something noble.
4. Speaking of towns in other countries, read this article about how LIBOR manipulation has screwed U.S. cities to the ground. I’ve got a lot more to say about municipal debt and how that sleazy system works but it’s waiting for another post.
5. Finally, why did it take so long for the media to pick up on LIBOR manipulation? It tempts me to make a list of the illegal stuff that we all knew about back then and send it around just to make sure.
Categories: #OWS, finance, news

## Is open data a good thing?

As much as I like the idea of data being open and free, it’s not an open and shut case. As it were.

I’m first going to argue against open data with three examples.

The first is a pretty commonly discussed concern of privacy. Simply put, there is no such thing as anonymized data, and people who say there is are either lying or being naive. The amount of information you’d need to remove to really anonymize data is not known to be different from the amount of data you have in the first place. So if you did a good job to anonymize a data set, you’d probably remove all interesting information anyway. Of course, you could think this is only important with respect to individual data.

But my next example comes from land data, specifically Tamil Nadu in Southern India. There’s an interesting Crooked Timber blogpost here (hat tip Suresh Naidu) explaining how “open data” has screwed a local population, the Dalits. Although you could (and I would) argue that the way the data is collected and disseminated, and the fact that the courts go along with this process, is itself politically motivated and disenfrachising, there are some important point made in this post:

Open data undermines the power of those who benefit from “the idiosyncracies and complexities of communities… Local residents [who] understand the complexity of their community due to prolonged exposure.” The Bhoomi land records program is an example of this: it explicitly devalues informal knowledge of particular places and histories, making it legally irrelevant; in the brave new world of open data such knowledge is trumped by the ability to make effective queries of the “open” land records.15 The valuing of technological facility over idiosyncratic and informal knowledge is baked right in to open data efforts.

The Crooked Timber blog post specifically called out Tim O’Reilly and his “Government as Platform” project as troublesome:

The faith in markets sometimes goes further among open data advocates. It’s not just that open data can create new markets, there is a substantial portion of the push for open data that is explicitly seeking to create new markets as an alternative to providing government services.

It’s interesting to see O’Reilly’s Mike Loukides’s reaction (hat tip Chris Wiggins), entitled the Dark Side of Data, here. From Loukides:

The issue is how data is used. If the wealthy can manipulate legislators to wipe out generations of records and folk knowledge as “inaccurate,” then there’s a problem. A group like DataKind could go in and figure out a way to codify that older generation of knowledge. Then at least, if that isn’t acceptable to the government, it would be clear that the problem lies in political manipulation, not in the data itself. And note that a government could wipe out generations of “inaccurate records” without any requirement that the new records be open. In years past the monied classes would have just taken what they wanted, with the government’s support. The availability of open data gives a plausible pretext, but it’s certainly not a prerequisite (nor should it be blamed) for manipulation by the 0.1%.

[Speaking of DataKind (formerly Data Without Borders), it’s also a problem, as I discovered as a data ambassador working with the NYCLU on Stop, Question and Frisk data, when the government claims to be open but withholds essential data such as crime reports.]

My final example comes from finance. On the one hand I want total transparency of the markets, because it sickens me to think about how nobody knows the actual price of bonds, or the correct interest rate, or the current default assumption of the market, how all of that stuff is being kept secret by Wall Street insiders so they can each skim off their little cut and the dumb money players get constantly screwed.

But on the other hand, if I imagine a world where everything really is transparent, then even in the best of all database situations, that’s just asstons of data which only the very very richest and most technologically savvy high finance types could ever munge through.

So who would benefit? I’d say, for some time, the average dumb money customer would benefit very slightly, by not paying extra fees, but that the edgy techno finance firms would benefit fantastically. Then, I imagine, new ways would be invented for the dumb money customers to lost that small amount of benefit altogether, probably by just inundating them with so much data they can’t absorb it.

In other words, open data is great for the people who have the tools to use it for their benefit, usually to exploit other people and opportunities. It’s not clearly great for people who don’t have those tools.

But before I conclude that data shouldn’t be open, let me strike an optimistic (for me) tone.

The tools for the rest of us are being built right now. I’m not saying that the non-exploiters will ever catch up with the Goldman Sachs and credit card companies, because probably not.

But there will be real tools (already are things like python and R, and they’re getting better every day), built out of the open software movement, that will help specific people analyze and understand specific things, and there are platforms like wordpress and twitter that will allow those things to be broadcast, which will have real impact when the truth gets out. An example is the Crooked Timber blog post above.

So yes, open data is not an unalloyed good. It needs to be a war waged by people with common sense and decency against those who would only use it for profit and exploitation. I can’t think of a better thing to do with my free time.