## Strata: one down, one to go

Yesterday I gave a talk called “Finance vs. Machine Learning” at Strata. It was meant to be a smack-down, but for whatever reason I couldn’t engage people to personify the two disciplines and have a wrestling match on stage. For the record, I offered to be on either side. Either they were afraid to hurt a girl or they were afraid to lose to a girl, you decide.

Unfortunately I didn’t actually get to the main motivation for the genesis of this talk, namely the realization I had a while ago that when machine learners talk about “ridge regression” or “Tikhonov regularization” or even “L2 regularization” it comes down to the same thing that quants call a very simple bayesian prior that your coefficients shouldn’t be too large. I talked about this here.

What I *did* have time for: I talked about “causal modeling” in the finance-y sense (discussion of finance vs. statistician definition of causal here), exponential downweighting with a well-chosen decay, storytelling as part of feature selection, and always choosing to visualize everything, and always visualizing the evolution of a statistic rather than a snapshot statistic.

They videotaped me but I don’t see it on the strata website yet. I’ll update if that happens.

This morning, at 9:35, I’ll be in a keynote discussion with Julie Steele for 10 minutes entitled “You Can’t Learn That in School”, which will be live streamed. It’s about whether data science can and should be taught in academia.

For those of you wondering why I haven’t blogged the Columbia Data Science class like I usually do Thursday, these talks are why. I’ll get to it soon, I promise! Last night’s talks by Mark Hansen, data vizzer extraordinaire and Ian Wong, Inference Scientist from Square, were really awesome.

I’ve often wondered why ridge regression problems aren’t simplified by redefining the explanatory variables to make them more independent. If you are predicting how far a kid can throw a football by height and weight, the coefficients blow up because the explanatory variables are not independent. Weight could redefined as “excess weight” – the residuals from regressing weight against height. Excess weight and height should be more independent and useful for modeling and explaining how far kids can throw footballs.

Why doesn’t the “detect singularities, redefine explanatory variables to eliminate them” approach work with these problems?

I suspect one reason relates to explanation rather than prediction. You’ve chosen a fairly simple example. In more complicated ones, sure you could do a principal components basis or something and get good prediction, but your coefficients would be unlikely to be interpretable and to provide that explanation that you want from your explanatory – as opposed to predictor – variables. And even for prediction, such a model will only work to predict items which have the same correlation structure as the training set you used. But if all you want is prediction for items which have the same correlation structure as in your training data, then there’s no need to do anything fancy, a straightforward regression does that just fine even with highly correlated predictors.

I’ll agree about the problems with interpretations of principal components and factor analysis.

I’m wondering if the existence of techniques to build complex models makes modelers lazy and sloppy. If so, are we missing insights that would come from working hard building simple interpretable models? Sometimes working hard might mean finding a nice small set of independent explanatory variables as in the football throwing example.

Maybe my BS detector as gone off so often working with social scientists and b-school academics that I’ve just become jaded to complex models.

I never really thought about regularization this way until you mentioned it. As it turns out — Wikipedia agrees with you (paragraph 2). http://en.wikipedia.org/wiki/Regularization_%28mathematics%29