## Machine learners are spoiled for data

I’ve been reading lots of machine learning books lately, and let me say, as a relative outsider coming from finance: machine learners sure are spoiled for data.

It’s like, they’ve built these fancy techniques and machines that take a huge amount of data and try to predict an outcome, and they always seem to start with about 50 possible signals and “learn” the right combination of a bunch of them to be better at predicting. It’s like that saying, “It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.”

In finance, a quant gets maybe one or two or three time series, hopefully that haven’t been widely distributed so they may still have signal. The effect that this new data on a quant is key: it’s exciting almost to the point of sexually arousing to get new data. That’s right, I said it, data is sexy! We caress the data, we kiss it and go to bed with it every night (well, the in-sample part of it anyway). In the end we have an intimate relationship with each and every time series in our model. In terms of quantity, however, maybe it’s daily (so business days, 262 days per year about), for maybe 15 years, so altogether 4000 data points. Not a lot to work with but we make do.

In particular, given 50 possible signals in a pile of new data, we would first look at each time series by plotting, to be sure it’s not dirty, we’d plot the (in-sample) returns as a histogram to see what we’re dealing with, we’d regress each against the outcome, to see if anything contained signal. We’d draw lagged correlation graphs of each against the outcome. We’d draw cumulative pnl graphs over time with that univariate regression for that one potential signal at a time.

In other words, we’d explore the data in a careful, loving manner, signal by signal, without taking the data for granted, instead of stuffing the kit and kaboodle into a lawnmower. It’s more work but it means we have a sense of what’s going into the model.

I’m wondering how powerful it would be to combine the two approaches.

I lol’ed.

I’ve had the same feeling about ML as well. In particular R-SIG-Finance and similar forums are populated with people who seem to think that “machine learning fairy dust” (not my term but I love it) will yield up economic rents, even though everyone else has access to the same books and prog. libraries.

So this is what I think will make a Great data scientist: someone who understands what are the real mathematical objects underlying the data — or said better, a lattice-sequence of increasing assumptions which generate more specific models — AND uses the appropriate ML technique to pick parameters.

From the title of this post I thought you meant “spoiled for data” in a good way. Have you loaded R’s “datasets” package? There is so much quantified information — it could be a great learning resource, if education were completely different. (Let’s say at one of the state math&science schools, kids could Do social studies projects in R.)