Home > data science, modeling > How do you know when you’ve solved your data problem?

How do you know when you’ve solved your data problem?

November 6, 2013

I’ve been really impressed by how consistently people have gone to read my post “K-Nearest Neighbors: dangerously simple,” which I back in April. Here’s a timeline of hits on that post:

Stats for "K-Nearest Neighbors: dangerously simple." I've actually gotten more hits recently.

Stats for “K-Nearest Neighbors: dangerously simple.” I’ve actually gotten more hits recently.

I think the interest in this post is that people like having myths debunked, and are particularly interested in hearing how even the simple things that they thought they understand are possibly wrong, or at least more complicated than they’d been assuming. Either that or it’s just got a real catchy name.

Anyway, since I’m still getting hits on that post, I’m also still getting comments, and just this morning I came across a new comment by someone who calls herself “travelingactuary”. Here it is:

My understanding is that CEOs hate technical details, but do like results. So, they wouldn’t care if you used K-Nearest Neighbors, neural nets, or one that you invented yourself, so long as it actually solved a business problem for them. I guess the problem everyone faces is, if the business problem remains, is it because the analysis was lacking or some other reason? If the business is ‘solved’ is it actually solved or did someone just get lucky? That being so, if the business actually needs the classifier to classify correctly, you better hire someone who knows what they’re doing, rather than hoping the software will do it for you.

Presumably you want to sell something to Monica, and the next n Monicas who show up. If your model finds a whole lot of big spenders who then don’t, your technophobe CEO is still liable to think there’s something wrong.

I think this comment brings up the right question, namely knowing when you’ve solved your data problem, with K-Nearest Neighbors or whichever algorithms you’ve chosen to use. Unfortunately, it’s not that easy.

Here’s the thing, it’s almost never possible to tell if a data problem is truly solved. I mean, it might be a business problem where you go from losing money to making money, and in that sense you could say it’s been “solved.” But in terms of modeling, it’s very rarely a binary thing.

Why do I say that? Because, at least in my experience, it’s rare that you could possibly hope for high accuracy when you model stuff, even if it’s a classification problem. Most of the time you’re trying to achieve something better than random, some kind of edge. Often an edge is enough, but it’s nearly impossible to know if you’ve gotten the biggest edge possible.

For example, say you’re binning people you who come to your site in three equally sized groups, as “high spenders,” “medium spenders,” and “low spenders.” So if the model were random, you’d expect a third to be put into each group, and that someone who ends up as a big spender is equally likely to be in any of the three bins.

Next, say you make a model that’s better than random. How would you know that? You can measure that, for example, by comparing it to the random model, or in other words by seeing how much better you do than random. So if someone who ends up being a big spender is three times more likely to have been labeled a big spender than a low spender and twice as likely than a medium spender, you know your model is “working.”

You’d use those numbers, 3x and 2x, as a way of measuring the edge your model is giving you. You might care about other related numbers more, like whether pegged low spenders are actually low spenders. It’s up to you to decide what it means that the model is working. But even when you’ve done that carefully, and set up a daily updated monitor, the model itself still might not be optimal, and you might still be losing money.

In other words, you can be a bad modeler or a good modeler, and either way when you try to solve a specific problem you won’t really know if you did the best possible job you could have, or someone else could have with their different tools and talents.

Even so, there are standards that good modelers should follow. First and most importantly, you should always set up a model monitor to keep track of the quality of the model and see how it fares over time.  Because why? Because second, you should always assume that, over time, your model will degrade, even if you are updating it regularly or even automatically. It’s of course good to know how crappy things are getting so you don’t have a false sense of accomplishment.

Keep in mind that just because it’s getting worse doesn’t mean you can easily start over again and do better. But a least you can try, and you will know when it’s worth a try. So, that’s one thing that’s good about admitting your inability to finish anything.

On to the political aspect of this issue. If you work for a CEO who absolutely hates ambiguity – and CEO’s are trained to hate ambiguity, as well as trained to never hesitate – and if that CEO wants more than anything to think their data problem has been “solved,” then you might be tempted to argue that you’ve done a phenomenal job just to make her happy. But if you’re honest, you won’t say that, because it ‘aint true.

Ironically and for these reasons, some of the most honest data people end up looking like crappy scientists because they never claim to be finished doing their job.

Categories: data science, modeling
  1. November 6, 2013 at 10:48 am

    On the one hand, I’m glad to feel reassured that uncertainty about the goodness of a model is just a normal part of the [honest] modeler’s job. It seems that the Zen Buddhist practice of Shoshin (“beginner’s mind”), and the associated state of “not knowing”, is a valuable perspective for a data scientist.

    On the other hand, it’s depressing to consider that that even a “good” model will not last (or at least not remain “good”) … except that it means there will always be plenty of [re]modeling work to be done.

    As for people enjoying the debunking of myths, and realizing that they – or someone else – has been wrong, one of my favorite interviews of all time is Kathryn Schultz, author of “Being Wrong”, interviewing Ira Glass: On Air and On Error: This American Life’s Ira Glass on Being Wrong


  2. mathematrucker
    November 6, 2013 at 3:10 pm

    23-year-old Ryan “Riess the Beast” solved his data problem in optimal fashion last night at the Rio’s Penn & Teller Theatre in Las Vegas when he defeated Jay Farber to become poker’s 2013 WSOP Main Event champion. With 6,352 entrants having each paid $10K to enter, Riess’s 1st place finish earned him a cool $8.4 million. Congrats Ryan!


    • mathematrucker
      November 6, 2013 at 4:38 pm

      I obviously used the term “optimal fashion” in the usual business sense above, defined just in terms of the bottom line. However poker does serve as an example of a data problem with an optimal solution: optimal play is trivial given the specific 52-card permutation (shuffle) of the deck before each deal.


  3. Zathras
    November 6, 2013 at 4:39 pm

    “How do you know when you’ve solved your data problem?”

    It is instructive to start with a Dilbert answer: you’ve solved the data problem when you can convince your boss that the problem is solved. Anything more is working too hard.

    Okay, now that’s done, the question is: how does the boss know it is solved? It’s an interesting issue when the boss is non-technical and can’t independently verify your work. In this case, just about everything you say will have to be taken by faith. What kind of validation is there? In many cases, almost none. In other cases, once a solution is deployed and has to go “out-of-sample,” validation might be easier. It is really surprising, however, how little actual validation occurs, even if validation is easy to do. It is often the case that managers will be content with any answer and will never check whether the answer is actually correct. By then, the manager has to moved on to the next problem.


  4. SamChevre
    November 6, 2013 at 4:40 pm

    I have proposed, only partly in jest, that the actuarial profession integrity of work and data quality standards could be replaced by a simple 4-item questionnaire.

    1) Did the person you got your data from make it up?
    2) How can you tell?
    3) Did you make up your results?
    4) Would they be less informative if you had?


  5. November 7, 2013 at 4:41 am

    A few weeks ago, in my home town of Melbourne, Australia, I went to an R meetup where the speaker was one of the winners of Kaggle’s Heritage Health Prize. He had promised to divulge how he did this, and while he gave a few things away, I’m sure he’s still got a lot of secrets.
    Many of the people in the audience were modeling professionals, and after the presentation, the talk turned to people’s experience with clients, who were all business people. Surprising to me at the time, but less surprising by now, quite a few people who market themselves as predictive models said the clients who hire them actually want traditional – that is, inferential – statisticians, to build explainable models where every variable is statistically significant (there was some discussion that clients weren’t happy until they heard those words, which gave some sort of Royal approval).
    In these sort of cases, there can be, at least some of the time, a possibility of finishing, if the question is posed correctly. The last condition is a pretty big “if”, and the trick of it is that clients in business are highly unlikely to be able to pose a correct question. So the double edged sword of asking the right question is in the hands of the data analyst.The opportunity exists, though, to define the problem in a way that does have a finish line, even if a second finish line is agreed on for later.


  6. November 7, 2013 at 1:46 pm

    Great blog but a quick note: the link for the KNN post at the top is broken: it’s set up as if you want to edit the original post.


  7. elena
    December 27, 2013 at 12:46 pm

    Cathy, sorry for ignorance, could you explain why “your model will degrade, even if you are updating it regularly or even automatically” ?


  1. No trackbacks yet.
Comments are closed.
%d bloggers like this: