Data audits and data strategies

Home > data science, modeling > Data audits and data strategies

Data audits and data strategies

March 15, 2013 Cathy O'Neil, mathbabe

There are lots of start-up companies out there that want to have a data team, because they heard somewhere that they should leverage big data, but they don’t know what it really means, what they can expect from such a team, or how to get started. They also don’t really know how to hire qualified people, or what qualifications to look for.

Finally, they often don’t know what kinds of questions are answerable through data, nor what data they should be collecting to answer those questions. So even if they did manage to hire a data scientist or a data team, those guys might be literally sitting on their hands for six months until they have enough data to start work.

It’s a common situation and could end up a big waste time and money. What these companies need is something I like to call a “data audit” followed by a “data strategy”.

Data Audit

First thing’s first. Do you actually need a data team? Is your company a data science company or is it a traditional-style company that happens to collect data? It would be a waste of resources to form a data team you don’t need. There’s no reason every single company needs to consider itself part of the big data revolution just to be cool.

Here’s how you tell. Let’s say that, as of now, you’re using incoming data to monitor and report on what’s happening with the business and to keep tabs on various indicators to make sure things aren’t going to hell. Absolutely every company should do this, but it honestly could be set up by a good data analyst working closely with the end-users, i.e. the business peeps.

What are the high-level goals of using data in the business? In particular, is there a way that, if you could really know how customers or clients were interacting with your product, that you would change the product to respond to the data? Because that feedback loop is the hallmark of a true data science engine (versus data analytics).

Here are some extreme examples to give you an idea of what I’m talking about. If you make shoes, then you need data to see how sales are and which shoes are getting sold faster so you can kick up production in certain areas. You need to see how sales are seasonal so you know to stop making quite so many shoes at a certain point in the deep of winter. But that’s about it, and you should be able to make do with data analysis.

If, on the other hand, you are building a recommendation engine, say for music, then you need to constantly refresh and improve your recommendation model. Your model is your product, and you need a data team.

Not all examples are this easy. Sometimes you can use new kinds of data models to improve your product even if it seems somewhat traditional, depending on how much data you are able to collect about how your clients use your product. It all depends on what kinds of questions you are asking and what data you have access to. Of course, you might want to go out and collect data that you hadn’t bothered to do before, which could bring you from the first category to the second.

Say you decide you really are a data science company, or want to be one. What’s next?

Pose a bunch of questions you think you’ll need to answer and a bunch of data you think should be useful to answer them.

The heart of a data audit is a (preliminary) plan for choosing, collecting, and storing data, as well as figuring out the initial shape of the data pipeline and infrastructure. Do you store data in the cloud? Is it unstructured or do you set up some overnight jobs to put stuff into some type of database? Do you aggregate data and throw some stuff away, or do you keep absolutely everything?

The most important issue above is whether you’re collecting enough data. Truth be told, you could probably throw it all into an unstructured pile on S3 for now and figure out pipelines later. It might not be the best way to do it but if you are short for time and attention, it’s possible, and storage is cheap. But make sure you’re collecting the right stuff!

You’d be surprised how many startups want to ask good questions about their customers to improve their product, and have gone to some trouble to figure out what those questions are, but don’t bother to collect the relevant information. They might do things like count the number of users, or collect a timestamp for whenever a user logs in, but they don’t actually keep track of the interaction. It’s essential that you collect pertinent information if you want to use this data to check things are working or to predict people’s desires or needs.

So if you think customers might be all ditching your site at critical moments, then definitely tag their departure as well as their arrival, and keep track of where they were and what they were doing when they bailed.

Note I’m not necessarily being creepy here. You definitely want to know how people interact with your product and your site, and it doesn’t need to be personal information you’re collecting about your users. It could be kept aggregate. You could find out that 45% of people leave your site when you ask them for their phone number, and then you might decide it’s not worth it to do that.

Speaking of creepy, another critical thing to consider during your data audit is privacy controls and encryption methods. Are you saving data legally? Are you protecting it legally? Are you informing your users appropriately about how and what data will be stored? Are you planning to remain consistent with your stated privacy policy? Do you respect people’s “Do Not Track” option?

At the end of a data audit, you might still have a vague idea of what exactly you can do with your data, but you should have a bunch of possible ideas, as well as guesses at what kind of attributes would contribute to the kind of behavior you’re considering tracking.

Then, after you start collecting high-quality data and figuring out the basic questions you care about, you will probably have to wait a few weeks or months to start training and implementing your models. This is a good time to make sure your data infrastructure is in place and doesn’t have major bugs.

Data Strategy

Ok, now you’ve collected lots of data and you also have a bunch of questions you think may be answerable. It’s time to prioritize your questions and form a plan. For each question on your list, you’ll need to think about the following issues:

Is it a monitor or an algorithm?
Is it short-term, one-time analysis or should you set it up as a dashboard?
How much data will you need to train the model?
What is your expectation of the signal in the data you’re collecting?
How useful will the results of the model be considering the range of signal and the quality of the answer?
Do you need to go find proxy data? Should you start now?
Which algorithms should you consider?
What’s your evaluation method?
Is it scalable?
Can you do a baby version first or does it only make sense to go deep?
Can you do a simpler version of it that’s much cheaper to build?
How long will it probably take to train?
How fast can it update?
Will it be a pain to integrate it to the realtime system?
What are the costs if it doesn’t work?
What are the costs of not trying it? What else could you be doing with that time?
How is the feedback loop expected to work?
What is the impact of this model on the users?
What is the impact of this model on the world at large? This is especially important if you’re creepy. Don’t be creepy.

Also, you need a team to build your models. How do you hire? Who do you hire? Some of these answers depend on your above plan. If there’s a lot of realtime updating for your models you’ll need more data engineers and fewer pure modelers. If you need excellent-looking results from your work you’ll need more data viz nerds.

You should consider hiring a consultant just to interview for you. It’s really hard to interview for data scientists if nobody is an expert in data science, and you might end up with someone who knows how to sounds smart but can’t build anything. Or you could end up with someone who can build anything but has no idea what their choices really mean.

The ultimate goal at the end of a data audit and strategy is to end up with a reasonable expectation of what having a data science team will accomplish, how long it will take, how deep an investment it is, and how to do it.

Categories: data science, modeling

Comments (8)

Leon Kautsky

March 15, 2013 at 11:49 am

“So even if they did manage to hire a data scientist or a data team, those guys might be literally sitting on their hands for six months until they have enough data to start work.”
…
“It would be a waste of resources to form a data team you don’t need. There’s no reason every single company needs to consider itself part of the big data revolution just to be cool.”

Not everything that’s true needs to be said.

LikeLike
- Cathy O'Neil, mathbabe
  
  March 15, 2013 at 11:51 am
  
  So you suggest we data people get jobs we don’t need to sit around playing ping-pong and drinking artisanal beers on the VC dime? I don’t think it will work out in the long run for our industry if we do that.
  
  But never mind me, there are plenty of people already achieving that goal.
  
  LikeLike
  - leonkautsky
    
    March 15, 2013 at 1:46 pm
    
    There is also the gym, Coursera, finishing up my Instapaper.com account and all of the wonderful cyberactivism (projects left behind include: GitLaw, Screwed App, and compiling an easy to do use database of when drugs go off patent) I don’t currently have time to do.
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      March 15, 2013 at 1:57 pm
      
      Didn’t mean to assume that you’d drink beer on the VC dime when you have so many other things to do (on the VC dime).
      
      LikeLike
    - leonkautsky
      
      March 15, 2013 at 4:24 pm
      
      🙂 and that’s why I use this name.
      
      LikeLike
griznog

March 15, 2013 at 12:20 pm

I am not a data scientist, but I do provide computing support for a growing number of people who would like to become data scientists and start offering courses related to this emerging field. To provide that support I’m starting by looking at providing some resources for a beginners-sized hadoop setup, but beyond hadoop am pretty clueless what types of tools and hardware would provide the most benefit. And admittedly, offering hadoop is a guess based on a reading of the hype leaves in the bottom of my cup. Any chance you could run down a quick list of the technical computing side of being a data scientist? Any pointers to software tools and how you use computing (cloud? local compute farms? storage?…) Gearing it toward what you’d tell your system administrator to set up for you would be especially helpful.

LikeLike
medicalquackblog

March 15, 2013 at 3:07 pm

“What is the impact of this model on the world at large? This is especially important if you’re creepy. Don’t be creepy.”…

Thank you for including that comment in your article. I’m starting to see analytics used out of context or taken from a valuable “trending information format” and whittled down to an individual “scoring” system..and yes it hurts consumers and error goes up…the creepy effect for sure. I just read yesterday that credit information about Bill Gates was hacked..the never ending battle for enough security measures too, which are only as good as the last hack it seems:)

LikeLike
lkafle

March 15, 2013 at 3:15 pm

Reblogged this on lava kafle kathmandu nepal.

LikeLike