Archive

Archive for the ‘data science’ Category

What is regulation for?

A couple of days ago I was listening to a recorded webinar on K-12 student data privacy. I found out about it through an education blog I sometimes read called deutsch29, where the blog writer was complaining about “data chearleaders” on a panel and how important issues are sure to be ignored if everyone on a panel is on the same, pro-data and pro-privatization side.

Well as it turns out deutsch29 was almost correct. Most of the panelists were super bland and pro-data collection by private companies. But the first panelist named Joel Reidenberg, from Fordham Law School, reported on the state of data sharing in this country, the state of the law, and the gulf between the two.

I will come back to his report in another post, because it’s super fascinating, and in fact I’d love to interview that guy for my book.

One thing I wanted to mention was the high-level discussion that took place in the webinar on what regulation is for. Specifically, the following important question was asked:

Does every parent have to become a data expert in order to protect their children’s data?

The answer was different depending on who answered it, of course, but one answer that resonated with me was that that’s what regulation is for, it exists so that parents can rely on regulation to protect their children’s privacy, just as we expect HIPAA to protect the integrity of our medical data.

I started to like this definition – or attribute, if you will – of regulation, and I wondered how it relates to other kinds of regulation, like in finance, as well as how it would work if you’re arguing with people who hate all regulation.

First of all, I think that the financial industry has figured out how to make things so goddamn complicated that nobody can figure out how to regulate anything well. Moreover, they’ve somehow, at least so far, also been able to insist things need to be this complicated. So even if regulation were meant to allow people to interact with the financial system and at the same time “not be experts,” it’s clearly not wholly working. But what I like about it anyway is the emphasis on this issue of complexity and expertise. It took me a long time to figure out how big a problem that is in finance, but with this definition it goes right to the heart of the issue.

Second, as for the people who argue for de-regulation, I think it helps there too. Most of the time they act like everyone is a omniscient free agent who spends all their time becoming expert on everything. And if that were true, then it’s possible that regulation wouldn’t be needed (although transparency is key too). The point is that we live in a world where most people have no clue about the issues of data privacy, never mind when it’s being shielded by ridiculous and possibly illegal contracts behind their kids’ public school system.

Finally, in terms of the potential for protecting kids’ data: here the private companies like InBloom and others are way ahead of regulators, but it’s not because of complexity on the issues so much as the fact that regulators haven’t caught up with technology. At least that’s my optimistic feeling about it. I really think this stuff is solvable in the short term, and considering it involves kids, I think it will have bipartisan support. Plus the education benefits of collecting all this data have not been proven at all, nor do they really require such shitty privacy standards even if they do work.

Categories: data science, finance, modeling

I’m writing a book called Weapons of Math Destruction

I’m incredibly excited to announce that I am writing a book called Weapons of Math Destruction for Random House books, with my editor Amanda Cook. There will also be a subtitle which we haven’t decided on yet.

Here’s how this whole thing went down. First I met my amazing book agent Jay Mandel from William Morris though my buddy Jordan Ellenberg. As many of you know, Jordan is also writing a book but it’s much farther along in the process and has already passed the editing phase. Jordan’s book is called How Not To Be Wrong and it’s already available for pre-order on Amazon.

Anyhoo, Jay spent a few months with me telling me how to write a book proposal, and it was a pretty substantial undertaking actually and required more than just an outline. It was like a short treatment of all the chapters but then two chapters pretty filled in, including the first, and as you know the first is kind of like an advertisement for the whole rest of the book.

Then, once that proposal was ready, Jay started what he hoped would be a bidding war for the proposal among publishers. He had a whole list of people he talked to from all over the place in the publishing world.

What actually happened though was Amanda Cook from Crown Publishing, which is part of Random House, was the first person who was interested enough to talk to me about it, and then we hit it off really well, and she made a pre-emptive offer for the book so the full on bidding war didn’t end up needing to happen. And then just last week she announced the deal in what’s called the Publisher’s Marketplace, which is for people inside publishing to keep abreast of the deals and news. The actual link is here, but it’s behind a pay wall, so Amanda got me a screen shot:

screenshot_pubmarketplace

If that font is too small, it says something like this:

Harvard math Ph.D., former Wall Street quant, and advisor to the Occupy movement Cathy O’Neil’s WEAPONS OF MATH DESTRUCTION, arguing that mathematical modeling has become a pervasive and destructive force in society—in finance, education, medicine, politics, and the workplace—and showing how current models exacerbate inequality and endanger democracy and how we might rein them in, to Amanda Cook at Crown in a pre-empt by Jay Mandel at William Morris Endeavor (NA).

So as you can tell I’m incredibly excited about the book, and I have tons of ideas about it, but of course I’d love my readers to weigh in on crucial examples of models and industries that you think might get overlooked.

Please, post a comment or send me an email (located on my About page) with your favorite example of a family of models (Value Added Model for teachers is already in!) or a specific model (Value-at-Risk model in finance in already!) that is illustrative of feedback loops, or perverted incentives, or creepy modeling, or some such concept that you imagine I’ll be writing about (or should be!). Thanks so much for your input!

One last thing. I’m aiming to finish the writing part by next Spring, and then the book is actually released about 9 months later. It takes a while. I’m super glad I have had the experience of writing a technical book with O’Reilly as well as the homemade brew Occupy Finance with my Occupy group so I know at least some of the ropes, but even so this is a bit more involved.

 

Categories: data science, modeling, news

Parents fighting back against sharing children’s data with InBloom

There is a movement afoot in New York (and other places) to allow private companies to house and mine tons of information about children and how they learn. It’s being touted as a great way to tailor online learning tools to kids, but it also raises all sorts of potential creepy modeling problems, and one very bad sign is how secretive everything is in terms of privacy issues. Specifically, it’s all being done through school systems and without consulting parents.

In New York it’s being done through InBloom, which I already mentioned here when I talked about big data and surveillance. In that post I related an EducationNewYork report which quoted an official from InBloom as saying that the company “cannot guarantee the security of the information stored … or that the information will not be intercepted when it is being transmitted.”

The issue is super important and timely, and parents have been left out of the loop, with no opt-out option, and are actively fighting back, for example with this petition from MoveOn (h/t George Peacock). And although the InBloomers claim that no data about their kids will ever be sold, that doesn’t mean it won’t be used by third parties for various mining purposes and possibly marketing – say for test prep tools. In fact that’s a major feature of InBloom’s computer and data infrastructure, the ability for third parties to plug into the data. Not cool that this is being done on the downlow.

Who’s behind this? InBloom is funded by the Bill & Melinda Gates foundation and the operating system for inBloom is being developed by the Amplify division (formerly Wireless Generation) of Rupert Murdoch’s News Corp. More about the Murdoch connection here.

Wait, who’s paying for this? Besides the Gates and Murdoch, New York has spent $50 million in federal grants to set up the partnership with InBloom. And it’s not only New York that is pushing back, according to this Salon article:

InBloom essentially offers off-site digital storage for student data—names, addresses, phone numbers, attendance, test scores, health records—formatted in a way that enables third-party education applications to use it. When inBloom was launched in February, the company announced partnerships with school districts in nine states, and parents were outraged. Fears of a “national database” of student information spread. Critics said that school districts, through inBloom, were giving their children’s confidential data away to companies who sought to profit by proposing a solution to a problem that does not exist. Since then, all but three of those nine states have backed out.

Finally, according to this nydailynews article, Bill de Blasio is coming out on the side of protecting children’s privacy as well. That’s a good sign, let’s hope he sticks with it.

I’m not against using technology to learn, and in fact I think it’s inevitable and possibly very useful. But first we need to have a really good, public discussion about how this data is being shared, controlled, and protected, and that simply hasn’t happened. I’m glad to see parents are aware of this as a problem.

Categories: data science, modeling, news, rant

Computer, do I really want to get married?

There’s a new breed of models out there nowadays that reads your face for subtle expressions of emotions, possibly stuff that normal humans cannot pick up on. You can read more about it here, but suffice it to say it’s a perfect target for computers – something that is free information, that can be trained over many many examples, and then deployed everywhere and anywhere, even without our knowledge since surveillance cameras are so ubiquitous.

Plus, there are new studies that show that, whether you’re aware of it or not, a certain “gut feeling”, which researchers can get at by asking a few questions, will expose whether your marriage is likely to work out.

Let’s put these two together. I don’t think it’s too much of a stretch to imagine that surveillance cameras strategically placed at an altar can now make predictions on the length and strength of a marriage.

Oh goodie!

I guess it brings up the following question: is there some information we are better off not knowing? I don’t think knowing my marriage is likely to be in trouble would help me keep the faith. And every marriage needs a good dose of faith.

I heard a radio show about Huntington’s disease. There’s no cure for it, but there is a simple genetic test to see if you’ve got it, and it usually starts in adulthood so there’s plenty of time for adults to see their parents degenerate and start to worry about themselves.

But here’s the thing, only 5% of people who have a 50% chance of having Huntington’s actually take that test. For them the value of not knowing that information is larger than knowing. Of course knowing you don’t have it is better still, but until that happens the ambiguity is preferable.

Maybe what’s critical is that there’s no cure. I mean, if there was therapy that would help Huntington’s disease sufferers delay it or ameliorate it, I think we’d see far more people taking that genetic marker test.

And similarly, if there were ways to save a marriage that is at risk, we might want to know on the altar what the prognosis is. Right?

I still don’t know. Somehow, when things get that personal and intimate, I’d rather be left alone, even if an algorithm could help me “optimize my love life”. But maybe that’s just me being old-fashioned, and maybe in 100 years people will treat their computers like love oracles.

Categories: data science, modeling, news

PDF Liberation Hackathon: January 17-19

This is a guest post by Marc Joffe, the principal consultant at Public Sector Credit Solutions, an organization that provides data and analysis related to sovereign and municipal securities. Previously, Joffe was a Senior Director at Moody’s Analytics.

As Cathy has argued, open source models can bring much needed transparency to scientific research, finance, education and other fields plagued by biased, self-serving analytics. Models often need large volumes of data, and if the model is to be run on an ongoing basis, regular data updates are required.

Unfortunately, many data sets are not ready to be loaded into your analytical tool of choice; they arrive in an unstructured form and must be organized into a consistent set of rows and columns. This cleaning process can be quite costly. Since open source modeling efforts are usually low dollar operations, the costs of data cleaning may prove to be prohibitive. Hence no open model – distortion and bias continue their reign.

Much data comes to us in the form of PDFs. Say, for example, you want to model student loan securitizations. You will be confronted with a large number of PDF servicing reports that look like this. A corporation or well funded research institution can purchase an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from the PDFs into a database. But this is not much help to insurgent modelers who want to produce open source work.

Data journalists face a similar challenge. They often need to extract bulk data from PDFs to support their reporting. Examples include IRS Form 990s filed by non-profits and budgets issued by governments at all levels.

The data journalism community has responded to this challenge by developing software to harvest usable information from PDFs. Examples include Tabula, a tool written by Knight-Mozilla OpenNews Fellow Manuel Aristarán, extracts data from PDF tables in a form that can be readily imported to a spreadsheet – if the PDF was “printed” from a computer application. Introduced earlier this year, Tabula continues to evolve thanks to the volunteer efforts of Manuel, with help from OpenNews Fellow Mike Tigas and New York Times interactive developer Jeremy Merrill. Meanwhile, DocHive, a tool whose continuing development is being funded by a Knight Foundation grant, addresses PDFs that were created by scanning paper documents. DocHive is a project of Raleigh Public Record and is led by Charles and Edward Duncan.

These open source tools join a number of commercial offerings such as Able2Extract and ABBYY Fine Reader that extract data from PDFs. A more comprehensive list of open source and commercial resources is available here.

Unfortunately, the free and low cost tools available to modelers, data journalists and transparency advocates have limitations that hinder their ability to handle large scale tasks. If, like me, you want to submit hundreds of PDFs to a software tool, press “Go” and see large volumes of cleanly formatted data, you are out of luck.

It is for this reason that I am working with The Sunlight Foundation and other sponsors to stage the PDF Liberation Hackathon from January 17-19, 2014. We’ll have hack sites at Sunlight’s Washington DC office and at RallyPad in San Francisco. Developers can also join remotely because we will publish a number of clearly specified PDF extraction challenges before the hackathon.

Participants can work on one of the pre-specified challenges or choose their own PDF extraction projects. Ideally, hackathon teams will use (and hopefully improve upon) open source tools to meet the hacking challenges, but they will also be allowed to embed commercial tools into their projects as long as their licensing cost is less than $1000 and an unlimited trial is available.

Prizes of up to $500 will be awarded to winning entries. To receive a prize, a team must publish their source code on a GitHub public repository. To join the hackathon in DC or remotely, please sign up at Eventbrite; to hack with us in SF, please sign up via this Meetup. Please also complete our Google Form survey. Also, if anyone reading this is associated with an organization in New York or Chicago that would like to organize an additional hack space, please contact me.

The PDF Liberation Hackathon is going to be a great opportunity to advance the state of the art when it comes to harvesting data from public documents. I hope you can join us.

Algorithmic Accountability Reporting: On the Investigation of Black Boxes

Tonight I’m going to be on a panel over at Columbia’s Journalism School called Algorithmic Accountability Reporting: On the Investigation of Black Boxes. It’s being organized by Nick Diakopoulos, Tow Fellow and previous guest blogger on mathbabe. You can sign up to come here and it will also be livestreamed.

The other panelists are Scott Klein from ProPublica and Clifford Stein from Columbia. I’m super excited to meet them.

Unlike some panel discussions I’ve been on, where the panelists talk about some topic they choose for a few minutes each and then there are questions, this panel will be centered around a draft of a paper coming from the Tow Center at Columbia. First Nick will present the paper and then the panelists will respond to it. Then there will be Q&A.

I wish I could share it with you but it doesn’t seem publicly available yet. Suffice it to say it has many elements in common with Nick’s guest post on raging against the algorithms, and its overall goal is to understand how investigative journalism should handle a world filled with black box algorithms.

Super interesting stuff, and I’m looking forward to tonight, even if it means I’ll miss the New Day New York rally in Foley Square tonight.

Categories: data science, modeling

“People analytics” embeds old cultural problems in new mathematical models

Today I’d like to discuss recent article from the Atlantic entitled “They’re watching you at work” (hat tip Deb Gieringer).

In the article they describe what they call “people analytics,” which refers to the new suite of managerial tools meant to help find and evaluate employees of firms. The first generation of this stuff happened in the 1950’s, and relied on stuff like personality tests. It didn’t seem to work very well and people stopped using it.

But maybe this new generation of big data models can be super useful? Maybe they will give us an awesome way of throwing away people who won’t work out more efficiently and keeping those who will?

Here’s an example from the article. Royal Dutch Shell sources ideas for “business disruption” and wants to know which ideas to look into. There’s an app for that, apparently, written by a Silicon Valley start-up called Knack.

Specifically, Knack had a bunch of the ideamakers play a video game, and they presumably also were given training data on which ideas historically worked out. Knack developed a model and was able to give Royal Dutch Shell a template for which ideas to pursue in the future based on the personality of the ideamakers.

From the perspective of Royal Dutch Shell, this represents huge timesaving. But from my perspective it means that whatever process the dudes at Royal Dutch Shell developed for vetting their ideas has now been effectively set in stone, at least for as long as the algorithm is being used.

I’m not saying they won’t save time, they very well might. I’m saying that, whatever their process used to be, it’s now embedded in an algorithm. So if they gave preference to a certain kind of arrogance, maybe because the people in charge of vetting identified with that, then the algorithm has encoded it.

One consequence is that they might very well pass on really excellent ideas that happened to have come from a modest person – no discussion necessary on what kind of people are being invisible ignored in such a set-up. Another consequence is that they will believe their process is now objective because it’s living inside a mathematical model.

The article compares this to the “blind auditions” for orchestras example, where people are kept behind a curtain so that the listeners don’t give extra consideration to their friends. Famously, the consequence of blind auditions has been way more women in orchestras. But that’s an extremely misleading comparison to the above algorithmic hiring software, and here’s why.

In the blind auditions case, the people measuring the musician’s ability have committed themselves to exactly one clean definition of readiness for being a member of the orchestra, namely the sound of the person playing the instrument. And they accept or deny someone, sight unseen, based solely on that evaluation metric.

Whereas with the idea-vetting process above, the training data consisted of “previous winners” which presumable had to go through a series of meetings and convince everyone in the meeting that their idea had merit, and that they could manage the team to try it out, and all sorts of other things. Their success relied, in other words, on a community’s support of their idea and their ability to command that support.

In other words, imagine that, instead of listening to someone playing trombone behind a curtain, their evaluation metric was to compare a given musician to other musicians that had already played in a similar orchestra and, just to make it super success-based, had made first seat.

That you’d have a very different selection criterion, and a very different algorithm. It would be based on all sorts of personality issues, and community bias and buy-in issues. In particular you’d still have way more men.

The fundamental difference here is one of transparency. In the blind auditions case, everyone agrees beforehand to judge on a single transparent and appealing dimension. In the black box algorithms case, you’re not sure what you’re judging things on, but you can see when a candidate comes along that is somehow “like previous winners.”

One of the most frustrating things about this industry of hiring algorithms is how unlikely it is to actively fail. It will save time for its users, since after all computers can efficiently throw away “people who aren’t like people who have succeeded in your culture or process” once they’ve been told what that means.

The most obvious consequence of using this model, for the companies that use it, is that they’ll get more and more people just like the people they already have. And that’s surprisingly unnoticeable for people in such companies.

My conclusion is that these algorithms don’t make things objective, they makes things opaque. And they embeds our old cultural problems in new mathematical models, giving us a false badge of objectivity.

Categories: data science, modeling, rant

Cool open-source models?

I’m looking to develop my idea of open models, which I motivated here and started to describe here. I wrote the post in March 2012, but the need for such a platform has only become more obvious.

I’m lucky to be working with a super fantastic python guy on this, and the details are under wraps, but let’s just say it’s exciting.

So I’m looking to showcase a few good models to start with, preferably in python, but the critical ingredient is that they’re open source. They don’t have to be great, because the point is to see their flaws and possible to improve them.

  1. For example, I put in a FOIA request a couple of days ago to get the current teacher value-added model from New York City.
  2. A friends of mine, Marc Joffe, has an open source municipal credit rating model. It’s not in python but I’m hopeful we can work with it anyway.
  3. I’m in search of an open source credit scoring model for individuals. Does anyone know of something like that?
  4. They don’t have to be creepy! How about a Nate Silver – style weather model?
  5. Or something that relies on open government data?
  6. Can we get the Reinhart-Rogoff model?

The idea here is to get the model, not necessarily the data (although even better if it can be attached to data and updated regularly). And once we get a model, we’d build interactives with the model (like this one), or at least the tools to do so, so other people could build them.

At its core, the point of open models is this: you don’t really know what a model does until you can interact with it. You don’t know if a model is robust unless you can fiddle with its parameters and check. And finally, you don’t know if a model is best possible unless you’ve let people try to improve it.

Twitter and its modeling war

I often talk about the modeling war, and I usually mean the one where the modelers are on one side and the public is on the other. The modelers are working hard trying to convince or trick the public into clicking or buying or consuming or taking out loans or buying insurance, and the public is on the other, barely aware that they’re engaging in anything at all resembling a war.

But there are plenty of other modeling wars that are being fought by two sides which are both sophisticated. To name a couple, Anonymous versus the NSA and Anonymous versus itself.

Here’s another, and it’s kind of bland but pretty simple: Twitter bots versus Twitter.

This war arose from the fact that people care about how many followers someone on Twitter has. It’s a measure of a person’s influence, albeit a crappy one for various reasons (and not just because it’s being gamed).

The high impact of the follower count means it’s in a wannabe celebrity’s best interest to juice their follower numbers, which introduces the idea of fake twitter accounts to game the model. This is an industry in itself, and an associated arms race of spam filters to get rid of them. The question is, who’s winning this arms race and why?

Twitter has historically made some strides in finding and removing such fake accounts with the help of some modelers who actually bought the services of a spammer and looked carefully at what their money bought them. Recently though, at least according to this WSJ article, it looks like Twitter has spent less energy pursuing the spammers.

It begs the question, why? After all, Twitter has a lot theoretically at stake. Namely, its reputation, because if everyone knows how gamed the system is, they’ll stop trusting it. On the other hand, that argument only really holds if people have something else to use instead as a better proxy of influence.

Even so, considering that Twitter has a bazillion dollars in the bank right now, you’d think they’d spend a few hundred thousand a year to prevent their reputation from being too tarnished. And maybe they’re doing that, but the spammers seem to be happily working away in spite of that.

And judging from my experience on Twitter recently, there are plenty of active spammers which actively degrade the user experience. That brings up my final point, which is that the lack of competition argument at some point gives way to the “I don’t want to be spammed” user experience argument. At some point, if Twitter doesn’t maintain standards, people will just not spend time on Twitter, and its proxy of influence will fall out of favor for that more fundamental reason.

Categories: data science, modeling

Crisis Text Line: Using Data to Help Teens in Crisis

This morning I’m helping out at a datadive event set up by DataKind (apologies to Aunt Pythia lovers).

The idea is that we’re analyzing metadata around a texting hotline for teens in crisis. We’re trying to see if we can use the information we have on these texts (timestamps, character length, topic – which is most often suicide – and outcome reported by both the texter and the counselor) to help the counselors improve their responses.

For example, right now counselors can be in up to 5 conversations at a time – is that too many? Can we figure that out from the data? Is there too much waiting between texts? Other questions are listed here.

Our “hackpad” is located here, and will hopefully be updated like a wiki with results and visuals from the exploration of our group. It looks like we have a pretty amazing group of nerds over here looking into this (mostly python users!), and I’m hopeful that we will be helping the good people at Crisis Text Line.

There is no “market solution” for ethics

We saw what happened in finance with self-regulation and ethics. Let’s prepare for the exact same thing in big data.

Finance

Remember back in the 1970’s through the 1990’s, the powers that were decided that we didn’t need to regulate banks because “they” wouldn’t put “their” best interests at risk? And then came the financial crisis, and most recently came Alan Greenspan’s recent admission that he’d got it kinda wrong but not really.

Let’s look at what the “self-regulated market” in derivatives has bestowed upon us. We’ve got a bunch of captured regulators and a huge group of bankers who insist on keeping derivatives opaque so that they can charge clients bigger fees, not to mention that they insist on not having fiduciary duties to their clients, and oh yes, they’d like to continue to bet depositors’ money on those derivatives. They wrote the regulation themselves for that one. And this is after they blew up the world and got saved by the taxpayers.

Given that the banks write the regulations, it’s arguably still kind of a self-regulated market in finance. So we can see how ethics has been and is faring in such a culture.

The answer is, not well. Just in case the last 5 years of news articles wasn’t enough to persuade you of this fact, here’s what NY Fed Chief Dudley had to say recently about big banks and the culture of ethics, from this Huffington Post article:

“Collectively, these enhancements to our current regime may not solve another important problem evident within some large financial institutions — the apparent lack of respect for law, regulation and the public trust,” he said.

“There is evidence of deep-seated cultural and ethical failures at many large financial institutions,” he continued. “Whether this is due to size and complexity, bad incentives, or some other issues is difficult to judge, but it is another critical problem that needs to be addressed.”

Given that my beat is now more focused on the big data community and less on finance, mostly since I haven’t worked in finance for almost 2 years, this kind of stuff always makes me wonder how ethics is faring in the big data world, which is, again, largely self-regulated.

Big data

According to this ComputerWorld article, things are pretty good. I mean, there are the occasional snafus – unappreciated sensors or unreasonable zip code gathering examples – but the general idea is that, as long as you have a transparent data privacy policy, you’ll be just fine.

Examples of how awesome “transparency” is in these cases vary from letting people know what cookies are being used (BlueKai), to promising not to share certain information between vendors (Retention Science), to allowing customers a limited view into their profiling by Acxiom, the biggest consumer information warehouse. Here’s what I assume a typical reaction might be to this last one.

Wow! I know a few things Acxiom knows about me, but probably not all! How helpful. I really trust those guys now.

Not a solution

What’s great about letting customers know exactly what you’re doing with their data is that you can then turn around and complain that customers don’t understand or care about privacy policies. In any case, it’s on them to evaluate and argue their specific complaints. Which of course they don’t do, because they can’t possibly do all that work and have a life, and if they really care they just boycott the product altogether. The result in any case is a meaningless, one-sided conversation where the tech company only hears good news.

Oh, and you can also declare that customers are just really confused and don’t even know what they want:

In a recent Infosys global survey, 39% of the respondents said that they consider data mining invasive. And 72% said they don’t feel that the online promotions or emails they receive speak to their personal interests and needs.

Conclusion: people must want us to collect even more of their information so they can get really really awesome ads.

Finally, if you make the point that people shouldn’t be expected to be data mining and privacy experts to use the web, the issue of a “market solution for ethics” is raised.

“The market will provide a mechanism quicker than legislation will,” he says. “There is going to be more and more control of your data, and more clarity on what you’re getting in return. Companies that insist on not being transparent are going to look outdated.”

Back to ethics

What we’ve got here is a repeat problem. The goal of tech companies is to make money off of consumers, just as the goal of banks is to make money off of investors (and taxpayers as a last resort).

Given how much these incentives clash, the experts on the inside have figured out a way of continuing to do their thing, make money, and at the same time, keeping a facade of the consumer’s trust. It’s really well set up for that since there are so many technical terms and fancy math models. Perfect for obfuscation.

If tech companies really did care about the consumer, they’d help set up reasonable guidelines and rules on these issues, which could easily be turned into law. Instead they send lobbyists to water down any and all regulation. They’ve even recently created a new superPAC for big data (h/t Matt Stoller).

And although it’s true that policy makers are totally ignorant of the actual issues here, that might be because of the way big data professionals talk down to them and keep them ignorant. It’s obvious that tech companies are desperate for policy makers to stay out of any actual informed conversation about these issues, never mind the public.

Conclusion

There never has been, nor there ever will be, a market solution for ethics so long as the basic incentives between the public and an industry are so misaligned. The public needs to be represented somehow, and without rules and regulations, and without leverage of any kind, that will not happen.

Categories: data science, finance, modeling

How do you know when you’ve solved your data problem?

I’ve been really impressed by how consistently people have gone to read my post “K-Nearest Neighbors: dangerously simple,” which I back in April. Here’s a timeline of hits on that post:

Stats for "K-Nearest Neighbors: dangerously simple." I've actually gotten more hits recently.

Stats for “K-Nearest Neighbors: dangerously simple.” I’ve actually gotten more hits recently.

I think the interest in this post is that people like having myths debunked, and are particularly interested in hearing how even the simple things that they thought they understand are possibly wrong, or at least more complicated than they’d been assuming. Either that or it’s just got a real catchy name.

Anyway, since I’m still getting hits on that post, I’m also still getting comments, and just this morning I came across a new comment by someone who calls herself “travelingactuary”. Here it is:

My understanding is that CEOs hate technical details, but do like results. So, they wouldn’t care if you used K-Nearest Neighbors, neural nets, or one that you invented yourself, so long as it actually solved a business problem for them. I guess the problem everyone faces is, if the business problem remains, is it because the analysis was lacking or some other reason? If the business is ‘solved’ is it actually solved or did someone just get lucky? That being so, if the business actually needs the classifier to classify correctly, you better hire someone who knows what they’re doing, rather than hoping the software will do it for you.

Presumably you want to sell something to Monica, and the next n Monicas who show up. If your model finds a whole lot of big spenders who then don’t, your technophobe CEO is still liable to think there’s something wrong.

I think this comment brings up the right question, namely knowing when you’ve solved your data problem, with K-Nearest Neighbors or whichever algorithms you’ve chosen to use. Unfortunately, it’s not that easy.

Here’s the thing, it’s almost never possible to tell if a data problem is truly solved. I mean, it might be a business problem where you go from losing money to making money, and in that sense you could say it’s been “solved.” But in terms of modeling, it’s very rarely a binary thing.

Why do I say that? Because, at least in my experience, it’s rare that you could possibly hope for high accuracy when you model stuff, even if it’s a classification problem. Most of the time you’re trying to achieve something better than random, some kind of edge. Often an edge is enough, but it’s nearly impossible to know if you’ve gotten the biggest edge possible.

For example, say you’re binning people you who come to your site in three equally sized groups, as “high spenders,” “medium spenders,” and “low spenders.” So if the model were random, you’d expect a third to be put into each group, and that someone who ends up as a big spender is equally likely to be in any of the three bins.

Next, say you make a model that’s better than random. How would you know that? You can measure that, for example, by comparing it to the random model, or in other words by seeing how much better you do than random. So if someone who ends up being a big spender is three times more likely to have been labeled a big spender than a low spender and twice as likely than a medium spender, you know your model is “working.”

You’d use those numbers, 3x and 2x, as a way of measuring the edge your model is giving you. You might care about other related numbers more, like whether pegged low spenders are actually low spenders. It’s up to you to decide what it means that the model is working. But even when you’ve done that carefully, and set up a daily updated monitor, the model itself still might not be optimal, and you might still be losing money.

In other words, you can be a bad modeler or a good modeler, and either way when you try to solve a specific problem you won’t really know if you did the best possible job you could have, or someone else could have with their different tools and talents.

Even so, there are standards that good modelers should follow. First and most importantly, you should always set up a model monitor to keep track of the quality of the model and see how it fares over time.  Because why? Because second, you should always assume that, over time, your model will degrade, even if you are updating it regularly or even automatically. It’s of course good to know how crappy things are getting so you don’t have a false sense of accomplishment.

Keep in mind that just because it’s getting worse doesn’t mean you can easily start over again and do better. But a least you can try, and you will know when it’s worth a try. So, that’s one thing that’s good about admitting your inability to finish anything.

On to the political aspect of this issue. If you work for a CEO who absolutely hates ambiguity – and CEO’s are trained to hate ambiguity, as well as trained to never hesitate – and if that CEO wants more than anything to think their data problem has been “solved,” then you might be tempted to argue that you’ve done a phenomenal job just to make her happy. But if you’re honest, you won’t say that, because it ‘aint true.

Ironically and for these reasons, some of the most honest data people end up looking like crappy scientists because they never claim to be finished doing their job.

Categories: data science, modeling

The private-data-for-services trade fallacy

I had a great time at Harvard Wednesday giving my talk (prezi here) about modeling challenges. The audience was fantastic and truly interdisciplinary, and they pushed back and challenged me in a great way. I’m glad I went and I’m glad Tess Wise invited me.

One issue that came up is something I want to talk about today, because I hear it all the time and it’s really starting to bug me.

Namely, the fallacy that people, especially young people, are “happy to give away their private data in order to get the services they love on the internet”. The actual quote came from the IBM guy on the congressional subcommittee panel on big data, which I blogged about here (point #7), but I’ve started to hear that reasoning more and more often from people who insist on side-stepping the issue of data privacy regulation.

Here’s the thing. It’s not that people don’t click “yes” on those privacy forms. They do click yes, and I acknowledge that. The real problem is that people generally have no clue what it is they’re trading.

In other words, this idea of a omniscient market participant with perfect information making a well-informed trade, which we’ve already seen is not the case in the actual market, is doubly or triply not the case when you think about young people giving away private data for the sake of a phone app.

Just to be clear about what these market participants don’t know, I’ll make a short list:

  • They probably don’t know that their data is aggregated, bought, and sold by Acxiom, which they’ve probably never heard of.
  • They probably don’t know that Facebook and other social media companies sell stuff about them even if their friends don’t see it and even though it’s often “de-identified”. Think about this next time you sign up for a service like “Bang With Friends,” which works through Facebook.
  • They probably don’t know how good algorithms are getting at identifying de-identified information.
  • They probably don’t know how this kind of information is used by companies to profile users who ask for credit or try to get a job.

Conclusion: people are ignorant of what they’re giving away to play Candy Crush Saga[1]. And whatever it is they’re giving away, it’s something way far in the future that they’re not worried about right now. In any case it’s not a fair trade by any means, and we should stop referring to it as such.

What is it instead? I’d say it’s a trick. A trick which plays on our own impulses and short-sightedness and possibly even a kind of addiction to shiny toys in the form of candy. If you give me your future, I’ll give you a shiny toy to play with right now. People who click “yes” are not signaling that they’ve thought deeply about the consequences of giving their data away, and they are certainly not making the definitive political statement that we don’t need privacy regulation.

1. I actually don’t know the data privacy rules for Candy Crush and can’t seem to find them, for example here. Please tell me if you know what they are.

Categories: data science, modeling, rant

Harvard Applied Statistics workshop today

I’m on an Amtrak train to Boston today to give a talk in the Applied Statistics workshop at Harvard, which is run out of the Harvard Institute for Quantiative Social Science. I was kindly invited by Tess Wise, a Ph.D. student in the Department of Government at Harvard who is organizing this workshop.

My title is “Data Skepticism in Industry” but as I wrote the talk (link to my prezi here) it transformed a bit and now it’s more about the problems not only for data professionals inside industry but for the public as well. So I talk about creepy models and how there are multiple longterm feedback loops having a degrading effect on culture and democracy in the name of short-term profits. 

Since we’re on the subject of creepy, my train reading this morning is this book entitled “Murdoch’s Politics,” which talks about how Rupert Murdoch lives by design in the center of all things creepy. 

Categories: data science, modeling

Disorderly Conduct with Alexis and Jesse #OWS

Podcast

So there’s a new podcast called Disorderly Conduct which “explores finance without a permit” and is hosted by Alexis Goldstein, whom I met through her work on Occupy the SEC, and Jesse Myerson, and activist and a writer.

I was recently a very brief guest on their “In the Weeds” feature, where I was asked to answer the question, “What is the single best way to rein in the power of Wall Street?” in three minutes. The answers given by:

  1. me,
  2. The Other 98% organizer Nicole Carty (@nacarty),
  3. Salon.com contributing writer David Dayen (@ddayen),
  4. Americans for Financial Reform Policy Director Marcus Stanley (@MarcusMStanley), and
  5. Marxist militant José Martín (@sabokitty)

can be found here or you can download the episode here.

Occupy Finance video series

We’ve been having our Occupy Finance book club meetings every Sunday, and although our group has decided not to record them, a friend of our group and a videographer in her own right, Donatella Barbarella, has started to interview the authors and post them on YouTube. The first few interviews have made their way to the interwebs:

  1. Linda talking about Chapter 1: Financialization and the 99%.
  2. Me talking about Chapter 2: the Bailout
  3. Tamir talking about Chapter 3: How banks work

Doing Data Science now out!

O’Reilly is releasing the book today. I can’t wait to see a hard copy!! And when I say “hard copy,” keep in mind that all of O’Reilly’s books are soft cover.

Categories: #OWS, data science, finance

*Doing Data Science* now available on Kindle!

My book with Rachel Schutt is now available on Kindle. I’ve tested this by buying it myself from amazon.com and looking at it on my computer’s so-called cloud reader.

Here’s the good news. It is actually possible to do this, and it’s satisfying to see!

Here’s the bad news. The kindle reader doesn’t render latex well, or for that matter many of the various fonts we use for various reasons. The result is a pretty comical display of formatting inconsistency. In particular, whenever a formula comes up it might seem like we’re

screaming about it

and often the quoted passages come in

very very tiny indeed.

I hope it’s readable. If you prefer less comical formatting, the hard copy edition is coming out on October 22nd, next Tuesday.

Next, a word about the book’s ranking. Amazon has this very generous way of funneling down into categories sufficiently so that the ranking of a given book looks really high. So right now I can see this on the book’s page:

but for a while, before yesterday, it took a few more iterations of digging to get to single digits, so it was more like:

But you, know, I’ll take what I can get to be #1! It’s all about metrics!!!

One last thing, which is that the full title is now “Doing Data Science: Straight Talk from the Frontline” and for the record, I wanted the full title to be something more like “Doing Data Science: the no bullshit approach” but for some reason I was overruled. Whatevs.

Categories: data science

Cumulative covariance plots

One thing I do a lot when I work with data is figure out how to visualize my signals, especially with respect to time.

Lots of things change over time – relationships between variables, for example – and it’s often crucial to get deeply acquainted with how exactly that works with your in-sample data.

Say I am trying to predict “y”: so for a data point at time t, we’ll say we try to predict y(t). I’ll take an “x”, a variable that is expected to predict “y”, and I’ll demean both series x and y, hopefully in a causal way, and I will rename them x’ and y’, and then, making sure I’ve ordered everything with respect to time, I’ll plot the cumulative sum of the product x'(t) * y'(t).

In the case that both x'(t) and y'(t) have the both sign – so they’re both bigger than average or they’re both smaller than average, this product is positive, and otherwise it’s negative. So if you plot the cumulative sum, you get an upwards trend if things are positively correlated and downwards trend if things are negatively correlated. If you think about it, you are computing the numerator of the correlation function, so it is indeed just an unscaled version of total correlation.

Plus, since you ordered everything by time first, you can see how the relationship between these variables evolved over time.

Also, in the case that you are working with financial models, you can make a simplifying assumption that both x and y are pretty well demeaned already (especially at short time scales) and this gives you the cumulative PnL plot of your model. In other words, it tells you how much money your model is making.

So I was doing this exercise of plotting the cumulative covariance with some data the other day, and I got a weird picture. It kind of looked like a “U” plot: it went down dramatically at the beginning, then was pretty flat but trending up, then it went straight up at the end. It ended up not quite as high as it started, which is to say that in terms of straight-up overall correlation, I was calculating something negative but not very large.

But what could account for that U-shape? After some time I realized that the data had been extracted from the database in such a way that, after ordering my data by date, it was hugely biased in the beginning and at the end, in different directions, and that this was unavoidable, and the picture helped me determine exactly which data to exclude from my set.

After getting rid of the biased data at the beginning and the end, I concluded that I had a positive correlation here, even though if I’d trusted the overall “dirty” correlation I would have thought it was negative.

This is good information, and confirmed my belief that it’s always better to visualize data over time than it is to believe one summary statistic like correlation.

Categories: data science, modeling

Data Skeptic post

I wrote a blog post for O’Reilly’s website to accompany my essay, On Being a Data Skeptic. Here’s an excerpt:

I left finance pretty disgusted with the whole thing, and because I needed to make money and because I’m a nerd, I pretty quickly realized I could rebrand myself a “data scientist” and get a pretty cool job, and that’s what I did. Once I started working in the field, though, I was kind of shocked by how positive everyone was about the “big data revolution” and the “power of data science.”

Not to underestimate the power of data––it’s clearly powerful! And big data has the potential to really revolutionize the way we live our lives for the better––or sometimes not. It really depends.

From my perspective, this was, in tenor if not in the details, the same stuff we’d been doing in finance for a couple of decades and that fields like advertising were slow to pick up on. And, also from my perspective, people needed to be way more careful and skeptical of their powers than they currently seem to be. Because whereas in finance we need to worry about models manipulating the market, in data science we need to worry about models manipulating people, which is in fact scarier. Modelers, if anything, have a bigger responsibility now than ever before.

Categories: data science, finance, modeling

Guest post: Rage against the algorithms

This is a guest post by , a Tow Fellow at the Columbia University Graduate School of Journalism where he is researching the use of data and algorithms in the news. You can find out more about his research and other projects on his website or by following him on Twitter. Crossposted from engenhonetwork with permission from the author.

shutterstock_125392883_650

How can we know the biases of a piece of software? By reverse engineering it, of course.

When was the last time you read an online review about a local business or service on a platform like Yelp? Of course you want to make sure the local plumber you hire is honest, or that even if the date is dud, at least the restaurant isn’t lousy. A recent survey found that 76 percent of consumers check online reviews before buying, so a lot can hinge on a good or bad review. Such sites have become so important to local businesses that it’s not uncommon for scheming owners to hire shills to boost themselves or put down their rivals.

To protect users from getting duped by fake reviews Yelp employs an algorithmic review reviewer which constantly scans reviews and relegates suspicious ones to a “filtered reviews” page, effectively de-emphasizing them without deleting them entirely. But of course that algorithm is not perfect, and it sometimes de-emphasizes legitimate reviews and leaves actual fakes intact—oops. Some businesses have complained, alleging that the filter can incorrectly remove all of their most positive reviews, leaving them with a lowly one- or two-stars average.

This is just one example of how algorithms are becoming ever more important in society, for everything from search engine personalizationdiscriminationdefamation, and censorship online, to how teachers are evaluated, how markets work, how political campaigns are run, and even how something like immigration is policed. Algorithms, driven by vast troves of data, are the new power brokers in society, both in the corporate world as well as in government.

They have biases like the rest of us. And they make mistakes. But they’re opaque, hiding their secrets behind layers of complexity. How can we deal with the power that algorithms may exert on us? How can we better understand where they might be wronging us?

Transparency is the vogue response to this problem right now. The big “open data” transparency-in-government push that started in 2009 was largely the result of an executive memo from President Obama. And of course corporations are on board too; Google publishes a biannual transparency report showing how often they remove or disclose information to governments. Transparency is an effective tool for inculcating public trust and is even the way journalists are now trained to deal with the hole where mighty Objectivity once stood.

But transparency knows some bounds. For example, though the Freedom of Information Act facilitates the public’s right to relevant government data, it has no legal teeth for compelling the government to disclose how that data was algorithmically generated or used in publicly relevant decisions (extensions worth considering).

Moreover, corporations have self-imposed limits on how transparent they want to be, since exposing too many details of their proprietary systems may undermine a competitive advantage (trade secrets), or leave the system open to gaming and manipulation. Furthermore, whereas transparency of data can be achieved simply by publishing a spreadsheet or database, transparency of an algorithm can be much more complex, resulting in additional labor costs both in creation as well as consumption of that information—a cognitive overload that keeps all but the most determined at bay. Methods for usable transparency need to be developed so that the relevant aspects of an algorithm can be presented in an understandable way.

Given the challenges to employing transparency as a check on algorithmic power, a new and complementary alternative is emerging. I call it algorithmic accountability reporting. At its core it’s really about reverse engineering—articulating the specifications of a system through a rigorous examination drawing on domain knowledge, observation, and deduction to unearth a model of how that system works.

As interest grows in understanding the broader impacts of algorithms, this kind of accountability reporting is already happening in some newsrooms, as well as in academic circles. At the Wall Street Journal a team of reporters probed e-commerce platforms to identify instances of potential price discrimination in dynamic and personalized online pricing. By polling different websites they were able to spot several, such as Staples.com, that were adjusting prices dynamically based on the location of the person visiting the site. At the Daily Beast, reporter Michael Keller dove into the iPhone spelling correction feature to help surface patterns of censorship and see which words, like “abortion,” the phone wouldn’t correct if they were misspelled. In my own investigation for Slate, I traced the contours of the editorial criteria embedded in search engine autocomplete algorithms. By collecting hundreds of autocompletions for queries relating to sex and violence I was able to ascertain which terms Google and Bing were blocking or censoring, uncovering mistakes in how these algorithms apply their editorial criteria.

All of these stories share a more or less common method. Algorithms are essentially black boxes, exposing an input and output without betraying any of their inner organs. You can’t see what’s going on inside directly, but if you vary the inputs in enough different ways and pay close attention to the outputs, you can start piecing together some likeness for how the algorithm transforms each input into an output. The black box starts to divulge some secrets.

Algorithmic accountability is also gaining traction in academia. At Harvard, Latanya Sweeney has looked at how online advertisements can be biased by the racial association of names used as queries. When you search for “black names” as opposed to “white names” ads using the word “arrest” appeared more often for online background check service Instant Checkmate. She thinks the disparity in the use of “arrest” suggests a discriminatory connection between race and crime. Her method, as with all of the other examples above, does point to a weakness though: Is the discrimination caused by Google, by Instant Checkmate, or simply by pre-existing societal biases? We don’t know, and correlation does not equal intention. As much as algorithmic accountability can help us diagnose the existence of a problem, we have to go deeper and do more journalistic-style reporting to understand the motivations or intentions behind an algorithm. We still need to answer the question of why.

And this is why it’s absolutely essential to have computational journalists not just engaging in the reverse engineering of algorithms, but also reporting and digging deeper into the motives and design intentions behind algorithms. Sure, it can be hard to convince companies running such algorithms to open up in detail about how their algorithms work, but interviews can still uncover details about larger goals and objectives built into an algorithm, better contextualizing a reverse-engineering analysis. Transparency is still important here too, as it adds to the information that can be used to characterize the technical system.

Despite the fact that forward thinkers like Larry Lessig have been writing for some time about how code is a lever on behavior, we’re still in the early days of developing methods for holding that code and its influence accountable. “There’s no conventional or obvious approach to it. It’s a lot of testing or trial and error, and it’s hard to teach in any uniform way,” noted Jeremy Singer-Vine, a reporter and programmer who worked on the WSJ price discrimination story. It will always be a messy business with lots of room for creativity, but given the growing power that algorithms wield in society it’s vital to continue to develop, codify, and teach more formalized methods of algorithmic accountability. In the absence of new legal measures, it may just provide a novel way to shed light on such systems, particularly in cases where transparency doesn’t or can’t offer much clarity.

New Essay, On Being a Data Skeptic, now out

It is available here and is based on a related essay written by Susan Webber entitled “Management’s Great Addiction: It’s time we recognized that we just can’t measure everything.” It is being published by O’Reilly as an e-book.

No, I don’t know who that woman is looking skeptical on the cover. I wish they’d asked me for a picture of a skeptical person, I think my 11-year-old son would’ve done a better job.

Categories: data science, modeling, musing
Follow

Get every new post delivered to your Inbox.

Join 1,039 other followers