Archive

Archive for the ‘data science’ Category

An Interview And A Notebook

Interview on Junk Charts

Yesterday I was featured on Kaiser Fung’s Junk Charts blog in an interview where he kindly refers to me as a “Numbersense Pro”. Previous to this week, my strongest connection with Kaiser Fung was through Andrew Gelman’s meta-review of my review and Kaiser’s review of Nate Silver’s book The Signal And The Noise.

iPython Notebook in Data Journalism

Speaking of Nate Silver, Brian Keegan, a quantitative social scientist from Northeastern University, recently built a very cool iPython notebook (hat tip Ben Zaitlen), replete with a blog post in markdown on the need for openness in journalism (also available here), which revisited a fivethirtyeight article originally written by Walt Hickey on the subject of women in film. Keegan’s notebook is truly a model of open data journalism, and the underlying analysis is also interesting, so I hope you have time to read it.

Let’s not replace the SAT with a big data approach

The big news about the SAT is that the College Boards, which makes the SAT, has admitted there is a problem, which is widespread test-prep and gaming. As I talked about in this post, the SAT mainly serves to sort people by income.

It shouldn’t be a surprise to anyone when a weak proxy gets gamed. Yesterday I discussed this very thing in the context of Google’s PageRank algorithm, and today it’s student learning aptitude. The question is, what do we do next?

Rick Bookstaber wrote an interesting post yesterday (hat tip Marcos Carreira) with an idea to address the SAT problem with the same approach that I’m guessing Google is addressing the PageRank problem, namely by abandoning the poor proxy and getting a deeper, more involved one. Here’s Bookstaber’s suggestion:

You would think that in the emerging world of big data, where Amazon has gone from recommending books to predicting what your next purchase will be, we should be able to find ways to predict how well a student will do in college, and more than that, predict the colleges where he will thrive and reach his potential.  Colleges have a rich database at their disposal: high school transcripts, socio-economic data such as household income and family educational background, recommendations and the extra-curricular activities of every applicant, and data on performance ex post for those who have attended. For many universities, this is a database that encompasses hundreds of thousands of students.

There are differences from one high school to the next, and the sample a college has from any one high school might be sparse, but high schools and school districts can augment the data with further detail, so that the database can extend beyond those who have applied. And the data available to the colleges can be expanded by orders of magnitude if students agree to share their admission data and their college performance on an anonymized basis. There already are common applications forms used by many schools, so as far as admission data goes, this requires little more than adding an agreement in the college applications to share data; the sort of agreement we already make with Facebook or Google.

The end result, achievable in a few years, is a vast database of high school performance, drilling down to the specific high school, coupled with the colleges where each student applied, was accepted and attended, along with subsequent college performance. Of course, the nature of big data is that it is data, so students are still converted into numerical representations.  But these will cover many dimensions, and those dimensions will better reflect what the students actually do. Each college can approach and analyze the data differently to focus on what they care about.  It is the end of the SAT version of standardization. Colleges can still follow up with interviews, campus tours, and reviews of musical performances, articles, videos of sports, and the like.  But they will have a much better filter in place as they do so.

Two things about this. First, I believe this is largely already happening. I’m not an expert on the usage of student data at colleges and universities, but the peek I’ve had into this industry tells me that the analytics are highly advanced (please add related comments and links if you have them!). And they have more to do with admissions and college aid – and possibly future alumni giving – than any definition of academic success. So I think Bookstaber is being a bit naive and idealistic if he thinks colleges will use this information for good. They already have it and they’re not.

Secondly, I want to think a little bit harder about when the “big, deeper data” approach makes sense. I think it does for teachers to some extent, as I talked about yesterday, because after all it’s part of a job to get evaluated. For that matter I expect this kind of thing to be part of most jobs soon (but it will be interesting to see when and where it stops – I’m pretty sure Bloomberg will never evaluate himself quantitatively).

I don’t think it makes sense to evaluate children in the same way, though. After all, we’re basically talking about pre-consensual surveillance, not to mention the collection and mining of information far beyond the control of the individual child. And we’re proposing to mine demographic and behavioral data to predict future success. This is potentially much more invasive than just one crappy SAT test. Childhood is a time which we should try to do our best to protect, not quantify.

Also, the suggestion that this is less threatening because “the data is anonymized” is misleading. Stripping out names in historical data doesn’t change or obscure the difference between coming from a rich high school or a poor one. In the end you will be judged by how “others like you” performed, and in this regime the system gets off the hook but individuals are held accountable. If you think about it, it’s exactly the opposite of the American dream.

I don’t want to be naive. I know colleges will do what they can to learn about their students and to choose students to make themselves look good, at least as long as the US News & World Reports exists. I’d like to make it a bit harder for them to do so.

The endgame for PageRank

First there was Google Search, and then pretty quickly SEOs came into existence.

SEOs are marketing people hired by businesses to bump up the organic rankings for that business in Google Search results. That means they pay people to make their website more attractive and central to Google Search so they don’t have to pay for ads but will get visitors anyway. And since lots of customers come from search results, this is a big deal for those businesses.

Since Google Search was based on a pretty well-known, pretty open algorithm called PageRank which relies on ranking the interestingness of pages by their links, SEOs’ main jobs were to add links and otherwise fiddle with links to and from the websites of their clients. This worked pretty well at the beginning and the businesses got higher rank and they didn’t have to pay for it, except they did have to pay for the SEOs.

But after a while Google caught on to the gaming and adjusted its search algorithm, and SEOs responded by working harder at gaming the system (see more history here). It got more expensive but still kind of worked, and nowadays SEOs are a big business. And the algorithm war is at full throttle, with some claiming that Google Search results are nowadays all a bunch of crappy, low-quality ads.

This is to be expected, of course, when you use a proxy like “link” to indicate something much deeper and more complex like “quality of website”. Since it’s so high stakes, the gaming acts to decouple the proxy entirely from its original meaning. You end up with something that is in fact the complete opposite of what you’d intended. It’s hard to address except by giving up the proxy altogether and going for something much closer to what you care about.

Recently my friend Jordan Ellenberg sent me an article entitled The Future of PageRank: 13 Experts on the Dwindling Value of the LinkIt’s an insider article, interviewing 13 SEO experts on how they expect Google to respond to the ongoing gaming of the Google Search algorithm.

The experts don’t all agree on the speed at which this will happen, but there seems to be some kind of consensus that Google will stop relying on links as such and will go to user behavior, online and offline, to rank websites.

If correct, this means that we can expect Google to pump all of our email, browsing, and even GPS data to understand our behaviors in a minute fashion in order to get at a deeper understanding of how we perceive “quality” and how to monetize that. Because, let’s face it, it’s all about money. Google wants good organic searches so that people won’t abandon its search engine altogether so it can sell ads.

So we’re talking GPS on your android, or sensor data, and everything else it can get its hands on through linking up various data sources (which as I read somewhere is why Google+ still exists at all, but I can’t seem to find that article on Google).

It’s kind of creepy all told, and yet I do see something good coming out of it. Namely, it’s what I’ve been saying we should be doing to evaluate teachers, instead of using crappy and gameable standardized tests. We should go deeper and try to define what we actually think makes a good teacher, which will require sensors in the classroom to see if kids are paying attention and are participating and such.

Maybe Google and other creepy tech companies can show us the way on this one, although I don’t expect them to explain their techniques in detail, since they want to stay a step ahead of SEO’s.

Categories: data science, modeling

Julia Angwin’s Dragnet Nation

I recently devoured Julia Angwin‘s new book Dragnet Nation: A Quest for Privacy, Security, and Freedom in a World of Relentless Surveillance. I actually met Julia a few months ago and talked to her briefly about her upcoming book when I visited the ProPublica office downtown, so it was an extra treat to finally get my hands on the book.

First off, let me just say this is an important book, and a provides a crucial and well-described view into the private data behind the models that I get so worried about. After reading this book you have a good idea of the data landscape as well as many of the things that can currently go wrong for you personally with the associated loss of privacy. So for that reason alone I think this book should be widely read. It’s informational.

Julia takes us along her journey of trying to stay off the grid, and for me the most fascinating parts are her “data audit” (Chapter 6), where she tries to figure out what data about her is out there and who has it, and the attempts she makes to clean the web of her data and generally speaking “opt out”, which starts in Chapter 7 but extends beyond that when she makes the decision to get off of gmail and LinkedIn. Spoiler alert: her attempts do not succeed.

From the get go Julia is not a perfectionist, which is a relief. She’s a working mother with a web presence, and she doesn’t want to live in paranoid fear of being tracked. Rather, she wants to make the trackers work harder. She doesn’t want to hand herself over to them on a silver platter. That is already very very hard.

In fact, she goes pretty far, and pays for quite a few different esoteric privacy services; along the way she explores questions like how you decide to trust the weird people who offer those services. At some point she finds herself with two phones – including a “burner”, which made me think she was a character in House of Cards – and one of them was wrapped up in tin foil to avoid the GPS tracking. That was a bit far for me.

Early on in the book she compares the tracking of a U.S. citizen with what happened under Nazi Germany, and she makes the point that the Stasi would have been amazed by all this technology.

Very true, but here’s the thing. The culture of fear was very different then, and although there’s all this data out there, important distinctions need to be made: both what the data is used for and the extent to which people feel threatened by that usage are very different now.

Julia brought these up as well, and quoted sci-fi writer David Brin: The key question is, who has access? and what do they do with it?

Probably the most interesting moment in the book was when she described the so-called “Wiretapper’s Ball”, a private conference of private companies selling surveillance hardware and software to governments to track their citizens. Like maybe the Ukrainian government used such stuff when they texted warning messages to to protesters.

She quoted the Wiretapper’s Ball organizer Jerry Lucas as saying “We don’t really get into asking, ‘Is in the public’s interest?’”.

That’s the closest the book got to what I consider the critical question: to what extent is the public’s interest being pursued, if at all, by all of these data trackers and data miners?

And if the answer is “to no extent, by anyone,” what does that mean in the longer term? Julia doesn’t go much into this from an aggregate viewpoint, since her perspective is both individual and current.

At the end of the book, she makes a few interesting remarks. First, it’s just too much work to stay off the grid, and moreover it’s become entirely commoditized. In other words, you have to either be incredibly sophisticated or incredibly rich to get this done, at least right now. My guess is that, in the future, it will be more about the latter category: privacy will be enjoyed only by those people who can afford it.

Julia also mentions near the end that, even though she didn’t want to get super paranoid, she found herself increasingly inside a world based on fear and well on her way to becoming a “data survivalist,” which didn’t sound pleasant. It is not a lot of fun to be the only person caring about the tracking in a world of blithe acceptance.

Julia had some ways of measuring a tracking system, which she refers to as a “dragnet”, which seems to me a good place to start:

julia_angwinIt’s a good start.

Speaking tonight at NYC Open Data

March 6, 2014 Comments off

Tonight I’ll be giving a talk at the NYC Open Data Meetup, organized by Vivian Zhang. I’ll be discussing my essay from last year entitled On Being a Data Skeptic, as well as my Doing Data Science book. I believe there are still spots left if you’d like to attend. The details are as follows:

When: Thursday, March 6, 2014, 7:00 PM to 9:00 PM

Where: Enigma HQ, 520 Broadway, 11th Floor, New York, NY (map)

Schedule:

  • 6:15pm: Doors Open for pizza and casual networking
  • 7:00pm: Workshop begins
  • 8:30pm: Audience Q&A
Categories: data science

How much is your data worth?

I heard an NPR report yesterday with Emily Steel, reporter from the Financial Times, about what kind of attributes make you worth more to advertisers. She has developed an ingenious online calculator here, which you should go play with.

As you can see it cares about things like whether you’re about to have a kid or are a new parent, as well as if you’ve got some disease where the industry for that disease is well-developed in terms of predatory marketing.

For example, you can bump up your worth to $0.27 from the standard $0.0007 if you’re obese, and another $0.10 if you admit to being the type to buy weight-loss products. And of course data warehouses can only get that much money for your data if they know about your weight, which they may or may not since if you don’t buy weight-loss products.

The calculator doesn’t know everything, and you can experiment with how much it does know, but some of the default assumptions are that it knows my age, gender, education level, and ethnicity. Plenty of assumed information to, say, build an unregulated version of a credit score to bypass the Equal Credit Opportunities Act.

Here’s a price list with more information from the biggest data warehouser of all, Acxiom.

Categories: data science, modeling

What privacy advocates get wrong

There’s a wicked irony when it comes to many privacy advocates.

They are often narrowly focused on the their own individual privacy issues, but when it comes down to it they are typically super educated well-off nerds with few revolutionary thoughts. In other words, the very people obsessing over their privacy are people who are not particularly vulnerable to the predatory attacks of either the NSA or the private companies that make use of private data.

Let me put it this way. If I’m a data scientist working at a predatory credit card firm, seeking to build a segmentation model to target the most likely highly profitable customers – those that ring up balances and pay off minimums every month, sometimes paying late to accrue extra fees – then if I am profiling a user and notice an ad blocker or some other signal of privacy concerns, chances are that becomes a wealth indicator and I leave them alone. The mere presence of privacy concerns signals that this person isn’t worth pursuing with my manipulative scheme.

If you don’t believe me, take a look at a recent Slate article written by  and entitled Take My Data Please: How I learned to stop worrying and love a less private internet.

In it he describes how he used to be privacy obsessed, for no better reason than that he like to stick up a middle finger to those who would collect his data. I think that article should have been called something like, Well-educated white guy was a privacy freak until he realized he didn’t have to be because he’s a well-educated white guy.

He concludes that he really likes how well customized things are to his particular personality, and that shucks, we should all just appreciate the web and stop fretting.

But here’s the thing, the problem isn’t that companies are using his information to screw Cyrus Nemati. The problem is that the most vulnerable people – the very people that should be concerned with privacy but aren’t – are the ones getting tracked, mined, and screwed.

In other words, it’s silly for certain people to be scrupulously careful about their private data if they are the types of people who get great credit card offers and have a stable well-paid job and are generally healthy. I include myself in this group. I do not prevent myself from being tracked, because I’m not at serious risk.

And I’m not saying nothing can go wrong for those people, including me. Things can, especially if they suddenly lose their jobs or they have kids with health problems or something else happens which puts them into a special category. But generally speaking those people with enough time on their hands and education to worry about these things are not the most vulnerable people.

I hereby challenge Cyrus Nemati to seriously consider who should be concerned about their data being collected, and how we as a society are going to address their concerns. Recent legislation in California is a good start for kids, and I’m glad to see the New York Times editors asking for more.

Categories: data science, rant

What is regulation for?

A couple of days ago I was listening to a recorded webinar on K-12 student data privacy. I found out about it through an education blog I sometimes read called deutsch29, where the blog writer was complaining about “data chearleaders” on a panel and how important issues are sure to be ignored if everyone on a panel is on the same, pro-data and pro-privatization side.

Well as it turns out deutsch29 was almost correct. Most of the panelists were super bland and pro-data collection by private companies. But the first panelist named Joel Reidenberg, from Fordham Law School, reported on the state of data sharing in this country, the state of the law, and the gulf between the two.

I will come back to his report in another post, because it’s super fascinating, and in fact I’d love to interview that guy for my book.

One thing I wanted to mention was the high-level discussion that took place in the webinar on what regulation is for. Specifically, the following important question was asked:

Does every parent have to become a data expert in order to protect their children’s data?

The answer was different depending on who answered it, of course, but one answer that resonated with me was that that’s what regulation is for, it exists so that parents can rely on regulation to protect their children’s privacy, just as we expect HIPAA to protect the integrity of our medical data.

I started to like this definition – or attribute, if you will – of regulation, and I wondered how it relates to other kinds of regulation, like in finance, as well as how it would work if you’re arguing with people who hate all regulation.

First of all, I think that the financial industry has figured out how to make things so goddamn complicated that nobody can figure out how to regulate anything well. Moreover, they’ve somehow, at least so far, also been able to insist things need to be this complicated. So even if regulation were meant to allow people to interact with the financial system and at the same time “not be experts,” it’s clearly not wholly working. But what I like about it anyway is the emphasis on this issue of complexity and expertise. It took me a long time to figure out how big a problem that is in finance, but with this definition it goes right to the heart of the issue.

Second, as for the people who argue for de-regulation, I think it helps there too. Most of the time they act like everyone is a omniscient free agent who spends all their time becoming expert on everything. And if that were true, then it’s possible that regulation wouldn’t be needed (although transparency is key too). The point is that we live in a world where most people have no clue about the issues of data privacy, never mind when it’s being shielded by ridiculous and possibly illegal contracts behind their kids’ public school system.

Finally, in terms of the potential for protecting kids’ data: here the private companies like InBloom and others are way ahead of regulators, but it’s not because of complexity on the issues so much as the fact that regulators haven’t caught up with technology. At least that’s my optimistic feeling about it. I really think this stuff is solvable in the short term, and considering it involves kids, I think it will have bipartisan support. Plus the education benefits of collecting all this data have not been proven at all, nor do they really require such shitty privacy standards even if they do work.

Categories: data science, finance, modeling

I’m writing a book called Weapons of Math Destruction

I’m incredibly excited to announce that I am writing a book called Weapons of Math Destruction for Random House books, with my editor Amanda Cook. There will also be a subtitle which we haven’t decided on yet.

Here’s how this whole thing went down. First I met my amazing book agent Jay Mandel from William Morris though my buddy Jordan Ellenberg. As many of you know, Jordan is also writing a book but it’s much farther along in the process and has already passed the editing phase. Jordan’s book is called How Not To Be Wrong and it’s already available for pre-order on Amazon.

Anyhoo, Jay spent a few months with me telling me how to write a book proposal, and it was a pretty substantial undertaking actually and required more than just an outline. It was like a short treatment of all the chapters but then two chapters pretty filled in, including the first, and as you know the first is kind of like an advertisement for the whole rest of the book.

Then, once that proposal was ready, Jay started what he hoped would be a bidding war for the proposal among publishers. He had a whole list of people he talked to from all over the place in the publishing world.

What actually happened though was Amanda Cook from Crown Publishing, which is part of Random House, was the first person who was interested enough to talk to me about it, and then we hit it off really well, and she made a pre-emptive offer for the book so the full on bidding war didn’t end up needing to happen. And then just last week she announced the deal in what’s called the Publisher’s Marketplace, which is for people inside publishing to keep abreast of the deals and news. The actual link is here, but it’s behind a pay wall, so Amanda got me a screen shot:

screenshot_pubmarketplace

If that font is too small, it says something like this:

Harvard math Ph.D., former Wall Street quant, and advisor to the Occupy movement Cathy O’Neil’s WEAPONS OF MATH DESTRUCTION, arguing that mathematical modeling has become a pervasive and destructive force in society—in finance, education, medicine, politics, and the workplace—and showing how current models exacerbate inequality and endanger democracy and how we might rein them in, to Amanda Cook at Crown in a pre-empt by Jay Mandel at William Morris Endeavor (NA).

So as you can tell I’m incredibly excited about the book, and I have tons of ideas about it, but of course I’d love my readers to weigh in on crucial examples of models and industries that you think might get overlooked.

Please, post a comment or send me an email (located on my About page) with your favorite example of a family of models (Value Added Model for teachers is already in!) or a specific model (Value-at-Risk model in finance in already!) that is illustrative of feedback loops, or perverted incentives, or creepy modeling, or some such concept that you imagine I’ll be writing about (or should be!). Thanks so much for your input!

One last thing. I’m aiming to finish the writing part by next Spring, and then the book is actually released about 9 months later. It takes a while. I’m super glad I have had the experience of writing a technical book with O’Reilly as well as the homemade brew Occupy Finance with my Occupy group so I know at least some of the ropes, but even so this is a bit more involved.

 

Categories: data science, modeling, news

Parents fighting back against sharing children’s data with InBloom

There is a movement afoot in New York (and other places) to allow private companies to house and mine tons of information about children and how they learn. It’s being touted as a great way to tailor online learning tools to kids, but it also raises all sorts of potential creepy modeling problems, and one very bad sign is how secretive everything is in terms of privacy issues. Specifically, it’s all being done through school systems and without consulting parents.

In New York it’s being done through InBloom, which I already mentioned here when I talked about big data and surveillance. In that post I related an EducationNewYork report which quoted an official from InBloom as saying that the company “cannot guarantee the security of the information stored … or that the information will not be intercepted when it is being transmitted.”

The issue is super important and timely, and parents have been left out of the loop, with no opt-out option, and are actively fighting back, for example with this petition from MoveOn (h/t George Peacock). And although the InBloomers claim that no data about their kids will ever be sold, that doesn’t mean it won’t be used by third parties for various mining purposes and possibly marketing – say for test prep tools. In fact that’s a major feature of InBloom’s computer and data infrastructure, the ability for third parties to plug into the data. Not cool that this is being done on the downlow.

Who’s behind this? InBloom is funded by the Bill & Melinda Gates foundation and the operating system for inBloom is being developed by the Amplify division (formerly Wireless Generation) of Rupert Murdoch’s News Corp. More about the Murdoch connection here.

Wait, who’s paying for this? Besides the Gates and Murdoch, New York has spent $50 million in federal grants to set up the partnership with InBloom. And it’s not only New York that is pushing back, according to this Salon article:

InBloom essentially offers off-site digital storage for student data—names, addresses, phone numbers, attendance, test scores, health records—formatted in a way that enables third-party education applications to use it. When inBloom was launched in February, the company announced partnerships with school districts in nine states, and parents were outraged. Fears of a “national database” of student information spread. Critics said that school districts, through inBloom, were giving their children’s confidential data away to companies who sought to profit by proposing a solution to a problem that does not exist. Since then, all but three of those nine states have backed out.

Finally, according to this nydailynews article, Bill de Blasio is coming out on the side of protecting children’s privacy as well. That’s a good sign, let’s hope he sticks with it.

I’m not against using technology to learn, and in fact I think it’s inevitable and possibly very useful. But first we need to have a really good, public discussion about how this data is being shared, controlled, and protected, and that simply hasn’t happened. I’m glad to see parents are aware of this as a problem.

Categories: data science, modeling, news, rant

Computer, do I really want to get married?

There’s a new breed of models out there nowadays that reads your face for subtle expressions of emotions, possibly stuff that normal humans cannot pick up on. You can read more about it here, but suffice it to say it’s a perfect target for computers – something that is free information, that can be trained over many many examples, and then deployed everywhere and anywhere, even without our knowledge since surveillance cameras are so ubiquitous.

Plus, there are new studies that show that, whether you’re aware of it or not, a certain “gut feeling”, which researchers can get at by asking a few questions, will expose whether your marriage is likely to work out.

Let’s put these two together. I don’t think it’s too much of a stretch to imagine that surveillance cameras strategically placed at an altar can now make predictions on the length and strength of a marriage.

Oh goodie!

I guess it brings up the following question: is there some information we are better off not knowing? I don’t think knowing my marriage is likely to be in trouble would help me keep the faith. And every marriage needs a good dose of faith.

I heard a radio show about Huntington’s disease. There’s no cure for it, but there is a simple genetic test to see if you’ve got it, and it usually starts in adulthood so there’s plenty of time for adults to see their parents degenerate and start to worry about themselves.

But here’s the thing, only 5% of people who have a 50% chance of having Huntington’s actually take that test. For them the value of not knowing that information is larger than knowing. Of course knowing you don’t have it is better still, but until that happens the ambiguity is preferable.

Maybe what’s critical is that there’s no cure. I mean, if there was therapy that would help Huntington’s disease sufferers delay it or ameliorate it, I think we’d see far more people taking that genetic marker test.

And similarly, if there were ways to save a marriage that is at risk, we might want to know on the altar what the prognosis is. Right?

I still don’t know. Somehow, when things get that personal and intimate, I’d rather be left alone, even if an algorithm could help me “optimize my love life”. But maybe that’s just me being old-fashioned, and maybe in 100 years people will treat their computers like love oracles.

Categories: data science, modeling, news

PDF Liberation Hackathon: January 17-19

This is a guest post by Marc Joffe, the principal consultant at Public Sector Credit Solutions, an organization that provides data and analysis related to sovereign and municipal securities. Previously, Joffe was a Senior Director at Moody’s Analytics.

As Cathy has argued, open source models can bring much needed transparency to scientific research, finance, education and other fields plagued by biased, self-serving analytics. Models often need large volumes of data, and if the model is to be run on an ongoing basis, regular data updates are required.

Unfortunately, many data sets are not ready to be loaded into your analytical tool of choice; they arrive in an unstructured form and must be organized into a consistent set of rows and columns. This cleaning process can be quite costly. Since open source modeling efforts are usually low dollar operations, the costs of data cleaning may prove to be prohibitive. Hence no open model – distortion and bias continue their reign.

Much data comes to us in the form of PDFs. Say, for example, you want to model student loan securitizations. You will be confronted with a large number of PDF servicing reports that look like this. A corporation or well funded research institution can purchase an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from the PDFs into a database. But this is not much help to insurgent modelers who want to produce open source work.

Data journalists face a similar challenge. They often need to extract bulk data from PDFs to support their reporting. Examples include IRS Form 990s filed by non-profits and budgets issued by governments at all levels.

The data journalism community has responded to this challenge by developing software to harvest usable information from PDFs. Examples include Tabula, a tool written by Knight-Mozilla OpenNews Fellow Manuel Aristarán, extracts data from PDF tables in a form that can be readily imported to a spreadsheet – if the PDF was “printed” from a computer application. Introduced earlier this year, Tabula continues to evolve thanks to the volunteer efforts of Manuel, with help from OpenNews Fellow Mike Tigas and New York Times interactive developer Jeremy Merrill. Meanwhile, DocHive, a tool whose continuing development is being funded by a Knight Foundation grant, addresses PDFs that were created by scanning paper documents. DocHive is a project of Raleigh Public Record and is led by Charles and Edward Duncan.

These open source tools join a number of commercial offerings such as Able2Extract and ABBYY Fine Reader that extract data from PDFs. A more comprehensive list of open source and commercial resources is available here.

Unfortunately, the free and low cost tools available to modelers, data journalists and transparency advocates have limitations that hinder their ability to handle large scale tasks. If, like me, you want to submit hundreds of PDFs to a software tool, press “Go” and see large volumes of cleanly formatted data, you are out of luck.

It is for this reason that I am working with The Sunlight Foundation and other sponsors to stage the PDF Liberation Hackathon from January 17-19, 2014. We’ll have hack sites at Sunlight’s Washington DC office and at RallyPad in San Francisco. Developers can also join remotely because we will publish a number of clearly specified PDF extraction challenges before the hackathon.

Participants can work on one of the pre-specified challenges or choose their own PDF extraction projects. Ideally, hackathon teams will use (and hopefully improve upon) open source tools to meet the hacking challenges, but they will also be allowed to embed commercial tools into their projects as long as their licensing cost is less than $1000 and an unlimited trial is available.

Prizes of up to $500 will be awarded to winning entries. To receive a prize, a team must publish their source code on a GitHub public repository. To join the hackathon in DC or remotely, please sign up at Eventbrite; to hack with us in SF, please sign up via this Meetup. Please also complete our Google Form survey. Also, if anyone reading this is associated with an organization in New York or Chicago that would like to organize an additional hack space, please contact me.

The PDF Liberation Hackathon is going to be a great opportunity to advance the state of the art when it comes to harvesting data from public documents. I hope you can join us.

Algorithmic Accountability Reporting: On the Investigation of Black Boxes

Tonight I’m going to be on a panel over at Columbia’s Journalism School called Algorithmic Accountability Reporting: On the Investigation of Black Boxes. It’s being organized by Nick Diakopoulos, Tow Fellow and previous guest blogger on mathbabe. You can sign up to come here and it will also be livestreamed.

The other panelists are Scott Klein from ProPublica and Clifford Stein from Columbia. I’m super excited to meet them.

Unlike some panel discussions I’ve been on, where the panelists talk about some topic they choose for a few minutes each and then there are questions, this panel will be centered around a draft of a paper coming from the Tow Center at Columbia. First Nick will present the paper and then the panelists will respond to it. Then there will be Q&A.

I wish I could share it with you but it doesn’t seem publicly available yet. Suffice it to say it has many elements in common with Nick’s guest post on raging against the algorithms, and its overall goal is to understand how investigative journalism should handle a world filled with black box algorithms.

Super interesting stuff, and I’m looking forward to tonight, even if it means I’ll miss the New Day New York rally in Foley Square tonight.

Categories: data science, modeling

“People analytics” embeds old cultural problems in new mathematical models

Today I’d like to discuss recent article from the Atlantic entitled “They’re watching you at work” (hat tip Deb Gieringer).

In the article they describe what they call “people analytics,” which refers to the new suite of managerial tools meant to help find and evaluate employees of firms. The first generation of this stuff happened in the 1950′s, and relied on stuff like personality tests. It didn’t seem to work very well and people stopped using it.

But maybe this new generation of big data models can be super useful? Maybe they will give us an awesome way of throwing away people who won’t work out more efficiently and keeping those who will?

Here’s an example from the article. Royal Dutch Shell sources ideas for “business disruption” and wants to know which ideas to look into. There’s an app for that, apparently, written by a Silicon Valley start-up called Knack.

Specifically, Knack had a bunch of the ideamakers play a video game, and they presumably also were given training data on which ideas historically worked out. Knack developed a model and was able to give Royal Dutch Shell a template for which ideas to pursue in the future based on the personality of the ideamakers.

From the perspective of Royal Dutch Shell, this represents huge timesaving. But from my perspective it means that whatever process the dudes at Royal Dutch Shell developed for vetting their ideas has now been effectively set in stone, at least for as long as the algorithm is being used.

I’m not saying they won’t save time, they very well might. I’m saying that, whatever their process used to be, it’s now embedded in an algorithm. So if they gave preference to a certain kind of arrogance, maybe because the people in charge of vetting identified with that, then the algorithm has encoded it.

One consequence is that they might very well pass on really excellent ideas that happened to have come from a modest person – no discussion necessary on what kind of people are being invisible ignored in such a set-up. Another consequence is that they will believe their process is now objective because it’s living inside a mathematical model.

The article compares this to the “blind auditions” for orchestras example, where people are kept behind a curtain so that the listeners don’t give extra consideration to their friends. Famously, the consequence of blind auditions has been way more women in orchestras. But that’s an extremely misleading comparison to the above algorithmic hiring software, and here’s why.

In the blind auditions case, the people measuring the musician’s ability have committed themselves to exactly one clean definition of readiness for being a member of the orchestra, namely the sound of the person playing the instrument. And they accept or deny someone, sight unseen, based solely on that evaluation metric.

Whereas with the idea-vetting process above, the training data consisted of “previous winners” which presumable had to go through a series of meetings and convince everyone in the meeting that their idea had merit, and that they could manage the team to try it out, and all sorts of other things. Their success relied, in other words, on a community’s support of their idea and their ability to command that support.

In other words, imagine that, instead of listening to someone playing trombone behind a curtain, their evaluation metric was to compare a given musician to other musicians that had already played in a similar orchestra and, just to make it super success-based, had made first seat.

That you’d have a very different selection criterion, and a very different algorithm. It would be based on all sorts of personality issues, and community bias and buy-in issues. In particular you’d still have way more men.

The fundamental difference here is one of transparency. In the blind auditions case, everyone agrees beforehand to judge on a single transparent and appealing dimension. In the black box algorithms case, you’re not sure what you’re judging things on, but you can see when a candidate comes along that is somehow “like previous winners.”

One of the most frustrating things about this industry of hiring algorithms is how unlikely it is to actively fail. It will save time for its users, since after all computers can efficiently throw away “people who aren’t like people who have succeeded in your culture or process” once they’ve been told what that means.

The most obvious consequence of using this model, for the companies that use it, is that they’ll get more and more people just like the people they already have. And that’s surprisingly unnoticeable for people in such companies.

My conclusion is that these algorithms don’t make things objective, they makes things opaque. And they embeds our old cultural problems in new mathematical models, giving us a false badge of objectivity.

Categories: data science, modeling, rant

Cool open-source models?

I’m looking to develop my idea of open models, which I motivated here and started to describe here. I wrote the post in March 2012, but the need for such a platform has only become more obvious.

I’m lucky to be working with a super fantastic python guy on this, and the details are under wraps, but let’s just say it’s exciting.

So I’m looking to showcase a few good models to start with, preferably in python, but the critical ingredient is that they’re open source. They don’t have to be great, because the point is to see their flaws and possible to improve them.

  1. For example, I put in a FOIA request a couple of days ago to get the current teacher value-added model from New York City.
  2. A friends of mine, Marc Joffe, has an open source municipal credit rating model. It’s not in python but I’m hopeful we can work with it anyway.
  3. I’m in search of an open source credit scoring model for individuals. Does anyone know of something like that?
  4. They don’t have to be creepy! How about a Nate Silver – style weather model?
  5. Or something that relies on open government data?
  6. Can we get the Reinhart-Rogoff model?

The idea here is to get the model, not necessarily the data (although even better if it can be attached to data and updated regularly). And once we get a model, we’d build interactives with the model (like this one), or at least the tools to do so, so other people could build them.

At its core, the point of open models is this: you don’t really know what a model does until you can interact with it. You don’t know if a model is robust unless you can fiddle with its parameters and check. And finally, you don’t know if a model is best possible unless you’ve let people try to improve it.

Twitter and its modeling war

I often talk about the modeling war, and I usually mean the one where the modelers are on one side and the public is on the other. The modelers are working hard trying to convince or trick the public into clicking or buying or consuming or taking out loans or buying insurance, and the public is on the other, barely aware that they’re engaging in anything at all resembling a war.

But there are plenty of other modeling wars that are being fought by two sides which are both sophisticated. To name a couple, Anonymous versus the NSA and Anonymous versus itself.

Here’s another, and it’s kind of bland but pretty simple: Twitter bots versus Twitter.

This war arose from the fact that people care about how many followers someone on Twitter has. It’s a measure of a person’s influence, albeit a crappy one for various reasons (and not just because it’s being gamed).

The high impact of the follower count means it’s in a wannabe celebrity’s best interest to juice their follower numbers, which introduces the idea of fake twitter accounts to game the model. This is an industry in itself, and an associated arms race of spam filters to get rid of them. The question is, who’s winning this arms race and why?

Twitter has historically made some strides in finding and removing such fake accounts with the help of some modelers who actually bought the services of a spammer and looked carefully at what their money bought them. Recently though, at least according to this WSJ article, it looks like Twitter has spent less energy pursuing the spammers.

It begs the question, why? After all, Twitter has a lot theoretically at stake. Namely, its reputation, because if everyone knows how gamed the system is, they’ll stop trusting it. On the other hand, that argument only really holds if people have something else to use instead as a better proxy of influence.

Even so, considering that Twitter has a bazillion dollars in the bank right now, you’d think they’d spend a few hundred thousand a year to prevent their reputation from being too tarnished. And maybe they’re doing that, but the spammers seem to be happily working away in spite of that.

And judging from my experience on Twitter recently, there are plenty of active spammers which actively degrade the user experience. That brings up my final point, which is that the lack of competition argument at some point gives way to the “I don’t want to be spammed” user experience argument. At some point, if Twitter doesn’t maintain standards, people will just not spend time on Twitter, and its proxy of influence will fall out of favor for that more fundamental reason.

Categories: data science, modeling

Crisis Text Line: Using Data to Help Teens in Crisis

This morning I’m helping out at a datadive event set up by DataKind (apologies to Aunt Pythia lovers).

The idea is that we’re analyzing metadata around a texting hotline for teens in crisis. We’re trying to see if we can use the information we have on these texts (timestamps, character length, topic – which is most often suicide – and outcome reported by both the texter and the counselor) to help the counselors improve their responses.

For example, right now counselors can be in up to 5 conversations at a time – is that too many? Can we figure that out from the data? Is there too much waiting between texts? Other questions are listed here.

Our “hackpad” is located here, and will hopefully be updated like a wiki with results and visuals from the exploration of our group. It looks like we have a pretty amazing group of nerds over here looking into this (mostly python users!), and I’m hopeful that we will be helping the good people at Crisis Text Line.

There is no “market solution” for ethics

We saw what happened in finance with self-regulation and ethics. Let’s prepare for the exact same thing in big data.

Finance

Remember back in the 1970′s through the 1990′s, the powers that were decided that we didn’t need to regulate banks because “they” wouldn’t put “their” best interests at risk? And then came the financial crisis, and most recently came Alan Greenspan’s recent admission that he’d got it kinda wrong but not really.

Let’s look at what the “self-regulated market” in derivatives has bestowed upon us. We’ve got a bunch of captured regulators and a huge group of bankers who insist on keeping derivatives opaque so that they can charge clients bigger fees, not to mention that they insist on not having fiduciary duties to their clients, and oh yes, they’d like to continue to bet depositors’ money on those derivatives. They wrote the regulation themselves for that one. And this is after they blew up the world and got saved by the taxpayers.

Given that the banks write the regulations, it’s arguably still kind of a self-regulated market in finance. So we can see how ethics has been and is faring in such a culture.

The answer is, not well. Just in case the last 5 years of news articles wasn’t enough to persuade you of this fact, here’s what NY Fed Chief Dudley had to say recently about big banks and the culture of ethics, from this Huffington Post article:

“Collectively, these enhancements to our current regime may not solve another important problem evident within some large financial institutions — the apparent lack of respect for law, regulation and the public trust,” he said.

“There is evidence of deep-seated cultural and ethical failures at many large financial institutions,” he continued. “Whether this is due to size and complexity, bad incentives, or some other issues is difficult to judge, but it is another critical problem that needs to be addressed.”

Given that my beat is now more focused on the big data community and less on finance, mostly since I haven’t worked in finance for almost 2 years, this kind of stuff always makes me wonder how ethics is faring in the big data world, which is, again, largely self-regulated.

Big data

According to this ComputerWorld article, things are pretty good. I mean, there are the occasional snafus – unappreciated sensors or unreasonable zip code gathering examples – but the general idea is that, as long as you have a transparent data privacy policy, you’ll be just fine.

Examples of how awesome “transparency” is in these cases vary from letting people know what cookies are being used (BlueKai), to promising not to share certain information between vendors (Retention Science), to allowing customers a limited view into their profiling by Acxiom, the biggest consumer information warehouse. Here’s what I assume a typical reaction might be to this last one.

Wow! I know a few things Acxiom knows about me, but probably not all! How helpful. I really trust those guys now.

Not a solution

What’s great about letting customers know exactly what you’re doing with their data is that you can then turn around and complain that customers don’t understand or care about privacy policies. In any case, it’s on them to evaluate and argue their specific complaints. Which of course they don’t do, because they can’t possibly do all that work and have a life, and if they really care they just boycott the product altogether. The result in any case is a meaningless, one-sided conversation where the tech company only hears good news.

Oh, and you can also declare that customers are just really confused and don’t even know what they want:

In a recent Infosys global survey, 39% of the respondents said that they consider data mining invasive. And 72% said they don’t feel that the online promotions or emails they receive speak to their personal interests and needs.

Conclusion: people must want us to collect even more of their information so they can get really really awesome ads.

Finally, if you make the point that people shouldn’t be expected to be data mining and privacy experts to use the web, the issue of a “market solution for ethics” is raised.

“The market will provide a mechanism quicker than legislation will,” he says. “There is going to be more and more control of your data, and more clarity on what you’re getting in return. Companies that insist on not being transparent are going to look outdated.”

Back to ethics

What we’ve got here is a repeat problem. The goal of tech companies is to make money off of consumers, just as the goal of banks is to make money off of investors (and taxpayers as a last resort).

Given how much these incentives clash, the experts on the inside have figured out a way of continuing to do their thing, make money, and at the same time, keeping a facade of the consumer’s trust. It’s really well set up for that since there are so many technical terms and fancy math models. Perfect for obfuscation.

If tech companies really did care about the consumer, they’d help set up reasonable guidelines and rules on these issues, which could easily be turned into law. Instead they send lobbyists to water down any and all regulation. They’ve even recently created a new superPAC for big data (h/t Matt Stoller).

And although it’s true that policy makers are totally ignorant of the actual issues here, that might be because of the way big data professionals talk down to them and keep them ignorant. It’s obvious that tech companies are desperate for policy makers to stay out of any actual informed conversation about these issues, never mind the public.

Conclusion

There never has been, nor there ever will be, a market solution for ethics so long as the basic incentives between the public and an industry are so misaligned. The public needs to be represented somehow, and without rules and regulations, and without leverage of any kind, that will not happen.

Categories: data science, finance, modeling

How do you know when you’ve solved your data problem?

I’ve been really impressed by how consistently people have gone to read my post “K-Nearest Neighbors: dangerously simple,” which I back in April. Here’s a timeline of hits on that post:

Stats for "K-Nearest Neighbors: dangerously simple." I've actually gotten more hits recently.

Stats for “K-Nearest Neighbors: dangerously simple.” I’ve actually gotten more hits recently.

I think the interest in this post is that people like having myths debunked, and are particularly interested in hearing how even the simple things that they thought they understand are possibly wrong, or at least more complicated than they’d been assuming. Either that or it’s just got a real catchy name.

Anyway, since I’m still getting hits on that post, I’m also still getting comments, and just this morning I came across a new comment by someone who calls herself “travelingactuary”. Here it is:

My understanding is that CEOs hate technical details, but do like results. So, they wouldn’t care if you used K-Nearest Neighbors, neural nets, or one that you invented yourself, so long as it actually solved a business problem for them. I guess the problem everyone faces is, if the business problem remains, is it because the analysis was lacking or some other reason? If the business is ‘solved’ is it actually solved or did someone just get lucky? That being so, if the business actually needs the classifier to classify correctly, you better hire someone who knows what they’re doing, rather than hoping the software will do it for you.

Presumably you want to sell something to Monica, and the next n Monicas who show up. If your model finds a whole lot of big spenders who then don’t, your technophobe CEO is still liable to think there’s something wrong.

I think this comment brings up the right question, namely knowing when you’ve solved your data problem, with K-Nearest Neighbors or whichever algorithms you’ve chosen to use. Unfortunately, it’s not that easy.

Here’s the thing, it’s almost never possible to tell if a data problem is truly solved. I mean, it might be a business problem where you go from losing money to making money, and in that sense you could say it’s been “solved.” But in terms of modeling, it’s very rarely a binary thing.

Why do I say that? Because, at least in my experience, it’s rare that you could possibly hope for high accuracy when you model stuff, even if it’s a classification problem. Most of the time you’re trying to achieve something better than random, some kind of edge. Often an edge is enough, but it’s nearly impossible to know if you’ve gotten the biggest edge possible.

For example, say you’re binning people you who come to your site in three equally sized groups, as “high spenders,” “medium spenders,” and “low spenders.” So if the model were random, you’d expect a third to be put into each group, and that someone who ends up as a big spender is equally likely to be in any of the three bins.

Next, say you make a model that’s better than random. How would you know that? You can measure that, for example, by comparing it to the random model, or in other words by seeing how much better you do than random. So if someone who ends up being a big spender is three times more likely to have been labeled a big spender than a low spender and twice as likely than a medium spender, you know your model is “working.”

You’d use those numbers, 3x and 2x, as a way of measuring the edge your model is giving you. You might care about other related numbers more, like whether pegged low spenders are actually low spenders. It’s up to you to decide what it means that the model is working. But even when you’ve done that carefully, and set up a daily updated monitor, the model itself still might not be optimal, and you might still be losing money.

In other words, you can be a bad modeler or a good modeler, and either way when you try to solve a specific problem you won’t really know if you did the best possible job you could have, or someone else could have with their different tools and talents.

Even so, there are standards that good modelers should follow. First and most importantly, you should always set up a model monitor to keep track of the quality of the model and see how it fares over time.  Because why? Because second, you should always assume that, over time, your model will degrade, even if you are updating it regularly or even automatically. It’s of course good to know how crappy things are getting so you don’t have a false sense of accomplishment.

Keep in mind that just because it’s getting worse doesn’t mean you can easily start over again and do better. But a least you can try, and you will know when it’s worth a try. So, that’s one thing that’s good about admitting your inability to finish anything.

On to the political aspect of this issue. If you work for a CEO who absolutely hates ambiguity – and CEO’s are trained to hate ambiguity, as well as trained to never hesitate – and if that CEO wants more than anything to think their data problem has been “solved,” then you might be tempted to argue that you’ve done a phenomenal job just to make her happy. But if you’re honest, you won’t say that, because it ‘aint true.

Ironically and for these reasons, some of the most honest data people end up looking like crappy scientists because they never claim to be finished doing their job.

Categories: data science, modeling

The private-data-for-services trade fallacy

I had a great time at Harvard Wednesday giving my talk (prezi here) about modeling challenges. The audience was fantastic and truly interdisciplinary, and they pushed back and challenged me in a great way. I’m glad I went and I’m glad Tess Wise invited me.

One issue that came up is something I want to talk about today, because I hear it all the time and it’s really starting to bug me.

Namely, the fallacy that people, especially young people, are “happy to give away their private data in order to get the services they love on the internet”. The actual quote came from the IBM guy on the congressional subcommittee panel on big data, which I blogged about here (point #7), but I’ve started to hear that reasoning more and more often from people who insist on side-stepping the issue of data privacy regulation.

Here’s the thing. It’s not that people don’t click “yes” on those privacy forms. They do click yes, and I acknowledge that. The real problem is that people generally have no clue what it is they’re trading.

In other words, this idea of a omniscient market participant with perfect information making a well-informed trade, which we’ve already seen is not the case in the actual market, is doubly or triply not the case when you think about young people giving away private data for the sake of a phone app.

Just to be clear about what these market participants don’t know, I’ll make a short list:

  • They probably don’t know that their data is aggregated, bought, and sold by Acxiom, which they’ve probably never heard of.
  • They probably don’t know that Facebook and other social media companies sell stuff about them even if their friends don’t see it and even though it’s often “de-identified”. Think about this next time you sign up for a service like “Bang With Friends,” which works through Facebook.
  • They probably don’t know how good algorithms are getting at identifying de-identified information.
  • They probably don’t know how this kind of information is used by companies to profile users who ask for credit or try to get a job.

Conclusion: people are ignorant of what they’re giving away to play Candy Crush Saga[1]. And whatever it is they’re giving away, it’s something way far in the future that they’re not worried about right now. In any case it’s not a fair trade by any means, and we should stop referring to it as such.

What is it instead? I’d say it’s a trick. A trick which plays on our own impulses and short-sightedness and possibly even a kind of addiction to shiny toys in the form of candy. If you give me your future, I’ll give you a shiny toy to play with right now. People who click “yes” are not signaling that they’ve thought deeply about the consequences of giving their data away, and they are certainly not making the definitive political statement that we don’t need privacy regulation.

1. I actually don’t know the data privacy rules for Candy Crush and can’t seem to find them, for example here. Please tell me if you know what they are.

Categories: data science, modeling, rant
Follow

Get every new post delivered to your Inbox.

Join 887 other followers