I’m psyched to see Suresh Naidu tonight in the first Data Skeptics Meetup. He’s talking about Political Uses and Abuses of Data and his abstract is this:
While a lot has been made of the use of technology for election campaigns, little discussion has focused on other political uses of data. From targeting dissidents and tax-evaders to organizing protests, the same datasets and analytics that let data scientists do prediction of consumer and voter behavior can also be used to forecast political opponents, mobilize likely leaders, solve collective problems and generally push people around. In this discussion, Suresh will put this in a 1000 year government data-collection perspective, and talk about how data science might be getting used in authoritarian countries, both by regimes and their opponents.
Given the recent articles highlighting this kind of stuff, I’m sure the topic will provoke a lively discussion – my favorite kind!
Unfortunately the Meetup is full but I’d love you guys to give suggestions for more speakers and/or more topics.
This is a guest post by Rachel Law, a conceptual artist, designer and programmer living in Brooklyn, New York. She recently graduated from Parsons MFA Design&Technology. Her practice is centered around social myths and how technology facilitates the creation of new communities. Currently she is writing a book with McKenzie Wark called W.A.N.T, about new ways of analyzing networks and debunking ‘mapping’.
Let’s start with a timely question. How would you like to be able to change how you are identified by online networks? We’ll talk more about how you’re currently identified below, but for now just imagine having control over that process for once – how would that feel? Vortex is something I’ve invented that will try to make that happen.
Namely, Vortex is a data management game that allows players to swap cookies, change IPs and disguise their locations. Through play, individuals experience how their browser changes in real time when different cookies are equipped. Vortex is a proof of concept that illustrates how network collisions in gameplay expose contours of a network determined by consumer behavior.
What happens when users are allowed to swap cookies?
These cookies, placed by marketers to track behavioral patterns, are stored on our personal devices from mobile phones to laptops to tablets, as a symbolic and data-driven signifier of who we are. In other words, to the eyes of the database, the cookies are us. They are our identities, controlling the way we use, browse and experience the web. Depending on cookie type, they might follow us across multiple websites, save entire histories about how we navigate and look at things and pass this information to companies while still living inside our devices.
If we have the ability to swap cookies, the debate on privacy shifts from relying on corporations to follow regulations to empowering users by giving them the opportunity to manage how they want to be perceived by the network.
What are cookies?
The corporate technological ability to track customers and piece together entire personal histories is a recent development. While there are several ways of doing so, the most common and prevalent method is with HTTP cookies. Invented in 1994 by a computer programmer, Lou Montulli, HTTP cookies were originally created with the shopping cart system as a way for the computer to store the current state of the session, i.e. how many items existed in the cart without overloading the company’s server. These session histories were saved inside each user’s computer or individual device, where companies accessed and updated consumer history constantly as a form of ‘internet history’. Information such as where you clicked, how to you clicked, what you clicked first, your general purchasing history and preferences were all saved in your browsing history and accessed by companies through cookies.
Cookies were originally implemented to the general public without their knowledge until the Financial Times published an article about how they were made and utilized on websites without user knowledge on February 12th, 1996 . This revelation led to a public outcry over privacy issues, especially since data was being gathered without the knowledge or consent of users. In addition, corporations had access to information stored on personal computers as the cookie sessions were stored on your computer and not their servers.
At the center of the debate was the issue on third-party cookies, also known as “persistent” or “tracking” cookies. When you are browsing a webpage, there may be components on the page that are hosted on the same server, but different domain. These external objects then pass cookies to you if you click an image, link or article. They are then used by advertising and media mining corporations to track users across multiple sites to garner more knowledge about the users browsing patterns to create more specific and targeted advertising.
In August 2013, Wall Street Journal ran an article on how Mac users were being unfairly targeted by travel site Orbitz with advertisements that were 13% more expensive than PC users. New York Times followed it up with a similar article in November 2012 about how the data collected and re-sold to advertisers. These advertisers would analyze users buying habits to create micro-categories where the personal experiences were tailored to maximize potential profits.
What does that mean for us?
The current state of today’s internet is no longer the same as the carefree 90s of ‘internet democracy’ and utopian ‘cyberspace’. Mediamining exploits invasive technologies such as IP tracking, geolocating and cookies to create specific advertisements targeted to individuals. Browsing is now determined by your consumer profile what you see, hear and the feeds you receive are tailored from your friends’ lists, emails, online purchases etc. The ‘Internet’ does not exist. Instead, it is many overlapping filter bubbles which selectively curate us into data objects to be consumed and purchased by advertisers.
This information, though anonymous, is built up over time and used to track and trace an individual’s history – sometimes spanning an entire lifetime. Who you are, and your real name is irrelevant in the overall scale of collected data, depersonalizing and dehumanizing you into nothing but a list of numbers on a spreadsheet.
The superstore Target, provides a useful case study for data profiling in its use of statisticians on their marketing teams. In 2002, Target realized that when a couple is expecting a child, the way they shop and purchase products changes. But they needed a tool to be able to see and take advantage of the pattern. As such, they asked mathematicians to come up with algorithms to identify behavioral patterns that would indicate a newly expectant mother and push direct marketing materials their way. In a public relations fiasco, Target had sent maternity and infant care advertisements to a household, inadvertedly revealing that their teenage daughter was pregnant before she told her parents .
This build-up of information creates a ‘database of ruin’, enough information that marketers and advertisers know more about your life and predictive patterns than any single entity. Databases that can predict whether you’re expecting, or when you’ve moved, or what stage of your life or income level you’re at… information that you have no control over where it goes to, who is reading it or how it is being used. More importantly, these databases have collected enough information that they know secrets such as family history of illness, criminal or drug records or other private information that could potentially cause harm upon the individual data point if released – without ever needing to know his or her name.
What happens now is two terrifying possibilities:
- Corporate databases with information about you, your family and friends that you have zero control over, including sensitive information such as health, criminal/drug records etc. that are bought and re-sold to other companies for profit maximization.
- New forms of discrimination where your buying/consumer habits determine which level of internet you can access, or what kind of internet you can experience. This discrimination is so insidious because it happens on a user account level which you cannot see unless you have access to other people’s accounts.
Here’s a visual describing this process:
What can Vortex do, and where can I download a copy?
As Vortex lives on the browser, it can manage both pseudo-identities (invented) as well as ‘real’ identities shared with you by other users. These identity profiles are created through mining websites for cookies, swapping them with friends as well as arranging and re-arranging them to create new experiences. By swapping identities, you are essentially ‘disguised’ as someone else – the network or website will not be able to recognize you. The idea is that being completely anonymous is difficult, but being someone else and hiding with misinformation is easy.
This does not mean a death knell for online shopping or e-commerce industries. For instance, if a user decides to go shoe-shopping for summer, he/she could equip their browser with the cookies most associated and aligned with shopping, shoes and summer. Targeted advertising becomes a targeted choice for both advertisers and users. Advertisers will not have to worry about misinterpreting or mis-targeting inappropriate advertisements i.e. showing tampon advertisements to a boyfriend who happened to borrow his girlfriend’s laptop; and at the same time users can choose what kind of advertisements they want to see. (i.e. Summer is coming, maybe it’s time to load up all those cookies linked to shoes and summer and beaches and see what websites have to offer; or disable cookies it completely if you hate summer apparel.)
Currently the game is a working prototype/demo. The code is licensed under creative commons and will be available on GitHub by the end of summer. I am trying to get funding to make it free, safe & easy to use; but right now I’m broke from grad school and a proper back-end to be built for creating accounts that is safe and cannot be intercepted. If you have any questions on technical specs or interest in collaborating to make it happen – particularly looking for people versed in python/mongodb, please email me: Rachel@milkred.net.
This is a guest post by Marc Joffe, the principal consultant at Public Sector Credit Solutions, an organization that provides data and analysis related to sovereign and municipal securities. Previously, Joffe was a Senior Director at Moody’s Analytics for more than a decade.
Note to readers: for a bit of background on the SEC Credit Ratings Roundtable and the Franken Amendment see this recent mathbabe post.
I just returned from Washington after participating in the SEC’s Credit Ratings Roundtable. The experience was very educational, and I wanted to share what I’ve learned with readers interested in financial industry reform.
First and foremost, I learned that the Franken Amendment is dead. While I am not a proponent of this idea – under which the SEC would have set up a ratings agency assignment authority – I do welcome its intentions and mourn its passing. Thus, I want to take some time to explain why I think this idea is dead, and what financial reformers need to do differently if they want to see serious reforms enacted.
The Franken Amendment, as revised by the Dodd Frank conference committee, tasked the SEC with investigating the possibility of setting up a ratings assignment authority and then executing its decision. Within the SEC, the responsibility for Franken Amendment activities fell upon the Office of Credit Ratings (OCR), a relatively new creature of the 2006 Credit Rating Agency Reform Act.
OCR circulated a request for comments – posting the request on its web site and in the federal register – a typical SEC procedure. The majority of serious comments OCR received came from NRSROs and others with a vested interest in perpetuating the status quo or some close approximation thereof. Few comments came from proponents of the Franken Amendment, and some of those that did were inarticulate (e.g., a note from Joe Sixpack of Anywhere, USA saying that rating agencies are terrible and we just gotta do something about them).
OCR summarized the comments in its December 2012 study of the Franken Amendment. Progressives appear to have been shocked that OCR’s work product was not an originally-conceived comprehensive blueprint for a re-imagined credit rating business. Such an expectation is unreasonable. SEC regulators sit in Washington and New York; not Silicon Valley. There is little upside and plenty of political downside to taking major risks. Regulators are also heavily influenced by the folks they regulate, since these are the people they talk to on a day-to-day basis.
Political theorists Charles Lindblom and Aaron Wildavsky developed a theory that explains the SEC’s policymaking process quite well: it is called incrementalism. Rather than implement brand new ideas, policymakers prefer to make marginal changes by building upon and revising existing concepts.
While I can understand why Progressives think the SEC should “get off its ass” and really fix the financial industry, their critique is not based in the real world. The SEC is what it is. It will remain under budget pressure for the forseeable future because campaign donors want to restrict its activities. Staff will always be influenced by financial industry players, and out-of-the-box thinking will be limited by the prevailing incentives.
Proponents of the Franken Amendment and other Progressive reforms have to work within this system to get their reforms enacted. How? The answer is simple: when a request for comment arises they need to stuff the ballot box with varying and well informed letters supporting reform. The letters need to place proposed reforms within the context of the existing system, and respond to anticipated objections from status quo players. If 20 Progressive academics and Occupy-leaning financial industry veterans had submitted thoughtful, reality-based letters advocating the Franken Amendment, I believe the outcome would have been very different. (I should note that Occupy the SEC has produced a number of comment letters, but they did not comment on the Franken Amendment and I believe they generally send a single letter).
While the Franken Amendment may be dead, I am cautiously optimistic about the lifecycle of my own baby: open source credit rating models. I’ll start by explaining how I ended up on the panel and then conclude by discussing what I think my appearance achieved.
The concept of open source credit rating models is extremely obscure. I suspect that no more than a few hundred people worldwide understand this idea and less than a dozen have any serious investment in it. Your humble author and one person on his payroll, are probably the world’s only two people who actually dedicated more than 100 hours to this concept in 2012.
That said, I do want to acknowledge that the idea of open source credit rating models is not original to me – although I was not aware of other advocacy before I embraced it. Two Bay Area technologists started FreeRisk, a company devoted to open source risk models, in 2009. They folded the company without releasing a product and went on to more successful pursuits. FreeRisk left a “paper” trail for me to find including an article on the P2P Foundation’s wiki. FreeRisk’s founders also collaborated with Cate Long, a staunch advocate of financial markets transparency, to create riski.us – a financial regulation wiki.
In 2011, Cathy O’Neil (a.k.a. Mathbabe) an influential Progressive blogger who has a quantitative finance background ran a post about the idea of open source credit ratings, generating several positive comments. Cathy also runs the Alternative Banking group, an affiliate of Occupy Wall Street that attracts a number of financially literate activists.
I stumbled across Cathy’s blog while Googling “open source credit ratings”, sent her an email, had a positive phone conversation and got an invitation to address her group. Cathy then blogged about my open source credit rating work. This too was picked up on the P2P Foundation wiki, leading ultimately to a Skype call with the leader of the P2P Foundation, Michel Bauwens. Since then, Michel – a popularizer of progressive, collaborative concepts – has offered a number of suggestions about organizations to contact and made a number of introductions.
Most of my outreach attempts on behalf of this idea – either made directly or through an introduction – are ignored or greeted with terse rejections. I am not a proven thought leader, am not affiliated with a major research university and lack a resume that includes any position of high repute or authority. Consequently, I am only a half-step removed from the many “crackpots” that send around their unsolicited ideas to all and sundry.
Thus, it is surprising that I was given the chance to address the SEC Roundtable on May 14. The fact that I was able to get an invitation speaks well of the SEC’s process and is thus worth recounting. In October 2012, SEC Commissioner Dan Gallagher spoke at the Stanford Rock Center on Corporate Governance. He mentioned that the SEC was struggling with the task of implementing Dodd Frank Section 939A, which calls for the replacement of credit ratings in federal regulations, such as those that govern asset selection by money market funds.
After his talk, I pitched him the idea of open source credit ratings as an alternative creditworthiness standard that would satisfy the intentions of 939A. He suggested that I write to Tom Butler, head of the Office of Credit Ratings (OCR) and copy him. This led to a number of phone calls and ultimately a presentation to OCR staff in New York in January. Staff members that joined the meeting were engaged and asked good questions. I connected my proposal to an earlier SEC draft regulation which would have required structured finance issuers to publish cashflow waterall models in Python – a popular open source language.
I walked away from the meeting with the perception that, while they did not want to reinvent the industry, OCR staff were sincerely interested in new ideas that might create incremental improvements. That meeting led to my inclusion in the third panel of the Credit Ratings Roundtable.
For me, the panel discussion itself was mostly positive. Between the opening statement, questions and discussion, I probably had about 8 minutes to express my views. I put across all the points I hoped to make and even received a positive comment from one of the other panelists. On the downside, only one commissioner attended my panel – whereas all five had been present at the beginning of the day when Al Franken, Jules Kroll, Doug Peterson and other luminaries held the stage.
The roundtable generated less media attention than I expected, but I got an above average share of the limited coverage relative to the day’s other 25 panelists. The highlight was a mention in the Wall Street Journal in its pre-roundtable coverage.
Perhaps the fact that I addressed the SEC will make it easier for me to place op-eds and get speaking engagements to promote the open source ratings concept. Only time will tell. Ultimately, someone with a bigger reputation than mine will need to advocate this concept before it can progress to the next level.
Also, the idea is now part of the published record of SEC deliberations. The odds of it getting into a proposed regulation remain long in the near future, but these odds are much shorter than they were prior to the roundtable.
Political scientist John Kingdon coined the term “policy entrepreneurs” to describe people who look for and exploit opportunities to inject new ideas into the policy discussion. I like to think of myself as a policy entrepreneur, although I have a long way to go before I become a successful one. If you have read this far and also have strongly held beliefs about how the financial system should improve, I suggest you apply the concepts of incrementalism and policy entrepreneurship to your own activism.
This is a guest post by Adam Obeng, a Ph.D. candidate in the Sociology Department at Columbia University. His work encompasses computational social science, social network analysis and sociological theory (basically anything which constitutes an excuse to sit in front of a terminal for unadvisably long periods of time). This post is Copyright Adam Obeng 2013 and licensed under a (Creative Commons Attribution-ShareAlike 3.0 Unported License). Crossposted on adamobeng.com.
Eben Moglen’s delivery leaves you in no doubt as to the sincerity of this sentiment. Stripy-tied, be-hatted and pocked-squared, he took to the stage at last week’s IDSE Seminar Series event without slides, but with engaging — one might say, prosecutorial — delivery. Lest anyone doubt his neckbeard credentials, he let slip that he had participated in the development of almost certainly the first networked email system in the United States, as well as mentioning his current work for the Freedom Box Foundation and the Software Freedom Law Center.
A superorganism called humankind
The content was no less captivating than the delivery: we were invited to consider the world where every human consciousness is connected by an artificial extra-skeletal nervous system, linking everyone into a new superorganism. What we refer to as data science is the nascent study of flows of neural data in that network. And having access to the data will entirely transform what the social sciences can explain: we will finally have a predictive understanding of human behaviour, based not on introspection but empirical science. It will do for the social sciences what Newton did for physics.
The reason the science of the nervous system – “this wonderful terrible art” – is optimised to study human behaviour is because consumption and entertainment are a large part of economic activity. The subjects of the network don’t own it. In a society which is more about consumption than production, the technology of economic power will be that which affects consumption. Indeed, what we produce becomes information about consumption which is itself used to drive consumption. Moglen is matter-of-fact: this will happen, and is happening.
And it’s also ineluctable that this science will be used to extend the reach of political authority, and it has the capacity to regiment human behaviour completely. It’s not entirely deterministic that it should happen at a particular place and time, but extrapolation from history suggests that somewhere, that’s how it’s going to be used, that’s how it’s going to come out, because it can. Whatever is possible to engineer will eventually be done. And once it’s happened somewhere, it will happen elsewhere. Unlike the components of other super-organisms, humans possess consciousness. Indeed, it is the relationship between sociality and consciousness that we call the human condition. The advent of the human species-being threatens that balance.
The Oppenheimer moment
Moglen’s vision of the future is, as he describes it, both familiar and strange. But his main point, is as he puts it, very modest: unless you are sure that this future is absolutely 0% possible, you should engage in the discussion of its ethics.
First, when the network is wrapped around every human brain, privacy will be nothing more than a relic of the human past. He believes that privacy is critical to creativity and freedom, but really the assumption that privacy – the ability to make decisions independent of the machines – should be preserved is axiomatic.
What is crucial about privacy is that it is not personal, or even bilateral, it is ecological: how others behave determine the meaning of the actions I take. As such, dealing with privacy requires an ecological ethics. It is irrelevant whether you consent to be delivered poisonous drinking water, we don’t regulate such resources by allowing individuals to make desicions about how unsafe they can afford their drinking water to be. Similarly, whether you opt in or opt out of being tracked online is irrelevant.
The existing questions of ethics that science has had to deal with – how to handle human subjects – are of no use here: informed consent is only sufficient when the risks to investigating a human subject produce apply only to that individual.
These ethical questions are for citizens, but perhaps even more so for those in the business of making products from personal information. Whatever goes on to be produced from your data will be trivially traced back to you. Whatever finished product you are used to make, you do not disappear from it. What’s more, the scientists are beholden to the very few secretive holders of data.
Consider, says Moglen,the question of whether punishment deters crime: there will be increasing amounts of data about it, but we’re not even going to ask – because no advertising sale depends on it. Consider also, the prospect of machines training humans, which is already beginning to happen. The Coursera business model is set to do to the global labour market what Google did to the global advertising market: auctioning off the good learners, found via their learning patterns, to to employers. Granted, defeating ignorance on a global scale is within grasp. But there are still ethical questions here, and evil is ethics undealt with.
One of the criticisms often levelled at techno-utopians is that the enabling power of technology can very easily be stymied by the human factors, the politics, the constants of our species, which cannot be overwritten by mere scientific progress. Moglen could perhaps be called a a techno-dystopian, but he has recognised that while the technology is coming, inevitably, how it will affect us depends on how we decide to use it.
But these decisions cannot just be made at the individual level, Moglen pointed out, we’ve changed everything except the way people think. I can’t say that I wholeheartedly agree with either Moglen’s assumptions or his conclusions, but he is obviously asking important questions, and he has shown the form in which they need to be asked.
Another doubt: as a social scientist, I’m also not convinced that having all these data available will make all human behaviour predictable. We’ve catalogued a billion stars, the Large Hadron Collider has produced a hundred thousand million million bytes of data, and yet we’re still trying to find new specific solutions to the three-body problem. I don’t think that just having more data is enough. I’m not convinced, but I don’t think it’s 0% possible.
This post is Copyright Adam Obeng 2013 and licensed under a (Creative Commons Attribution-ShareAlike 3.0 Unported License).
I’ve discussed the broken business model that is the credit rating agency system in this country on a few occasions. It directly contributed to the opacity and fraud in the MBS market and to the ensuing financial crisis, for example. And in this post and then this one, I suggest that someone should start an open source version of credit rating agencies. Here’s my explanation:
The system of credit ratings undermines the trust of even the most fervently pro-business entrepreneur out there. The models are knowingly games by both sides, and it’s clearly both corrupt and important. It’s also a bipartisan issue: Republicans and Democrats alike should want transparency when it comes to modeling downgrades- at the very least so they can argue against the results in a factual way. There’s no reason I can see why there shouldn’t be broad support for a rule to force the ratings agencies to make their models publicly available. In other words, this isn’t a political game that would score points for one side or the other.
Well, it wasn’t long before Marc Joffe, who had started an open source credit rating agency, contacted me and came to my Occupy group to explain his plan, which I blogged about here. That was almost a year ago.
Today the SEC is going to have something they’re calling a Credit Ratings Roundtable. This is in response to an amendment that Senator Al Franken put on Dodd-Frank which requires the SEC to examine the credit rating industry. From their webpage description of the event:
The roundtable will consist of three panels:
- The first panel will discuss the potential creation of a credit rating assignment system for asset-backed securities.
- The second panel will discuss the effectiveness of the SEC’s current system to encourage unsolicited ratings of asset-backed securities.
- The third panel will discuss other alternatives to the current issuer-pay business model in which the issuer selects and pays the firm it wants to provide credit ratings for its securities.
Marc is going to be one of something like 9 people in the third panel. He wrote this op-ed piece about his goal for the panel, a key excerpt being the following:
Section 939A of the Dodd-Frank Act requires regulatory agencies to replace references to NRSRO ratings in their regulations with alternative standards of credit-worthiness. I suggest that the output of a certified, open source credit model be included in regulations as a standard of credit-worthiness.
Just to be clear: the current problem is that not only is there wide-spread gaming, but there’s also a near monopoly by the “big three” credit rating agencies, and for whatever reason that monopoly status has been incredibly well protected by the SEC. They don’t grant “NRSRO” status to credit rating agencies unless the given agency can produce something like 10 letters from clients who will vouch for them providing credit ratings for at least 3 years. You can see why this is a hard business to break into.
The Roundtable was covered yesterday in the Wall Street Journal as well: Ratings Firms Steer Clear of an Overhaul – an unfortunate title if you are trying to be optimistic about the event today. From the WSJ article:
Mr. Franken’s amendment requires the SEC to create a board that would assign a rating firm to evaluate structured-finance deals or come up with another option to eliminate conflicts.
While lawsuits filed against S&P in February by the U.S. government and more than a dozen states refocused unflattering attention on the bond-rating industry, efforts to upend its reliance on issuers have languished, partly because of a lack of consensus on what to do.
I’m just kind of amazed that, given how dirty and obviously broken this industry is, we can’t do better than this. SEC, please start doing your job. How could allowing an open-source credit rating agency hurt our country? How could it make things worse?
Yesterday I wrote this short post about my concerns about the emerging field of e-discovery. As usual the comments were amazing and informative. By the end of the day yesterday I realized I needed to make a much more nuanced point here.
Namely, I see a tacit choice being made, probably by judges or court-appointed “experts”, on how machine learning is used in discovery, and I think that the field could get better or worse. I think we need to urgently discuss this matter, before we wander into a crazy place.
And to be sure, the current discovery process is fraught with opacity and human judgment, so complaining about those features being present in a machine learning version of discovery is unreasonable – the question is whether it’s better or worse than the current system.
Making it worse: private code, opacity
The way I see it, if we allow private companies to build black box machines that we can’t peer into, nor keep track of as they change versions, then we’ll never know why a given set of documents was deemed “relevant” in a given case. We can’t, for example, check to see if the code was modified to be more friendly to a given side.
Besides the healthy response to this new revenue source of competition for clients, the resulting feedback loop will likely be a negative one, whereby private companies use the cheapest version they can get away with to achieve the best results (for their clients) that they can argue for.
Making it better: open source code, reproducibility
What we should be striving for is to use only open source software, saved in a repository so we can document exactly what happened with a given corpus and a given version of the tools. It will still be an industry to clean the data and feed in the documents, train the algorithm (whilst documenting how that works), and interpreting the results. Data scientists will still get paid.
In other words, instead of asking for interpretability, which is a huge ask considering the massive scale of the work being done, we should, at the very least, be able to ask for reproducibility of the e-discovery, as well as transparency in the code itself.
Why reproducibility? Then we can go back in time, or rather scholars can, and test how things might have changed if a different version of the code were used, for example. This could create a feedback loop crucial to improve the code itself over time, and to improve best practices for using that code.
Today I want to bring up a few observations and concerns I have about the emergence of a new field in machine learning called e-discovery. It’s the algorithmic version of discovery, so I’ll start there.
Discovery is part of the process in a lawsuit where relevant documents are selected, pored over, and then handed to the other side. Nowadays, of course, there are more and more documents, almost all electronic, typically including lots of e-mails.
If you’re talking about a big lawsuit, there could be literally millions of documents to wade through, and that takes a lot of time for humans to do, and it can be incredibly expensive and time-consuming. Enter the algorithm.
With advances in Natural Language Processing (NLP), a machine algorithm can sort emails or documents by topic (after getting the documents into machine-readable form, cleaning, and deduping) and can in general do a pretty good job of figuring out whether a given email is “relevant” to the case.
And this is already happening – the Wall Street Journal recently reported that the Justice Department allowed e-discovery for a case involving the merger of two beer companies. From the article:
With the blessing of the Justice Department’s antitrust division, the lawyers loaded the documents into a program and manually reviewed a batch to train the software to recognize relevant documents. The manual review was repeated until the Justice Department and Constellation were satisfied that the program could accurately predict relevance in the rest of the documents. Lawyers for Constellation and Crown Imports used software developed by kCura Corp., which lists the Justice Department as a client.
In the end, Constellation and Crown Imports turned over hundreds of thousands of documents to antitrust investigators.
Here are some of my questions/ concerns:
- These algorithms are typically not open source – companies like kCura make good money doing these jobs.
- That means that they could be wrong, possibly in subtle ways.
- Or maybe not so subtle ways: maybe they’ve been trained to find documents that are both “relevant” and “positive” for a given side.
- In any case, the laws of this country will increasingly depend on a black box algorithm that is no accessible to the average citizen.
- Is that in the public’s interest?
- Is that even constitutional?
When I first met Chris Wiggins of Columbia and hackNY back in 2011, he immediately introduced me to about a hundred other people, which made it obvious that his introductions were highly stereotyped. I thought he was some kind of robot, especially when I started getting emails from his phone which all had the same (long) phrases in them, like “I’m away from my keyboard right now, but when I get back to my desk I’ll calendar prune and send you some free times.”
Finally I was like “what the hell, are you sending me whole auto-generated emails”? To which he replied “of course.”
Feeling cheated, I called him to tell him he has an addiction to shell scripting. Here’s a brief interview, rewritten to make me sound smarter and cooler than I am.
CO: Ok, let’s start with these iphone shortcuts. Sometimes the whole email from you reads like a bunch of shortcuts.
CW: Yup, lots of times.
CO: What the hell? Don’t you want to personalize things for me at least a little?
CW: I do! But I also want to catch the subway.
CO: Ugh. How many shortcuts do you have on that thing?
CW: Well.. (pause)..38.
CO: Ok now I’m officially worried about you. What’s the longest one?
CW: Probably this one I wrote for Sandy: If I write “sandy” it unpacks to
“Sorry for delay and brevity in reply. Sandy knocked out my phone, power, water, and internet so I won’t be replying as quickly as usual. Please do not hesitate to email me again if I don’t reply soon.”
CO: You officially have a problem. What’s the shortest one?
CW: Well, when I type “plu” it becomes “+1”
CO: Ok, let me apply the math for you: your shortcut is longer than your longcut.
CW: I know but not if you include switching from letters to numbers on the iphone, which is annoying.
CO: How did you first become addicted to shortcuts?
CW: I got introduced to UNIX in the 80s and, in my frame of reference at the time, the closest I had come to meeting a wizard was the university’s sysadmin. I was constantly breaking things by chomping cpu with undead processes or removing my $HOME or something, and he had to come in and fix things. I learned a lot over his shoulder. In the summer before I started college, my dream was to be a university sysadmin. He had to explain to me patiently that I shouldn’t spend college in a computercave.
CO: Good advice, but now that you’re a grownup you can do that.
CW: Exactly. Anyway, everytime he would fix whatever incredible mess I had made he would sign off with some different flair and walk out, like he was dropping the mic and walking off stage. He never signed out “logout” it was always “die” or “leave” or “ciao” (I didn’t know that word at the time). So of course by the time he got back to his desk one day there was an email from me asking how to do this and he replied:
CO: That seems like kind of a mean thing to do to you at such a young age.
CW: It’s true. UNIX alias was clearly the gateway drug that led me to writing shell scripts for everything.
CO: How many aliases do you have now?
CW: According to “alias | wc -l “, I have 1137. So far.
CO: So you’ve spent countless hours making aliases to save time.
CW: Yes! And shell scripts!
CO: Ok let’s talk about this script for introducing me to people. As you know I don’t like getting treated like a small cog. I’m kind of a big deal.
CW: Yes, you’ve mentioned that.
CO: So how does it work?
CW: I have separate biography files for everyone, and a file called nfile.asc that has first name, lastname@tag, and email address. Then I can introduce people via
% ii oneil@mathbabe schutt
It strips out the @mathbabe part (so I can keep track of multiple people named oneil) from the actual email, reads in and reformats the biographies, grepping out the commented lines, and writes an email I can pipe to mutt. The whole thing can be done in a few seconds.
CO: Ok that does sound pretty good. How many shell scripts do you have?
CW: Hundreds. A few of them are in my public mise-en-place repository, which I should update more. I’m not sure which of them I really use all the time, but it’s pretty rare I type an actual legal UNIX command at the command line. That said I try never to leave the command line. Students are always teaching me fancypants tricks for their browsers or some new app, but I spend a lot of time at the command line getting and munging data, and for that, sed, awk, and grep are here to stay.
CO: That’s kinda sad and yet… so true. Ok here’s the only question I really wanted to ask though: will you promise me you’ll never send me any more auto-generated emails?
This is a guest post by Julia Evans. Julia is a data scientist & programmer who lives in Montréal. She spends her free time these days playing with data and running events for women who program or want to — she just started a Montréal chapter of pyladies to teach programming, and co-organize a monthly meetup called Montréal All-Girl Hack Night for women who are developers.
I asked mathbabe a question a few weeks ago saying that I’d recently started a data science job without having too much experience with statistics, and she asked me to write something about how I got the job. Needless to say I’m pretty honoured to be a guest blogger here :) Hopefully this will help someone!
Last March I decided that I wanted a job playing with data, since I’d been playing with datasets in my spare time for a while and I really liked it. I had a BSc in pure math, a MSc in theoretical computer science and about 6 months of work experience as a programmer developing websites. I’d taken one machine learning class and zero statistics classes.
In October, I left my web development job with some savings and no immediate plans to find a new job. I was thinking about doing freelance web development. Two weeks later, someone posted a job posting to my department mailing list looking for a “Junior Data Scientist”. I wrote back and said basically “I have a really strong math background and am a pretty good programmer”. This email included, embarrassingly, the sentence “I am amazing at math”. They said they’d like to interview me.
The interview was a lunch meeting. I found out that the company (Via Science) was opening a new office in my city, and was looking for people to be the first employees at the new office. They work with clients to make predictions based on their data.
My interviewer (now my manager) asked me about my role at my previous job (a little bit of everything — programming, system administration, etc.), my math background (lots of pure math, but no stats), and my experience with machine learning (one class, and drawing some graphs for fun). I was asked how I’d approach a digit recognition problem and I said “well, I’d see what people do to solve problems like that, and I’d try that”.
I also talked about some data visualizations I’d worked on for fun. They were looking for someone who could take on new datasets and be independent and proactive about creating model, figuring out what is the most useful thing to model, and getting more information from clients.
I got a call back about a week after the lunch interview saying that they’d like to hire me. We talked a bit more about the work culture, starting dates, and salary, and then I accepted the offer.
So far I’ve been working here for about four months. I work with a machine learning system developed inside the company (there’s a paper about it here). I’ve spent most of my time working on code to interface with this system and make it easier for us to get results out of it quickly. I alternate between working on this system (using Java) and using Python (with the fabulous IPython Notebook) to quickly draw graphs and make models with scikit-learn to compare our results.
I like that I have real-world data (sometimes, lots of it!) where there’s not always a clear question or direction to go in. I get to spend time figuring out the relevant features of the data or what kinds of things we should be trying to model. I’m beginning to understand what people say about data-wrangling taking up most of their time. I’m learning some statistics, and we have a weekly Friday seminar series where we take turns talking about something we’ve learned in the last few weeks or introducing a piece of math that we want to use.
Overall I’m really happy to have a job where I get data and have to figure out what direction to take it in, and I’m learning a lot.
MathBabe recently wrote an article critical of the elitist nature of Ted Talks, which you can read here. Fortunately for her, and for the hoi polloi everywhere clamoring for populist science edutainment, there is an alternative: Nerd Nite. Once a month, in cities all over the globe, nerds herd into a local bar and turn it into a low-brow forum for innovative science ideas. Think Ted Talks on tequila.
Each month, three speakers present talks for 20-30 minutes, followed by questions and answers from the invariably sold-out audience. The monthly forum gives professional and amateur scientists an opportunity to explain their fairly abstruse specialties accessibly to a lay audience – a valuable skill. Since the emphasis is on science entertainment, it also gives the speakers a chance to present their ideas in a more engaging way: in iambic pentameter, in drag with a tuba, in three-part harmony, or via interpretive dance – an invaluable skill. The resulting atmosphere is informal, delightfully debauched, and refreshingly pro-science.
Slaking our thirst for both science education and mojitos, Nerd Nite started small but quickly went viral. Nerd Nites are now being held in 50 cities, from San Francisco to Kansas City and Auckland to Liberia. You can find the full listing of cities here; if you don’t see one near you, start one!
Last Wednesday night I was twitterpated to be one of three guest nerds sharing the stage at San Francisco’s Nerd Nite. I put the chic back into geek with a biology talk entitled “Genital Plugs, Projectile Penises, and Gay Butterflies: A Naturalist Explains the Birds and the Bees.”
A video recording of the presentation will be available online soon, but in the meantime, here’s a tantalizing clip from the talk, in which Isabella Rossellini explains the mating habits of the bee. Warning: this is scientifically sexy.
I shared the stage with Chris Anderson, who gave a fascinating talk on how the DIY community is building drones out of legos and open-source software. These DIY drones fly below government regulation and can be used for non-military applications, something we hear far too little of in the daily war digest that passes for news. The other speaker was Mark Rosin of the UK-based Guerrilla Science project. This clever organization reaches out to audiences at non-science venues, such as music concerts, and conducts entertaining presentations that teach core science ideas. As part of his presentation Mark used 250 inflated balloons and a bass amp to demonstrate the physics concept of resonance.
If your curiosity has been piqued and you’d like to check out an upcoming Nerd Nite, consider attending the upcoming Nerdtacular, the first Nerd Nite Global Festival, to be held this August 16-18th in Brooklyn, New York.
The global Nerdtacular: Now that’s an idea worth spreading.
This Friday, I’ll be participating at HackPrinceton.
My team will be training an EEG to recognize yes and no thoughts for particular electromechanical devices and creating general human brain interface (HBI) architecture.
We’ll be working on allowing you to turn on your phone and navigate various menus with your mind!
There’s lots of cool swag and prizes – the best being jobs at Google and Microsoft. Everyone on the team has experience in the field,* but of course the more the merrier and you’re welcome no matter what you bring (or don’t bring!) to the table.
If you’re interested, email email@example.com ASAP!
*So far we’ve got a math Ph.D., a mech engineer, some CS/Operations Research guys and while my field is finance I picked up some neuro/machine learning along the way. If you have nothing to do for the next three days and want to learn something specifically for this competition, I recommend checking out my personal favorites: neurofocus.com, frontiernerds.com or neurogadget.com.
I wanted to give you the low-down on a data hackathon I participated in this weekend, which was sponsored by the NYU Institute for Public Knowledge on the topic of climate change and social information. We were assigned teams and given a very broad mandate. We had only 24 hours to do the work, so it had to be simple.
Our team consisted of Venky Kannan, Tom Levine, Eric Schles, Aaron Schumacher, Laura Noren, Stephen Fybish, and me.
We decided to think about the effects of super storms on different neighborhoods. In particular, to measure the recovery time of the subway ridership in various neighborhoods using census information. Our project was inspired by this “nofarehikes” map of New York which tries to measure the impact of a fare hike on the different parts of New York. Here’s a copy of our final slides.
Also, it’s not directly related to climate change, but rather rests on the assumption that with climate change comes more frequent extreme weather events, which seems to be an existing myth (please tell me if the evidence is or isn’t there for that myth).
We used three data sets: subway ridership by turnstile, which only exists since May 2010, the census of 2010 (which is kind of out of date but things don’t change that quickly) and daily weather observations from NOAA.
Using the weather map and relying on some formal definitions while making up some others, we came up with a timeline of extreme weather events:
Then we looked at subway daily ridership to see the effect of the storms or the recovery from the storms:
Then we used the census tracts to understand wealth in New York:
And of course we had to know which subway stations were in which census tracts. This isn’t perfect because we didn’t have time to assign “empty” census tracts to some nearby subway station. There are on the order of 2,000 census tracts but only on the order of 800 subway stations. But again, 24 hours isn’t alot of time, even to build clustering algorithms.
Finally, we attempted to put the data together to measure which neighborhoods have longer-than-expected recovery times after extreme weather events. This is our picture:
Interestingly, it looks like the neighborhoods of Manhattan are most impacted by severe weather events, which is not in line with our prior [Update: I don't think we actually computed the impact on a given resident, but rather just the overall change in rate of ridership versus normal. An impact analysis would take into account the relative wealth of the neighborhoods and would probably look very different].
There are tons of caveats, I’ll mention only a few here:
- We didn’t have time to measure the extent to which the recovery time took longer because the subway stopped versus other reasons people might not sure the subway. But our data is good enough to do this.
- Our data might have been overwhelmingly biased by Sandy. We’d really like to do this with much longer-term data, but the granular subway ridership data has not been available for long. But the good news is we can do this from now on.
- We didn’t have bus data at the same level, which is a huge part of whether someone can get to work, especially in the outer boroughs. This would have been great and would have given us a clearer picture.
- When someone can’t get to work, do they take a car service? How much does that cost? We’d love to have gotten our hands on the alternative ways people got to work and how that would impact them.
- In general we’d have like to measure the impact relative to their median salary.
- We would also have loved to have measured the extent to which each neighborhood consisted of salary versus hourly wage earners to further understand how a loss of transportation would translate into an impact on income.
I just read this paper, written by Björn Brembs and Marcus Munafò and entitled “Deep Impact: Unintended consequences of journal rank”. It was recently posted on the Computer Science arXiv (h/t Jordan Ellenberg).
I’ll give you a rundown on what it says, but first I want to applaud the fact that it was written in the first place. We need more studies like this, which examine the feedback loop of modeling at a societal level. Indeed this should be an emerging scientific or statistical field of study in its own right, considering how many models are being set up and deployed on the general public.
Here’s the abstract:
Much has been said about the increasing bureaucracy in science, stifling innovation, hampering the creativity of researchers and incentivizing misconduct, even outright fraud. Many anecdotes have been recounted, observations described and conclusions drawn about the negative impact of impact assessment on scientists and science. However, few of these accounts have drawn their conclusions from data, and those that have typically relied on a few studies. In this review, we present the most recent and pertinent data on the consequences that our current scholarly communication system has had on various measures of scientific quality (such as utility/citations, methodological soundness, expert ratings and retractions). These data confirm previous suspicions: using journal rank as an assessment tool is bad scientific practice. Moreover, the data lead us to argue that any journal rank (not only the currently-favored Impact Factor) would have this negative impact. Therefore, we suggest that abandoning journals altogether, in favor of a library-based scholarly communication system, will ultimately be necessary. This new system will use modern information technology to vastly improve the filter, sort and discovery function of the current journal system.
The key points in the paper are as follows:
- There’s a growing importance of science and trust in science
- There’s also a growing rate (x20 from 2000 to 2010) of retractions, with scientific misconduct cases growing even faster to become the majority of retractions (to an overall rate of 0.02% of published papers)
- There’s a larger and growing “publication bias” problem – in other words, an increasing unreliability of published findings
- One problem: initial “strong effects” get published in high-ranking journal, but subsequent “weak results” (which are probably more reasonable) are published in low-ranking journals
- The formal “Impact Factor” (IF) metric for rank is highly correlated to “journal rank”, defined below.
- There’s a higher incidence of retraction in high-ranking (measured through “high IF”) journals.
- “A meta-analysis of genetic association studies provides evidence that the extent to which a study over-estimates the likely true effect size is positively correlated with the IF of the journal in which it is published”
- Can the higher retraction error in high-rank journal be explained by higher visibility of those journals? They think not. Journal rank is bad predictor for future citations for example. [mathbabe inserts her opinion: this part needs more argument.]
- “…only the most highly selective journals such as Nature and Science come out ahead over unselective preprint repositories such as ArXiv and RePEc”
- Are there other measures of excellence that would correlate with IF? Methodological soundness? Reproducibility? No: “In fact, the level of reproducibility was so low that no relationship between journal rank and reproducibility could be detected.
- More about Impact Factor: The IF is a metric for the number of citations to articles in a journal (the numerator), normalized by the number of articles in that journal (the denominator). Sounds good! But:
- For a given journal, IF is not calculated but is negotiated – the publisher can (and does) exclude certain articles (but not citations). Even retroactively!
- The IF is also not reproducible – errors are found and left unexplained.
- Finally, IF is likely skewed by the fat-tailedness of citations (certain articles get lots, most get few). Wouldn’t a more robust measure be given by the median?
- Journal rank is a weak to moderate predictor of scientific impact
- Journal rank is a moderate to strong predictor of both intentional and unintentional scientific unreliability
- Journal rank is expensive, delays science and frustrates researchers
- Journal rank as established by IF violates even the most basic scientific standards, but predicts subjective judgments of journal quality
- “IF generates an illusion of exclusivity and prestige based on an assumption that it will predict subsequent impact, which is not supported by empirical data.”
- “Systemic pressures on the author, rather than increased scrutiny on the part of the reader, inflate the unreliability of much scientific research. Without reform of our publication system, the incentives associated with increased pressure to publish in high-ranking journals will continue to encourage scientiststo be less cautious in their conclusions (or worse), in an attempt to market their research to the top journals.”
- “It is conceivable that, for the last few decades, research institutions world-wide may have been hiring and promoting scientists who excel at marketing their work to top journals, but who are not necessarily equally good at conducting their research. Conversely, these institutions may have purged excellent scientists from their ranks, whose marketing skills did not meet institutional requirements. If this interpretation of the data is correct, we now have a generation of excellent marketers (possibly, but not necessarily also excellent scientists) as the leading figures of the scientific enterprise, constituting another potentially major contributing factor to the rise in retractions. This generation is now in charge of training the next generation of scientists, with all the foreseeable consequences for the reliability of scientific publications in the future.
The authors suggest that we need a new kind of publishing platform. I wonder what they’d think of the Episciences Project.
The past: Money in politics
First thing’s first, I went to the Bicoastal Datafest a few weekends ago and haven’t reported back. Mostly that’s because I got sick and didn’t go on the second day, but luckily other people did, like Kathy Kiely from the Sunlight Foundation, who wrote up this description of the event and the winning teams’ projects.
And hey, it turns out that my new company shares an office with Harmony Institute, whose data scientist Burton DeWilde was on the team that won “Best in Show” for their orchestral version of the federal government’s budget.
Another writeup of the event comes by way of Michael Lawson, who worked on the team that set up an accounting fraud detection system through Benford’s Law. I might be getting a guest blog post about this project through another one of its team members soon.
And we got some good progress on our DataKind/ Sunlight Foundation money-in-politics project as well, thanks to DataKind intern Pete Darche and math nerds Kevin Wilson and Johan de Jong.
The future one week from now: Occupy
It’s a combination of an Occupy event and a datafest, so obviously I am going to try to go. The theme is general – data for the 99% – but there’s a discussion on this listserv as to the various topics people might want to focus on (Aaron Swartz and Occupy Sandy are coming up for example). I’m looking forward to reporting back (or reporting other people’s report-backs if my kids don’t let me go).
The future two weeks from now: Climate change
Finally, there’s this datathon, which doesn’t look open to registration, but which I’ll be participating in through my work. It’s stated goal is “to explore how social and meteorological data can be combined to enhance social science research on climate change and cities.” The datathon will run Saturday March 9th – Sunday March 10th, 2013, starting noon Saturday, with final presentations at noon Sunday. I’ll try to report back on that as well.
I wanted to share with you guys a project I’ve been involved with started by John Spens of Thoughtworks regarding data collection and open analysis around guns and gun-related violence. John lives in Connecticut and has friends who were directly affected by the massacre in Newtown. Here is John’s description of the project:
I initiated the Sandy Hook Project in response to this need for information. The purpose of this project is to produce rigorous and transparent analyses of data pertaining to gun-related violence. My goal is to move beyond the rhetoric and emotion and produce (hopefully) objective insight into the relationship between guns and violence in the US. I realize that objectivity will be challenging, which is why I want to share the methods and the data openly so others can validate or refute my findings as well as contribute their own.
I’ve put the project on GitHub. (https://github.com/john-m-spens/SandyHookProject). While it’s not designed as a data repository, I think the ubiquitous nature of GitHub and the control enforced through the code repository model will support effective collaboration.
John has written a handful of posts about statistics and guns, including A Brief Analysis of Firearm-related Homicide Rates and Investigating Statistics Regarding Right to Carry Laws.
In addition to looking into the statistics that exist, John wants to address the conversation itself. As he said in his most recent post:
What limited data and analysis that exists is often misrepresented and abused, and is given much less attention than anecdotal evidence. It is relatively simple to produce a handful of cases that support either side in this debate. What we really need is to understand the true impact of guns on our society. Push back by the NRA that any such research would be “political opinion masquerading as medical science.” is unacceptable. We can only make intelligent decisions when we have the fullest picture possible.
John is looking for nerd collaborators who can help him with data collection and open analysis. He’s also hoping to have a weekend datafest to work on this project in March, so stay tuned if you want to be part of that!
I’ve been talking a lot recently, with various people and on this blog, about data and model privacy. It seems like individuals, who should have the right to protect their data, don’t seem to, but huge private companies, with enormous powers over the public, do.
Another example: models working on behalf of the public, like Fed stress tests and other regulatory models, seem essentially publicly known, which is useful indeed to the financial insiders, the very people who are experts on gaming systems.
Google search has a deeply felt power over the public, and arguably needs to be understood for the consistent threat it poses to people’s online environment. It’s a scary thought experiment to imagine what could be done with it, and after all why should we blindly trust a corporation to have our best intentions in mind? Maybe it’s time to call for the Google search model to be open source.
But what would that look like? At first blush we might imagine forcing them to actually opening up their source code. But at this point that code must be absolutely enormous, unreadable, and written specifically for their uniquely massive machine set-up. In other words, totally overwhelming and useless (as my friend Suresh might say, the singularity has already happened and this is what it looks like (update: Suresh credits Cosma)).
Considering how many people would actually be able to make sense of the underlying code base, then you quickly realize that opening it up would be meaningless for the task of protecting the public. Instead, we’d want to make the code accessible in some way.
But I claim that’s exactly what Google does, by allowing everyone to search using the model from anywhere. In other words, it’s on us, the public, to run experiments to undertand what the underlying model actually does. We have the tools, let’s get going.
If we think there’s inherent racism in google searches, then we should run experiments like Nathan Newman recently did, examining the different ads that pop up when someone writes an email about buying a car, for example, with different names and in different zip codes. We should organize to change our zip codes, our personas (which would mean deliberately creating personas and gmail logins, etc.), and our search terms, and see how the Google search results change as our inputs change.
After all, I don’t know what’s in the code base but I’m pretty sure there’s no sub-routine that’s called “add_racism_to_search”; instead, it’s a complicated Rube-Goldberg machine that should be judged by its outputs, in a statistical way, rather than expected to prescriptively describe how it treats things on a case-by-case basis.
Another thing: I don’t think there are bad intentions on the part of the modelers, but that doesn’t mean there aren’t bad consequences – the model is too complicated for anyone to anticipate exactly how it acts unless they perform experiments to test them. In the meantime, until people undertand that, we need to distinguish between the intentions and the results. So, for example, in the update to Nathan Newman’s experiments with Google mail, Google responded with this:
This post relies on flawed methodology to draw a wildly inaccurate conclusion. If Mr. Newman had contacted us before publishing it, we would have told him the facts: we do not select ads based on sensitive information, including ethnic inferences from names.
And then Newman added this:
Now, I’m happy to hear Google doesn’t “select ads” on this basis, but Google’s words seem chosen to allow a lot of wiggle room (as such Google statements usually seem to). Do they mean that Google algorithms do not use the ethnicity of names in ad selection or are they making the broader claim that they bar advertisers from serving up different ads to people with different names?
My point is that it doesn’t matter what Google says it does or doesn’t do, if statistically speaking the ads change depending on ethnicity. It’s a moot argument what they claim to do if what actually happens, the actual output of their Rube-Goldberg machine, is racist.
And I’m not saying Google’s models are definitively racist, by the way, since Newman’s efforts were small, the efforts of one man, and there were not thousands and thousands of tests but only a few. But his approach to understanding the model was certainly correct, and it’s a cause that technologists and activists should take up and expand on.
Mathematically speaking, it’s already as open-source as we need it to be to understand it, although in a dual way than people are used to thinking about. Actually, it defines the gold standard of open-source: instead of getting a bunch of gobbly-gook that we can’t process and that depends on enormously large data that changes over time, we get real-time access to the newest version that even a child can use.
I only wish that other public-facing models had such access. Let’s create a large-scale project like SETI to understand the Google search model.
I’m learning a bit of R in my current stint at ThoughtWorks. Coming from python, I was happy to see most of the plotting functions are very similar, as well as many of the vector-level data handling functions. Besides the fact that lists start at 1 instead of 0, things were looking pretty familiar.
But then I came across something that totally changed my mind. In R they have these data frames, which are like massive excel spreadsheets: very structured matrices with named columns and rows, on which you can perform parallelized operations.
One thing I noticed right away about these rigid data structures is that they make handling missing data very easy. So if you have a huge data frame where a few rows are missing a few data points, then one command, na.omit, gets rid of your problem. Sometimes you don’t even need that, you can just perform your operation on your NA’s, and you just get back more NA’s where appropriate.
This ease-of-use for crappy data is good and bad: good because it’s convenient, bad because you never feel the pain of missing data. When I use python, I rely on dictionaries of dictionaries (of dictionaries) to store my data, and I have to make specific plans for missing data, which means it’s a pain but also that I have to face up to bad data directly.
But that’s not why I think R is somewhat like SQL. It’s really because of how bad “for” loops are in R.
So I was trying to add a new column to my rather large (~65,000 row) dataframe. Adding a column is very easy indeed, if the value in the new column is a simple function of the values in the current columns, because of the way you can parallelize operations. So if the new value is the square of the first column value plus the second column value, it can do it on the whole columns all at once and it’s super fast.
In my case, though, the new value required a look-up in the table itself, which may or may not work, and then required a decision depending on whether it worked. For the life of me I couldn’t figure out how to do it using iterated “apply” or “lapply” functions in the existing dataframe. Of course it’s easy to do using a “for” loop, but that is excruciatingly slow.
Finally I realized I needed to think like a SQL programmer, and build a new dataframe which consisted of the look-up row, if it existed, along with a unique identifier in common with the row I start with. Then I merged the two dataframes, which is like a SQL join, using that unique identifier as the pivot. This would never happen in python with a dataset of this size, because dictionaries are very unstructured and fast.
Easy peasy lemon squeazy, once you understand it, but it made me realize that the approach to learning a new language by translating each word really doesn’t work. You need to think like a Parisian to really speak French.
In yesterday’s New York Times Science section, there was an article called “Life in the Red” (hat tip Becky Jaffe) about people’s behavior when they are in debt, summed up by this:
The usual explanations for reckless borrowing focus on people’s character, or social norms that promote free spending and instant gratification. But recent research has shown that scarcity by itself is enough to cause this kind of financial self-sabotage.
“When we put people in situations of scarcity in experiments, they get into poverty traps,” said Eldar Shafir, a professor of psychology and public affairs at Princeton. “They borrow at high interest rates that hurt them, in ways they knew to avoid when there was less scarcity.”
The psychological burden of debt not only saps intellectual resources, it also reinforces the reckless behavior, and quickly, Dr. Shafir and other experts said. Millions of Americans have been keeping the lights on through hard times with borrowed money, running a kind of shell game to keep bill collectors away.
So what we’ve got here is a feedback loop of poverty, which certainly jives with my observations of friends and acquaintances I’ve seen who are in debt.
I’m guessing the experiments described in the article are not as bad as real life, however.
I say that because I’ve been talking on this blog as well as in my recent math talks about a separate feedback loop involving models, namely the feedback loop whereby people who are judged poor by the model are offered increasingly bad terms on their loans. I call it the death spiral of modeling.
If you think about how these two effects work together – the array of offers gets worse as your vulnerability to bad deals increases – then you start to understand what half of our country is actually living through on a day-to-day basis.
As an aside, I have an enormous amount of empathy for people experiencing this poverty trap. I don’t think it’s a moral issue to be in debt: nobody wants to be poor, and nobody plans it that way.
This opinion article (hat tip Laura Strausfeld), also in yesterday’s New York Times, makes the important point that listening to a bunch of rich, judgmental people like David Bach, Dave Ramsey, and Suze Orman telling us it’s our fault we haven’t finished saving for retirement isn’t actually useful, and suggest we individually choose a money issue to take charge and sort out.
So my empathetic nerd take on poverty traps is this: how can we quantitatively measure this phenomenon, or more precisely these phenomena, since we’ve identified at least two feedback loops?
One reason it’s hard is that it’d be hard to perform natural tests where some people are submitted to the toxic environment but other people aren’t – it’s the “people who aren’t” category that’s the hard part, of course.
For the vulnerability to bad terms, the article describes the level of harassment that people receive from bill collectors as a factor in how they react, which doesn’t surprise anyone who’s ever dealt with a bill collector. Are there certain people who don’t get harassed for whatever reason, and do they fall prey to bad deals at a different rate? Are there local laws in some places prohibiting certain harassment? Can we go to another country where the bill collectors are reined in and see how people in debt behave there?
Also, in terms of availability of loans, it might be relatively easy to start out with people who live in states with payday loans versus people who don’t, and see how much faster the poverty spiral overtakes people with worse options. Of course, as crappy loans get more and more available online, this proximity study will become moot.
It’s also going to be tricky to tease out the two effects from each other. One is a question of supply and the other is a question of demand, and as we know those two are related.
I’m not answering these questions today, it’s a long-term project that I need your help on, so please comment below with ideas. Maybe if we have a few good ideas and if we find some data we can plan a data hackathon.
I had a great time giving my “Weapons of Math Destruction” talk in San Diego, and the audience was fantastic and thoughtful.
One question that someone asked was whether the US News & World Reports college ranking model should be forced to be open sourced – wouldn’t that just cause colleges to game the model?
First of all, colleges are already widely gaming the model and have been for some time. And that gaming is a distraction and has been heading colleges in directions away from good instruction, which is a shame.
And if you suggest that they change the model all the time to prevent this, then you’ve got an internal model of this model that needs adjustment. They might be tinkering at the edges but overall it’s quite clear what’s going into the model: namely, graduation rates, SAT scores, number of Ph.D’s on staff, and so on. The exact percentages change over time but not by much.
The impact that this model has had on education and how universities apportion resources has been profound. Academic papers have been written on the law school version of this story.
Moreover, the tactics that US News & World Reports uses to enforce their dominance of the market are bullying, as you can learn from the President of Reed College, which refuses to be involved.
Back to the question. Just as I realize that opening up all data is not reasonable or desirable, because first of all there are serious privacy issues but second of all certain groups have natural advantages to openly shared resources, it is also true that opening up all models is similarly problematic.
However, certain data should surely be open: for example, the laws of our country, that we are all responsible to know, should be freely available to us (something that Aaron Swartz understood and worked towards). How can we be held responsible for laws we can’t read?
Similarly, public-facing models, such as credit scoring models and teacher value-added models, should absolutely be open and accessible to the public. If I’m being judged and measured and held accountable by some model in my daily life as a citizen, that has real impact on how my future will unfold, then I should know how that process works.
And if you complain about the potential gaming of those public-facing models, I’d answer: if they are gameable then they shouldn’t be used, considering the impact they have on so many people’s lives. Because a gameable model is a weak model, with proxies that fail.
Another way to say this is we should want someone to “game” the credit score model if it means they pay their bills on time every month (I wrote about this here).
Back to the US News & World Report model. Is it public facing? I’m no lawyer but I think a case can be made that it is, and that the public’s trust in this model makes it a very important model indeed. Evidence can be gathered by measuring the extent to which colleges game the model, which they only do because the public cares so much about the rankings.
Even so, what difference would that make, to open it up?
In an ideal world, where the public is somewhat savvy about what models can and cannot do, opening up the US News & World Reports college ranking model would result in people losing faith in it. They’d realize that it’s no more valuable than an opinion from a highly vocal uncle of theirs who is obsessed with certain metrics and blind to individual eccentricities and curriculums that may be a perfect match for a non-conformist student. It’s only one opinion among many, and not to be religiously believed.
But this isn’t an ideal world, and we have a lot of work to do to get people to understand models as opinions in this sense, and to get people to stop trusting them just because they’re mathematically presented.
I just signed up for an upcoming datafest called “Big Data, Big Money, and You” which will be co-hosted at Columbia University and Stanford University on February 2nd and 3rd.
The idea is to use data from:
- National Institute on Money in State Politics,
- Open States,
- Pew Research Center,
- The Center for Responsive Politics,
- State Integrity, and
- The Sunlight Foundation
and open source tools such as R, python, and various api’s to model and explore various issues in the intersection of money and politics. Among those listed are things like: “look for correlation between the subject of bills introduced to state legislatures to big companies within those districts and campaign donations” and “comparing contributions per and post redistricting”.
As usual, a weekend-long datafest is just the beginning of a good data exploration: if you’re interested in this, think of this as an introduction to the ideas and the people involved; it’s just as much about networking with like-minded people as it is about finding an answer in two days.
So sign up, come on by, and get ready to roll up your sleeves and have a great time for that weekend, but also make sure you get people’s email addresses so you can keep in touch as things continue to develop down the road.