Today I want to bring up a few observations and concerns I have about the emergence of a new field in machine learning called e-discovery. It’s the algorithmic version of discovery, so I’ll start there.
Discovery is part of the process in a lawsuit where relevant documents are selected, pored over, and then handed to the other side. Nowadays, of course, there are more and more documents, almost all electronic, typically including lots of e-mails.
If you’re talking about a big lawsuit, there could be literally millions of documents to wade through, and that takes a lot of time for humans to do, and it can be incredibly expensive and time-consuming. Enter the algorithm.
With advances in Natural Language Processing (NLP), a machine algorithm can sort emails or documents by topic (after getting the documents into machine-readable form, cleaning, and deduping) and can in general do a pretty good job of figuring out whether a given email is “relevant” to the case.
And this is already happening – the Wall Street Journal recently reported that the Justice Department allowed e-discovery for a case involving the merger of two beer companies. From the article:
With the blessing of the Justice Department’s antitrust division, the lawyers loaded the documents into a program and manually reviewed a batch to train the software to recognize relevant documents. The manual review was repeated until the Justice Department and Constellation were satisfied that the program could accurately predict relevance in the rest of the documents. Lawyers for Constellation and Crown Imports used software developed by kCura Corp., which lists the Justice Department as a client.
In the end, Constellation and Crown Imports turned over hundreds of thousands of documents to antitrust investigators.
Here are some of my questions/ concerns:
- These algorithms are typically not open source – companies like kCura make good money doing these jobs.
- That means that they could be wrong, possibly in subtle ways.
- Or maybe not so subtle ways: maybe they’ve been trained to find documents that are both “relevant” and “positive” for a given side.
- In any case, the laws of this country will increasingly depend on a black box algorithm that is no accessible to the average citizen.
- Is that in the public’s interest?
- Is that even constitutional?
When I first met Chris Wiggins of Columbia and hackNY back in 2011, he immediately introduced me to about a hundred other people, which made it obvious that his introductions were highly stereotyped. I thought he was some kind of robot, especially when I started getting emails from his phone which all had the same (long) phrases in them, like “I’m away from my keyboard right now, but when I get back to my desk I’ll calendar prune and send you some free times.”
Finally I was like “what the hell, are you sending me whole auto-generated emails”? To which he replied “of course.”
Feeling cheated, I called him to tell him he has an addiction to shell scripting. Here’s a brief interview, rewritten to make me sound smarter and cooler than I am.
CO: Ok, let’s start with these iphone shortcuts. Sometimes the whole email from you reads like a bunch of shortcuts.
CW: Yup, lots of times.
CO: What the hell? Don’t you want to personalize things for me at least a little?
CW: I do! But I also want to catch the subway.
CO: Ugh. How many shortcuts do you have on that thing?
CW: Well.. (pause)..38.
CO: Ok now I’m officially worried about you. What’s the longest one?
CW: Probably this one I wrote for Sandy: If I write “sandy” it unpacks to
“Sorry for delay and brevity in reply. Sandy knocked out my phone, power, water, and internet so I won’t be replying as quickly as usual. Please do not hesitate to email me again if I don’t reply soon.”
CO: You officially have a problem. What’s the shortest one?
CW: Well, when I type “plu” it becomes “+1”
CO: Ok, let me apply the math for you: your shortcut is longer than your longcut.
CW: I know but not if you include switching from letters to numbers on the iphone, which is annoying.
CO: How did you first become addicted to shortcuts?
CW: I got introduced to UNIX in the 80s and, in my frame of reference at the time, the closest I had come to meeting a wizard was the university’s sysadmin. I was constantly breaking things by chomping cpu with undead processes or removing my $HOME or something, and he had to come in and fix things. I learned a lot over his shoulder. In the summer before I started college, my dream was to be a university sysadmin. He had to explain to me patiently that I shouldn’t spend college in a computercave.
CO: Good advice, but now that you’re a grownup you can do that.
CW: Exactly. Anyway, everytime he would fix whatever incredible mess I had made he would sign off with some different flair and walk out, like he was dropping the mic and walking off stage. He never signed out “logout” it was always “die” or “leave” or “ciao” (I didn’t know that word at the time). So of course by the time he got back to his desk one day there was an email from me asking how to do this and he replied:
CO: That seems like kind of a mean thing to do to you at such a young age.
CW: It’s true. UNIX alias was clearly the gateway drug that led me to writing shell scripts for everything.
CO: How many aliases do you have now?
CW: According to “alias | wc -l “, I have 1137. So far.
CO: So you’ve spent countless hours making aliases to save time.
CW: Yes! And shell scripts!
CO: Ok let’s talk about this script for introducing me to people. As you know I don’t like getting treated like a small cog. I’m kind of a big deal.
CW: Yes, you’ve mentioned that.
CO: So how does it work?
CW: I have separate biography files for everyone, and a file called nfile.asc that has first name, lastname@tag, and email address. Then I can introduce people via
% ii oneil@mathbabe schutt
It strips out the @mathbabe part (so I can keep track of multiple people named oneil) from the actual email, reads in and reformats the biographies, grepping out the commented lines, and writes an email I can pipe to mutt. The whole thing can be done in a few seconds.
CO: Ok that does sound pretty good. How many shell scripts do you have?
CW: Hundreds. A few of them are in my public mise-en-place repository, which I should update more. I’m not sure which of them I really use all the time, but it’s pretty rare I type an actual legal UNIX command at the command line. That said I try never to leave the command line. Students are always teaching me fancypants tricks for their browsers or some new app, but I spend a lot of time at the command line getting and munging data, and for that, sed, awk, and grep are here to stay.
CO: That’s kinda sad and yet… so true. Ok here’s the only question I really wanted to ask though: will you promise me you’ll never send me any more auto-generated emails?
This is a guest post by Julia Evans. Julia is a data scientist & programmer who lives in Montréal. She spends her free time these days playing with data and running events for women who program or want to — she just started a Montréal chapter of pyladies to teach programming, and co-organize a monthly meetup called Montréal All-Girl Hack Night for women who are developers.
I asked mathbabe a question a few weeks ago saying that I’d recently started a data science job without having too much experience with statistics, and she asked me to write something about how I got the job. Needless to say I’m pretty honoured to be a guest blogger here Hopefully this will help someone!
Last March I decided that I wanted a job playing with data, since I’d been playing with datasets in my spare time for a while and I really liked it. I had a BSc in pure math, a MSc in theoretical computer science and about 6 months of work experience as a programmer developing websites. I’d taken one machine learning class and zero statistics classes.
In October, I left my web development job with some savings and no immediate plans to find a new job. I was thinking about doing freelance web development. Two weeks later, someone posted a job posting to my department mailing list looking for a “Junior Data Scientist”. I wrote back and said basically “I have a really strong math background and am a pretty good programmer”. This email included, embarrassingly, the sentence “I am amazing at math”. They said they’d like to interview me.
The interview was a lunch meeting. I found out that the company (Via Science) was opening a new office in my city, and was looking for people to be the first employees at the new office. They work with clients to make predictions based on their data.
My interviewer (now my manager) asked me about my role at my previous job (a little bit of everything — programming, system administration, etc.), my math background (lots of pure math, but no stats), and my experience with machine learning (one class, and drawing some graphs for fun). I was asked how I’d approach a digit recognition problem and I said “well, I’d see what people do to solve problems like that, and I’d try that”.
I also talked about some data visualizations I’d worked on for fun. They were looking for someone who could take on new datasets and be independent and proactive about creating model, figuring out what is the most useful thing to model, and getting more information from clients.
I got a call back about a week after the lunch interview saying that they’d like to hire me. We talked a bit more about the work culture, starting dates, and salary, and then I accepted the offer.
So far I’ve been working here for about four months. I work with a machine learning system developed inside the company (there’s a paper about it here). I’ve spent most of my time working on code to interface with this system and make it easier for us to get results out of it quickly. I alternate between working on this system (using Java) and using Python (with the fabulous IPython Notebook) to quickly draw graphs and make models with scikit-learn to compare our results.
I like that I have real-world data (sometimes, lots of it!) where there’s not always a clear question or direction to go in. I get to spend time figuring out the relevant features of the data or what kinds of things we should be trying to model. I’m beginning to understand what people say about data-wrangling taking up most of their time. I’m learning some statistics, and we have a weekly Friday seminar series where we take turns talking about something we’ve learned in the last few weeks or introducing a piece of math that we want to use.
Overall I’m really happy to have a job where I get data and have to figure out what direction to take it in, and I’m learning a lot.
MathBabe recently wrote an article critical of the elitist nature of Ted Talks, which you can read here. Fortunately for her, and for the hoi polloi everywhere clamoring for populist science edutainment, there is an alternative: Nerd Nite. Once a month, in cities all over the globe, nerds herd into a local bar and turn it into a low-brow forum for innovative science ideas. Think Ted Talks on tequila.
Each month, three speakers present talks for 20-30 minutes, followed by questions and answers from the invariably sold-out audience. The monthly forum gives professional and amateur scientists an opportunity to explain their fairly abstruse specialties accessibly to a lay audience – a valuable skill. Since the emphasis is on science entertainment, it also gives the speakers a chance to present their ideas in a more engaging way: in iambic pentameter, in drag with a tuba, in three-part harmony, or via interpretive dance – an invaluable skill. The resulting atmosphere is informal, delightfully debauched, and refreshingly pro-science.
Slaking our thirst for both science education and mojitos, Nerd Nite started small but quickly went viral. Nerd Nites are now being held in 50 cities, from San Francisco to Kansas City and Auckland to Liberia. You can find the full listing of cities here; if you don’t see one near you, start one!
Last Wednesday night I was twitterpated to be one of three guest nerds sharing the stage at San Francisco’s Nerd Nite. I put the chic back into geek with a biology talk entitled “Genital Plugs, Projectile Penises, and Gay Butterflies: A Naturalist Explains the Birds and the Bees.”
A video recording of the presentation will be available online soon, but in the meantime, here’s a tantalizing clip from the talk, in which Isabella Rossellini explains the mating habits of the bee. Warning: this is scientifically sexy.
I shared the stage with Chris Anderson, who gave a fascinating talk on how the DIY community is building drones out of legos and open-source software. These DIY drones fly below government regulation and can be used for non-military applications, something we hear far too little of in the daily war digest that passes for news. The other speaker was Mark Rosin of the UK-based Guerrilla Science project. This clever organization reaches out to audiences at non-science venues, such as music concerts, and conducts entertaining presentations that teach core science ideas. As part of his presentation Mark used 250 inflated balloons and a bass amp to demonstrate the physics concept of resonance.
If your curiosity has been piqued and you’d like to check out an upcoming Nerd Nite, consider attending the upcoming Nerdtacular, the first Nerd Nite Global Festival, to be held this August 16-18th in Brooklyn, New York.
The global Nerdtacular: Now that’s an idea worth spreading.
This Friday, I’ll be participating at HackPrinceton.
My team will be training an EEG to recognize yes and no thoughts for particular electromechanical devices and creating general human brain interface (HBI) architecture.
We’ll be working on allowing you to turn on your phone and navigate various menus with your mind!
There’s lots of cool swag and prizes – the best being jobs at Google and Microsoft. Everyone on the team has experience in the field,* but of course the more the merrier and you’re welcome no matter what you bring (or don’t bring!) to the table.
If you’re interested, email firstname.lastname@example.org ASAP!
*So far we’ve got a math Ph.D., a mech engineer, some CS/Operations Research guys and while my field is finance I picked up some neuro/machine learning along the way. If you have nothing to do for the next three days and want to learn something specifically for this competition, I recommend checking out my personal favorites: neurofocus.com, frontiernerds.com or neurogadget.com.
I wanted to give you the low-down on a data hackathon I participated in this weekend, which was sponsored by the NYU Institute for Public Knowledge on the topic of climate change and social information. We were assigned teams and given a very broad mandate. We had only 24 hours to do the work, so it had to be simple.
Our team consisted of Venky Kannan, Tom Levine, Eric Schles, Aaron Schumacher, Laura Noren, Stephen Fybish, and me.
We decided to think about the effects of super storms on different neighborhoods. In particular, to measure the recovery time of the subway ridership in various neighborhoods using census information. Our project was inspired by this “nofarehikes” map of New York which tries to measure the impact of a fare hike on the different parts of New York. Here’s a copy of our final slides.
Also, it’s not directly related to climate change, but rather rests on the assumption that with climate change comes more frequent extreme weather events, which seems to be an existing myth (please tell me if the evidence is or isn’t there for that myth).
We used three data sets: subway ridership by turnstile, which only exists since May 2010, the census of 2010 (which is kind of out of date but things don’t change that quickly) and daily weather observations from NOAA.
Using the weather map and relying on some formal definitions while making up some others, we came up with a timeline of extreme weather events:
Then we looked at subway daily ridership to see the effect of the storms or the recovery from the storms:
Then we used the census tracts to understand wealth in New York:
And of course we had to know which subway stations were in which census tracts. This isn’t perfect because we didn’t have time to assign “empty” census tracts to some nearby subway station. There are on the order of 2,000 census tracts but only on the order of 800 subway stations. But again, 24 hours isn’t alot of time, even to build clustering algorithms.
Finally, we attempted to put the data together to measure which neighborhoods have longer-than-expected recovery times after extreme weather events. This is our picture:
Interestingly, it looks like the neighborhoods of Manhattan are most impacted by severe weather events, which is not in line with our prior [Update: I don't think we actually computed the impact on a given resident, but rather just the overall change in rate of ridership versus normal. An impact analysis would take into account the relative wealth of the neighborhoods and would probably look very different].
There are tons of caveats, I’ll mention only a few here:
- We didn’t have time to measure the extent to which the recovery time took longer because the subway stopped versus other reasons people might not sure the subway. But our data is good enough to do this.
- Our data might have been overwhelmingly biased by Sandy. We’d really like to do this with much longer-term data, but the granular subway ridership data has not been available for long. But the good news is we can do this from now on.
- We didn’t have bus data at the same level, which is a huge part of whether someone can get to work, especially in the outer boroughs. This would have been great and would have given us a clearer picture.
- When someone can’t get to work, do they take a car service? How much does that cost? We’d love to have gotten our hands on the alternative ways people got to work and how that would impact them.
- In general we’d have like to measure the impact relative to their median salary.
- We would also have loved to have measured the extent to which each neighborhood consisted of salary versus hourly wage earners to further understand how a loss of transportation would translate into an impact on income.
I just read this paper, written by Björn Brembs and Marcus Munafò and entitled “Deep Impact: Unintended consequences of journal rank”. It was recently posted on the Computer Science arXiv (h/t Jordan Ellenberg).
I’ll give you a rundown on what it says, but first I want to applaud the fact that it was written in the first place. We need more studies like this, which examine the feedback loop of modeling at a societal level. Indeed this should be an emerging scientific or statistical field of study in its own right, considering how many models are being set up and deployed on the general public.
Here’s the abstract:
Much has been said about the increasing bureaucracy in science, stifling innovation, hampering the creativity of researchers and incentivizing misconduct, even outright fraud. Many anecdotes have been recounted, observations described and conclusions drawn about the negative impact of impact assessment on scientists and science. However, few of these accounts have drawn their conclusions from data, and those that have typically relied on a few studies. In this review, we present the most recent and pertinent data on the consequences that our current scholarly communication system has had on various measures of scientific quality (such as utility/citations, methodological soundness, expert ratings and retractions). These data confirm previous suspicions: using journal rank as an assessment tool is bad scientific practice. Moreover, the data lead us to argue that any journal rank (not only the currently-favored Impact Factor) would have this negative impact. Therefore, we suggest that abandoning journals altogether, in favor of a library-based scholarly communication system, will ultimately be necessary. This new system will use modern information technology to vastly improve the filter, sort and discovery function of the current journal system.
The key points in the paper are as follows:
- There’s a growing importance of science and trust in science
- There’s also a growing rate (x20 from 2000 to 2010) of retractions, with scientific misconduct cases growing even faster to become the majority of retractions (to an overall rate of 0.02% of published papers)
- There’s a larger and growing “publication bias” problem – in other words, an increasing unreliability of published findings
- One problem: initial “strong effects” get published in high-ranking journal, but subsequent “weak results” (which are probably more reasonable) are published in low-ranking journals
- The formal “Impact Factor” (IF) metric for rank is highly correlated to “journal rank”, defined below.
- There’s a higher incidence of retraction in high-ranking (measured through “high IF”) journals.
- “A meta-analysis of genetic association studies provides evidence that the extent to which a study over-estimates the likely true effect size is positively correlated with the IF of the journal in which it is published”
- Can the higher retraction error in high-rank journal be explained by higher visibility of those journals? They think not. Journal rank is bad predictor for future citations for example. [mathbabe inserts her opinion: this part needs more argument.]
- “…only the most highly selective journals such as Nature and Science come out ahead over unselective preprint repositories such as ArXiv and RePEc”
- Are there other measures of excellence that would correlate with IF? Methodological soundness? Reproducibility? No: “In fact, the level of reproducibility was so low that no relationship between journal rank and reproducibility could be detected.
- More about Impact Factor: The IF is a metric for the number of citations to articles in a journal (the numerator), normalized by the number of articles in that journal (the denominator). Sounds good! But:
- For a given journal, IF is not calculated but is negotiated – the publisher can (and does) exclude certain articles (but not citations). Even retroactively!
- The IF is also not reproducible – errors are found and left unexplained.
- Finally, IF is likely skewed by the fat-tailedness of citations (certain articles get lots, most get few). Wouldn’t a more robust measure be given by the median?
- Journal rank is a weak to moderate predictor of scientific impact
- Journal rank is a moderate to strong predictor of both intentional and unintentional scientific unreliability
- Journal rank is expensive, delays science and frustrates researchers
- Journal rank as established by IF violates even the most basic scientific standards, but predicts subjective judgments of journal quality
- “IF generates an illusion of exclusivity and prestige based on an assumption that it will predict subsequent impact, which is not supported by empirical data.”
- “Systemic pressures on the author, rather than increased scrutiny on the part of the reader, inflate the unreliability of much scientific research. Without reform of our publication system, the incentives associated with increased pressure to publish in high-ranking journals will continue to encourage scientiststo be less cautious in their conclusions (or worse), in an attempt to market their research to the top journals.”
- “It is conceivable that, for the last few decades, research institutions world-wide may have been hiring and promoting scientists who excel at marketing their work to top journals, but who are not necessarily equally good at conducting their research. Conversely, these institutions may have purged excellent scientists from their ranks, whose marketing skills did not meet institutional requirements. If this interpretation of the data is correct, we now have a generation of excellent marketers (possibly, but not necessarily also excellent scientists) as the leading figures of the scientific enterprise, constituting another potentially major contributing factor to the rise in retractions. This generation is now in charge of training the next generation of scientists, with all the foreseeable consequences for the reliability of scientific publications in the future.
The authors suggest that we need a new kind of publishing platform. I wonder what they’d think of the Episciences Project.
The past: Money in politics
First thing’s first, I went to the Bicoastal Datafest a few weekends ago and haven’t reported back. Mostly that’s because I got sick and didn’t go on the second day, but luckily other people did, like Kathy Kiely from the Sunlight Foundation, who wrote up this description of the event and the winning teams’ projects.
And hey, it turns out that my new company shares an office with Harmony Institute, whose data scientist Burton DeWilde was on the team that won “Best in Show” for their orchestral version of the federal government’s budget.
Another writeup of the event comes by way of Michael Lawson, who worked on the team that set up an accounting fraud detection system through Benford’s Law. I might be getting a guest blog post about this project through another one of its team members soon.
And we got some good progress on our DataKind/ Sunlight Foundation money-in-politics project as well, thanks to DataKind intern Pete Darche and math nerds Kevin Wilson and Johan de Jong.
The future one week from now: Occupy
It’s a combination of an Occupy event and a datafest, so obviously I am going to try to go. The theme is general – data for the 99% – but there’s a discussion on this listserv as to the various topics people might want to focus on (Aaron Swartz and Occupy Sandy are coming up for example). I’m looking forward to reporting back (or reporting other people’s report-backs if my kids don’t let me go).
The future two weeks from now: Climate change
Finally, there’s this datathon, which doesn’t look open to registration, but which I’ll be participating in through my work. It’s stated goal is “to explore how social and meteorological data can be combined to enhance social science research on climate change and cities.” The datathon will run Saturday March 9th – Sunday March 10th, 2013, starting noon Saturday, with final presentations at noon Sunday. I’ll try to report back on that as well.
I wanted to share with you guys a project I’ve been involved with started by John Spens of Thoughtworks regarding data collection and open analysis around guns and gun-related violence. John lives in Connecticut and has friends who were directly affected by the massacre in Newtown. Here is John’s description of the project:
I initiated the Sandy Hook Project in response to this need for information. The purpose of this project is to produce rigorous and transparent analyses of data pertaining to gun-related violence. My goal is to move beyond the rhetoric and emotion and produce (hopefully) objective insight into the relationship between guns and violence in the US. I realize that objectivity will be challenging, which is why I want to share the methods and the data openly so others can validate or refute my findings as well as contribute their own.
I’ve put the project on GitHub. (https://github.com/john-m-spens/SandyHookProject). While it’s not designed as a data repository, I think the ubiquitous nature of GitHub and the control enforced through the code repository model will support effective collaboration.
John has written a handful of posts about statistics and guns, including A Brief Analysis of Firearm-related Homicide Rates and Investigating Statistics Regarding Right to Carry Laws.
In addition to looking into the statistics that exist, John wants to address the conversation itself. As he said in his most recent post:
What limited data and analysis that exists is often misrepresented and abused, and is given much less attention than anecdotal evidence. It is relatively simple to produce a handful of cases that support either side in this debate. What we really need is to understand the true impact of guns on our society. Push back by the NRA that any such research would be “political opinion masquerading as medical science.” is unacceptable. We can only make intelligent decisions when we have the fullest picture possible.
John is looking for nerd collaborators who can help him with data collection and open analysis. He’s also hoping to have a weekend datafest to work on this project in March, so stay tuned if you want to be part of that!
I’ve been talking a lot recently, with various people and on this blog, about data and model privacy. It seems like individuals, who should have the right to protect their data, don’t seem to, but huge private companies, with enormous powers over the public, do.
Another example: models working on behalf of the public, like Fed stress tests and other regulatory models, seem essentially publicly known, which is useful indeed to the financial insiders, the very people who are experts on gaming systems.
Google search has a deeply felt power over the public, and arguably needs to be understood for the consistent threat it poses to people’s online environment. It’s a scary thought experiment to imagine what could be done with it, and after all why should we blindly trust a corporation to have our best intentions in mind? Maybe it’s time to call for the Google search model to be open source.
But what would that look like? At first blush we might imagine forcing them to actually opening up their source code. But at this point that code must be absolutely enormous, unreadable, and written specifically for their uniquely massive machine set-up. In other words, totally overwhelming and useless (as my friend Suresh might say, the singularity has already happened and this is what it looks like (update: Suresh credits Cosma)).
Considering how many people would actually be able to make sense of the underlying code base, then you quickly realize that opening it up would be meaningless for the task of protecting the public. Instead, we’d want to make the code accessible in some way.
But I claim that’s exactly what Google does, by allowing everyone to search using the model from anywhere. In other words, it’s on us, the public, to run experiments to undertand what the underlying model actually does. We have the tools, let’s get going.
If we think there’s inherent racism in google searches, then we should run experiments like Nathan Newman recently did, examining the different ads that pop up when someone writes an email about buying a car, for example, with different names and in different zip codes. We should organize to change our zip codes, our personas (which would mean deliberately creating personas and gmail logins, etc.), and our search terms, and see how the Google search results change as our inputs change.
After all, I don’t know what’s in the code base but I’m pretty sure there’s no sub-routine that’s called “add_racism_to_search”; instead, it’s a complicated Rube-Goldberg machine that should be judged by its outputs, in a statistical way, rather than expected to prescriptively describe how it treats things on a case-by-case basis.
Another thing: I don’t think there are bad intentions on the part of the modelers, but that doesn’t mean there aren’t bad consequences – the model is too complicated for anyone to anticipate exactly how it acts unless they perform experiments to test them. In the meantime, until people undertand that, we need to distinguish between the intentions and the results. So, for example, in the update to Nathan Newman’s experiments with Google mail, Google responded with this:
This post relies on flawed methodology to draw a wildly inaccurate conclusion. If Mr. Newman had contacted us before publishing it, we would have told him the facts: we do not select ads based on sensitive information, including ethnic inferences from names.
And then Newman added this:
Now, I’m happy to hear Google doesn’t “select ads” on this basis, but Google’s words seem chosen to allow a lot of wiggle room (as such Google statements usually seem to). Do they mean that Google algorithms do not use the ethnicity of names in ad selection or are they making the broader claim that they bar advertisers from serving up different ads to people with different names?
My point is that it doesn’t matter what Google says it does or doesn’t do, if statistically speaking the ads change depending on ethnicity. It’s a moot argument what they claim to do if what actually happens, the actual output of their Rube-Goldberg machine, is racist.
And I’m not saying Google’s models are definitively racist, by the way, since Newman’s efforts were small, the efforts of one man, and there were not thousands and thousands of tests but only a few. But his approach to understanding the model was certainly correct, and it’s a cause that technologists and activists should take up and expand on.
Mathematically speaking, it’s already as open-source as we need it to be to understand it, although in a dual way than people are used to thinking about. Actually, it defines the gold standard of open-source: instead of getting a bunch of gobbly-gook that we can’t process and that depends on enormously large data that changes over time, we get real-time access to the newest version that even a child can use.
I only wish that other public-facing models had such access. Let’s create a large-scale project like SETI to understand the Google search model.
I’m learning a bit of R in my current stint at ThoughtWorks. Coming from python, I was happy to see most of the plotting functions are very similar, as well as many of the vector-level data handling functions. Besides the fact that lists start at 1 instead of 0, things were looking pretty familiar.
But then I came across something that totally changed my mind. In R they have these data frames, which are like massive excel spreadsheets: very structured matrices with named columns and rows, on which you can perform parallelized operations.
One thing I noticed right away about these rigid data structures is that they make handling missing data very easy. So if you have a huge data frame where a few rows are missing a few data points, then one command, na.omit, gets rid of your problem. Sometimes you don’t even need that, you can just perform your operation on your NA’s, and you just get back more NA’s where appropriate.
This ease-of-use for crappy data is good and bad: good because it’s convenient, bad because you never feel the pain of missing data. When I use python, I rely on dictionaries of dictionaries (of dictionaries) to store my data, and I have to make specific plans for missing data, which means it’s a pain but also that I have to face up to bad data directly.
But that’s not why I think R is somewhat like SQL. It’s really because of how bad “for” loops are in R.
So I was trying to add a new column to my rather large (~65,000 row) dataframe. Adding a column is very easy indeed, if the value in the new column is a simple function of the values in the current columns, because of the way you can parallelize operations. So if the new value is the square of the first column value plus the second column value, it can do it on the whole columns all at once and it’s super fast.
In my case, though, the new value required a look-up in the table itself, which may or may not work, and then required a decision depending on whether it worked. For the life of me I couldn’t figure out how to do it using iterated “apply” or “lapply” functions in the existing dataframe. Of course it’s easy to do using a “for” loop, but that is excruciatingly slow.
Finally I realized I needed to think like a SQL programmer, and build a new dataframe which consisted of the look-up row, if it existed, along with a unique identifier in common with the row I start with. Then I merged the two dataframes, which is like a SQL join, using that unique identifier as the pivot. This would never happen in python with a dataset of this size, because dictionaries are very unstructured and fast.
Easy peasy lemon squeazy, once you understand it, but it made me realize that the approach to learning a new language by translating each word really doesn’t work. You need to think like a Parisian to really speak French.
In yesterday’s New York Times Science section, there was an article called “Life in the Red” (hat tip Becky Jaffe) about people’s behavior when they are in debt, summed up by this:
The usual explanations for reckless borrowing focus on people’s character, or social norms that promote free spending and instant gratification. But recent research has shown that scarcity by itself is enough to cause this kind of financial self-sabotage.
“When we put people in situations of scarcity in experiments, they get into poverty traps,” said Eldar Shafir, a professor of psychology and public affairs at Princeton. “They borrow at high interest rates that hurt them, in ways they knew to avoid when there was less scarcity.”
The psychological burden of debt not only saps intellectual resources, it also reinforces the reckless behavior, and quickly, Dr. Shafir and other experts said. Millions of Americans have been keeping the lights on through hard times with borrowed money, running a kind of shell game to keep bill collectors away.
So what we’ve got here is a feedback loop of poverty, which certainly jives with my observations of friends and acquaintances I’ve seen who are in debt.
I’m guessing the experiments described in the article are not as bad as real life, however.
I say that because I’ve been talking on this blog as well as in my recent math talks about a separate feedback loop involving models, namely the feedback loop whereby people who are judged poor by the model are offered increasingly bad terms on their loans. I call it the death spiral of modeling.
If you think about how these two effects work together – the array of offers gets worse as your vulnerability to bad deals increases – then you start to understand what half of our country is actually living through on a day-to-day basis.
As an aside, I have an enormous amount of empathy for people experiencing this poverty trap. I don’t think it’s a moral issue to be in debt: nobody wants to be poor, and nobody plans it that way.
This opinion article (hat tip Laura Strausfeld), also in yesterday’s New York Times, makes the important point that listening to a bunch of rich, judgmental people like David Bach, Dave Ramsey, and Suze Orman telling us it’s our fault we haven’t finished saving for retirement isn’t actually useful, and suggest we individually choose a money issue to take charge and sort out.
So my empathetic nerd take on poverty traps is this: how can we quantitatively measure this phenomenon, or more precisely these phenomena, since we’ve identified at least two feedback loops?
One reason it’s hard is that it’d be hard to perform natural tests where some people are submitted to the toxic environment but other people aren’t – it’s the “people who aren’t” category that’s the hard part, of course.
For the vulnerability to bad terms, the article describes the level of harassment that people receive from bill collectors as a factor in how they react, which doesn’t surprise anyone who’s ever dealt with a bill collector. Are there certain people who don’t get harassed for whatever reason, and do they fall prey to bad deals at a different rate? Are there local laws in some places prohibiting certain harassment? Can we go to another country where the bill collectors are reined in and see how people in debt behave there?
Also, in terms of availability of loans, it might be relatively easy to start out with people who live in states with payday loans versus people who don’t, and see how much faster the poverty spiral overtakes people with worse options. Of course, as crappy loans get more and more available online, this proximity study will become moot.
It’s also going to be tricky to tease out the two effects from each other. One is a question of supply and the other is a question of demand, and as we know those two are related.
I’m not answering these questions today, it’s a long-term project that I need your help on, so please comment below with ideas. Maybe if we have a few good ideas and if we find some data we can plan a data hackathon.
I had a great time giving my “Weapons of Math Destruction” talk in San Diego, and the audience was fantastic and thoughtful.
One question that someone asked was whether the US News & World Reports college ranking model should be forced to be open sourced – wouldn’t that just cause colleges to game the model?
First of all, colleges are already widely gaming the model and have been for some time. And that gaming is a distraction and has been heading colleges in directions away from good instruction, which is a shame.
And if you suggest that they change the model all the time to prevent this, then you’ve got an internal model of this model that needs adjustment. They might be tinkering at the edges but overall it’s quite clear what’s going into the model: namely, graduation rates, SAT scores, number of Ph.D’s on staff, and so on. The exact percentages change over time but not by much.
The impact that this model has had on education and how universities apportion resources has been profound. Academic papers have been written on the law school version of this story.
Moreover, the tactics that US News & World Reports uses to enforce their dominance of the market are bullying, as you can learn from the President of Reed College, which refuses to be involved.
Back to the question. Just as I realize that opening up all data is not reasonable or desirable, because first of all there are serious privacy issues but second of all certain groups have natural advantages to openly shared resources, it is also true that opening up all models is similarly problematic.
However, certain data should surely be open: for example, the laws of our country, that we are all responsible to know, should be freely available to us (something that Aaron Swartz understood and worked towards). How can we be held responsible for laws we can’t read?
Similarly, public-facing models, such as credit scoring models and teacher value-added models, should absolutely be open and accessible to the public. If I’m being judged and measured and held accountable by some model in my daily life as a citizen, that has real impact on how my future will unfold, then I should know how that process works.
And if you complain about the potential gaming of those public-facing models, I’d answer: if they are gameable then they shouldn’t be used, considering the impact they have on so many people’s lives. Because a gameable model is a weak model, with proxies that fail.
Another way to say this is we should want someone to “game” the credit score model if it means they pay their bills on time every month (I wrote about this here).
Back to the US News & World Report model. Is it public facing? I’m no lawyer but I think a case can be made that it is, and that the public’s trust in this model makes it a very important model indeed. Evidence can be gathered by measuring the extent to which colleges game the model, which they only do because the public cares so much about the rankings.
Even so, what difference would that make, to open it up?
In an ideal world, where the public is somewhat savvy about what models can and cannot do, opening up the US News & World Reports college ranking model would result in people losing faith in it. They’d realize that it’s no more valuable than an opinion from a highly vocal uncle of theirs who is obsessed with certain metrics and blind to individual eccentricities and curriculums that may be a perfect match for a non-conformist student. It’s only one opinion among many, and not to be religiously believed.
But this isn’t an ideal world, and we have a lot of work to do to get people to understand models as opinions in this sense, and to get people to stop trusting them just because they’re mathematically presented.
I just signed up for an upcoming datafest called “Big Data, Big Money, and You” which will be co-hosted at Columbia University and Stanford University on February 2nd and 3rd.
The idea is to use data from:
- National Institute on Money in State Politics,
- Open States,
- Pew Research Center,
- The Center for Responsive Politics,
- State Integrity, and
- The Sunlight Foundation
and open source tools such as R, python, and various api’s to model and explore various issues in the intersection of money and politics. Among those listed are things like: “look for correlation between the subject of bills introduced to state legislatures to big companies within those districts and campaign donations” and “comparing contributions per and post redistricting”.
As usual, a weekend-long datafest is just the beginning of a good data exploration: if you’re interested in this, think of this as an introduction to the ideas and the people involved; it’s just as much about networking with like-minded people as it is about finding an answer in two days.
So sign up, come on by, and get ready to roll up your sleeves and have a great time for that weekend, but also make sure you get people’s email addresses so you can keep in touch as things continue to develop down the road.
This is a guest post. Crossposted at aluation.
I’m a bit late to this conversation, but I was reminded by Cathy’s post over the weekend on open data – which most certainly is not a panacea – of my own experience a couple of years ago with a group that is trying hard to do the right thing with open data.
The UN funded a new initiative in 2009 called Global Pulse, with a mandate to explore ways of using Big Data for the rapid identification of emerging crises as well as for crafting more effective development policy in general. Their working hypothesis at its most simple is that the digital traces individuals leave in their electronic life – whether through purchases, mobile phone activity, social media or other sources – can reveal emergent patterns that can help target policy responses. The group’s website is worth a visit for anyone interested in non-commercial applications of data science – they are absolutely the good guys here, doing the kind of work that embodies the social welfare promise of Big Data.
With that said, I think some observations about their experience in developing their research projects may shed some light on one of Cathy’s two main points from her post:
- How “open” is open data when there are significant differences in both the ability to access the data, and more important, in the ability to analyze it?
- How can we build in appropriate safeguards rather than just focusing on the benefits and doing general hand-waving about the risks?
I’ll focus on Cathy’s first question here since the second gets into areas beyond my pay grade.
The Global Pulse approach to both sourcing and data analytics has been to rely heavily on partnerships with academia and the private sector. To Cathy’s point above, this is true of both closed data projects (such as those that rely on mobile phone data) as well as open data projects (those that rely on blog posts, news sites and other sources). To take one example, the group partnered with two firms in Cambridge to build a real-time indicator of bread prices in Latin America in order. The data in this case was open, while the web-scraping analytics (generally using grocery-story website prices) were developed and controlled by the vendors. As someone who is very interested in food prices, I found their work fascinating. But I also found it unsettling that the only way to make sense of this open data – to turn it into information, in other words – was through the good will of a private company.
The same pattern of open data and closed analytics characterized another project, which tracked Twitter in Indonesia for signals of social distress around food, fuel prices, health and other issues. The project used publicly available Twitter data, so it was open to that extent, though the sheer volume of data and the analytical challenges of teasing meaningful patterns out of it called for a powerful engine. As we all know, web-based consumer analytics are far ahead of the rest of the world in terms of this kind of work. And that was precisely where Global Pulse rationally turned – to a company that has generally focused on analyzing social media on behalf of advertisers.
Does this make them evil? Of course not – as I said above, Global Pulse are the good guys here. My point is not about the nature of their work but about its fragility.
The group’s Director framed their approach this way in a recent blog post:
We are asking companies to consider a new kind of CSR – call it “data philanthropy.” Join us in our efforts by making anonymized data sets available for analysis, by underwriting technology and research projects, or by funding our ongoing efforts in Pulse Labs. The same technologies, tools and analysis that power companies’ efforts to refine the products they sell, could also help make sure their customers are continuing to improve their social and economic wellbeing. We are asking governments to support our efforts because data analytics can help the United Nations become more agile in understanding the needs of and supporting the most vulnerable populations around the globe, which in terms boosts the global economy, benefiting people everywhere.
What happens when corporate donors are no longer willing to be data philanthropists? And a question for Cathy – how can we ensure that these new Data Science programs like the one at Columbia don’t end up just feeding people into consumer analytics firms, in the same way that math and econ programs ended up feeding people into Wall Street jobs?
I don’t have any answers here, and would be skeptical of anyone who claimed to. But the answers to these questions will likely define a lot of the gap between the promise of open data and whatever it ends up becoming.
I’ve talked a lot recently about how there’s an information war currently being waged on consumers by companies that troll the internet and collect personal data, search histories, and other “attributes” in data warehouses which then gets sold to the highest bidders.
It’s natural to want to balance out this information asymmetry somehow. One such approach is open data, defined in Wikipedia as the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.
I’m going to need more than one blog post to think this through, but I wanted to make two points this morning.
The first is my issue with the phrase “freely available to everyone to use”. What does that mean? Having worked in futures trading, where we put trading machines and algorithms in close proximity with exchanges for large fees so we can get to the market data a few nanoseconds before anyone else, it’s clear to me that availability and access to data is an incredibly complicated issue.
And it’s not just about speed. You can have hugely important, rich, and large data sets sitting in a lump on a publicly available website like wikipedia, and if you don’t have fancy parsing tools and algorithms you’re not going to be able to make use of it.
When important data goes public, the edge goes to the most sophisticated data engineer, not the general public. The Goldman Sachs’s of the world will always know how to make use of “freely available to everyone” data before the average guy.
Which brings me to my second point about open data. It’s general wisdom that we should hope for the best but prepare for the worst. My feeling is that as we move towards open data we are doing plenty of the hoping part but not enough of the preparing part.
If there’s one thing I learned working in finance, it’s not to be naive about how information will be used. You’ve got to learn to think like an asshole to really see what to worry about. It’s a skill which I don’t regret having.
So, if you’re giving me information on where public schools need help, I’m going to imagine using that information to cut off credit for people who live nearby. If you tell me where environmental complaints are being served, I’m going to draw a map and see where they aren’t being served so I can take my questionable business practices there.
I’m not saying proponents of open data aren’t well-meaning, they often seem to be. And I’m not saying that the bad outweighs the good, because I’m not sure. But it’s something we should figure out how to measure, and in this information war it’s something we should keep a careful eye on.
The stacks project is awesome: it explains the theory of stacks thoroughly, assuming only that you have a basic knowledge of algebra and a shitload of time to read. It’s
about three thousand update: it’s exactly 3,452 pages, give or take, and it has a bunch of contributors besides Johan. I’m on the list most likely because of the fact that I helped him develop the tag system which allows permanent references to theorems and lemmas even within an evolving latex manuscript.
Speaking of latex, that’s what I wanted to mention today.
Recently a guy named Pieter Belmans has been helping Johan out with development for the site: spiffing it up and making it look more professional. The most recent thing he did was to render the latex into human readable form using XyJax package, which is an “almost xy-pic compatible package for MathJax“. I think they are understating the case; it looks great to me:
I find myself every other day in a conversation with people about the massive online open course (MOOC) movement.
People often want to complain about the quality of this education substitute. They say that students won’t get the one-on-one interaction between the professor and student that is required to really learn. They complain that we won’t know if someone really knows something if they only took a MOOC or two.
First of all, this isn’t going away, nor should it: it’s many people’s only opportunity to learn this stuff. It’s not like MIT has plans to open 4,000 campuses across the world. It’s really awesome that rural villagers (with internet access) all over the world can now take MIT classes anyway through edX.
Second, if we’re going to put this new kind of education under the microscope, let’s put the current system under the microscope too. Many of the people fretting about the quality of MOOC education are themselves products of super elite universities, and probably don’t know what the average student’s experience actually is. Turns out not everyone gets a whole lot of attention from their professors.
Even at elite institutions, there are plenty of masters programs which are treated as money machines for the university and where the quality and attention of the teaching is a secondary concern. If certain students decide to forgo the thousands of dollars and learn the stuff just as well online, then that would be a good thing (for them at least).
Some things I think are inevitable:
- Educational institutions will increasingly need to show they add value beyond free MOOC experiences. This will be an enormous market force for all but the most elite universities.
- Instead of seeing where you went to school, potential employers will directly test knowledge of candidates. This will mean weird things like you never actually have to learn a foreign language or study Shakespeare to get a job, but it will be good for the democratization of education in general.
- Professors will become increasingly scarce as the role of the professor is decreased.
- One-on-one time with masters of a subject will become increasingly rare and expensive. Only truly elite students will have the mythological education experience.
In the final week of Rachel Schutt’s Columbia Data Science course we heard from two groups of students as well as from Rachel herself.
Data Science; class consciousness
The first team of presenters consisted of Yegor, Eurry, and Adam. Many others whose names I didn’t write down contributed to the research, visualization, and writing.
First they showed us the very cool graphic explaining how self-reported skills vary by discipline. The data they used came from the class itself, which did this exercise on the first day:
so the star in the middle is the average for the whole class, and each star along the side corresponds to the average (self-reported) skills of people within a specific discipline. The dotted lines on the outside stars shows the “average” star, so it’s easier to see how things vary per discipline compared to the average.
Surprises: Business people seem to think they’re really great at everything except communication. Journalists are better at data wrangling than engineers.
We will get back to the accuracy of self-reported skills later.
We were asked, do you see your reflection in your star?
Also, take a look at the different stars. How would you use them to build a data science team? Would you want people who are good at different skills? Is it enough to have all the skills covered? Are there complementary skills? Are the skills additive, or do you need overlapping skills among team members?
If all data which had ever been collected were freely available to everyone, would we be better off?
Some ideas were offered:
- all nude photos are included. [Mathbabe interjects: it's possible to not let people take nude pics of you. Just sayin'.]
- so are passwords, credit scores, etc.
- how do we make secure transactions between a person and her bank considering this?
- what does it mean to be “freely available” anyway?
The data of power; the power of data
You see a lot of people posting crap like this on Facebook:
But here’s the thing: the Berner Convention doesn’t exist. People are posting this to their walls because they care about their privacy. People think they can exercise control over their data but they can’t. Stuff like this give one a false sense of security.
In Europe the privacy laws are stricter, and you can request data from Irish Facebook and they’re supposed to do it, but it’s still not easy to successfully do.
And it’s not just data that’s being collected about you – it’s data you’re collecting. As scientists we have to be careful about what we create, and take responsibility for our creations.
As Francois Rabelais said,
Wisdom entereth not into a malicious mind, and science without conscience is but the ruin of the soul.
Or as Emily Bell from Columbia said,
Every algorithm is editorial.
We can’t be evil during the day and take it back at hackathons at night. Just as journalists need to be aware that the way they report stories has consequences, so do data scientists. As a data scientist one has impact on people’s lives and how they think.
Here are some takeaways from the course:
- We’ve gained significant powers in this course.
- In the future we may have the opportunity to do more.
- With data power comes data responsibility.
Who does data science empower?
The second presentation was given by Jed and Mike. Again, they had a bunch of people on their team helping out.
Let’s start with a quote:
“Anything which uses science as part of its name isn’t political science, creation science, computer science.”
- Hal Abelson, MIT CS prof
Keeping this in mind, if you could re-label data science, would you? What would you call it?
Some comments from the audience:
- Let’s call it “modellurgy,” the craft of beating mathematical models into shape instead of metal
- Let’s call it “statistics”
Does it really matter what data science is? What should it end up being?
Chris Wiggins from Columbia contends there are two main views of what data science should end up being. The first stems from John Tukey, inventor of the fast fourier transform and the box plot, and father of exploratory data analysis. Tukey advocated for a style of research he called “data analysis”, emphasizing the primacy of data and therefore computation, which he saw as part of statistics. His descriptions of data analysis, which he saw as part of doing statistics, are very similar to what people call data science today.
The other prespective comes from Jim Gray, Computer Scientist from Microsoft. He saw the scientific ideals of the enlightenment age as expanding and evolving. We’ve gone from the theories of Darwin and Newton to experimental and computational approaches of Turing. Now we have a new science, a data-driven paradigm. It’s actually the fourth paradigm of all the sciences, the first three being experimental, theoretical, and computational. See more about this here.
Wait, can data science be both?
Note it’s difficult to stick Computer Science and Data Science on this line.
Statistics is a tool that everyone uses. Data science also could be seen that way, as a tool rather than a science.
Who does data science?
Here’s a graphic showing the make-up of Kaggle competitors. Teams of students collaborated to collect, wrangle, analyze and visualize this data:
The size of the blocks correspond to how many people in active competitions have an education background in a given field. We see that almost a quarter of competitors are computer scientists. The shading corresponds to how often they compete. So we see the business finance people do more competitions on average than the computer science people.
Consider this: the only people doing math competitions are math people. If you think about it, it’s kind of amazing how many different backgrounds are represented above.
We got some cool graphics created by the students who collaborated to get the data, process it, visualize it and so on.
Which universities offer courses on Data Science?
There will be 26 universities in total by 2013 that offer data science courses. The balls are centered at the center of gravity of a given state, and the balls are bigger if there are more in that state.
Where are data science jobs available?
- We see more professional schools offering data science courses on the west coast.
- It would also would be interesting to see this corrected for population size.
- Only two states had no jobs.
- Massachusetts #1 per capita, then Maryland
McKinsey says there will be hundreds of thousands of data science jobs in the next few years. There’s a massive demand in any case. Some of us will be part of that. It’s up to us to make sure what we’re doing is really data science, rather than validating previously held beliefs.
We need to advance human knowledge if we want to take the word “scientist” seriously.
How did this class empower you?
You are one of the first people to take a data science class. There’s something powerful there.
Thank you Rachel!
Last Day of Columbia Data Science Class, What just happened? from Rachel’s perspective
Recall the stated goals of this class were:
- learn about what it’s like to be a data scientists
- be able to do some of what a data scientist does
Hey we did this! Think of all the guest lectures; they taught you a lot of what it’s like to be a data scientist, which was goal 1. Here’s what I wanted you guys to learn before the class started based on what a data scientist does, and you’ve learned a lot of that, which was goal 2:
Mission accomplished! Mission accomplished?
Thought experiment that I gave to myself last Spring
How would you design a data science class?
Comments I made to myself:
- It’s not a well-defined body of knowledge, subject, no textbook!
- It’s popularized and celebrated in the press and media, but there’s no “authority” to push back
- I’m intellectually disturbed by idea of teaching a course when the body of knowledge is ill-defined
- I didn’t know who would show up, and what their backgrounds and motivations would be
- Could it become redundant with a machine learning class?
I asked questions of myself and from other people. I gathered information, and endured existential angst about data science not being a “real thing.” I needed to give it structure.
Then I started to think about it this way: while I recognize that data science has the potential to be a deep research area, it’s not there yet, and in order to actually design a class, let’s take a pragmatic approach: Recognize that data science exists. After all, there are jobs out there. I want to help students to be qualified for them. So let me teach them what it takes to get those jobs. That’s how I decided to approach it.
In other words, from this perspective, data science is what data scientists do. So it’s back to the list of what data scientists do. I needed to find structure on top of that, so the structure I used as a starting point were the data scientist profiles.
Data scientist profiles
This was a way to think about your strengths and weaknesses, as well as a link between speakers. Note it’s easy to focus on “technical skills,” but it can also be problematic in being too skills-based, as well as being problematic because it has no scale, and no notion of expertise. On the other hand it’s good in that it allows for and captures variability among data scientists.
I assigned weekly guest speakers topics related to their strengths. We held lectures, labs, and (optional) problem sessions. From this you got mad skillz:
- programming in R
- some python
- you learned some best practices about coding
From the perspective of machine learning,
- you know a bunch of algorithms like linear regression, logistic regression, k-nearest neighbors, k-mean, naive Bayes, random forests,
- you know what they are, what they’re used for, and how to implement them
- you learned machine learning concepts like training sets, test sets, over-fitting, bias-variance tradeoff, evaluation metrics, feature selection, supervised vs. unsupervised learning
- you learned about recommendation systems
- you’ve entered a Kaggle competition
Importantly, you now know that if there is an algorithm and model that you don’t know, you can (and will) look it up and figure it out. I’m pretty sure you’ve all improved relative to how you started.
You’ve learned some data viz by taking flowing data tutorials.
You’ve learned statistical inference, because we discussed
- observational studies,
- causal inference, and
- experimental design.
- We also learned some maximum likelihood topics, but I’d urge you to take more stats classes.
In the realm of data engineering,
- we showed you map reduce and hadoop
- we worked with 30 separate shards
- we used an api to get data
- we spent time cleaning data
- we’ve processed different kinds of data
As for communication,
- you wrote thoughts in response to blog posts
- you observed how different data scientists communicate or present themselves, and have different styles
- your final project required communicating among each other
As for domain knowledge,
- lots of examples were shown to you: social networks, advertising, finance, pharma, recommender systems, dallas art museum
I heard people have been asking the following: why didn’t we see more data science coming from non-profits, governments, and universities? Note that data science, the term, was born in for-profits. But the truth is I’d also like to see more of that. It’s up to you guys to go get that done!
How do I measure the impact of this class I’ve created? Is it possible to incubate awesome data science teams in the classroom? I might have taken you from point A to point B but you might have gone there anyway without me. There’s no counterfactual!
Can we set this up as a data science problem? Can we use a causal modeling approach? This would require finding students who were more or less like you but didn’t take this class and use propensity score matching. It’s not a very well-defined experiment.
But the goal is important: in industry they say you can’t learn data science in a university, that it has to be on the job. But maybe that’s wrong, and maybe this class has proved that.
What has been the impact on you or to the outside world? I feel we have been contributing to the broader discourse.
Does it matter if there was impact? and does it matter if it can be measured or not? Let me switch gears.
What is data science again?
Data science could be defined as:
- A set of best practices used in tech companies, which is how I chose to design the course
- A space of problems that could be solved with data
- A science of data where you can think of the data itself as units
The bottom two have the potential to be the basis of a rich and deep research discipline, but in many cases, the way the term is currently used is:
- Pure hype
But it doesn’t matter how we define it, as much as that I want for you:
- to be problem solvers
- to be question askers
- to think about your process
- to use data responsibly and make the world better, not worse.
More on being problem solvers: cultivate certain habits of mind
Here’s a possible list of things to strive for, taken from here:
Here’s the thing. Tons of people can implement k-nearest neighbors, and many do it badly. What matters is that you cultivate the above habits, remain open to continuous learning.
In education in traditional settings, we focus on answers. But what we probably should focus on is how a student behaves when they don’t know the answer. We need to have qualities that help us find the answer.
How would you design a data science class around habits of mind rather than technical skills? How would you quantify it? How would you evaluate? What would students be able to write on their resumes?
Comments from the students:
- You’d need to keep making people doing stuff they don’t know how to do while keeping them excited about it.
- have people do stuff in their own domains so we keep up wonderment and awe.
- You’d use case studies across industries to see how things work in different contexts
More on being question-askers
Some suggestions on asking questions of others:
- start with assumption that you’re smart
- don’t assume the person you’re talking to knows more or less. You’re not trying to prove anything.
- be curious like a child, not worried about appearing stupid
- ask for clarification around notation or terminology
- ask for clarification around process: where did this data come from? how will it be used? why is this the right data to use? who is going to do what? how will we work together?
Some questions to ask yourself
- does it have to be this way?
- what is the problem?
- how can I measure this?
- what is the appropriate algorithm?
- how will I evaluate this?
- do I have the skills to do this?
- how can I learn to do this?
- who can I work with? Who can I ask?
- how will it impact the real world?
Data Science Processes
In addition to being problem-solvers and question-askers, I mentioned that I want you to think about process. Here are a couple processes we discussed in this course:
(1) Real World –> Generates Data –>
–> Collect Data –> Clean, Munge (90% of your time)
–> Exploratory Data Analysis –>
–> Feature Selection –>
–> Build Model, Build Algorithm, Visualize
–> Evaluate –>Iterate–>
–> Impact Real World
(2) Asking questions of yourselves and others –>
Identifying problems that need to be solved –>
Gathering information, Measuring –>
Learning to find structure in unstructured situations–>
Framing Problem –>
Creating Solutions –> Evaluating
Come up with a business that improves the world and makes money and uses data
Comments from the students:
- autonomous self-driving cars you order with a smart phone
- find all the info on people and then show them how to make it private
- social network with no logs and no data retention
10 Important Data Science Ideas
Of all the blog posts I wrote this semester, here’s one I think is important:
Confidence and Uncertainty
Let’s talk about confidence and uncertainty from a couple perspectives.
First, remember that statistical inference is extracting information from data, estimating, modeling, explaining but also quantifying uncertainty. Data Scientists could benefit from understanding this more. Learn more statistics and read Ben’s blog post on the subject.
Second, we have the Dunning-Kruger Effect.
Have you ever wondered why don’t people say “I don’t know” when they don’t know something? This is partly explained through an unconscious bias called the Dunning-Kruger effect.
Basically, people who are bad at something have no idea that they are bad at it and overestimate their confidence. People who are super good at something underestimate their mastery of it. Actual competence may weaken self-confidence.
Design an app to combat the dunning-kruger effect.
Optimizing your life, Career Advice
What are you optimizing for? What do you value?
- money, need some minimum to live at the standard of living you want to, might even want a lot.
- time with loved ones and friends
- doing good in the world
- personal fulfillment, intellectual fulfillment
- goals you want to reach or achieve
- being famous, respected, acknowledged
- some weighted function of all of the above. what are the weights?
What constraints are you under?
- external factors (factors outside of your control)
- your resources: money, time, obligations
- who you are, your education, strengths & weaknesses
- things you can or cannot change about yourself
There are many possible solutions that optimize what you value and take into account the constraints you’re under.
So what should you do with your life?
Remember that whatever you decide to do is not permanent so don’t feel too anxious about it, you can always do something else later –people change jobs all the time
But on the other hand, life is short, so always try to be moving in the right direction (optimizing for what you care about).
If you feel your way of thinking or perspective is somehow different than what those around you are thinking, then embrace and explore that, you might be onto something.
I’m always happy to talk to you about your individual case.
Next Gen Data Scientists
The second blog post I think is important is this “manifesto” that I wrote:
Next-Gen Data Scientists. That’s you! Go out and do awesome things, use data to solve problems, have integrity and humility.
Here’s our class photo!
Last night my 7th-grade son, who is working on a school project about the mathematician Diophantus, walked into the living room with a mopey expression.
He described how Diophantus worked on a series of mathematical texts called Arithmetica, in which he described the solutions to what we now describe as diophantine equations, but which are defined as polynomial equations with strictly integer coefficients, and where the solutions we care about are also restricted to be integers. I care a lot about this stuff because it’s what I studied when I was an academic mathematician, and I still consider this field absolutely beautiful.
What my son was upset about, though, was that of the 13 original books in Arhtimetica, only 6 have survived. He described this as “a way of losing progress“. I concur: Diophantus was brilliant, and there may be things we still haven’t recovered from that text.
But it also struck me that my son would be right to worry about this idea of losing progress even today.
We now have things online and often backed up, so you’d think we might never need to worry about this happening again. Moreover, there’s something called the arXiv where mathematicians and physicists put all or mostly all their papers before they’re published in journals (and many of the papers never make it to journals, but that’s another issue).
Namely, it’s not all that valuable to have one unreviewed, unpublished math paper in your possession. But it’s very valuable indeed to have all the math papers written in the past 10 years.
If we lost access to that collection, as a community, we will have lost progress in a huge way.
Note: I’m not accusing the people who run arXiv of anything weird. I’m sure they’re very cool, and I appreciate their work in keeping up the arXiv. I just want to acknowledge how much power they have, and how strange it is for an entire field to entrust that power to people they don’t know and didn’t elect in a popular vote.
As I understand it (and I could be wrong, please tell me if I am), the arXiv doesn’t allow crawlers to make back-ups of the documents. I think this is a mistake, as it increases the public reliance on this one resource. It’s unrobust in the same way it would be if the U.S. depended entirely on its food supply from a country whose motives are unclear.
Let’s not lose Arithmetica again.