math education | mathbabe

Columbia Data Science course, week 8: Data visualization, broadening the definition of data science, Square, fraud detection

October 29, 2012 Cathy O'Neil, mathbabe 3 comments

This week in Rachel Schutt’s Columbia Data Science course we had two excellent guest speakers.

The first speaker of the night was Mark Hansen, who recently came from UCLA via the New York Times to Columbia with a joint appointment in journalism and statistics. He is a renowned data visualization expert and also an energetic and generous speaker. We were lucky to have him on a night where he’d been drinking an XXL latte from Starbucks to highlight his natural effervescence.

Mark started by telling us a bit about Gabriel Tarde (1843-1904).

Tarde was a sociologist who believed that the social sciences had the capacity to produce vastly more data than the physical sciences. His reasoning was as follows.

The physical sciences observe from a distance: they typically model or incorporate models which talk about an aggregate in some way – for example, biology talks about the aggregate of our cells. What Tarde pointed out was that this is a deficiency, basically a lack of information. We should instead be tracking every atom.

This is where Tarde points out that in the social realm we can do this, where cells are replaced by people. We can collect a huge amount of information about those individuals.

But wait, are we not missing the forest for the trees when we do this? Bruno Latour weighs in on his take of Tarde as follows:

“But the ‘whole’ is now nothing more than a provisional visualization which can be modified and reversed at will, by moving back to the individual components, and then looking for yet other tools to regroup the same elements into alternative assemblages.”

In 1903, Tarde even foresees the emergence of Facebook, although he refers to a “daily press”:

“At some point, every social event is going to be reported or observed.”

Mark then laid down the theme of his lecture using a 2009 quote of Bruno Latour:

“Change the instruments and you will change the entire social theory that goes with them.”

Kind of like that famous physics cat, I guess, Mark (and Tarde) want us to newly consider

the way the structure of society changes as we observe it, and
ways of thinking about the relationship of the individual to the aggregate.

Mark’s Thought Experiment:

As data become more personal, as we collect more data about “individuals”, what new methods or tools do we need to express the fundamental relationship between ourselves and our communities, our communities and our country, our country and the world? Could we ever be satisfied with poll results or presidential approval ratings when we can see the complete trajectory of public opinions, individuated and interacting?

What is data science?

Mark threw up this quote from our own John Tukey:

“The best thing about being a statistician is that you get to play in everyone’s backyard”

But let’s think about that again – is it so great? Is it even reasonable? In some sense, to think of us as playing in other people’s yards, with their toys, is to draw a line between “traditional data fields” and “everything else”.

It’s maybe even implying that all our magic comes from the traditional data fields (math, stats, CS), and we’re some kind of super humans because we’re uber-nerds. That’s a convenient way to look at it from the perspective of our egos, of course, but it’s perhaps too narrow and arrogant.

And it begs the question, what is “traditional” and what is “everything else” anyway?

Mark claims that everything else should include:

social science,
physical science,
geography,
architecture,
education,
information science,
architecture,
digital humanities,
journalism,
design,
media art

There’s more to our practice than being technologists, and we need to realize that technology itself emerges out of the natural needs of a discipline. For example, GIS emerges from geographers and text data mining emerges from digital humanities.

In other words, it’s not math people ruling the world, it’s domain practices being informed by techniques growing organically from those fields. When data hits their practice, each practice is learning differently; their concerns are unique to that practice.

Responsible data science integrates those lessons, and it’s not a purely mathematical integration. It could be a way of describing events, for example. Specifically, it’s not necessarily a quantifiable thing.

Bottom-line: it’s possible that the language of data science has something to do with social science just as it has something to do with math.

Processing

Mark then told us a bit about his profile (“expansionist”) and about the language processing, in answer to a question about what is different when a designer takes up data or starts to code.

He explained it by way of another thought experiment: what is the use case for a language for artists? Students came up with a bunch of ideas:

being able to specify shapes,
faithful rendering of what visual thing you had in mind,
being able to sketch,
3-d,
animation,
interactivity,
Mark added publishing – artists must be able to share and publish their end results.

It’s java based, with a simple “publish” button, etc. The language is adapted to the practice of artists. He mentioned that teaching designers to code meant, for him, stepping back and talking about iteration, if statements, etc., of in other words stuff that seemed obvious to him but is not obvious to someone who is an artist. He needed to unpack his assumptions, which is what’s fun about teaching to the uninitiated.

He next moved on to close versus distant reading of texts. He mentioned Franco Moretti from Stanford. This is for Franco:

Franco thinks about “distant reading”, which means trying to get a sense of what someone’s talking about without reading line by line. This leads to PCA-esque thinking, a kind of dimension reduction of novels.

In other words, another cool example of how data science should integrate the way the experts in various fields figure it out. We don’t just go into their backyards and play, maybe instead we go in and watch themplay and formalize and inform their process with our bells and whistles. In this way they can teach us new games, games that actually expand our fundamental conceptions of data and the approaches we need to analyze them.

Mark’s favorite viz projects

1) Nuage Vert, Helen Evans & Heiko Hansen: a projection onto a power plant’s steam cloud. The size of the green projection corresponds to the amount of energy the city is using. Helsinki and Paris.

2) One Tree, Natalie Jeremijenko: The artist cloned trees and planted the genetically identical seeds in several areas. Displays among other things the environmental conditions in each area where they are planted.

3) Dusty Relief, New Territories: here the building collects pollution around it, displayed as dust.

4) Project Reveal, New York Times R&D lab: this is a kind of magic mirror which wirelessly connects using facial recognition technology and gives you information about yourself. As you stand at the mirror in the morning you get that “come-to-jesus moment” according to Mark.

5) Million Dollar Blocks, Spatial Information Design Lab (SIDL): So there are crime stats for google maps, which are typically painful to look at. The SIDL is headed by Laura Kurgan, and in this piece she flipped the statistics. She went into the prison population data, and for every incarcerated person, she looked at their home address, measuring per home how much money the state was spending to keep the people who lived there in prison. She discovered that some blocks were spending $1,000,000 to keep people in prison.

Moral of the above: just because you can put something on the map, doesn’t mean you should. Doesn’t mean there’s a new story. Sometimes you need to dig deeper and flip it over to get a new story.

New York Times lobby: Moveable Type

Mark walked us through a project he did with Ben Rubin for the NYTimes on commission (and he later went to the NYTimes on sabbatical). It’s in the lobby of their midtown headquarters at 8th and 42nd.

It consists of 560 text displays, two walls with 280 on each, and the idea is they cycle through various “scenes” which each have a theme and an underlying data science model.

For example, in one there are waves upon waves of digital ticker-tape like scenes which leave behind clusters of text, and where each cluster represents a different story from the paper. The text for a given story highlights phrases which make a given story different from others in some information-theory sense.

In another scene the numbers coming out of stories are highlighted, so you might see on a given box “18 gorillas”. In a third scene, crossword puzzles play themselves with sounds of pencil and paper.

The display boxes themselves are retro, with embedded linux processors running python, and a sound card on each box, which makes clicky sounds or wavy sounds or typing sounds depending on what scene is playing.

The data taken in is text from NY Times articles, blogs, and search engine activity. Every sentence is parsed using Stanford NLP techniques, which diagrams sentences.

Altogether there are about 15 “scenes” so far, and it’s code so one can keep adding to it. Here’s an interview with them about the exhibit:

Project Cascade: Lives on a Screen

Mark next told us about Cascade, which was joint work with Jer Thorp data artist-in-residence at the New York Times. Cascade came about from thinking about how people share New York Times links on Twitter. It was in partnerships with bitly.

The idea was to collect enough data so that we could see someone browse, encode the link in bitly, tweet that encoded link, see other people click on that tweet and see bitly decode the link, and then see those new people browse the New York Times. It’s a visualization of that entire process, much as Tarde suggested we should do.

There were of course data decisions to be made: a loose matching of tweets and clicks through time, for example. If 17 different tweets have the same url they don’t know which one you clicked on, so they guess (the guess actually seemed to involve probabilistic matching on time stamps so it’s an educated guess). They used the Twitter map of who follows who. If someone you follow tweets about something before you do then it counts as a retweet. It covers any nytimes.com link.

Here’s a NYTimes R&D video about Project Cascade:

Note: this was done 2 years ago, and Twitter has gotten a lot bigger since then.

Cronkite Plaza

Next Mark told us about something he was working on which just opened 1.5 months ago with Jer and Ben. It’s also news related, but this is projecting on the outside of a building rather than in the lobby; specifically, the communications building at UT Austin, in Cronkite Plaza.

The majority of the projected text is sourced from Cronkite’s broadcasts, but also have local closed-captioned news sources. One scene of this project has extracted the questions asked during local news – things like “how did she react?” or “What type of dog would you get?”. The project uses 6 projectors.

Goals of these exhibits

They are meant to be graceful and artistic, but should also teach something. At the same time we don’t want to be overly didactic. The aim is to live in between art and information. It’s a funny place: increasingly we see a flattening effect when tools are digitized and made available, so that statisticians can code like a designer (we can make things that look like design) and similarly designers can make something that looks like data.

What data can we get? Be a good investigator: a small polite voice which asks for data usually gets it.

eBay transactions and books

Again working jointly with Jer Thorp, Mark investigated a day’s worth of eBay’s transactions that went through Paypal and, for whatever reason, two years of book sales. How do you visualize this? Take a look at the yummy underlying data:

Here’s how they did it (it’s ingenious). They started with the text of Death of a Salesman by Arthur Miller. They used a mechanical turk mechanism to locate objects in the text that you can buy on eBay.

When an object is found it moves it to a special bin, so “chair” or “flute” or “table.” When it has a few collected buy-able objects, it then takes the objects and sees where they are all for sale on the day’s worth of transactions, and looks at details on outliers and such. After examining the sales, the code will find a zipcode in some quiet place like Montana.

Then it flips over to the book sales data, looks at all the books bought or sold in that zip code, picks a book (which is also on Project Gutenberg), and begins to read that book and collect “buyable” objects from that. And it keeps going. Here’s a video:

Public Theater Shakespeare Machine

The last thing Mark showed us is is joint work with Rubin and Thorp, installed in the lobby of the Public Theater. The piece itself is an oval structure with 37 bladed LED displays, set above the bar.

There’s one blade for each of Shakespeare’s plays. Longer plays are in the long end of the oval, Hamlet you see when you come in.

The data input is the text of each play. Each scene does something different – for example, it might collect noun phrases that have something to do with body from each play, so the “Hamlet” blade will only show a body phrase from Hamlet. In another scene, various kinds of combinations or linguistic constructs are mined:

“high and might” “good and gracious” etc.
“devilish-holy” “heart-sore” “ill-favored” “sea-tossed” “light-winged” “crest-fallen” “hard-favoured” etc.

Note here that the digital humanities, through the MONK Project, offered intense xml descriptions of the plays. Every single word is given hooha and there’s something on the order of 150 different parts of speech.

As Mark said, it’s Shakespeare so it stays awesome no matter what you do, but here we see we’re successively considering words as symbols, or as thematic, or as parts of speech. It’s all data.

Ian Wong from Square

Next Ian Wong, an “Inference Scientist” at Square who dropped out of an Electrical Engineering Ph.D. program at Stanford talked to us about Data Science in Risk.

He conveniently started with his takeaways:

Machine learning is not equivalent to R scripts. ML is founded in math, expressed in code, and assembled into software. You need to be an engineer and learn to write readable, reusable code: your code will be reread more times by other people than by you, so learn to write it so that others can read it.
Data visualization is not equivalent to producing a nice plot. Rather, think about visualizations as pervasive and part of the environment of a good company.
Together, they augment human intelligence. We have limited cognitive abilities as human beings, but if we can learn from data, we create an exoskeleton, an augmented understanding of our world through data.

Square

Square was founded in 2009. There were 40 employees in 2010, and there are 400 now. The mission of the company is to make commerce easy. Right now transactions are needlessly complicated. It takes too much to understand and to do, even to know where to start for a vendor. For that matter, it’s too complicated for buyers as well. The question we set out to ask is, how do we make transactions simple and easy?

We send out a white piece of plastic, which we refer to as the iconic square. It’s something you can plug into your phone or iPad. It’s simple and familiar, and it makes it easy to use and to sell.

It’s even possible to buy things hands-free using the square. A buyer can open a tab on their phone so that they can pay by saying their name.. Then the merchant taps your name on their screen. This makes sense if you are a frequent visitor to a certain store like a coffee shop.

Our goal is to make it easy for sellers to sign up for Square and accept payments. Of course, it’s also possible that somebody may sign up and try to abuse the service. We are therefore very careful at Square to avoid losing money on sellers with fraudulent intentions or bad business models.

The Risk Challenge

At Square we need to balance the following goals:

to provide a frictionless and delightful experience for buyers and sellers,
to fuel rapid growth, and in particular to avoid inhibiting growth through asking for too much information of new sellers, which adds needless barriers to joining, and
to maintain low financial loss.

Today we’ll just focus on the third goal through detection of suspicious activity. We do this by investing in machine learning and viz. We’ll first discuss the machine learning aspects.

Part 1: Detecting suspicious activity using machine learning

First of all, what’s suspicious? Examples from the class included:

lots of micro transactions occurring,
signs of money laundering,
high frequency or inconsistent frequency of transactions.

Example: Say Rachel has a food truck, but then for whatever reason starts to have $1000 transactions (mathbabe can’t help but insert that Rachel might be a food douche which would explain everything).

On the one hand, if we let money go through, Square is liable in the case it was unauthorized. Technically the fraudster, so in this case Rachel would be liable, but our experience is that usually fraudsters are insolvent, so it ends up on Square.

On the other hand, the customer service is bad if we stop payment on what turn out to be real payments. After all, what if she’s innocent and we deny the charges? She will probably hate us, may even sully our reputation, and in any case our trust is lost with her after that.

This example crystallizes the important challenges we face: false positives erode customer trust, false negatives make us lose money.

And since Square processes millions of dollars worth of sales per day, we need to do this systematically and automatically. We need to assess the risk level of every event and entity in our system.

So what do we do?

First of all, we take a look at our data. We’ve got three types:

payment data, where the fields are transaction_id, seller_id, buyer_id, amount, success (0 or 1), timestamp,
seller data, where the fields are seller_id, sign_up_date, business_name, business_type, business_location,
settlement data, where the fields are settlement_id, state, timestamp.

Important fact: we settle to our customers the next day so we don’t have to make our decision within microseconds. We have a few hours. We’d like to do it quickly of course, but in certain cases we have time for a phone call to check on things.

So here’s the process: given a bunch (as in hundreds or thousands) of payment events, we throw each through the risk engine, and then send some iffy looking ones on to a “manual review”. An ops team will then review the cases on an individual basis. Specifically, anything that looks rejectable gets sent to ops, which make phone calls to double check unless it’s super outrageously obviously fraud.

Also, to be clear, there are actually two kinds of fraud to worry about, seller-side fraud and buyer-side fraud. For the purpose of this discussion, we’ll focus on the former.

So now it’s a question of how we set up the risk engine. Note that we can think of the risk engine as putting things in bins, and those bins each have labels. So we can call this a labeling problem.

But that kind of makes it sound like unsupervised learning, like a clustering problem, and although it shares some properties with that, it’s certainly not that simple – we don’t reject a payment and then merely stand pat with that label, because as we discussed we send it on to an ops team to assess it independently. So in actuality we have a pretty complicated set of labels, including for example:

initially rejected but ok,
initially rejected and bad,
initially accepted but on further consideration might have been bad,
initially accepted and things seem ok,
initially accepted and later found to be bad, …

So in other words we have ourselves a semi-supervised learning problem, straddling the worlds of supervised and unsupervised learning. We first check our old labels, and modify them, and then use them to help cluster new events using salient properties and attributes common to historical events whose labels we trust. We are constantly modifying our labels even in retrospect for this reason.

We estimate performance using precision and recall. Note there are very few positive examples so accuracy is not a good metric of success, since the “everything looks good” model is dumb but has good accuracy.

Labels are what Ian considered to be the “neglected half of the data” (recall T = {(x_i, y_i)}). In undergrad statistics education and in data mining competitions, the availability of labels is often taken for granted. In reality, labels are tough to define and capture. Labels are really important. It’s not just objective function, it is the objective.

As is probably familiar to people, we have a problem with sparsity of features. This is exacerbated by class imbalance (i.e., there are few positive samples). We also don’t know the same information for all of our sellers, especially when we have new sellers. But if we are too conservative we start off on the wrong foot with new customers.

Also, we might have a data point, say zipcode, for every seller, but we don’t have enough information in knowing the zipcode alone because so few sellers share zipcodes. In this case we want to do some clever binning of the zipcodes, which is something like sub model of our model.

Finally, and this is typical for predictive algorithms, we need to tweak our algorithm to optimize it- we need to consider whether features interact linearly or non-linearly, and to account for class imbalance.. We also have to be aware of adversarial behavior. An example of adversarial behavior in e-commerce is new buyer fraud, where a given person sets up 10 new accounts with slightly different spellings of their name and address.

Since models degrade over time, as people learn to game them, we need to continually retrain models. The keys to building performance models are as follows:

it’s not a black box. You can’t build a good model by assuming that the algorithm will take care of everything. For instance, I need to know why I am misclassifying certain people, so I’ll need to roll up my sleeves and dig into my model.
We need to perform rapid iterations of testing, with experiments like you’d do in a science lab. If you’re not sure whether to try A or B, then try both.
When you hear someone say, “So which models or packages do you use?” then you’ve got someone who doesn’t get it. Models and/or packages are not magic potion.

Mathbabe cannot resist paraphrasing Ian here as saying “It’s not about the package. it’s about what you do with it.” But what Ian really thinks it’s about, at least for code, is:

readability
reusability
correctness
structure
hygiene

So, if you’re coding a random forest algorithm and you’ve hardcoded the number of trees: you’re an idiot. put a friggin parameter there so people can reuse it. Make it tweakable. And write the tests for pity’s sake; clean code and clarity of thought go together.

At Square we try to maintain reusability and readability — we structure our code in different folders with distinct, reusable components that provide semantics around the different parts of building a machine learning model: model, signal, error, experiment.

We only write scripts in the experiments folder where we either tie together components from model, signal and error or we conduct exploratory data analysis. It’s more than just a script, it’s a way of thinking, a philosophy of approach.

What does such a discipline give you? Every time you run an experiment your should incrementally increase your knowledge. This discipline helps you make sure you don’t do the same work again. Without it you can’t even figure out the things you or someone else has already attempted.

For more on what every project directory should contain, see Project Template, written by John Myles White.

We had a brief discussion of how reading other people’s code is a huge problem, especially when we don’t even know what clean code looks like. Ian stayed firm on his claim that “if you don’t write production code then you’re not productive.”

In this light, Ian suggests exploring and actively reading Github’s repository of R code. He says to try writing your own R package after reading this. Also, he says that developing an aesthetic sense for code is analogous to acquiring the taste for beautiful proofs; it’s done through rigorous practice and feedback from peers and mentors. The problem is, he says, that statistics instructors in schools usually do not give feedback on code quality, nor are they qualified to.

For extra credit, Ian suggests the reader contrasts the implementations of the caret package (poor code) with scikit-learn (clean code).

Important things Ian skipped

how is a model “productionized”?
how are features computed in real-time to support these models?
how do we make sure “what we see is what we get”, meaning the features we build in a training environment will be the ones we see in real-time. Turns out this is a pretty big problem.
how do you test a risk engine?

Next Ian talked to us about how Square uses visualization.

Data Viz at Square

Ian talked to us about a bunch of different ways the Inference Team at Square use visualizations to monitor the transactions going on at any given time. He mentioned that these monitors aren’t necessarily trying to predict fraud per se but rather provides a way of keeping an eye on things to look for trends and patterns over time and serves as the kind of “data exoskeleton” that he mentioned at the beginning. People at Square believe in ambient analytics, which means passively ingesting data constantly so you develop a visceral feel for it.

After all, it is only by becoming very familiar with our data that we even know what kind of patterns are unusual or deserve their own model. To go further into the philosophy of this approach, he said two thing:

“What gets measured gets managed,” and “You can’t improve what you don’t measure.”

He described a workflow tool to review users, which shows features of the seller, including the history of sales and geographical information, reviews, contact info, and more. Think mission control.

In addition to the raw transactions, there are risk metrics that Ian keeps a close eye on. So for example he monitors the “clear rates” and “freeze rates” per day, as well as how many events needed to be reviewed. Using his fancy viz system he can get down to which analysts froze the most today and how long each account took to review, and what attributes indicate a long review process.

In general people at Square are big believers in visualizing business metrics (sign-ups, activations, active users, etc.) in dashboards; they think it leads to more accountability and better improvement of models as they degrade. They run a kind of constant EKG of their business through ambient analytics.

Ian ended with his data scientist profile. He thinks it should be on a logarithmic scale, since it doesn’t take very long to be okay at something (good enough to get by) but it takes lots of time to get from good to great. He believes that productivity should also be measured in log-scale, and his argument is that leading software contributors crank out packages at a much higher rate than other people.

Ian’s advice to aspiring data scientists

play with real data
build a good foundation in school
get an internship
be literate, not just in statistics
stay curious

Ian’s thought experiment

Suppose you know about every single transaction in the world as it occurs. How would you use that data?

Categories: data science, math education, open source tools, statistics

Columbia Data Science course, week 7: Hunch.com, recommendation engines, SVD, alternating least squares, convexity, filter bubbles

October 18, 2012 Cathy O'Neil, mathbabe 6 comments

Last night in Rachel Schutt’s Columbia Data Science course we had Matt Gattis come and talk to us about recommendation engines. Matt graduated from MIT in CS, worked at SiteAdvisor, and co-founded hunch as its CTO, which recently got acquired by eBay. Here’s what Matt had to say about his company:

Hunch

Hunch is a website that gives you recommendations of any kind. When we started out it worked like this: we’d ask you a bunch of questions (people seem to love answering questions), and then you could ask the engine questions like, what cell phone should I buy? or, where should I go on a trip? and it would give you advice. We use machine learning to learn and to give you better and better advice.

Later we expanded into more of an API where we crawled the web for data rather than asking people direct questions. We can also be used by third party to personalize content for a given site, a nice business proposition which led eBay to acquire us. My role there was doing the R&D for the underlying recommendation engine.

Matt has been building code since he was a kid, so he considers software engineering to be his strong suit. Hunch is a cross-domain experience so he doesn’t consider himself a domain expert in any focused way, except for recommendation systems themselves.

The best quote Matt gave us yesterday was this: “Forming a data team is kind of like planning a heist.” He meant that you need people with all sorts of skills, and that one person probably can’t do everything by herself. Think Ocean’s Eleven but sexier.

A real-world recommendation engine

You have users, and you have items to recommend. Each user and each item has a node to represent it. Generally users like certain items. We represent this as a bipartite graph. The edges are “preferences”. They could have weights: they could be positive, negative, or on a continuous scale (or discontinuous but many-valued like a star system). The implications of this choice can be heavy but we won’t get too into them today.

So you have all this training data in the form of preferences. Now you wanna predict other preferences. You can also have metadata on users (i.e. know they are male or female, etc.) or on items (a product for women).

For example, imagine users came to your website. You may know each user’s gender, age, whether they’re liberal or conservative, and their preferences for up to 3 items.

We represent a given user as a vector of features, sometimes including only their meta data, sometimes including only their preferences (which would lead to a sparse vector since you don’t know all their opinions) and sometimes including both, depending on what you’re doing with the vector.

Nearest Neighbor Algorithm?

Let’s review nearest neighbor algorithm (discussed here): if we want to predict whether a user A likes something, we just look at the user B closest to user A who has an opinion and we assume A’s opinion is the same as B’s.

To implement this you need a definition of a metric so you can measure distance. One example: Jaccard distance, i.e. the number of things preferences they have in common divided by the total number of things. Other examples: cosine similarity or euclidean distance. Note: you might get a different answer depending on which metric you choose.

What are some problems using nearest neighbors?

There are too many dimensions, so the closest neighbors are too far away from each other. There are tons of features, moreover, that are highly correlated with each other. For example, you might imagine that as you get older you become more conservative. But then counting both age and politics would mean you’re double counting a single feature in some sense. This would lead to bad performance, because you’re using redundant information. So we need to build in an understanding of the correlation and project onto smaller dimensional space.
Some features are more informative than others. Weighting features may therefore be helpful: maybe your age has nothing to do with your preference for item 1. Again you’d probably use something like covariances to choose your weights.
If your vector (or matrix, if you put together the vectors) is too sparse, or you have lots of missing data, then most things are unknown and the Jaccard distance means nothing because there’s no overlap.
There’s measurement (reporting) error: people may lie.
There’s a calculation cost – computational complexity.
Euclidean distance also has a scaling problem: age differences outweigh other differences if they’re reported as 0 (for don’t like) or 1 (for like). Essentially this means that raw euclidean distance doesn’t explicitly optimize.
Also, old and young people might think one thing but middle-aged people something else. We seem to be assuming a linear relationship but it may not exist
User preferences may also change over time, which falls outside the model. For example, at Ebay, they might be buying a printer, which makes them only want ink for a short time.
Overfitting is also a problem. The one guy is closest, but it could be noise. How do you adjust for that? One idea is to use k-nearest neighbor, with say k=5.
It’s also expensive to update the model as you add more data.

Matt says the biggest issues are overfitting and the “too many dimensions” problem. He’ll explain how he deals with them.

Going beyond nearest neighbor: machine learning/classification

In its most basic form, we’ve can model separately for each item using a linear regression. Denote by $f_{i, j}$ user $i$ ‘s preference for item $j$ (or attribute, if item $j$ is a metadata item). Say we want to model a given user’s preferences for a given item using only the 3 metadata properties of that user, which we assume are numeric. Then we can look for the best choice of $\beta_k$ as follows:

$p_i = \beta_1 f_{1, i} + \beta_2 f_{2, i} + \beta_3 f_{3, i} +$ $\epsilon$

Remember, this model only works for one item. We need to build as many models as we have items. We know how to solve the above per item by linear algebra. Indeed one of the drawbacks is that we’re not using other items’ information at all to create the model for a given item.

This solves the “weighting of the features” problem we discussed above, but overfitting is still a problem, and it comes in the form of having huge coefficients when we don’t have enough data (i.e. not enough opinions on given items). We have a bayesian prior that these weights shouldn’t be too far out of whack, and we can implement this by adding a penalty term for really large coefficients.

This ends up being equivalent to adding a prior matrix to the covariance matrix. how do you choose lambda? Experimentally: use some data as your training set, evaluate how well you did using particular values of lambda, and adjust.

Important technical note: You can’t use this penalty term for large coefficients and assume the “weighting of the features” problem is still solved, because in fact you’re implicitly penalizing some coefficients more than others. The easiest way to get around this is to normalize your variables before entering them into the model, similar to how we did it in this earlier class.

The dimensionality problem

We still need to deal with this very large problem. We typically use both Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).

To understand how this works, let’s talk about how we reduce dimensions and create “latent features” internally every day. For example, we invent concepts like “coolness” – but I can’t directly measure how cool someone is, like I could weigh them or something. Different people exhibit pattern of behavior which we internally label to our one dimension of “coolness”.

We let the machines do the work of figuring out what the important “latent features” are. We expect them to explain the variance in the answers to the various questions. The goal is to build a model which has a representation in a lower dimensional subspace which gathers “taste information” to generate recommendations.

SVD

Given a matrix $X,$ compose it into three matrices:

$X = U S V^{\tau}.$

Here $X$ is $m \times n, U$ is $m \times k, S$ is $k\times k,$ and $V$ is $k\times n,$ where $m$ is the number of users, $n$ is the number of items, and $k$ is the rank of $X.$

The rows of $U$ correspond to users, whereas $V$ has a row for each item. The square matrix $S$ is diagonal where each entry is a singular value, which measure the importance of each dimension. If we put them in decreasing order, which we do, then the dimensions are ordered by importance from highest to lowest. Every matrix has such a decomposition.

Important properties:

The columns of $U$ and $V$ are orthogonal to each other.
So we can order the columns by singular values.
We can take lower rank approximation of X by throwing away part of $S.$ In this way we might have $k$ much smaller than either $n$ or $m$ , and this is what we mean by compression.
There is an important interpretation to the values in the matrices $U$ and $V.$ For example, we can see, by using SVD, that “the most important latent feature” is often something like seeing if you’re a man or a woman.

[Question: did you use domain expertise to choose questions at Hunch? Answer: we tried to make them as fun as possible. Then, of course, we saw things needing to be asked which would be extremely informative, so we added those. In fact we found that we could ask merely 20 questions and then predict the rest of them with 80% accuracy. They were questions that you might imagine and some that surprised us, like competitive people v. uncompetitive people, introverted v. extroverted, thinking v. perceiving, etc., not unlike MBTI.]

More details on our encoding:

Most of the time the questions are binary (yes/no).
We create a separate variable for every variable.
Comparison questions may be better at granular understanding, and get to revealed preferences, but we don’t use them.

Note if we have a rank $k$ matrix $X$ and we use the SVD above, we can take the approximation with only $k-3$ rows of the middle matrix $S,$ so in other words we take the top $k-3$ most important latent features, and the corresponding rows of $U$ and $V,$ and we get back something very close to $X.$

Note that the problem of sparsity or missing data is not fixed by the above SVD approach, nor is the computational complexity problem; SVD is expensive.

PCA

Now we’re still looking for $U$ and $V$ as above, but we don’t have $S$ anymore, so $X = U \cdot V^{\tau},$ and we have a more general optimization problem. Specifically, we want to minimize:

$argmin \sum_{i, j \in P} (p_{i, j} - u_i \cdot v_j)^2.$

Let me explain. We denote by $u_i$ the row of $U$ corresponding to user $i,$ and similarly we denote by $v_j$ the row of $V$ corresponding to item $j.$ Items can include meta-data information (so the age vectors of all the users will be a row in $V$ ).

Then the dot product $u_i \cdot v_j$ is taken to mean the predicted value of user $i$ ‘s preference for item $j,$ and we compare that to the actual preference $p_{i, j}$ . The set $P$ is just the set of all actual known preferences or meta-data attribution values.

So, we want to find the best choices of $U$ and $V$ which overall minimize the squared differences between prediction and observation on everything we actually know, and the idea is that if it’s really good on stuff we know, it will also be good on stuff we’re guessing.

Now we have a parameter, namely the number $D$ which is how may latent features we want to use. The matrix $U$ will have a row for each user and a column for each latent feature, and the matrix $V$ will have a row for each item and a column for each latent features.

How do we choose $D?$ It’s typically about 100, since it’s more than 20 (we already know we had a pretty good grasp on someone if we ask them 20 questions) and it’s as much as we care to add before it’s computational too much work. Note the resulting latent features will be uncorrelated, since they are solving an efficiency problem (not a proof).

But how do we actually find $U$ and $V?$

Alternating Least Squares

This optimization doesn’t have a nice closed formula like ordinary least squares with one set of coefficients. Instead, we use an iterative algorithm like with gradient descent. As long as your problem is convex you’ll converge ok (i.e. you won’t find yourself at a local but not global maximum), and we will force our problem to be convex using regularization.

Algorithm:

Pick a random $V$
Optimize $U$ while $V$ is fixed
Optimize $V$ while $U$ is fixed
Keep doing the above two steps until you’re not changing very much at all.

Example: Fix $V$ and update $U.$

The way we do this optimization is user by user. So for user $i,$ we want to find

$argmin_{u_i} \sum_{j \in P_i} (p_{i, j} - u_i * v_j)^2,$

where $v_j$ is fixed. In other words, we just care about this user for now.

But wait a minute, this is the same as linear least squares, and has a closed form solution! In other words, set:

$u_i = (V_{*, i}^{\tau} V_{*, i})^{-1} V_{*, i}^{\tau} P_{* i},$

where $V_{*, i}$ is the subset of $V$ for which we have preferences coming from user $i.$ Taking the inverse is easy since it’s $D \times D,$ which is small. And there aren’t that many preferences per user, so solving this many times is really not that hard. Overall we’ve got a do-able update for $U.$

When you fix U and optimize V, it’s analogous; you only ever have to consider the users that rated that movie, which may be pretty large, but you’re only ever inverting a $D \times D$ matrix.

Another cool thing: since each user is only dependent on their item’s preferences, we can parallelize this update of $U$ or $V.$ We can run it on as many different machines as we want to make it fast.

There are lots of different versions of this. Sometimes you need to extend it to make it work in your particular case.

Note: as stated this is not actually convex, but similar to the regularization we did for least squares, we can add a penalty for large entries in $U$ and $V,$ depending on some parameter $\lambda,$ which again translates to the same thing, i.e. adding a diagonal matrix to the covariance matrix, when you solve least squares. This makes the problem convex if $\lambda$ is big enough.

You can add new users, new data, keep optimizing U and V. You can choose which users you think need more updating. Or if they have enough ratings, you can decide not to update the rest of them.

As with any machine learning model, you should perform cross-validation for this model – leave out a bit and see how you did. This is a way of testing overfitting problems.

Thought experiment – filter bubbles

What are the implications of using error minimization to predict preferences? How does presentation of recommendations affect the feedback collected?

For example, can we end up in local maxima with rich-get-richer effects? In other words, does showing certain items at the beginning “give them an unfair advantage” over other things? And so do certain things just get popular or not based on luck?

How do we correct for this?

Categories: data science, math education, statistics

The investigative mathematical journalist

October 14, 2012 Cathy O'Neil, mathbabe 19 comments

I’ve been out of academic math a few years now, but I still really enjoy talking to mathematicians. They are generally nice and nerdy and utterly earnest about their field and the questions in their field and why they’re interesting.

In fact, I enjoy these conversations more now than when I was an academic mathematician myself. Partly this is because, as a professional, I was embarrassed to ask people stupid questions, because I thought I should already know the answers. I wouldn’t have asked someone to explain motives and the Hodge Conjecture in simple language because honestly, I’m pretty sure I’d gone to about 4 lectures as a graduate student explaining all of this and if I could just remember the answer I would feel smarter.

But nowadays, having left and nearly forgotten that kind of exquisite anxiety that comes out of trying to appear superhuman, I have no problem at all asking someone to clarify something. And if they give me an answer that refers to yet more words I don’t know, I’ll ask them to either rephrase or explain those words.

In other words, I’m becoming something of an investigative mathematical journalist. And I really enjoy it. I think I could do this for a living, or at least as a large project.

What I have in mind is the following: I go around the country (I’ll start here in New York) and interview people about their field. I ask them to explain the “big questions” and what awesomeness would come from actually having answers. Why is their field interesting? How does it connect to other fields? What is the end goal? How would achieving it inform other fields?

Then I’d write them up like columns. So one column might be “Hodge Theory” and it would explain the main problem, the partial results, and the connections to other theories and fields, or another column might be “motives” and it would explain the underlying reason for inventing yet another technology and how it makes things easier to think about.

Obviously I could write a whole book on a given subject, but I wouldn’t. My audience would be, primarily, other mathematicians, but I’d write it to be readable by people who have degrees in other quantitative fields like physics or statistics.

Even more obviously, every time I chose a field and a representative to interview and every time I chose to stop there, I’d be making in some sense a political choice, which would inevitably piss someone off, because I realize people are very sensitive to this. This is presuming anybody every read my surveys in the first place, which is a big if.

Even so, I think it would be a contribution to mathematics. I actually think a pretty serious problem with academic math is that people from disparate fields really have no idea what each other is doing. I’m generalizing, of course, and colloquiums do tend to address this, when they are well done and available. But for the most part, let’s face it, people are essentially only rewarded for writing stuff that is incredibly “insider” for their field. that only a few other experts can understand. Survey of topics, when they’re written, are generally not considered “research” but more like a public service.

And by the way, this is really different from the history of mathematics, in that I have never really cared about who did what, and I still don’t (although I’m not against name a few people in my columns). The real goal here is to end up with a more or less accurate map of the active research areas in mathematics and how they are related. So an enormous network, with various directed edges of different types. In fact, writing this down makes me want to build my map as I go, an annotated visualization to pair with the columns.

Also, it obviously doesn’t have to be me doing all this: I’m happy to make it an open-source project with a few guidelines and version control. But I do want to kick it off because I think it’s a neat idea.

A few questions about my mathematical journalism plan.

Who’s going to pay me to do this?
Where should I publish it?

If the answers are “nobody” and “on mathbabe.org” then I’m afraid it won’t happen, at least by me. Any ideas?

One more thing. This idea could just as well be done for another field altogether, like physics or biology. Are there models of people doing something like that in those fields that you know about? Or is there someone actually already doing this in math?

Categories: math, math education, musing

Columbia Data Science course, week 6: Kaggle, crowd-sourcing, decision trees, random forests, social networks, and experimental design

October 11, 2012 Cathy O'Neil, mathbabe 3 comments

Yesterday we had two guest lecturers, who took up approximately half the time each. First we welcomed William Cukierski from Kaggle, a data science competition platform.

Will went to Cornell for a B.A. in physics and to Rutgers to get his Ph.D. in biomedical engineering. He focused on cancer research, studying pathology images. While working on writing his dissertation, he got more and more involved in Kaggle competitions, finishing very near the top in multiple competitions, and now works for Kaggle. Here’s what Will had to say.

Crowd-sourcing in Kaggle

What is a data scientist? Some say it’s someone who is better at stats than an engineer and better at engineering than a statistician. But one could argue it’s actually someone who is worse at stats than a statistician. Being a data scientist is when you learn more and more about more and more until you know nothing about everything.

Kaggle using prizes to induce the public to do stuff. This is not a new idea:

the Royal Navy in 1714 couldn’t measure longitude, and put out a prize worth $6 million in today’s dollars to get help. John Harrison, an unknown cabinetmaker, figured it out how to make a clock to solve the problem.
In the U.S. in 2002 FOX issued a prize for the next pop solo artist, which resulted in American Idol.
There’s also the X-prize company; $10 million was offered for the Ansari X-prize and $100 million was lost in trying to solve it. So it doesn’t always mean it’s an efficient process (but it’s efficient for the people offering the prize if it gets solved!)

There are two kinds of crowdsourcing models. First, we have the distributive crowdsourcing model, like wikipedia, which as for relatively easy but large amounts of contributions. Then, there’s the singular, focused difficult problems that Kaggle, DARPA, InnoCentive and other companies specialize in.

Somee of the problems with some crowdsourcing projects include:

they don’t always evaluate your submission objectively. Instead they have a subjective measure, so they might just decide your design is bad or something. This leads to high barrier to entry, since people don’t trust the evaluation criterion.
Also, one doesn’t get recognition until after they’ve won or ranked highly. This leads to high sunk costs for the participants.
Also, bad competitions often conflate participants with mechanical turks: in other words, they assume you’re stupid. This doesn’t lead anywhere good.
Also, the competitions sometimes don’t chunk the work into bite size pieces, which means it’s too big to do or too small to be interesting.

A good competition has a do-able, interesting question, with an evaluation metric which is transparent and entirely objective. The problem is given, the data set is given, and the metric of success is given. Moreover, prizes are established up front.

The participants are encouraged to submit their models up to twice a day during the competitions, which last on the order of a few days. This encourages a “leapfrogging” between competitors, where one ekes out a 5% advantage, giving others incentive to work harder. It also establishes a band of accuracy around a problem which you generally don’t have- in other words, given no other information, you don’t know if your 75% accurate model is the best possible.

The test set y’s are hidden, but the x’s are given, so you just use your model to get your predicted y’s for the test set and upload them into the Kaggle machine to see your evaluation score. This way you don’t share your actual code with Kaggle unless you win the prize (and Kaggle doesn’t have to worry about which version of python you’re running).

Note this leapfrogging effect is good and bad. It encourages people to squeeze out better performing models but it also tends to make models much more complicated as they get better. One reason you don’t want competitions lasting too long is that, after a while, the only way to inch up performance is to make things ridiculously complicated. For example, the original Netflix Prize lasted two years and the final winning model was too complicated for them to actually put into production.

The hole that Kaggle is filling is the following: there’s a mismatch between those who need analysis and those with skills. Even though companies desperately need analysis, they tend to hoard data; this is the biggest obstacle for success.

They have had good results so far. Allstate, with a good actuarial team, challenged their data science competitors to improve their actuarial model, which, given attributes of drivers, approximates the probability of a car crash. The 202 competitors improved Allstate’s internal model by 271%.

There were other examples, including one where the prize was $1,000 and it benefited the company $100,000.

A student then asked, is that fair? There are actually two questions embedded in that one. First, is it fair to the data scientists working at the companies that engage with Kaggle? Some of them might lose their job, for example. Second, is it fair to get people to basically work for free and ultimately benefit a for-profit company? Does it result in data scientists losing their fair market price?

Of course Kaggle charges a fee for hosting competitions, but is it enough?

[Mathbabe interjects her view: personally, I suspect this is a model which seems like an arbitrage opportunity for companies but only while the data scientists of the world haven’t realized their value and have extra time on their hands. As soon as they price their skills better they’ll stop working for free, unless it’s for a cause they actually believe in.]

Facebook is hiring data scientists, they hosted a Kaggle competition, where the prize was an interview. There were 422 competitors.

[Mathbabe can’t help but insert her view: it’s a bit too convenient for Facebook to have interviewees for data science positions in such a posture of gratitude for the mere interview. This distracts them from asking hard questions about what the data policies are and the underlying ethics of the company.]

There’s a final project for the class, namely an essay grading contest. The students will need to build it, train it, and test it, just like any other Kaggle competition. Group work is encouraged.

Thought Experiment: What are the ethical implications of a robo-grader?

Some of the students’ thoughts:

It depends on how much you care about your grade.
Actual human graders aren’t fair anyway.
Is this the wrong question? The goal of a test is not to write a good essay but rather to do well in a standardized test. The real profit center for standardized testing is, after all, to sell books to tell you how to take the tests. It’s a screening, you follow the instructions, and you get a grade depending on how well you follow instructions.
There are really two question: 1) Is it wise to move from the human to the machine version of same thing for any given thing? and 2) Are machines making things more structured, and is this inhibiting creativity? One thing is for sure, robo-grading prevents me from being compared to someone more creative.
People want things to be standardized. It gives us a consistency that we like. People don’t want artistic cars, for example.
Will: We used machine learning to research cancer, where the stakes are much higher. In fact this whole field of data science has to be thinking about these ethical considerations sooner or later, and I think it’s sooner. In the case of doctors, you could give the same doctor the same slide two months apart and get different diagnoses. We aren’t consistent ourselves, but we think we are. Let’s keep that in mind when we talk about the “fairness” of using machine learning algorithms in tricky situations.

Introduction to Feature Selection

“Feature extraction and selection are the most important but underrated step of machine learning. Better features are better than better algorithms.” – Will

“We don’t have better algorithms, we just have more data” –Peter Norvig

Will claims that Norvig really wanted to say we have better features.

We are getting bigger and bigger data sets, but that’s not always helpful. The danger is if the number of features is larger than the number of samples or if we have a sparsity problem.

We improve our feature selection process to try to improve performance of predictions. A criticism of feature selection is that it’s no better than data dredging. If we just take whatever answer we get that correlates with our target, that’s not good.

There’s a well known bias-variance tradeoff: a model is “high bias” if it’s is too simple (the features aren’t encoding enough information). In this case lots more data doesn’t improve your model. On the other hand, if your model is too complicated, then “high variance” leads to overfitting. In this case you want to reduce the number of features you are using.

We will take some material from a famous paper by Isabelle Guyon published in 2003 entitled “An Introduction to Variable and Feature Selection”.

There are three categories of feature selection methods: filters, wrappers, and embedded methods. Filters order variables (i.e. possible features) with respect to some ranking (e.g. correlation with target). This is sometimes good on a first pass over the space of features. Filters take account of the predictive power of individual features, and estimate mutual information or what have you. However, the problem with filters is that you get correlated features. In other words, the filter doesn’t care about redundancy.

This isn’t always bad and it isn’t always good. On the one hand, two redundant features can be more powerful when they are both used, and on the other hand something that appears useless alone could actually help when combined with another possibly useless-looking feature.

Wrapper feature selection tries to find subsets of features that will do the trick. However, as anyone who has studied the binomial coefficients knows, the number of possible size $k$ subsets of $n$ things, called $n\choose k$ , grows exponentially. So there’s a nasty opportunity for over fitting by doing this. Most subset methods are capturing some flavor of minimum-redundancy-maximum-relevance. So, for example, we could have a greedy algorithm which starts with the best feature, takes a few more highly ranked, removes the worst, and so on. This a hybrid approach with a filter method.

We don’t have to retrain models at each step of such an approach, because there are fancy ways to see how objective function changes as we change the subset of features we are trying out. These are called “finite differences” and rely essentially on Taylor Series expansions of the objective function.

One last word: if you have a domain expertise on hand, don’t go into the machine learning rabbit hole of feature selection unless you’ve tapped into your expert completely!

Decision Trees

We’ve all used decision trees. They’re easy to understand and easy to use. How do we construct? Choosing a feature to pick at each step is like playing 20 questions. We take whatever the most informative thing is first. For the sake of this discussion, assume we break compound questions into multiple binary questions, so the answer is “+” or “-“.

To quantify “what is the most informative feature”, we first define entropy for a random variable $X$ to mean:

$H(X) = - p(x_+) log_2(p(x_+)) - p(x_-) log_2(p(x_-)).$

Note when $p(x_*) = 0,$ we define the term to vanish. This is consistent with the fact that

$\lim_{t\to 0} t log(t) = 0.$

In particular, if either option has probability zero, the entropy is 0. It is maximized at 0.5 for binary variables:

which we can easily compute using the fact that in the binary case, $p(x_+) = 1- p(x_-)$ and a bit of calculus.

Using this definition, we define the information gain for a given feature, which is defined as the entropy we lose if we know the value of that feature.

To make a decision tree, then, we want to maximize information gain, and make a split on that. We keep going until all the points at the end are in the same class or we end up with no features left. In this case we take the majority vote. Optionally we prune the tree to avoid overfitting.

This is an example of an embedded feature selection algorithm. We don’t need to use a filter here because the “information gain” method is doing our feature selection for us.

How do you handle continuous variables?

In the case of continuous variables, you need to ask for the correct threshold of value so that it can be though of as a binary variable. So you could partition a user’s spend into “less than $5” and “at least $5” and you’d be getting back to the binary variable case. In this case it takes some extra work to decide on the information gain because it depends on the threshold as well as the feature.

Random Forests

Random forests are cool. They incorporate “bagging” (bootstrap aggregating) and trees to make stuff better. Plus they’re easy to use: you just need to specify the number of trees you want in your forest, as well as the number of features to randomly select at each node.

A bootstrap sample is a sample with replacement, which we usually take to be 80% of the actual data, but of course can be adjusted depending on how much data we have.

To construct a random forest, we construct a bunch of decision trees (we decide how many). For each tree, we take a bootstrap sample of our data, and for each node we randomly select (a second point of bootstrapping actually) a few features, say 5 out of the 100 total features. Then we use our entropy-information-gain engine to decide which among those features we will split our tree on, and we keep doing this, choosing a different set of five features for each node of our tree.

Note we could decide beforehand how deep the tree should get, but we typically don’t prune the trees, since a great feature of random forests is that it incorporates idiosyncratic noise.

Here’s what does a decision tree looks like for surviving on the Titanic.

David Huffaker, Google: Hybrid Approach to Social Research

David is one of Rachel’s collaborators in Google. They had a successful collaboration, starting with complementary skill sets, an explosion of goodness ensued when they were put together to work on Google+ with a bunch of other people, especially engineers. David brings a social scientist perspective to the analysis of social networks. He’s strong in quantitative methods for understanding and analyzing online social behavior. He got a Ph.D. in Media, Technology, and Society from Northwestern.

Google does a good job of putting people together. They blur the lines between research and development. The researchers are embedded on product teams. The work is iterative, and the engineers on the team strive to have near-production code from day 1 of a project. They leverage cloud infrastructure to deploy experiments to their mass user base and to rapidly deploy a prototype at scale.

Note that, considering the scale of Google’s user base, redesign as they scaling up is not a viable option. They instead do experiments with smaller groups of users.

David suggested that we, as data scientists, consider how to move into an experimental design so as to move to a causal claim between variables rather than a descriptive relationship. In other words, to move from the descriptive to the predictive.

As an example, he talked about the genesis of the “circle of friends” feature of Google+. They know people want to selectively share; they’ll send pictures to their family, whereas they’d probably be more likely to send inside jokes to their friends. They came up with the idea of circles, but it wasn’t clear if people would use them. How do they answer the question: will they use circles to organize their social network? It’s important to know what motivates them when they decide to share.

They took a mixed-method approach, so they used multiple methods to triangulate on findings and insights. Given a random sample of 100,000 users, they set out to determine the popular names and categories of names given to circles. They identified 168 active users who filled out surveys and they had longer interviews with 12.

They found that the majority were engaging in selective sharing, that most people used circles, and that the circle names were most often work-related or school-related, and that they had elements of a strong-link (“epic bros”) or a weak-link (“acquaintances from PTA”)

They asked the survey participants why they share content. The answers primarily came in three categories: first, the desire to share about oneself – personal experiences, opinions, etc. Second, discourse: people wanna participate in a conversation. Third, evangelism: people wanna spread information.

Next they asked participants why they choose their audiences. Again, three categories: first, privacy – many people were public or private by default. Second, relevance – they wanted to share only with those who may be interested, and they don’t wanna pollute other people’s data stream. Third, distribution – some people just want to maximize their potential audience.

The takeaway from this study was this: people do enjoy selectively sharing content, depending on context, and the audience. So we have to think about designing features for the product around content, context, and audience.

Network Analysis

We can use large data and look at connections between actors like a graph. For Google+, the users are the nodes and the edges (directed) are “in the same circle”.

Other examples of networks:

nodes are users in 2nd life, interactions between users are possible in three different ways, corresponding to three different kinds of edges
nodes are websites, edges are links
nodes are theorems, directed edges are dependencies

After you define and draw a network, you can hopefully learn stuff by looking at it or analyzing it.

Social at Google

As you may have noticed, “social” is a layer across all of Google. Search now incorporates this layer: if you search for something you might see that your friend “+1″‘ed it. This is called a social annotation. It turns out that people care more about annotation when it comes from someone with domain expertise rather than someone you’re very close to. So you might care more about the opinion of a wine expert at work than the opinion of your mom when it comes to purchasing wine.

Note that sounds obvious but if you started the other way around, asking who you’d trust, you might start with your mom. In other words, “close ties,” even if you can determine those, are not the best feature to rank annotations. But that begs the question, what is? Typically in a situation like this we use click-through rate, or how long it takes to click.

In general we need to always keep in mind a quantitative metric of success. This defines success for us, so we have to be careful.

Privacy

Human facing technology has thorny issues of privacy which makes stuff hard. We took a survey of how people felt uneasy about content. We asked, how does it affect your engagement? What is the nature of your privacy concerns?

Turns out there’s a strong correlation between privacy concern and low engagement, which isn’t surprising. It’s also related to how well you understand what information is being shared, and the question of when you post something, where does it go and how much control do you have over it. When you are confronted with a huge pile of complicated all settings, you tend to start feeling passive.

Again, we took a survey and found broad categories of concern as follows:

identity theft

financial loss

digital world

access to personal data
really private stuff I searched on
unwanted spam
provocative photo (oh shit my boss saw that)
unwanted solicitation
unwanted ad targeting

physical world

offline threats
harm to my family
stalkers
employment risks
hassle

What is the best way to decrease concern and increase undemanding and control?

Possibilities:

Write and post a manifesto of your data policy (tried that, nobody likes to read manifestos)
Educate users on our policies a la the Netflix feature “because you liked this, we think you might like this”
Get rid of all stored data after a year

Rephrase: how do we design setting to make it easier for people? how do you make it transparent?

make a picture or graph of where data is going.
give people a privacy switchboard
give people access to quick settings
make the settings you show them categorized by things you don’t have a choice about vs. things you do
make reasonable default setting so people don’t have to worry about it.

David left us with these words of wisdom: as you move forward and have access to big data, you really should complement them with qualitative approaches. Use mixed methods to come to a better understanding of what’s going on. Qualitative surveys can really help.

Categories: data science, math education, statistics

Columbia Data Science course, week 5: GetGlue, time series, financial modeling, advanced regression, and ethics

October 5, 2012 Cathy O'Neil, mathbabe Comments off

I was happy to be giving Rachel Schutt’s Columbia Data Science course this week, where I discussed time series, financial modeling, and ethics. I blogged previous classes here.

The first few minutes of class were for a case study with GetGlue, a New York-based start-up that won the mashable breakthrough start-up of the year in 2011 and is backed by some of the VCs that also fund big names like Tumblr, etsy, foursquare, etc. GetGlue is part of the social TV space. Lead Scientist, Kyle Teague, came to tell the class a little bit about GetGlue, and some of what he worked on there. He also came to announce that GetGlue was giving the class access to a fairly large data set of user check-ins to tv shows and movies. Kyle’s background is in electrical engineering, he placed in the 2011 KDD cup (which we learned about last week from Brian), and he started programming when he was a kid.

GetGlue’s goal is to address the problem of content discovery within the movie and tv space, primarily. The usual model for finding out what’s on TV is the 1950’s TV Guide schedule, and that’s still how we’re supposed to find things to watch. There are thousands of channels and it’s getting increasingly difficult to find out what’s good on. GetGlue wants to change this model, by giving people personalized TV recommendations and personalized guides. There are other ways GetGlue uses Data Science but for the most part we focused on how this the recommendation system works. Users “check-in” to tv shows, which means they can tell people they’re watching a show. This creates a time-stamped data point. They can also do other actions such as like, or comment on the show. So this is a -tuple: {user, action, object} where the object is a tv show or movie. This induces a bi-partite graph. A bi-partite graph or network contains two types of nodes: users and tv shows. An edges exist between users and an tv shows, but not between users and users or tv shows and tv shows. So Bob and Mad Men are connected because Bob likes Mad Men, and Sarah and Mad Men and Lost are connected because Sarah liked Mad Men and Lost. But Bob and Sarah aren’t connected, nor are Mad Men and Lost. A lot can be learned from this graph alone.

But GetGlue finds ways to create edges between users and between objects (tv shows, or movies.) Users can follow each other or be friends on GetGlue, and also GetGlue can learn that two people are similar[do they do this?]. GetGlue also hires human evaluators to make connections or directional edges between objects. So True Blood and Buffy the Vampire Slayer might be similar for some reason and so the humans create an edge in the graph between them. There were nuances around the edge being directional. They may draw an arrow pointing from Buffy to True Blood but not vice versa, for example, so their notion of “similar” or “close” captures both content and popularity. (That’s a made-up example.) Pandora does something like this too.

Another important aspect is time. The user checked-in or liked a show at a specific time, so the -tuple extends to have a time-stamp: {user,action,object,timestamp}. This is essentially the data set the class has access to, although it’s slightly more complicated and messy than that. Their first assignment with this data will be to explore it, try to characterize it and understand it, gain intuition around it and visualize what they find.

Students in the class asked him questions around topics of the value of formal education in becoming a data scientist (do you need one? Kyle’s time spent doing signal processing in research labs was valuable, but so was his time spent coding for fun as a kid), what would be messy about a data set, why would the data set be messy (often bugs in the code), how would they know? (their QA and values that don’t make sense), what language does he use to prototype algorithms (python), how does he know his algorithm is good.

Then it was my turn. I started out with my data scientist profile:

As you can see, I feel like I have the most weakness in CS. Although I can use python pretty proficiently, and in particular I can scrape and parce data, prototype models, and use matplotlib to draw pretty pictures, I am no java map-reducer and I bow down to those people who are. I am also completely untrained in data visualization but I know enough to get by and give presentations that people understand.

Thought Experiment

I asked the students the following question:

What do you lose when you think of your training set as a big pile of data and ignore the timestamps?

They had some pretty insightful comments. One thing they mentioned off the bat is that you won’t know cause and effect if you don’t have any sense of time. Of course that’s true but it’s not quite what I meant, so I amended the question to allow you to collect relative time differentials, so “time since user last logged in” or “time since last click” or “time since last insulin injection”, but not absolute timestamps.

What I was getting at, and what they came up with, was that when you ignore the passage of time through your data, you ignore trends altogether, as well as seasonality. So for the insulin example, you might note that 15 minutes after your insulin injection your blood sugar goes down consistently, but you might not notice an overall trend of your rising blood sugar over the past few months if your dataset for the past few months has no absolute timestamp on it.

This idea, of keeping track of trends and seasonalities, is very important in financial data, and essential to keep track of if you want to make money, considering how small the signals are.

How to avoid overfitting when you model with time series

After discussing seasonality and trends in the various financial markets, we started talking about how to avoid overfitting your model.

Specifically, I started out with having a strict concept of in-sample (IS) and out-of-sample (OOS) data. Note the OOS data is not meant as testing data- that all happens inside OOS data. It’s meant to be the data you use after finalizing your model so that you have some idea how the model will perform in production.

Next, I discussed the concept of causal modeling. Namely, we should never use information in the future to predict something now. Similarly, when we have a set of training data, we don’t know the “best fit coefficients” for that training data until after the last timestamp on all the data. As we move forward in time from the first timestamp to the last, we expect to get different sets of coefficients as more events happen.

One consequence of this is that, instead of getting on set of coefficients, we actually get an evolution of each coefficient. This is helpful because it gives us a sense of how stable those coefficients are. In particular, if one coefficient has changed sign 10 times over the training set, then we expect a good estimate for it is zero, not the so-called “best fit” at the end of the data.

One last word on causal modeling and IS/OOS. It is consistent with production code. Namely, you are always acting, in the training and in the OOS simulation, as if you’re running your model in production and you’re seeing how it performs. Of course you fit your model in sample, so you expect it to perform better there than in production.

Another way to say this is that, once you have a model in production, you will have to make decisions about the future based only on what you know now (so it’s causal) and you will want to update your model whenever you gather new data. So your coefficients of your model are living organisms that continuously evolve.

Submodels of Models

We often “prepare” the data before putting it into a model. Typically the way we prepare it has to do with the mean or the variance of the data, or sometimes the log (and then the mean or the variance of that transformed data).

But to be consistent with the causal nature of our modeling, we need to make sure our running estimates of mean and variance are also causal. Once we have causal estimates of our mean $\overline{y}$ and variance $\sigma_y^2$, we can normalize the next data point with these estimates just like we do to get from a gaussian distribution to the normal gaussian distribution:

$y \mapsto \frac{y - \overline{y}}{\sigma_y}$

Of course we may have other things to keep track of as well to prepare our data, and we might run other submodels of our model. For example we may choose to consider only the “new” part of something, which is equivalent to trying to predict something like $y_t - y_{t-1}$ instead of $y_t.$ Or we may train a submodel to figure out what part of $y_{t-1}$ predicts $y_t,$ so a submodel which is a univariate regression or something.

There are lots of choices here, but the point is it’s all causal, so you have to be careful when you train your overall model how to introduce your next data point and make sure the steps are all in order of time, and that you’re never ever cheating and looking ahead in time at data that hasn’t happened yet.

Financial time series

In finance we consider returns, say daily. And it’s not percent returns, actually it’s log returns: if $F_t$ denotes a close on day $t,$ then the return that day is defined as $log(F_t/F_{t-1}).$ See more about this here.

So if you start with S&P closing levels:

Then you get the following log returns:

What’s that mess? It’s crazy volatility caused by the financial crisis. We sometimes (not always) want to account for that volatility by normalizing with respect to it (described above). Once we do that we get something like this:

Which is clearly better behaved. Note this process is discussed in this post.

We could also normalize with respect to the mean, but we typically assume the mean of daily returns is 0, so as to not bias our models on short term trends.

Financial Modeling

One thing we need to understand about financial modeling is that there’s a feedback loop. If you find a way to make money, it eventually goes away- sometimes people refer to this as the fact that the “market learns over time”.

One way to see this is that, in the end, your model comes down to knowing some price is going to go up in the future, so you buy it before it goes up, you wait, and then you sell it at a profit. But if you think about it, your buying it has actually changed the process, and decreased the signal you were anticipating. That’s how the market learns – it’s a combination of a bunch of algorithms anticipating things and making them go away.

The consequence of this learning over time is that the existing signals are very weak. We are happy with a 3% correlation for models that have a horizon of 1 day (a “horizon” for your model is how long you expect your prediction to be good). This means not much signal, and lots of noise! In particular, lots of the machine learning “metrics of success” for models, such as measurements of precision or accuracy, are not very relevant in this context.

So instead of measuring accuracy, we generally draw a picture to assess models, namely of the (cumulative) PnL of the model. This generalizes to any model as well- you plot the cumulative sum of the product of demeaned forecast and demeaned realized. In other words, you see if your model consistently does better than the “stupidest” model of assuming everything is average.

If you plot this and you drift up and to the right, you’re good. If it’s too jaggedy, that means your model is taking big bets and isn’t stable.

Why regression?

From above we know the signal is weak. If you imagine there’s some complicated underlying relationship between your information and the thing you’re trying to predict, get over knowing what that is – there’s too much noise to find it. Instead, think of the function as possibly complicated, but continuous, and imagine you’ve written it out as a Taylor Series. Then you can’t possibly expect to get your hands on anything but the linear terms.

Don’t think about using logistic regression, either, because you’d need to be ignoring size, which matters in finance- it matters if a stock went up 2% instead of 0.01%. But logistic regression forces you to have an on/off switch, which would be possible but would lose a lot of information. Considering the fact that we are always in a low-information environment, this is a bad idea.

Note that although I’m claiming you probably want to use linear regression in a noisy environment, the actual terms themselves don’t have to be linear in the information you have. You can always take products of various terms as x’s in your regression. but you’re still fitting a linear model in non-linear terms.

Advanced regression

The first thing I need to explain is the exponential downweighting of old data, which I already used in a graph above, where I normalized returns by volatility with a decay of 0.97. How do I do this?

Working from this post again, the formula is given by essentially a weighted version of the normal one, where I weight recent data more than older data, and where the weight of older data is a power of some parameter $s$ which is called the decay. The exponent is the number of time intervals since that data was new. Putting that together, the formula we get is:

$V_{old} = (1-s) \cdot \sum_i r_i^2 s^i.$

We are actually dividing by the sum of the weights, but the weights are powers of some number s, so it’s a geometric sum and the sum is given by $1/(1-s).$

One cool consequence of this formula is that it’s easy to update: if we have a new return $r_0$ to add to the series, then it’s not hard to show we just want

$V_{new} = s \cdot V_{old} + (1-s) \cdot r_0^2.$

In fact this is the general rule for updating exponential downweighted estimates, and it’s one reason we like them so much- you only need to keep in memory your last estimate and the number $s.$

How do you choose your decay length? This is an art instead of a science, and depends on the domain you’re in. Think about how many days (or time periods) it takes to weight a data point at half of a new data point, and compare that to how fast the market forgets stuff.

This downweighting of old data is an example of inserting a prior into your model, where here the prior is “new data is more important than old data”. What are other kinds of priors you can have?

Priors

Priors can be thought of as opinions like the above. Besides “new data is more important than old data,” we may decide our prior is “coefficients vary smoothly.” This is relevant when we decide, say, to use a bunch of old values of some time series to help predict the next one, giving us a model like:

$y = F_t = \alpha_0 + \alpha_1 F_{t-1} + \alpha_2 F_{t-2} + \epsilon,$

which is just the example where we take the last two values of the time series $F$ to predict the next one. But we could use more than two values, of course.

[Aside: in order to decide how many values to use, you might want to draw an autocorrelation plot for your data.]

The way you’d place the prior about the relationship between coefficients (in this case consecutive lagged data points) is by adding a matrix to your covariance matrix when you perform linear regression. See more about this here.

Ethics

I then talked about modeling and ethics. My goal is to get this next-gen group of data scientists sensitized to the fact that they are not just nerds sitting in the corner but have increasingly important ethical questions to consider while they work.

People tend to overfit their models. It’s human nature to want your baby to be awesome. They also underestimate the bad news and blame other people for bad news, because nothing their baby has done or is capable of is bad, unless someone else made them do it. Keep these things in mind.

I then described what I call the deathspiral of modeling, a term I coined in this post on creepy model watching.

I counseled the students to

try to maintain skepticism about their models and how their models might get used,
shoot holes in their own ideas,
accept challenges and devise tests as scientists rather than defending their models using words – if someone thinks they can do better, than let them try, and agree on an evaluation method beforehand,
In general, try to consider the consequences of their models.

I then showed them Emanuel Derman’s Hippocratic Oath of Modeling, which was made for financial modeling but fits perfectly into this framework. I discussed the politics of working in industry, namely that even if they are skeptical of their model there’s always the chance that it will be used the wrong way in spite of the modeler’s warnings. So the Hippocratic Oath is, unfortunately, insufficient in reality (but it’s a good start!).

Finally, there are ways to do good: I mentioned stuff like DataKind. There are also ways to be transparent: I mentioned Open Models, which is so far just an idea, but Victoria Stodden is working on RunMyCode, which is similar and very awesome.

Categories: data science, finance, math education, open source tools, statistics

What is a model?

September 28, 2012 Cathy O'Neil, mathbabe 9 comments

I’ve been thinking a lot recently about mathematical models and how to explain them to people who aren’t mathematicians or statisticians. I consider this increasingly important as more and more models are controlling our lives, such as:

employment models, which help large employers screen through applications,
political ad models, which allow political groups to personalize their ads,
credit scoring models, which allow consumer product companies and loan companies to screen applicants, and,
if you’re a teacher, the Value-Added Model.
See more models here and here.

It’s a big job, to explain these, because the truth is they are complicated – sometimes overly so, sometimes by construction.

The truth is, though, you don’t really need to be a mathematician to know what a model is, because everyone uses internal models all the time to make decisions.

For example, you intuitively model everyone’s appetite when you cook a meal for your family. You know that one person loves chicken (but hates hamburgers), while someone else will only eat the pasta (with extra cheese). You even take into account that people’s appetites vary from day to day, so you can’t be totally precise in preparing something – there’s a standard error involved.

To explain modeling at this level, then, you just need to imagine that you’ve built a machine that knows all the facts that you do and knows how to assemble them together to make a meal that will approximately feed your family. If you think about it, you’ll realize that you know a shit ton of information about the likes and dislikes of all of your family members, because you have so many memories of them grabbing seconds of the asparagus or avoiding the string beans.

In other words, it would be actually incredibly hard to give a machine enough information about all the food preferences for all your family members, and yourself, along with the constraints of having not too much junky food, but making sure everyone had something they liked, etc. etc.

So what would you do instead? You’d probably give the machine broad categories of likes and dislikes: this one likes meat, this one likes bread and pasta, this one always drinks lots of milk and puts nutella on everything in sight. You’d dumb it down for the sake of time, in other words. The end product, the meal, may not be perfect but it’s better than no guidance at all.

That’s getting closer to what real-world modeling for people is like. And the conclusion is right too- you aren’t expecting your model to do a perfect job, because you only have a broad outline of the true underlying facts of the situation.

Plus, when you’re modeling people, you have to a priori choose the questions to ask, which will probably come in the form of “does he/she like meat?” instead of “does he/she put nutella on everything in sight?”; in other words, the important but idiosyncratic rules won’t even be seen by a generic one-size-fits-everything model.

Finally, those generic models are hugely scaled- sometimes there’s really only one out there, being used everywhere, and its flaws are compounded that many times over because of its reach.

So, say you’ve got a CV with a spelling error. You’re trying to get a job, but the software that screens for applicants automatically rejects you because of this spelling error. Moreover, the same screening model is used everywhere, and you therefore don’t get any interviews because of this one spelling error, in spite of the fact that you’re otherwise qualified.

I’m not saying this would happen – I don’t know how those models actually work, although I do expect points against you for spelling errors. My point is there’s some real danger in using such models on a very large scale that we know are simplified versions of reality.

One last thing. The model fails in the example above, because the qualified person doesn’t get a job. But it fails invisibly; nobody knows exactly how it failed or even that it failed. Moreover, it only really fails for the applicant who doesn’t get any interviews. For the employer, as long as some qualified applicants survive the model, they don’t see failure at all.

Categories: data science, math education, statistics

Columbia Data Science course, week 4: K-means, Classifiers, Logistic Regression, Evaluation

September 27, 2012 Cathy O'Neil, mathbabe 4 comments

This week our guest lecturer for the Columbia Data Science class was Brian Dalessandro. Brian works at Media6Degrees as a VP of Data Science, and he’s super active in the research community. He’s also served as co-chair of the KDD competition.

Before Brian started, Rachel threw us a couple of delicious data science tidbits.

The Process of Data Science

First we have the Real World. Inside the Real World we have:

Users using Google+
People competing in the Olympics
Spammers sending email

From this we draw raw data, e.g. logs, all the olympics records, or Enron employee emails. We want to process this to make it clean for analysis. We use pipelines of data munging, joining, scraping, wrangling or whatever you want to call it and we use tools such as:

python
shell scripts
R
SQL

We eventually get the data down to a nice format, say something with columns:

name event year gender event time

Note: this is where you typically start in a standard statistics class. But it’s not where we typically start in the real world.

Once you have this clean data set, you should be doing some kind of exploratory data analysis (EDA); if you don’t really know what I’m talking about then look at Rachel’s recent blog post on the subject. You may realize that it isn’t actually clean.

Next, you decide to apply some algorithm you learned somewhere:

k-nearest neighbor
regression
Naive Bayes
(something else),

depending on the type of problem you’re trying to solve:

classification
prediction
description

You then:

interpret
visualize
report
communicate

At the end you have a “data product”, e.g. a spam classifier.

K-means

So far we’ve only seen supervised learning. K-means is the first unsupervised learning technique we’ll look into. Say you have data at the user level:

G+ data
survey data
medical data
SAT scores

Assume each row of your data set corresponds to a person, say each row corresponds to information about the user as follows:

age gender income Geo=state household size

Your goal is to segment them, otherwise known as stratify, or group, or cluster. Why? For example:

you might want to give different users different experiences. Marketing often does this.
you might have a model that works better for specific groups
hierarchical modeling in statistics does something like this.

One possibility is to choose the groups yourself. Bucket users using homemade thresholds. Like by age, 20-24, 25-30, etc. or by income. In fact, say you did this, by age, gender, state, income, marital status. You may have 10 age buckets, 2 gender buckets, and so on, which would result in 10x2x50x10x3 = 30,000 possible bins, which is big.

You can picture a five dimensional space with buckets along each axis, and each user would then live in one of those 30,000 five-dimensional cells. You wouldn’t want 30,000 marketing campaigns so you’d have to bin the bins somewhat.

Wait, what if you want to use an algorithm instead where you could decide on the number of bins? K-means is a “clustering algorithm”, and k is the number of groups. You pick k, a hyper parameter.

2-d version

Say you have users with #clicks, #impressions (or age and income – anything with just two numerical parameters). Then k-means looks for clusters on the 2-d plane. Here’s a stolen and simplistic picture that illustrates what this might look like:

The general algorithm is just the same picture but generalized to d dimensions, where d is the number of features for each data point.

Here’s the actual algorithm:

randomly pick K centroids
assign data to closest centroid.
move the centroids to the average location of the users assigned to it
repeat until the assignments don’t change

It’s up to you to interpret if there’s a natural way to describe these groups.

This is unsupervised learning and it has issues:

choosing an optimal k is also a problem although $1 \leq k \leq n$ , where n is number of data points.
convergence issues – the solution can fail to exist (the configurations can fall into a loop) or “wrong”
but it’s also fast
interpretability can be a problem – sometimes the answer isn’t useful
in spite of this, there are broad applications in marketing, computer vision (partition an image), or as a starting point for other models.

One common tool we use a lot in our systems is logistic regression.

Thought Experiment

Brian now asked us the following:

How would data science differ if we had a “grand unified theory of everything”?

He gave us some thoughts:

Would we even need data science?
Theory offers us a symbolic explanation of how the world works.
What’s the difference between physics and data science?
Is it just accuracy? After all, Newton wasn’t completely precise, but was pretty close.

If you think of the sciences as a continuum, where physics is all the way on the right, and as you go left, you get more chaotic, then where is economics on this spectrum? Marketing? Finance? As we go left, we’re adding randomness (and as a clever student points out, salary as well).

Bottomline: if we could model this data science stuff like we know how to model physics, we’d know when people will click on what ad. The real world isn’t this understood, nor do we expect to be able to in the future.

Does “data science” deserve the word “science” in its name? Here’s why maybe the answer is yes.

We always have more than one model, and our models are always changing.

The art in data science is this: translating the problem into the language of data science

The science in data science is this: given raw data, constraints and a problem statement, you have an infinite set of models to choose from, with which you will use to maximize performance on some evaluation metric, that you will have to specify. Every design choice you make can be formulated as an hypothesis, upon which you will use rigorous testing and experimentation to either validate or refute.

Never underestimate the power of creativity: usually people have vision but no method. As the data scientist, you have to turn it into a model within the operational constraints. You need to optimize a metric that you get to define. Moreover, you do this with a scientific method, in the following sense.

Namely, you hold onto your existing best performer, and once you have a new idea to prototype, then you set up an experiment wherein the two best models compete. You therefore have a continuous scientific experiment, and in that sense you can justify it as a science.

Classifiers

Given

data
a problem, and
constraints,

we need to determine:

a classifier,
an optimization method,
a loss function,
features, and
an evaluation metric.

Today we will focus on the process of choosing a classifier.

Classification involves mapping your data points into a finite set of labels or the probability of a given label or labels. Examples of when you’d want to use classification:

will someone click on this ad?
what number is this?
what is this news article about?
is this spam?
is this pill good for headaches?

From now on we’ll talk about binary classification only (0 or 1).

Examples of classification algorithms:

decision tree
random forests
naive bayes
k-nearest neighbors
logistic regression
support vector machines
neural networks

Which one should we use?

One possibility is to try them all, and choose the best performer. This is fine if you have no constraints or if you ignore constraints. But usually constraints are a big deal – you might have tons of data or not much time or both.

If I need to update 500 models a day, I do need to care about runtime. these end up being bidding decisions. Some algorithms are slow – k-nearest neighbors for example. Linear models, by contrast, are very fast.

One under-appreciated constraint of a data scientist is this: your own understanding of the algorithm.

Ask yourself carefully, do you understand it for real? Really? Admit it if you don’t. You don’t have to be a master of every algorithm to be a good data scientist. The truth is, getting the “best-fit” of an algorithm often requires intimate knowledge of said algorithm. Sometimes you need to tweak an algorithm to make it fit your data. A common mistake for people not completely familiar with an algorithm is to overfit.

Another common constraint: interpretability. You often need to be able to interpret your model, for the sake of the business for example. Decision trees are very easy to interpret. Random forests, on the other hand, not so much, even though it’s almost the same thing, but can take exponentially longer to explain in full. If you don’t have 15 years to spend understanding a result, you may be willing to give up some accuracy in order to have it easy to understand.

Note that credit cards have to be able to explain their models by law so decision trees make more sense than random forests.

How about scalability? In general, there are three things you have to keep in mind when considering scalability:

learning time: how much time does it take to train the model?
scoring time: how much time does it take to give a new user a score once the model is in production?
model storage: how much memory does the production model use up?

Here’s a useful paper to look at when comparing models: “An Empirical Comparison of Supervised Learning Algorithms”, from which we learn:

Simpler models are more interpretable but aren’t as good performers.
The question of which algorithm works best is problem dependent
It’s also constraint dependent

At M6D, we need to match clients (advertising companies) to individual users. We have logged the sites they have visited on the internet. Different sites collect this information for us. We don’t look at the contents of the page – we take the url and hash it into some random string and then we have, say, the following data about a user we call “u”:

u = <xyz, 123, sdqwe, 13ms>

This means u visited 4 sites and their urls hashed to the above strings. Recall last week we learned spam classifier where the features are words. We aren’t looking at the meaning of the words. So the might as well be strings.

At the end of the day we build a giant matrix whose columns correspond to sites and whose rows correspond to users, and there’s a “1” if that user went to that site.

To make this a classifier, we also need to associate the behavior “clicked on a shoe ad”. So, a label.

Once we’ve labeled as above, this looks just like spam classification. We can now rely on well-established methods developed for spam classification – reduction to a previously solved problem.

Logistic Regression

We have three core problems as data scientists at M6D:

feature engineering,
user level conversion prediction,
bidding.

We will focus on the second. We use logistic regression– it’s highly scalable and works great for binary outcomes.

What if you wanted to do something else? You could simply find a threshold so that, above you get 1, below you get 0. Or you could use a linear model like linear regression, but then you’d need to cut off below 0 or above 1.

What’s better: fit a function that is bounded in side [0,1]. For example, the logit function

$P(t)= \frac{1}{(1+ e^{-t})}.$

wanna estimate

$P(c_i | x) = f(x) = \frac{1}{1 + e^{-(\alpha + \beta^t*x)}}$ .

To make this a linear model in the outcomes $c_i$ , we take the log of the odds ratio:

$ln(P(c_i | x)/(1-P(c_i | x))) = \alpha + \beta^t *x.$

The parameter $\alpha$ keeps shape of the logit curve but shifts it back and forth. To interpret $\alpha$ further, consider what we call the base rate, the unconditional probability of “1” (so, in the case of ads, the base rate would correspond to the click-through rate, i.e. the overall tendency for people to click on ads; this is typically on the order of 1%).

If you had no information except the base rate, the average prediction would be just that. In a logistical regression, $\alpha$ defines the base rate. Specifically, the base rate is approximately equal to $\frac{1}{1+e^{-\alpha}}.$

The slope $\beta$ defines the slope of the logit function. Note in general it’s a vector which is as long as the number of features we are using for each data point.

Our immediate modeling goal is to use our training data to find the best choices for $\alpha$ and $\beta.$ We will use a maximum likelihood estimation or convex optimization to achieve this; we can’t just use derivatives and vector calculus like we did with linear regression because it’s a complicated function of our data.

The likelihood function $L$ is defined by:

$L(\Theta | X_1, X_2, \dots , X_n) = P(X | \Theta) =$ $P(X_1 | \Theta) \cdot \dots \cdot P(X_n | \Theta),$

where we are assuming the data points $X_i$ are independent and where $\Theta = \{\alpha, \beta\}.$

We then search for the parameters that maximize this having observed our data:

$\Theta_{MLE} = argmax_{\Theta} \prod_1^n P(X_i | \Theta).$

The probability of a single observation is

$p_i^{Y_i} \cdot (1-p_i)^{1-Y_i},$

where $p_i = 1/(1+e^{-(\alpha + \beta^t x)})$ is the modeled probability of a “1” for the binary outcome $Y_i.$ Taking the product of all of these we get our likelihood function which we want to maximize.

Similar to last week, we now take the log and get something convex, so it has to have a global maximum. Finally, we use numerical techniques to find it, which essentially follow the gradient like Newton’s method from calculus. Computer programs can do this pretty well. These algorithms depend on a step size, which we will need to adjust as we get closer to the global max or min – there’s an art to this piece of numerical optimization as well. Each step of the algorithm looks something like this:

$x_{n+1} = x_n - \gamma_n \Delta F(x_n),$

where remember we are actually optimizing our parameters $\alpha$ and $\beta$ to maximize the (log) likelihood function, so the $x$ you see above is really a vector of $\beta$ s and the function $F$ corresponds to our $log(L).$

“Flavors” of Logistic Regression for convex optimization.

The Newton’s method we described above is also called Iterative Reweighted Least Squares. It uses the curvature of log-likelihood to choose appropriate step direction. The actual calculation involves the Hessian matrix, and in particular requires its inversion, which is a kxk matrix. This is bad when there’s lots of features, as in 10,000 or something. Typically we don’t have that many features but it’s not impossible.

Another possible method to maximize our likelihood or log likelihood is called Stochastic Gradient Descent. It approximates gradient using a single observation at a time. The algorithm updates the current best-fit parameters each time it sees a new data point. The good news is that there’s no big matrix inversion, and it works well with huge data and with sparse features; it’s a big deal in Mahout and Vowpal Wabbit. The bad news is it’s not such a great optimizer and it’s very dependent on step size.

Evaluation

We generally use different evaluation metrics for different kind of models.

First, for ranking models, where we just want to know a relative rank versus and absolute score, we’d look to one of:

Second, for classification models, we’d look at the following metrics:

lift: how much more people are buying or clicking because of a model
accuracy: how often the correct outcome is being predicted
f-score
precision
recall

Finally, for density estimation, where we need to know an actual probability rather than a relative score, we’d look to:

In general it’s hard to compare lift curves but you can compare AUC (area under the receiver operator curve) – they are “base rate invariant.” In other words if you bring the click-through rate from 1% to 2%, that’s 100% lift but if you bring it from 4% to 7% that’s less lift but more effect. AUC does a better job in such a situation when you want to compare.

Density estimation tests tell you how well are you fitting for conditional probability. In advertising, this may arise if you have a situation where each ad impression costs $c and for each conversion you receive $q. You will want to target every conversion that has a positive expected value, i.e. whenever

$P(Conversion | X) \cdot \$q > \$c.$

But to do this you need to make sure the probability estimate on the left is accurate, which in this case means something like the mean squared error of the estimator is small. Note a model can give you good rankings but bad P estimates.

Similarly, features that rank highly on AUC don’t necessarily rank well with respect to mean absolute error. So feature selection, as well as your evaluation method, is completely context-driven.

Categories: data science, math education, open source tools, statistics

Evaluating professor evaluations

September 24, 2012 Cathy O'Neil, mathbabe 20 comments

I recently read this New York Times “Room for Debate” on professor evaluations. There were some reasonably good points made, with people talking about the trend that students generally give better grades to attractive professors and easy grading professors, and that they were generally more interested in the short-term than in the long-term in this sense.

For these reasons, it was stipulated, it would be better and more informative to have anonymous evaluations, or have students come back after some time to give evaluations, or interesting ideas like that.

Then there was a crazy crazy man named Jeff Sandefer, co-founder and master teacher at the Acton School of Business in Austin, Texas. He likes to call his students “customers” and here’s how he deals with evaluations:

Acton, the business school that I co-founded, is designed and is led exclusively by successful chief executives. We focus intently on customer feedback. Every week our students rank each course and professor, and the results are made public for all to see. We separate the emotional venting from constructive criticism in the evaluations, and make frequent changes in the program in real time.

We also tie teacher bonuses to the student evaluations and each professor signs an individual learning covenant with each student. We have eliminated grade inflation by using a forced curve for student grades, and students receive their grades before evaluating professors. Not only do we not offer tenure, but our lowest rated teachers are not invited to return.

First of all, I’m not crazy about the idea of weekly rankings and public shaming going on here. And how do you separate emotional venting from constructive criticism anyway? Isn’t the customer always right? Overall the experience of the teachers doesn’t sound good – if I have a choice as a teacher, I teach elsewhere, unless the pay and the students are stellar.

On the other hand, I think it’s interesting that they have a curve for student grades. This does prevent the extra good evaluations coming straight from grade inflation (I’ve seen it, it does happen).

Here’s one think I didn’t see discussed, which is students themselves and how much they want to be in the class. When I taught first semester calculus at Barnard twice in consecutive semesters, my experience was vastly different in the two classes.

The first time I taught, in the Fall, my students were mostly straight out of high school, bright eyed and bushy tailed, and were happy to be there, and I still keep in touch with some of them. It was a great class, and we all loved each other by the end of it. I got crazy good reviews.

By contrast, the second time I taught the class, which was the next semester, my students were annoyed, bored, and whiny. I had too many students in the class, partly because my reviews were so good. So the class was different on that score, but I don’t think that mattered so much to my teaching.

My theory, which was backed up by all the experienced Profs in the math department, was that I had the students who were avoiding calculus for some reason. And when I thought about it, they weren’t straight out of high school, they were all over the map. They generally were there only because they needed some kind of calculus to fulfill a requirement for their major.

Unsurprisingly, I got mediocre reviews, with some really pretty nasty ones. The nastiest ones, I noticed, all had some giveaway that they had a bad attitude- something like, “Cathy never explains anything clearly, and I hate calculus.” My conclusion is that I get great evaluations from students who want to learn calculus and nasty evaluations from students who resent me asking them to really learn calculus.

What should we do about prof evaluations?

The problem with using evaluations to measure professor effectiveness is that you might be a prof that only has ever taught calculus in the Spring, and then you’d be wrongfully punished. That’s where we are now, and people know it, so instead of using them they just mostly ignore them. Of course, the problem with not ever using these evaluations is that they might actually contain good information that you could use to get better at teaching.

We have a lot of data collected on teacher evaluations, so I figure we should be analyzing it to see if there really is a useful signal or not. And we should use domain expertise from experienced professors to see if there are any other effects besides the “Fall/Spring attitude towards math” effect to keep in mind.

It’s obviously idiosyncratic depending on field and even which class it is, i.e. Calc II versus Calc III. If there even is a signal after you extract the various effects and the “attractiveness” effect, I expect it to be very noisy and so I’d hate to see someone’s entire career depend on evaluations, unless there was something really outrageous going on.

In any case it would be fun to do that analysis.

Categories: data science, math, math education, statistics

Columbia Data Science course, week 3: Naive Bayes, Laplace Smoothing, and scraping data off the web

September 20, 2012 Cathy O'Neil, mathbabe 8 comments

In the third week of the Columbia Data Science course, our guest lecturer was Jake Hofman. Jake is at Microsoft Research after recently leaving Yahoo! Research. He got a Ph.D. in physics at Columbia and taught a fantastic course on modeling last semester at Columbia.

After introducing himself, Jake drew up his “data science profile;” turns out he is an expert on a category that he created called “data wrangling.” He confesses that he doesn’t know if he spends so much time on it because he’s good at it or because he’s bad at it.

Thought Experiment: Learning by Example

Jake had us look at a bunch of text. What is it? After some time we describe each row as the subject and first line of an email in Jake’s inbox. We notice the bottom half of the rows of text looks like spam.

Now Jake asks us, how did you figure this out? Can you write code to automate the spam filter that your brain is?

Some ideas the students came up with:

Any email is spam if it contains Viagra references. Jake: this will work if they don’t modify the word.
Maybe something about the length of the subject?
Exclamation points may point to spam. Jake: can’t just do that since “Yahoo!” would count.
Jake: keep in mind spammers are smart. As soon as you make a move, they game your model. It would be great if we could get them to solve important problems.
Should we use a probabilistic model? Jake: yes, that’s where we’re going.
Should we use k-nearest neighbors? Should we use regression? Recall we learned about these techniques last week. Jake: neither. We’ll use Naive Bayes, which is somehow between the two.

Why is linear regression not going to work?

Say you make a feature for each lower case word that you see in any email and then we used R’s “lm function:”

lm(spam ~ word1 + word2 + …)

Wait, that’s too many variables compared to observations! We have on the order of 10,000 emails with on the order of 100,000 words. This will definitely overfit. Technically, this corresponds to the fact that the matrix in the equation for linear regression is not invertible. Moreover, maybe can’t even store it because it’s so huge.

Maybe you could limit to top 10,000 words? Even so, that’s too many variables vs. observations to feel good about it.

Another thing to consider is that target is 0 or 1 (0 if not spam, 1 if spam), whereas you wouldn’t get a 0 or a 1 in actuality through using linear regression, you’d get a number. Of course you could choose a critical value so that above that we call it “1” and below we call it “0”. Next week we’ll do even better when we explore logistic regression, which is set up to model a binary response like this.

How about k-nearest neighbors?

To use k-nearest neighbors we would still need to choose features, probably corresponding to words, and you’d likely define the value of those features to be 0 or 1 depending on whether the word is present or not. This leads to a weird notion of “nearness”.

Again, with 10,000 emails and 100,000 words, we’ll encounter a problem: it’s not a singular matrix this time but rather that the space we’d be working in has too many dimensions. This means that, for example, it requires lots of computational work to even compute distances, but even that’s not the real problem.

The real problem is even more basic: even your nearest neighbors are really far away. this is called “the curse of dimensionality“. This problem makes for a poor algorithm.

Question: what if sharing a bunch of words doesn’t mean sentences are near each other in the sense of language? I can imagine two sentences with the same words but very different meanings.

Jake: it’s not as bad as it sounds like it might be – I’ll give you references at the end that partly explain why.

Aside: digit recognition

In this case k-nearest neighbors works well and moreover you can write it in a few lines of R.

Take your underlying representation apart pixel by pixel, say in a 16 x 16 grid of pixels, and measure how bright each pixel is. Unwrap the 16×16 grid and put it into a 256-dimensional space, which has a natural archimedean metric. Now apply the k-nearest neighbors algorithm.

Some notes:

If you vary the number of neighbors, it changes the shape of the boundary and you can tune k to prevent overfitting.
You can get 97% accuracy with a sufficiently large data set.
Result can be viewed in a “confusion matrix“.

Naive Bayes

Question: You’re testing for a rare disease, with 1% of the population is infected. You have a highly sensitive and specific test:

99% of sick patients test positive
99% of healthy patients test negative

Given that a patient tests positive, what is the probability that the patient is actually sick?

Answer: Imagine you have 100×100 = 10,000 people. 100 are sick, 9,900 are healthy. 99 sick people test sick, and 99 healthy people do too. So if you test positive, you’re equally likely to be healthy or sick. So the answer is 50%.

Let’s do it again using fancy notation so we’ll feel smart:

Recall

$p(y|x) p(x) = p(x, y) = p(x|y) p(y)$

and solve for $p(y|x):$

$p(y|x) = \frac{p(x|y) p(y)}{p(x)}.$

The denominator can be thought of as a “normalization constant;” we will often be able to avoid explicitly calculuating this. When we apply the above to our situation, we get:

$p(sick|+) = p(+|sick) p(sick) / p(+) = 99/198 = 1/2.$

This is called “Bayes’ Rule“. How do we use Bayes’ Rule to create a good spam filter? Think about it this way: if the word “Viagra” appears, this adds to the probability that the email is spam.

To see how this will work we will first focus on just one word at a time, which we generically call “word”. Then we have:

$p(spam|word) = p(word|spam) p(spam) / p(word)$

The right-hand side of the above is computable using enough pre-labeled data. If we refer to non-spam as “ham”, we only need $p(word|spam), p(word|ham), p(spam),$ and $p(ham).$ This is essentially a counting exercise.

Example: go online and download Enron emails. Awesome. We are building a spam filter on that – really this means we’re building a new spam filter on top of the spam filter that existed for the employees of Enron.

Jake has a quick and dirty shell script in bash which runs this. It downloads and unzips file, creates a folder. Each text file is an email. They put spam and ham in separate folders.

Jake uses “wc” to count the number of messages for one former Enron employee, for example. He sees 1500 spam, and 3672 ham. Using grep, he counts the number of instances of “meeting”:

grep -il meeting enron1/spam/*.txt | wc -l

This gives 153. This is one of the handful of computations we need to compute

$p(spam|meeting) = 0.09.$

Note we don’t need a fancy programming environment to get this done.

Next, we try:

“money”: 80% chance of being spam.
“viagra”: 100% chance.
“enron”: 0% chance.

This illustrates overfitting; we are getting overconfident because of our biased data. It’s possible, in other words, to write an non-spam email with the word “viagra” as well as a spam email with the word “enron.”

Next, do it for all the words. Each document can be represented by a binary vector, whose jth entry is 1 or 0 depending whether jth word appears. Note this is a huge-ass vector, we will probably actually represent it with the indices of the words that actually do show up.

Here’s the model we use to estimate the probability that we’d see a given word vector given that we know it’s spam (or that it’s ham). We denote the document vector $x$ and the various entries $x_j$ , where the $j$ correspond to all the indices of $x,$ in other words over all the words. For now we denote “is spam” by $c$ :

$p(x|c) = \prod_j \theta^{x_j}_{jc} (1- \theta_{jc})^{(1-x_j)}$

The theta here is the probability that an individual word is present in a spam email (we can assume separately and parallel-ly compute that for every word). Note we are modeling the words independently and we don’t count how many times they are present. That’s why this is called “Naive”.

Let’s take the log of both sides:

$log(p(x|c)) = \sum_j x_j log(\theta_{jc}/(1-\theta_{jc})) + \sum_j log(1-\theta_{jc})$

[It’s good to take the log because multiplying together tiny numbers can give us numerical problems.]

The term $log(\theta/(1-\theta))$ doesn’t depend on a given document, just the word, so let’s rename it $w_j.$ Same with $log(\theta/(1-\theta)) = w_0$ . The real weights that vary by document are the $x_j$ ‘s.

We can now use Bayes’ Rule to get an estimate of $p(c|x),$ which is what we actually want. We can also get away with not computing all the terms if we only care whether it’s more likely to be spam or to be ham. Only the varying term needs to be computed.

Wait, this ends up looking like a linear regression! But instead of computing them by inverting a huge matrix, the weights come from the Naive Bayes’ algorithm.

This algorithm works pretty well and it’s “cheap to train” if you have pre-labeled data set to train on. Given a ton of emails, just count the words in spam and non-spam emails. If you get more training data you can easily increment your counts. In practice there’s a global model, which you personalize to individuals. Moreover, there are lots of hard-coded, cheap rules before an email gets put into a fancy and slow model.

Here are some references:

“Idiot’s Bayes – not so stupid after all?” – the whole paper is about why it doesn’t suck, which is related to redundancies in language.
“Naive Bayes at Forty: The Independence Assumption in Information“
“Spam Filtering with Naive Bayes – Which Naive Bayes?“

Laplace Smoothing

Laplace Smoothing refers to the idea of replacing our straight-up estimate of the probability $\theta_{jc} = n_{jc}/n_c$ of seeing a given word in a spam email with something a bit fancier:

$\theta_{jc} = (n_{jc} + \alpha)/ (n_c + \beta).$

We might fix $\alpha = 1$ and $\beta = 10$ for example, to prevents the possibility of getting 0 or 1 for a probability. Does this seem totally ad hoc? Well if we want to get fancy, we can see this as equivalent to having a prior and performing a maximal likelihood estimate.

If we denote by $ML$ the maximal likelihood estimate, then we have:

$\theta_{ML} = argmax _{\theta} p(D | \theta)$

In other words, we are asking the question, for what value of $\theta$ were the data D most probable? If we assume independent trials then we want to maximize

$log(\theta^n (1-\theta)^{N-n})$

If you take the derivative, and set it to zero, we get

$\hat{\theta} = n/N.$

In other words, just what we had before. Now let’s add a prior. Denote by $MAP$ the maximum a posteriori likelihood:

$\theta_{MAP} = argmax p(\theta | D)$

This similarly asks the question, given the data I saw, which parameter is the most likely?

Use Bayes’ rule to get $p(D|\theta)*p(\theta)$ . This looks similar to above except for the $p(\theta)$ , which is the “prior”. If I assume $p(\theta)$ is of the form $\theta^{\alpha} (1- \theta)^{\beta}$ ; then we get the above, Laplacian smoothed version.

Sometimes $\alpha$ and $\beta$ are called “pseudo counts”. They’re fancy but also simple. It’s up to the data scientist to set the values of these hyperparameters, and it gives us two knobs to tune. By contrast, k-nearest neighbors has one knob, namely k.

Note: In the last 5 years people have started using stochastic gradient methods to avoid the non-invertible (overfitting) matrix problem. Switching to logistic regression with stochastic gradient method helped a lot, and can account for correlations between words. Even so, Naive Bayes’ is pretty impressively good considering how simple it is.

Scraping the web: API’s

For the sake of this discussion, an API (application programming interface) is something websites provide to developers so they can download data from the website easily and in standard format. Usually the developer has to register and receive a “key”, which is something like a password. For example, the New York Times has an API here. Note that some websites limit what data you have access to through their API’s or how often you can ask for data without paying for it.

When you go this route, you often get back weird formats, sometimes in JSON, but there’s no standardization to this standardization, i.e. different websites give you different “standard” formats.

One way to get beyond this is to use Yahoo’s YQL language which allows you to go to the Yahoo! Developer Network and write SQL-like queries that interact with many of the common API’s on the common sites like this:

select * from flickr.photos.search where text=”Cat” and api_key=”lksdjflskjdfsldkfj” limit 10

The output is standard and I only have to parse this in python once.

What if you want data when there’s no API available?

Note: always check the terms and services of the website before scraping.

In this case you might want to use something like the Firebug extension for Firefox, you can “inspect the element” on any webpage, and Firebug allows you to grab the field inside the html. In fact it gives you access to the full html document so you can interact and edit. In this way you can see the html as a map of the page and Firebug is a kind of tourguide.

After locating the stuff you want inside the html, you can use curl, wget, grep, awk, perl, etc., to write a quick and dirty shell script to grab what you want, especially for a one-off grab. If you want to be more systematic you can also do this using python or R.

Other parsing tools you might want to look into:

lynx and lynx –dump: good if you pine for the 1970’s. Oh wait, 1992. Whatever.
Beautiful Soup: robust but kind of slow
Mechanize (or here) super cool as well but doesn’t parse javascript.

Postscript: Image Classification

How do you determine if an image is a landscape or a headshot?

You either need to get someone to label these things, which is a lot of work, or you can grab lots of pictures from flickr and ask for photos that have already been tagged.

Represent each image with a binned RGB – (red green blue) intensity histogram. In other words, for each pixel, for each of red, green, and blue, which are the basic colors in pixels, you measure the intensity, which is a number between 0 and 255.

Then draw three histograms, one for each basic color, showing us how many pixels had which intensity. It’s better to do a binned histogram, so have counts of # pixels of intensity 0 – 51, etc. – in the end, for each picture, you have 15 numbers, corresponding to 3 colors and 5 bins per color. We are assuming every picture has the same number of pixels here.

Finally, use k-nearest neighbors to decide how much “blue” makes a landscape versus a headshot. We can tune the hyperparameters, which in this case are # of bins as well as k.

Categories: data science, math education, open source tools, statistics

Why are the Chicago public school teachers on strike?

September 14, 2012 Cathy O'Neil, mathbabe 59 comments

The issues of pay and testing

My friend and fellow HCSSiM 2012 staff member P.J. Karafiol explains some important issues in a Chicago Sun Times column entitled “Hard facts behind union, board dispute.”

P.J. is a Chicago public school math teacher, he has two kids in the CPS system, and he’s a graduate from that system. So I think he is qualified to speak on the issues.

He first explains that CPS teachers are paid less than those in the suburbs. This means, among other things, that it’s hard to keep good teachers. Next, he explains that, although it is difficult to argue against merit pay, the value-added models that Rahm Emanuel wants to account for half of teachers evaluation, is deeply flawed.

He then points out that, even if you trust the models, the number of teachers the model purports to identify as bad is so high that taking action on that result by firing them all would cause a huge problem – there’s a certain natural rate of finding and hiring good replacement teachers in the best of times, and these are not the best of times.

He concludes with this:

Teachers in Chicago are paid well initially, but face rising financial incentives to move to the suburbs as they gain experience and proficiency. No currently-existing “value added” evaluation system yields consistent, fair, educationally sound results. And firing bad teachers won’t magically create better ones to take their jobs.

To make progress on these issues, we have to figure out a way to make teaching in the city economically viable over the long-term; to evaluate teachers in a way that is consistent and reasonable, and that makes good sense educationally; and to help struggling teachers improve their practice. Because at base, we all want the same thing: classes full of students eager to be learning from their excellent, passionate teachers.

Test anxiety

Ultimately this crappy model, and the power that it yields, creates a culture of text anxiety for teachers and principals as well as for students. As Eric Zorn (grandson of mathematician Max Zorn) writes in the Chicago Tribune (h/t P.J. Karafiol):

The question: But why are so many presumptively good teachers also afraid? Why has the role of testing in teacher evaluations been a major sticking point in the public schools strike in Chicago?

The short answer: Because student test scores provide unreliable and erratic measurements of teacher quality. Because studies show that from subject to subject and from year to year, the same teacher can look alternately like a golden apple and a rotting fig.

Zorn quotes extensively from Math for America President John Ewing’s article in Notices of the American Mathematical Society:

Analyses of (value-added model) results have led researchers to doubt whether the methodology can accurately identify more and less effective teachers. (Value-added model) estimates have proven to be unstable across statistical models, years and classes that teachers teach.

One study found that across five large urban districts, among teachers who were ranked in the top 20 percent of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40 percent.

Another found that teachers’ effectiveness ratings in one year could only predict from 4 percent to 16 percent of the variation in such ratings in the following year.

The politics behind the test

I agree that the value-added model (VAM) is deeply flawed; I’ve blogged about it multiple times, for example here.

The way I see it, VAM is a prime example of the way that mathematics is used as a weapon against normal people – in this case, teachers, principals, and schools. If you don’t see my logic, ask yourself this:

Why would a overly-complex, unproved and very crappy model be so protected by politicians?

There’s really one reason, namely it serves a political function, not a mathematical one. And that political function is to maintain control over the union via a magical box that nobody completely understands (including the politicians, but it serves their purposes in spite of this) and therefore nobody can argue against.

This might seem ridiculous when you have examples like this one from the Washington Post (h/t Chris Wiggins), in which a devoted and beloved math teacher named Ashley received a ludicrously low VAM score.

I really like the article: it was written by Sean C. Feeney, Ashley’s principal at The Wheatley School in New York State and president of the Nassau County High School Principals’ Association. Feeney really tries to understand how the model works and how it uses data.

Feeney uncovers the crucial facts that, on the one hand nobody understands how VAM works at all, and that, on the other, the real reason it’s being used is for the political games being played behind the scenes (emphasis mine):

Officials at our State Education Department have certainly spent countless hours putting together guides explaining the scores. These documents describe what they call an objective teacher evaluation process that is based on student test scores, takes into account students’ prior performance, and arrives at a score that is able to measure teacher effectiveness. Along the way, the guides are careful to walk the reader through their explanations of Student Growth Percentiles (SGPs) and a teacher’s Mean Growth Percentile (MGP), impressing the reader with discussions and charts of confidence ranges and the need to be transparent about the data. It all seems so thoughtful and convincing! After all, how could such numbers fail to paint an accurate picture of a teacher’s effectiveness?

(One of the more audacious claims of this document is that the development of this evaluative model is the result of the collaborative efforts of the Regents Task Force on Teacher and Principal Effectiveness. Those of us who know people who served on this committee are well aware that the recommendations of the committee were either rejected or ignored by State Education officials.)

Feeney wasn’t supposed to do this. He wasn’t supposed to assume he was smart enough to understand the math behind the model. He wasn’t supposed to realize that these so-called “guides to explain the scores” actually represent the smoke being blown into the eyes of educators for the purposes of dismembering what’s left of the power of teachers’ unions in this country.

If he were better behaved, he would have bowed to the authority of the inscrutable, i.e. mathematics, and assume that his prize math teacher must have had flaws he, as her principal, just hadn’t seen before.

Weapons of Math Destruction

Politicans have created a WMD (Weapon of Math Destruction) in VAM; it’s the equivalent of owning an uzi factory when you’re fighting a war against people with pointy sticks.

It’s not the only WMD out there, but it’s a pretty powerful one, and it’s doing outrageous damage to our educational system.

If you don’t know what I mean by WMD, let me help out: one way to spot a WMD is to look at the name versus the underlying model and take note of discrepancies. VAM is a great example of this:

The name “Value-Added Model” makes us think we might learn how much a teacher brings to the class above and beyond, say, rote memorization.
In fact, if you look carefully you will see that the model is measuring exactly that: teaching to the test, but with errorbars so enormous that the noise almost completely obliterates any “teaching to the test” signal.

Nobody wants crappy teachers in the system, but vilifying well-meaning and hard-working professionals and subjecting them to random but high-stakes testing is not the solution, it’s pure old-fashioned scapegoating.

The political goal of the national VAM movement is clear: take control of education and make sure teachers know their place as the servants of the system, with no job security and no respect.

No wonder the Chicago public school teachers are on strike. I would be too.

Categories: data science, math education, news, rant, statistics

Columbia data science course, week 2: RealDirect, linear regression, k-nearest neighbors

September 13, 2012 Cathy O'Neil, mathbabe 3 comments

Data Science Blog

Today we started with discussing Rachel’s new blog, which is awesome and people should check it out for her words of data science wisdom. The topics she’s riffed on so far include: Why I proposed the course, EDA (exploratory data analysis), Analysis of the data science profiles from last week, and Defining data science as a research discipline.

She wants students and auditors to feel comfortable in contributing to blog discussion, that’s why they’re there. She particularly wants people to understand the importance of getting a feel for the data and the questions before ever worrying about how to present a shiny polished model to others. To illustrate this she threw up some heavy quotes:

“Long before worrying about how to convince others, you first have to understand what’s happening yourself” – Andrew Gelman

“Agreed” – Rachel Schutt

Thought experiment: how would you simulate chaos?

We split into groups and discussed this for a few minutes, then got back into a discussion. Here are some ideas from students:

A Lorenzian water wheel would do the trick, if you know what that is.
Question: is chaos the same as randomness?
Many a physical system can exhibit inherent chaos: examples with finite-state machines
Teaching technique of “Simulating chaos to teach order” gives us real world simulation of a disaster area
In this class w want to see how students would handle a chaotic situation. Most data problems start out with a certain amount of dirty data, ill-defined questions, and urgency. Can we teach a method of creating order from chaos?
See also “Creating order from chaos in a startup“.

Talking to Doug Perlson, CEO of RealDirect

We got into teams of 4 or 5 to assemble our questions for Doug, the CEO of RealDirect. The students have been assigned as homework the task of suggesting a data strategy for this new company, due next week.

He came in, gave us his background in real-estate law and startups and online advertising, and told us about his desire to use all the data he now knew about to improve the way people sell and buy houses.

First they built an interface for sellers, giving them useful data-driven tips on how to sell their house and using interaction data to give real-time recommendations on what to do next. Doug made the remark that normally, people sell their homes about once in 7 years and they’re not pros. The goal of RealDirect is not just to make individuals better but also pros better at their job.

He pointed out that brokers are “free agents” – they operate by themselves. they guard their data, and the really good ones have lots of experience, which is to say they have more data. But very few brokers actually have sufficient experience to do it well.

The idea is to apply a team of licensed real-estate agents to be data experts. They learn how to use information-collecting tools so we can gather data, in addition to publicly available information (for example, co-op sales data now available, which is new).

One problem with publicly available data is that it’s old news – there’s a 3 month lag. RealDirect is working on real-time feeds on stuff like:

when people start search,
what’s the initial offer,
the time between offer and close, and
how people search online.

Ultimately good information helps both the buyer and the seller.

RealDirect makes money in 2 ways. First, a subscription, $395 a month, to access our tools for sellers. Second, we allow you to use our agents at a reduced commission (2% of sale instead of the usual 2.5 or 3%). The data-driven nature of our business allows us to take less commission because we are more optimized, and therefore we get more volume.

Doug mentioned that there’s a law in New York that you can’t show all the current housing listings unless it’s behind a registration wall, which is why RealDirect requires registration. This is an obstacle for buyers but he thinks serious buyers are willing to do it. He also doesn’t consider places that don’t require registration, like Zillow, to be true competitors because they’re just showing listings and not providing real service. He points out that you also need to register to use Pinterest.

Doug mentioned that RealDirect is comprised of licensed brokers in various established realtor associations, but even so they have had their share of hate mail from realtors who don’t appreciate their approach to cutting commission costs. In this sense it is somewhat of a guild.

On the other hand, he thinks if a realtor refused to show houses because they are being sold on RealDirect, then the buyers would see the listings elsewhere and complain. So they traditional brokers have little choice but to deal with them. In other words, the listings themselves are sufficiently transparent so that the traditional brokers can’t get away with keeping their buyers away from these houses

RealDirect doesn’t take seasonality issues into consideration presently – they take the position that a seller is trying to sell today. Doug talked about various issues that a buyer would care about- nearby parks, subway, and schools, as well as the comparison of prices per square foot of apartments sold in the same building or block. These are the key kinds of data for buyers to be sure.

In terms of how the site works, it sounds like somewhat of a social network for buyers and sellers. There are statuses for each person on site. active – offer made – offer rejected – showing – in contract etc. Based on your status, different opportunities are suggested.

Suggestions for Doug?

Linear Regression

Example 1. You have points on the plane:

(x, y) = (1, 2), (2, 4), (3, 6), (4, 8).

The relationship is clearly y = 2x. You can do it in your head. Specifically, you’ve figured out:

There’s a linear pattern.
The coefficient 2
So far it seems deterministic

Example 2. You again have points on the plane, but now assume x is the input, and y is output.

(x, y) = (1, 2.1), (2, 3.7), (3, 5.8), (4, 7.9)

Now you notice that more or less y ~ 2x but it’s not a perfect fit. There’s some variation, it’s no longer deterministic.

Example 3.

(x, y) = (2, 1), (6, 7), (2.3, 6), (7.4, 8), (8, 2), (1.2, 2).

Here your brain can’t figure it out, and there’s no obvious linear relationship. But what if it’s your job to find a relationship anyway?

First assume (for now) there actually is a relationship and that it’s linear. It’s the best you can do to start out. i.e. assume

$y = \beta_0 + \beta_1 x + \epsilon$

and now find best choices for $\beta_0$ and $\beta_1$ . Note we include $\epsilon$ because it’s not a perfect relationship. This term is the “noise,” the stuff that isn’t accounted for by the relationship. It’s also called the error.

Before we find the general formula, we want to generalize with three variables now: $x_1, x_2, x_3$ , and we will again try to explain $y$ knowing these values. If we wanted to draw it we’d be working in 4 dimensional space, trying to plot points. As above, assuming a linear relationship means looking for a solution to:

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$

Writing this with matrix notation we get:

$y = x \cdot \beta + \epsilon.$

How do we calculate $\beta$ ? Define the “residual sum of squares”, denoted $RSS(\beta),$ to be

$RSS(\beta) = \sum_i (y_i - \beta x)^2,$

where $i$ ranges over the various data points. RSS is called a loss function. There are many other versions of it but this is one of the most basic, partly because it gives us a pretty nice measure of closeness of fit.

To minimize $RSS(\beta) = (y - \beta x)^t (y - \beta x),$ we differentiate it with respect to $\beta$ and set it equal to zero, then solve for $\beta.$ We end up with

$\beta = (x^t x)^{-1} x^t y.$

To use this, we go back to our linear form and plug in the values of $\beta$ to get a predicted $y$ .

But wait, why did we assume a linear relationship? Sometimes maybe it’s a polynomial relationship.

$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3.$

You need to justify why you’re assuming what you want. Answering that kind of question is a key part of being a data scientist and why we need to learn these things carefully.

All this is like one line of R code where you’ve got a column of y’s and a column of x’s.:

model <- lm(y ~ x)

Or if you’re going with the polynomial form we’d have:

model <- lm(y ~ x + x^2 + x^3)

Why do we do regression? Mostly for two reasons:

If we want to predict one variable from the next
If we want to explain or understand the relationship between two things.

K-nearest neighbors

Say you have the age, income, and credit rating for a bunch of people and you want to use the age and income to guess at the credit rating. Moreover, say we’ve divided credit ratings into “high” and “low”.

We can plot people as points on the plane and label people with an “x” if they have low credit ratings.

What if a new guy comes in? What’s his likely credit rating label? Let’s use k-nearest neighbors. To do so, you need to answer two questions:

How many neighbors are you gonna look at? k=3 for example.
What is a neighbor? We need a concept of distance.

For the sake of our problem, we can use Euclidean distance on the plane if the relative scalings of the variables are approximately correct. Then the algorithm is simple to take the average rating of the people around me. where average means majority in this case – so if there are 2 high credit rating people and 1 low credit rating person, then I would be designated high.

Note we can also consider doing something somewhat more subtle, namely assigning high the value of “1” and low the value of “0” and taking the actual average, which in this case would be 0.667. This would indicate a kind of uncertainty. It depends on what you want from your algorithm. In machine learning algorithms, we don’t typically have the concept of confidence levels. care more about accuracy of prediction. But of course it’s up to us.

Generally speaking we have a training phase, during which we create a model and “train it,” and then we have a testing phase where we use new data to test how good the model is.

For k-nearest neighbors, the training phase is stupid: it’s just reading in your data. In testing, you pretend you don’t know the true label and see how good you are at guessing using the above algorithm. This means you save some clean data from the overall data for the testing phase. Usually you want to save randomly selected data, at least 10%.

In R: read in the package “class”, and use the function knn().

You perform the algorithm as follows:

knn(train, test, cl, k=3)

The output includes the k nearest (in Euclidean distance) training set vectors, and the classification labels as decided by majority vote

How do you evaluate if the model did a good job?

This isn’t easy or universal – you may decide you want to penalize certain kinds of misclassification more than others. For example, false positives may be way worse than false negatives.

To start out stupidly, you might want to simply minimize the misclassification rate:

(# incorrect labels) / (# total labels)

How do you choose k?

This is also hard. Part of homework next week will address this.

When do you use linear regression vs. k-nearest neighbor?

Thinking about what happens with outliers helps you realize how hard this question is. Sometimes it comes down to a question of what the decision-maker decides they want to believe.

Note definitions of “closeness” vary depending on the context: closeness in social networks could be defined as the number of overlapping friends.

Both linear regression and k-nearest neighbors are examples of “supervised learning”, where you’ve observed both x and y, and you want to know the function that brings x to y.

Categories: data science, math education, open source tools, statistics, women in math

How is math used outside academia?

September 7, 2012 Cathy O'Neil, mathbabe 33 comments

Help me out, beloved readers. Brainstorm with me.

I’m giving two talks this semester on how math is used outside academia, for math audiences. One is going to be at the AGNES conference and another will be a math colloquium at Stonybrook.

I want to give actual examples, with fully defined models, where I can explain the data, the purported goal, the underlying assumptions, the actual outputs, the political context, and the reach of each model.

The cool thing about these talks is I don’t need to dumb down the math at all, obviously, so I can be quite detailed in certain respects, but I don’t want to assume my audience knows the context at all, especially the politics of the situation.

So far I have examples from finance, internet advertising, and educational testing. Please tell me if you have some more great examples, I want this talk to be awesome.

The ultimate goal of this project is probably an up-to-date essay, modeled after this one, which you should read. Published in the Notices of the AMS in January 2003, author Mary Poovey explains how mathematical models are used and abused in finance and accounting, how Enron booked future profits as current earnings and how they manipulated the energy market. From the essay:

Thus far the role that mathematics has played in these financial instruments has been as much inspirational as practical: people tend to believe that numbers embody objectivity even when they do not see (or understand) the calculations by which particular numbers are generated. In my final example, mathematical principles are still invisible to the vast majority of investors, but mathematical equations become the prime movers of value. The belief that makes it possible for mathematics to generate value is not simply that numbers are objective but that the market actually obeys mathematical rules. The instruments that embody this belief are futures options or, in their most arcane form, derivatives.

Slightly further on she explains:

In 1973 two economists produced a set of equations, the Black-Scholes equations, that provided the first strictly quantitative instrument for calculating the prices of options in which the determining variable is the volatility of the underlying asset. These equations enabled analysts to standardize the pricing of derivatives in exclusively quantitative terms. From this point it was no longer necessary for traders to evaluate individual stocks by predicting the probable rates of profit, estimating public demand for a particular commodity, or subjectively getting a feel for the market. Instead, a futures trader could engage in trades driven purely by mathematical equations and selected by a software program.

She ends with a bunch of great questions. Mind you, this was in 2003, before the credit crisis:

But what if markets are too complex for mathematical models? What if irrational and completely unprecedented events do occur, and when they do—as we know they do—what if they affect markets in ways that no mathematical model can predict? What if the regularity that all mathematical models assume effaces social and cultural variables that are not subject to mathematical analysis? Or what if the mathematical models traders use to price futures actually influence the future in ways the models cannot predict and the analysts cannot govern? Perhaps these are the only questions that can challenge the financial axis of power, which otherwise threatens to remake everything, including value, over in the image of its own abstractions. Perhaps these are the kinds of questions that mathematicians and humanists, working together, should ask and try to answer.

Categories: data science, finance, math, math education

Columbia data science course, week 1: what is data science?

September 6, 2012 Cathy O'Neil, mathbabe 17 comments

I’m attending Rachel Schutt’s Columbia University Data Science course on Wednesdays this semester and I’m planning to blog the class. Here’s what happened yesterday at the first meeting.

Syllabus

Rachel started by going through the syllabus. Here were her main points:

The prerequisites for this class are: linear algebra, basic statistics, and some programming.
The goals of this class are: to learn what data scientists do. and to learn to do some of those things.
Rachel will teach for a couple weeks, then we will have guest lectures.
The profiles of those speakers vary considerably, as do their backgrounds. Yet they are all data scientists.
We will be resourceful with readings: part of being a data scientist is realizing lots of stuff isn’t written down yet.
There will be 6-10 homework assignments, due every two weeks or so.
The final project will be an internal Kaggle competition. This will be a team project.
There will also be an in-class final.
We’ll use R and python, mostly R. The support will be mainly for R. Download RStudio.
If you’re only interested in learning hadoop and working with huge data, take Bill Howe’s Coursera course. We will get to big data, but not til the last part of the course.

The current landscape of data science

So, what is data science? Is data science new? Is it real? What is it?

This is an ongoing discussion, but Michael Driscoll’s answer is pretty good:

Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics.

But data science is not merely hacking, because when hackers finish debugging their Bash one-liners and Pig scripts, few care about non-Euclidean distance metrics.

And data science is not merely statistics, because when statisticians finish theorizing the perfect model, few could read a ^A delimited file into R if their job depended on it.

Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what’s possible.

Driscoll also refers to Drew Conway’s Venn diagram of data science from 2010:

Data science Venn diagram

We also may want to look at Nathan Yau’s “sexy skills of data geeks” from his “Rise of the Data Scientist” in 2009:

Statistics – traditional analysis you’re used to thinking about
Data Munging – parsing, scraping, and formatting data
Visualization – graphs, tools, etc.

But wait, is data science a bag of tricks? Or is it just the logical extension of other fields like statistics and machine learning?

For one argument, see Cosma Shalizi’s posts here and here and my posts here and here, which constitute an ongoing discussion of the difference between a statistician and a data scientist.

Also see ASA President Nancy Geller’s 2011 Amstat News article, “Don’t shun the ‘S’ word,” where she defends statistics.

One thing’s for sure, in data science, nobody hands you a clean data set, and nobody tells you what method to use. Moreover, the development of the field is happening in industry, not academia.

In 2011, DJ Patil described how he and Jeff Hammerbacher, in 2008, coined the term data scientist. However, in 2001, William Cleveland wrote a paper about data science (see Nathan Yau’s post on it here).

So data science existed before data scientists? Is this semantics, or does it make sense?

It begs the question, can you define data science by what data scientists do? Who gets to define the field, anyway? There’s lots of buzz and hype – does the media get to define it, or should we rely on the practitioners, the self-appointed data scientists? Or is there some actual authority? Let’s leave these as open questions for now.

Columbia just decided to start an Institute for Data Sciences and Engineering with Bloomberg’s help. The only question is why there’s a picture of a chemist on the announcement. There are 465 job openings in New York for data scientists last time we checked. That’s a lot. So even if data science isn’t a real field, it has real jobs.

Note that most of the job descriptions ask data scientists to be experts in computer science, statistics, communication, data visualization, and to have expert domain expertise. Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise, which together, as a team, can specialize in all those things.

Here are other players in the ecosystem:

O’Reilly and their Strata Conference
DataKind
Meetup groups
VC firms like Union Square Ventures are pouring big money into data science startups
Kaggle hosts data science competitions
Chris Wiggins, professor of applied math at Columbia, has been instrumental in connecting techy undergrads with New York start-ups through his summer internship program HackNY.

Note: wikipedia didn’t have an entry on data science until this 2012. This is a new term if not a new subject.

How do you start a Data Science project?

Say you’re working with some website with an online product. You want to track and analyse user behavior. Here’s a way of thinking about it:

The user interacts with product.
The product has a front end and a back end.
The user starts taking actions: clicks, etc.
Those actions get logged.
The logs include timestamps; they capture all the key user activity around the product.
The logs then get processed in pipelines: that’s where data munging, joining, and mapreducing occur.
These pipelines generate nice, clean, massive data sets.
These data sets are typically keyed by user, or song (like if you work at a place like Pandora), or however you want to see your data.
These data sets then get analyzed, modeled, etc.
They ultimately give us new ways of understanding user behavior.
This new understanding gets embedded back into the product itself.
We’ve created a circular process of changing the user interaction with the product by starting with examining the user interaction with the product. This differentiates the job of the data scientist from the traditional data analyst role, which might analyze users for likelihood of purchase but probably wouldn’t change the product itself but rather retarget advertising or something to more likely buyers.
The data scientist also reports to the CEO or head of product what she’s seeing with respect to the user, what’s happening with the user experience, what are the patterns she’s seeing. This is where communication and reporting skills, as well as data viz skills and old-time story telling skills come in. The data scientist builds the narrative around the product.
Sometimes you have to scrape the web, to get auxiliary info, because either the relevant data isn’t being logged or it isn’t actually being generated by the users.

Profile yourself

Rachel then handed out index cards and asked everyone to profile themselves (on a relative rather than absolute scale) with respect to their skill levels in the following domains:

software engineering,
math,
stats,
machine learning,
domain expertise,
communication and presentation skills, and
data viz

We taped the index cards up and got to see how everyone else thought of themselves. There was quite a bit of variation, which is cool – lots of people in the class are coming from social science.

And again, a data science team works best when different skills (profiles) are represented in different people, since nobody is good at everything. It makes me think that it might be easier to define a “data science team” than to define a data scientist.

Thought experiment: can we use data science to define data science?

We broke into small groups to think about this question. Then we had a discussion. Some ideas:

Yes: google search data science and perform a text mining model
But wait, that would depend on you being a usagist rather than a prescriptionist with respect to language. Do we let the masses define data science (where “the masses” refers to whatever google’s search engine finds)? Or do we refer to an authority such as the Oxford English Dictionary?
Actually the OED probably doesn’t have an entry yet and we don’t have time to wait for it. Let’s agree that there’s a spectrum, and one authority doesn’t feel right and “the masses” doesn’t either.
How about we look at practitioners of data science, and see how they describe what they do (maybe in a word cloud for starters), and then see how people who claim to be other things like statisticians or physics or economics describe what they do, and then we can try to use a clustering algorithm or some other model and see if, when it takes as input “the stuff I do”, it gives me a good prediction on what field I’m in.

Just for comparison, check out what Harlan Harris recently did inside the field of data science: he took a survey and used clustering to define subfields of data science, which gave rise to this picture:

It was a really exciting first week, I’m looking forward to more!

Categories: data science, math education, statistics

Update on organic food

August 19, 2012 Cathy O'Neil, mathbabe 5 comments

So I’m back from some town in North Ontario (please watch this video to get an idea). I spent four days on a tiny little island on Lake Huron with my family and some wonderful friends, swimming, boating, picnicking, and reading the Omnivore’s Dilemma by Michael Pollan whenever I could.

It was a really beautiful place but really far away, especially since my husband jumped gleefully into the water from a high rock with his glasses on so I had to drive all the way back without help. But what I wanted to mention to you is that, happily, I managed to finish the whole book – a victory considering the distractions.

I was told to read the book by a bunch of people who read my previous post on organic food and why I don’t totally get it: see the post here and be sure to read the comments.

One thing I have to give Pollan, he has written a book that lots of people read. I took notes on his approach and style because I want to write a book myself. And it’s not that I read statistics on the book sales – I know people read the book because, even though I hadn’t, lots of facts and passages were eerily familiar to me, which means people I know have quoted the book to me. That’s serious!

In other words, there’s been feedback from this book to the culture and how we think about organic food vs. industrial farming. I can’t very well argue that I already knew most of the stuff in the book, even though I did, because I probably only know it because he wrote the book on it and it’s become part of our cultural understanding.

I terms of the content, first, I’ll complain, then I’ll compliment.

Complaint #1: the guy is a major food snob (one might even say douche). He spends like four months putting together a single “hunting and gathering” meal with the help of his friends the Chez Panisse chefs. It’s kind of like a “lives of the rich and famous” episode in that section of the book, which is to say voyeuristic, painfully smug, and self-absorbed. It’s hard to find this guy wise when he’s being so precious.

Complaint #2: a related issue, which is that he never does the math on whether a given lifestyle is actually accessible for the average person. He mentions that the locally grown food is more expensive, but he also suggests that poor people now spend less of their income on food than they used to, implying that maybe they have extra cash on hand to buy local free-range chickens, not to mention that they’d need the time and a car and gas to drive to the local farms to buy this stuff (which somehow doesn’t seem to figure into his carbon footprint calculation of that lifestyle). I don’t think there’s all that much extra time and money on people’s hands these days, considering how many people are now living on food stamps (I will grant that he wrote this book before the credit crisis so he didn’t anticipate that).

Complaint #3: he doesn’t actually give a suggestion for what to do about this to the average person. In the end this book creates a way for well-to-do people to feel smug about their food choices but doesn’t forge a path otherwise, besides a vague idea that not eating processed food would be good. I know I’m asking a lot, but specific and achievable suggestions would have been nice. Here’s where my readers can say I missed something – please comment!

Compliment #1: he really educates the reader on how much the government farm subsidies distort the market, especially for corn, and how the real winners are the huge businesses like ConAgra and Monsanto, not the farmers themselves.

Compliment #2: he also explains the nastiness of processed food and large-scale cow, pig, and chicken farms. Yuck.

Compliment #3: My favorite part is that he describes the underlying model of the food industry as overly simplistic. He points out that, by just focusing on the chemicals like nitrogen and carbon in the soil, we have ignored all sorts of other important things that are also important to a thriving ecosystem. So, he explains, simply adding nitrogen to the soil in the form of fertilizer doesn’t actually solve the problem of growing things quickly. Well, it does do that, but it introduces other problems like pollution.

This is a general problem with models: they almost by definition simplify the world, but if they are successful, they get hugely scaled, and then the things they ignore, and the problems that arise from that ignorance, are amplified. There’s a feedback loop filled with increasingly devastating externalities. In the case of farming, the externalities take the form of pollution, unsustainable use of petrochemicals, sick cows and chickens, and nasty food-like items made from corn by-products.

Another example is teacher value-added models: the model is bad, it is becoming massively scaled, and the externalities are potentially disastrous (teaching to the test, the best teachers leaving the system, enormous amount of time and money spent on the test industry, etc.).

But that begs the question, what should we do about it? Should we well-to-do people object to the existence of the model and send our kids to the private schools where the teachers aren’t subject to that model? Or should we acknowledge it exists, it isn’t going away, and it needs to be improved?

It’s a similar question for the food system and the farming model: do we save ourselves and our family, because we can, or do we confront the industry and force them to improve their models?

I say we do both! Let’s not ignore our obligation to agitate for better farming practices for the enormous industry that already exists and isn’t going away. I don’t think the appropriate way to behave is to hole up with your immediate family and make sure your kids are eating wholesome food. That’s too small and insular! It’s important to think of ways to fight back against the system itself if we believe it’s corrupt and is ruining our environment.

For me that means being part of Occupy, joining movements and organization fighting against lobbyist power (here’s one that fights against BigFood lobbyists), and broadly educating people about statistics and mathematical modeling so that modeling flaws and externalities are understood, discussed, and minimized.

Categories: data science, math education, musing

Looterism

August 9, 2012 Cathy O'Neil, mathbabe 16 comments

My friend Nik recently sent me a PandoDaily article written by Francisco Dao entitled Looterism: The Cancerous Ethos That Is Gutting America.

He defines looterism as the “deification of pure greed” and says:

The danger of looterism, of focusing only on maximizing self interest above the importance of creating value, is that it incentivizes the extraction of wealth without regard to the creation or replenishment of the value building mechanism.

I like the term, I think I’ll use it. And it made me think of this recent Bloomberg article about private equity and hedge funds getting into the public schools space. From the article:

Indeed, investors of all stripes are beginning to sense big profit potential in public education.

The K-12 market is tantalizingly huge: The U.S. spends more than $500 billion a year to educate kids from ages five through 18. The entire education sector, including college and mid-career training, represents nearly 9 percent of U.S. gross domestic product, more than the energy or technology sectors.

Traditionally, public education has been a tough market for private firms to break into — fraught with politics, tangled in bureaucracy and fragmented into tens of thousands of individual schools and school districts from coast to coast.

Now investors are signaling optimism that a golden moment has arrived. They’re pouring private equity and venture capital into scores of companies that aim to profit by taking over broad swaths of public education.

The conference last week at the University Club, billed as a how-to on “private equity investing in for-profit education companies,” drew a full house of about 100.

[I think I know why that golden moment arrived, by the way. The obsession with test scores, a direct result of No Child Left Behind, is both pseudo-quantitative (by which I mean it is quantitative but is only measuring certain critical things and entirely misses other critical things) and has broken the backs of unions. Hedge funds and PE firms love quantitative things, and they don’t really care if they numbers are meaningful if they can meaningfully profit.]

Their immediate goal is out-sourcing: they want to create the Blackwater (now Academi) of education, but with cute names like Schoology and DreamBox.

Lest you worry that their focus will be on the wrong things, they point out that if you make kids drill math through DreamBox “heavily” for 16 weeks, they score 2.3 points higher in a standardized test, although they didn’t say if that was out of 800 or 20. Never mind that “heavily” also isn’t defined, but it seems safe to say from context that it’s at least 2 hours a day. So if you do that for 16 weeks, those 2.3 points better be pretty meaningful.

So either the private equity guys and hedge funders have the whole child in mind here, or it’s maybe looterism. I’m thinking looterism.

Categories: finance, hedge funds, math education, rant

Columbia Data Science Institute: it’s gonna happen

July 30, 2012 Cathy O'Neil, mathbabe 1 comment

So Bloomberg finally got around to announcing the Columbia Data Science Institute is really going to happen. The details as we know them now:

It’ll be at the main campus, not Manhattanville.
It’ll hire 75 faculty over the next decade (specifically, 30 new faculty by launch in August 2016 and 75 by 2030, so actually more than a decade but who’s counting?).
It will contain a New Media Center, a Smart Cities Center, a Health Analytics Center, a Cybersecurity Center, and a Financial Analytics Center.
The city is pitching in $15 million whereas Columbia is ponying up $80 million.
Columbia Computer Science professor Kathy McKeown will be the Director and Civil Engineering professor Patricia Culligan will be the Institute’s Deputy Director.

Categories: data science, math education, news

Does mathematics have a place in higher education?

July 29, 2012 Cathy O'Neil, mathbabe 66 comments

A recent New York Times Opinion piece (hat tip Wei Ho), Is Algebra Necessary?, argues for the abolishment of algebra as a requirement for college. It was written by Andrew Hacker, an emeritus professor of political science at Queens College, City University of New York. His concluding argument:

I’ve observed a host of high school and college classes, from Michigan to Mississippi, and have been impressed by conscientious teaching and dutiful students. I’ll grant that with an outpouring of resources, we could reclaim many dropouts and help them get through quadratic equations. But that would misuse teaching talent and student effort. It would be far better to reduce, not expand, the mathematics we ask young people to imbibe. (That said, I do not advocate vocational tracks for students considered, almost always unfairly, as less studious.)

Yes, young people should learn to read and write and do long division, whether they want to or not. But there is no reason to force them to grasp vectorial angles and discontinuous functions. Think of math as a huge boulder we make everyone pull, without assessing what all this pain achieves. So why require it, without alternatives or exceptions? Thus far I haven’t found a compelling answer.

For an interesting contrast, there’s a recent Bloomberg View Piece, How Recession Will Change University Financing, by Gary Shilling (not to be confused with Robert Shiller). From Shilling’s piece:

Most thought that a bachelor’s degree was the ticket to a well-paid job, and that the heavy student loans were worth it and manageable. And many thought that majors such as social science, education, criminal justice or humanities would still get them jobs. They didn’t realize that the jobs that could be obtained with such credentials were the nice-to-have but nonessential positions of the boom years that would disappear when times got tough and businesses slashed costs.

Some of those recent graduates probably didn’t want to do, or were intellectually incapable of doing, the hard work required to major in science and engineering. After all, afternoon labs cut into athletic pursuits and social time. Yet that’s where the jobs are now. Many U.S.-based companies are moving their research-and-development operations offshore because of the lack of scientists and engineers in this country, either native or foreign-born.

For 34- to 49-year-olds, student debt has leaped 40 percent in the past three years, more than for any other age group. Many of those debtors were unemployed and succumbed to for-profit school ads that promised high-paying jobs for graduates. But those jobs seldom materialized, while the student debt remained.

Moreover, many college graduates are ill-prepared for almost any job. A study by the Pew Charitable Trusts examined the abilities of U.S. college graduates in three areas: analyzing news stories, understanding documents and possessing the math proficiency to handle tasks such as balancing a checkbook or tipping in a restaurant.

The first article is written by a professor, so it might not be surprising that, as he sees more and more students coming through, he feels their pain and wants their experience to not be excruciating. The easiest way to do that is to remove the stumbling block requirement of math. He also seems to think of higher education as something everyone is entitled to, which I infer based on how he dismisses vocational training.

The second article is written by a financial analyst, an economist, so we might not be surprised that he strictly sees college as a purely commoditized investment in future income, and wants it to be a good one. The easiest way to do that is to have way fewer students go through college to begin with, since having dumb or bad students get into debt but not learn anything and then not get a job afterwards doesn’t actually make sense.

And where the first author acts like math is only needed for a tiny minority of college students, the second author basically dismisses non-math oriented subjects as frivolous and leading to a life of joblessness and debt. These are vastly different viewpoints. I’m thinking of inviting them both to dinner to discuss.

By the way, I think that last line, where Hacker wonders what the pain of math-as-huge-boulder achieves, is more or less answered by Shilling. The goal of having math requirements is to have students be mathematically literate, which is to say know how to do everyday things like balancing checkbooks and reading credit card interest rate agreements. The fact that we aren’t achieving this goal is important, but the goal is pretty clear. In other words, I think my dinner party would be fruitful as well as entertaining.

If there’s one thing these two agree on, it’s that students are having an awful lot of trouble doing basic math. This makes me wonder a few things.

First, why is algebra such a stumbling block? Is it that the students are really that bad, or is the approach to teaching it bad? I suspect what’s really going on is that the students taking it have mostly not been adequately taught the pre-requisites. That means we need more remedial college math.

I honestly feel like this is the perfect place for online learning. Instead of charging students enormous fees while they get taught high-school courses they should already know, and instead of removing basic literacy requirements altogether, ask them to complete some free online math courses at home or in their public library, to get them ready for college. The great thing about computers is that they can figure out the level of the user, and they never get impatient.

Next, should algebra be replaced by a Reckoning 101 course? Where, instead of manipulating formulas, we teach students to figure out tips and analyze news stories and understand basic statistical statements? I’m sure this has been tried, and I’m sure it’s easy to do badly or to water down entirely. Please tell me what you know. Specifically, are students better at snarky polling questions if they’ve taken these classes than if they’ve taken algebra?

Finally, I’d say this (and I’m stealing this from my friend Kiri, a principal of a high school for girls in math and science): nobody ever brags about not knowing how to read, but people brag all the time about not knowing how to do math. There’s nothing to be proud of in that, and it’s happening to a large degree because of our culture, not intelligence.

So no, let’s not remove mathematical literacy as a requirement for college graduates, but let’s think about what we can do to make the path reasonable and relevant while staying rigorous. And yes, there are probably too many students going to college because it’s now a cultural assumption rather than a thought-out decision, and this lands young people in debt up to their eyeballs and jobless, which sucks (here’s something that may help: forcing for-profit institutions to be honest in advertising future jobs promises and high interest debt).

Something just occurred to me. Namely, it’s especially ironic that the most mathematically illiterate and vulnerable students are being asked to sign loan contracts that they, almost by construction, don’t understand. How do we address this? Food for thought and for another post.

Categories: math, math education, news, statistics

Exploit me some more please

July 22, 2012 Cathy O'Neil, mathbabe 2 comments

I’m back home from HCSSiM (see my lecture notes here). Yesterday I took the bus from Amherst to New York and slept all the way, then got home and took a nap, and then after putting my 3-year-old to bed crashed on the couch until just now when I woke up. That’s about 13 hours of sleep in the past 20, and I’m planning to take a nap after writing this. That’s just an wee indication of how sleep deprived I became at math camp.

Add to that the fact that my bed there was hard plastic, that I completely lost touch with the memory of enjoying good food (taste buds? what are those?), and that I was pitifully underpaid, and you might think I’m glad to be home.

And I am, because I missed my family, but I’m already working feverishly to convince them to let me go again next year, and come with me next time if that would be better. Because I’m so in love with those kids and with those junior staff and with Kelly and with Hampshire College and with the whole entire program.

Just so you get an idea of what there is to love, check out one of the students talking about his plan for his yellow pig day shirt which I luckily captured on my phone:

And here’s a job description which always makes me laugh (and cry) (and I only worked there the first half):

When people haven’t experienced HCSSiM, we worry about being able to explain adequately the unusual commitment required by the exploitative position. The workday is long and challenging; it is also exciting and rewarding. A senior faculty and 2 junior faculty members actively participate in the morning classes (8:30 – 2:30, M-S) and evening problem sessions (7:30 – 10:30, M-F) of each of the c. 14-student Workshops (7/2 – 7/20 = days); they prepare (and duplicate) daily problem sets; they proofread notes and program journal articles, and they write evaluations; they offer constructive criticism; they attend the afternoon Prime Time Theorems (a 51-minute math-club type talk, over half given by visitors) and give 1 or 2. The staffing and most of those teaching opportunities (chores) apply to the 2nd half of the program when students take a Maxi-course (8:30-11, M-S, and 7:30 – 10:30, M-F, 7/23 – 8/10). During the 2nd half of the program, students also take, consecutively, 2 Mini-courses, which meet from 11:17 until 12:30 for 7 days and which have no attached problem sessions; many minis are created or co-created by junior staff. Except for Kelly and Susan (who are on call) the staff live in in the dorm (Enfield this year), join students for meals and recreational activities, provide transportation and counseling and supervision for students, and help to get the program to sleep around 11:17. There is virtually no hope of getting any research done or of maintaining an outside social life. In spite of (with some because of) the preceding, the job is exhilarating as well as exhausting; we have repeaters, and there are a lot of good math teachers out there who credit HCSSiM with teaching them to teach.

Categories: math education

HCSSiM Workshop day 17

July 21, 2012 Cathy O'Neil, mathbabe Comments off

This is a continuation of this, where I take notes on my workshop at HCSSiM.

Magic Squares

First Elizabeth Campolongo talked about magic squares. First she exhibited a bunch, including these classic ones which I found here:

Then we noted that flipping or rotating a magic square gives you another one, and also that the trivial magic square with all “1”‘s should also count as a crappy kind of magic square.

Then, for 3-by-3 magic squares, Elizabeth showed that knowing 3 entries (say the lowest row) would give you everything. This is in part because if you add up all four “lines” going through the center, you get the center guy four times and everything else once, or in other words everything once and the center guy three times. But you can also add up all the horizontal lines to get everything once. The first sum is 4C, if C is the sum of any line, and the second sum is 3C, so subtracting we get that C is the the center three times, or the center is just C/3.

This means if you have the bottom row, you can also infer C and thus the center. Then once you have the bottom and the center you can infer the top, and then you can infer the two sides.

After convincing them of this, Elizabeth explained that the set of magic cubes was actually a vector space over the rational numbers, and since we’d exhibited 3 non-collinear ones and since we’d proved we can have at most 3 degrees of freedom for any magic square, we actually had a basis.

Finally, she showed them one of the “all prime magic squares”:

The one with the “17” in the corner of course. She exhibited this as a sum of the basis we had written down. It was very cool.

The game of Set

My man Max Levit then whipped out some Set cards and showed people how to play (most of them already knew). After assigning numbers mod 3 to each of the 4 dimensions, he noted that taking any two cards, you’d be able to define the third card uniquely so that all three form a set. Moreover, that third card is just the negative of the sum of the first two, or in other words the sum of all three cards in a set is the vector $(0, 0, 0, 0).$

Next Max started talking about how many cards you can have where they don’t form a set. He started in the case where you have only two dimensions (but you’re still working mod 3). There are clearly at most 4, with a short pigeon hole argument, and he exhibited 4 that work.

He moved on to 3 and 4 dimensions and showed you could lay down 9 in 3 and 20 in 4 dimensions without forming a set (picture from here), which one of our students eagerly demonstrated with actual cards:

Finally, Max talked about creating magic squares with sets, tying together his awesome lecture with Elizabeth’s. A magic square of sets is also generated by 3 non-collinear cards, and you get the rest from those three placed anywhere away from a line:

Probability functions on lattices

Josh Vekhter then talked about probability distributions as functions from posets of “event spaces” to the real line. So if you role a 6-sided die, for example, you can think of the event “it’s a 3, 4, or 6” as being above the event “it’s a 3”. He talked about a lattice as a poset where there’s always an “or” and an “and”, so there’s always a common ancestor and child for any two elements.

Then he talked about the question of whether that probability function distributes in the way it “should” with respect to “and” and “or”, and explained how it doesn’t in the case of the two slit experiment.

He related this lack of distribution law of the probability function to the concept of the convexity of the space of probability distributions (keeping in mind that we actually have a vector space of possibly probability functions on a given lattice, can we find “pure” probability distributions that always take the values of 0 or 1 and which form a kind of basis for the entire set?).

This is not my expertise and hopefully Josh will fill in some details in the coming days.

King Chicken

I took over here at the end and discussed some beautiful problems related to flocks of chickens and pecking orders, which can be found in this paper. It was particularly poignant for me to talk about this because my first exposure to these problems was my entrance exam to get into this math program in 1987, when I was 14.

Notetaking algorithm

Finally, I proved that the notetaking algorithm we started the workshop with 3 weeks before actually always works. I did this by first remarking that, as long as it really was a function from the people to the numbers, i.e. it never gets stuck in an infinite loop, then it’s a bijection for sure because it has an inverse.

To show you don’t get stuck in an infinite loop, consider it as a directed graph instead of an undirected graph, in which case you put down arrows on all the columns (and all the segments of columns) from the names to the numbers, since you always go down when you’re on a column.

For the pipes between columns, you can actually squint really hard and realize there are two directed arrows there, one from the top left to the lower right and the other from the top right to the lower left. You’ve now replaces the “T” connections with directed arrows, and if you do this to every pipe you’ve removed all of the “T” connections altogether. It now looks like a bowl of yummy spaghetti.

But that means there is no chance to fall into an infinite loop, since that would require a vertex. Note that after doing this you do get lots of spaghetti circles falling out of your diagram.

Categories: math education

HCSSiM Workshop day 16

July 20, 2012 Cathy O'Neil, mathbabe 2 comments

This is a continuation of this, where I take notes on my workshop at HCSSiM.

Two days ago Benji Fisher came to my workshop to talk about group laws on rational points of weird things in the plane. Here are his notes.

Degenerate Elliptic Curves in the plane

Conics in the plane

Pick $t \in \mathbb{R}$ . Consider the line $L_t$ given by $y = tx + t$ . Where does $L_t$ intersect the y-axis? Where does it intersect the unit circle, $x^2 + y^2 =1?$

Substitute $y = tx +t$ into the equation for the circle to get

$(1 + t^2) \, x^2 + (2 t^2) \, x + (t^2 - 1) = 0.$

After you do it the hard way, notice that you already know one root: $x = -1$ .

The sum of the roots is $-2 t^2 / (1 + t^2)$ and their product is $(t^2 - 1) / (t^2 + 1)$ . Either way, you get

$x = (1 - t^2) / (1 + t^2)$ .

From $y= t x + t$ you get $y = 2 t / (1 + t^2)$ .

Do not forget that if you are given $x$ and $y$ , then $t = y / (x + 1)$ .

This gives you a 1-1 correspondence between the points of the circle (conic) and the points of the $y$ -axis (including $\infty$ ). The formula for Pythagorean triples also falls out. So do the formulae for the tangent-of-the-half-angle substitution, which is useful when integrating rational functions of $\sin$ and $\cos$ : set $x = \cos(\theta),$ $y = \sin(\theta),$ and $t = \tan(\theta/2).$

There are several ways you can generalize this. You could project a sphere onto a plane. I want to consider replacing the circle with a cubic curve. The problem is that the expected number of intersections between a line and a cubic is 3, so you get a 2-to-1 correspondence in general. That is interesting, too, but for now I want to consider the cases where the curve has a double point and I choose lines through that point. Such lines should intersect the cube in one other point, giving a 1-1 correspondence between the curve (minus the singular point) and a line (minus a small number of points).

cubic with a cusp

Let $C$ be the curve defined by

$y^2 = x^3$ ,

which has a sharp corner at the origin. This type of singularity is called a cusp. Let $L_t$ denote the line through the origin and the point $(1,t)$ , which has slope $t$ .

Sketch the curve $C$ . Does Mathematica do a good job of this?
The line $L_t$ meets the curve $C$ at the origin and in one other point, $(x, y)$ . Find formulae for $x$ and $y$ in terms of $t$ and for $t$ in terms of $x$ and $y$ .
You almost get a digestion (bijection) between $C$ and the line $x = 1$ . What points do you have to omit from each in order to get a digestion?
Three points $(x_1, y_1), (x_2, y_2)$ , and $(x_3, y_3)$ on $C$ are collinear. What condition does this impose on the corresponding values of $t$ ?

The calculations are easier than for the circle: $x=t^2, y = t^3$ , and $t = y/x.$

You have to remove the point $(0, 0)$ from the curve and the point $(1, 0)$ from the line. Well, you do not have to remove them, but the formula for $t$ in terms of $x$ and $y$ is fishy if you do not. The point at infinity (the third point of intersection between the curve and the $y$ -axis) corresponds to itself.

The condition for colliniarity is

$\frac{y_3 - y_1}{x_3 - x_1} = \frac{y_2 - y_1}{x_2 - x_1}.$

Plug in the expressions in terms of the $t$ coordinates, chug away, and you should get

$t_1^{-1} + t_2^{-1} + t_3^{-1} = 0$ .

If we let $u = t^{-1}$ , then $u$ is the natural coordinate on the line $y=1$ . (Maybe I should use that line to start with instead of $x=1$ .)

cubic with a node

This problem deals with the curve $C$ defined by

$y^2 = x^3 + x^2$ ,

which intersects itself at the origin. This type of singularity is called a node. Let $L_t$ denote the line through the origin and the point $(1,t)$ , which has slope $t$ .

Sketch the curve $C$ . Does Mathematica do a good job of this?
The line $L_t$ meets the curve $C$ at the origin and in one other point, $(x, y$ ). Find formulae for $x$ and $y$ in terms of $t$ and for $t$ in terms of $x$ and $y$ .
You almost get a digestion (bijection) between $C$ and the line $x = 1$ . What points do you have to omit from each in order to get a digestion?
Three points $(x_1, y_1), (x_2, y_2)$ , and $(x_3, y_3)$ on $C$ are collinear. What condition does this impose on the corresponding values of $t$ ?

Once again, $t = y / x$ . The usual method gives $x = t^2 - 1$ and $y = t^3 - t$ . In order to get a 1-1 correspondence, you need to delete the singular point $(0,0)$ from the curve and the points $(1,1)$ and $(1,-1)$ from the line.

The lines through the origin with slope $\pm1$ are tangent to the curve. If you plug away, you should find that the condition for colliniarity is:

$t_1 t_2 + t_1 t_3 + t_2 t_3 + 1 = 0$ .

Remember our curve $C$ (not to be Maximum Confused with $\mathbb{C}$ )? It’s the one whose equation is $u^2 = x^3 + x^2.$ The condition for 3 points to be collinear on $C$ is:

$t_1 t_2 + t_2 t_3 + t_3 t_1 +1 = 0.$

Claim: in terms of $u = \frac{t-1}{t+1}$ , the condition is $u_1 u_2 u_3 = 1.$

(Hint: In terms of $u$ , $t = \frac{1+u}{1 - u}$ .)

If you start with the known equation and replace the $t$ ‘s with $u$ ‘s, it takes some work to get down to the condition $u_1 u_2 u_3 = 1$ .

If you start with the LHS of the desired equation, there is a shortcut:

$u_1 u_2 u_3 = \frac{t_1 - 1}{t_1 + 1} \cdot \frac{t_2 - 1}{t_2 + 1}$ $\cdot \frac{t_3-1}{t_3+1}$ .

But then we have

$u_1 u_2 u_3 = \frac{t_1 t_2 t_3 + t_1 + t_2 + t_3 - (t_1 t_2 + t_1 t_3 + t _2 t_3 + 1)}{t_1 t_2 t_3 + t_1 + t_2 + t_3 + (t_1 t_2 + t_2 t_3 + t_3 t_1 + 1)} = 1.$

Note that the change-of-variable formulae are fractional linear transformations. Geometrically, $t$ is the natural coordinate on the line $x=1$ and $u$ is the natural coordinate on the line $x + y = 1$ .

To get from one line to the other, just draw a line through the origin.

One interpretation of our results for the curves $y^2 = x^3$ , $y^2 = x^3 + x^2$ , is that it gives us a way to add points on the line $y = 1$ (with coordinate $t^{-1}$ ) and multiply points on the line $x + y = 1$ (with coordinate $u$ ) using just a straightedge, provided that we are allowed to draw lines through the point at infinity.

In other words, we are allowed to draw vertical lines. I will continue to refer to this as “straightedge only.” I forgot to mention: you need to have the curve provided as well. Explicitly, this is the rule. Given two points on the line, find the corresponding point on the curve. Draw a line through them. The third point of intersection with the curve will be the (additive or multiplicative) inverse of the desired sum. Draw a vertical line through this point: the second point of intersection (or the third, if you count the point at infinity as being the second) will be the desired sum/product.

More precisely, it is the point on the curve corresponding to the desired sum/product, so you have to draw one more line.

Another interpretation is that we get a troupe (or karafiol) structure on the points of the curve, excluding the singular point but including the point at infinity. The point at infinity is the identity element of the troupe (group). The construction is exactly the same as in the previous paragraph, except you leave off the parts about starting and ending on the line.

smooth cubic

Similarly, we get a troupe (group) structure on any smooth cubic curve. For example, consider the curve $E$ defined by

$E: y^2 = x^3 - x .$

Start with two points on the curve, say $P_1 = (x_1, y_1)$ and $P_2 = (x_2, y_2)$ . The equation for the line through these two points is $\frac{y - y_1}{x - x_1} = \frac{y_2 - y_1}{x_2 - x_1} = m .$ $

Solving for $y$ gives the equation $y = mx + b$ where $b = \frac{x_2 y_1 - x_1 y_2}{x_2 - x_1}.$

Plug this into the equation defining $R$ and you get, after a little algebra,

$x^3- m^2 x^2 - (2 m b + 1)x - b^2 = 0 .$

Luckily, we know how to solve the general cubic. (Just kidding! Coming back to what we did with the circle, we observe that $x_1$ and $x_2$ are two of the roots, so we can use either of the relations $x_1 + x_2 + x_3 = m^2$ or $x_1 x_2 x_3 = b^2$ .)

The result is:

$x_3 = m^2 - x_1 - x_2 = \frac{x_1^2 x_2 + x_1 x_2^2 - x_1 - x_2 - 2 y_1 y_2}{(x_1 - x_2)^3},$

where the final form comes from squaring $m$ and using the relations $y_1^2 = x_1^3 - x_1$ and $y_2^2 = x_2^3 - x_2$ .

At this point, a little patience (or a little computer-aided algebra) gives

$y_3 = \frac{(3 x_1 x_2^2 + x_2^3 - x_1 - 3 x_2) y_1 + (-x_1^3 - 3 x_1^2 x_2 + 3 x_1 + x_2) y_2}{(x_1 - x_2)^3}.$

Do not forget the final step: $P_3 = (x_3, y_3)$ is the third point of intersection, but to get the sum we need to draw a vertical line, or reflect in the $x$ -axis:

$P_1 \oplus P_2 = \overline{P_3} = (x_3, - y_3).$

Now, I give up. With a lot of machinery, I could exlpain why the group law is associative. (The identity, as I think I said above, is the point at infinity. Commutativity and inverses are clear.) What I can do with a different machine (Mathematica or some other computer-algebra system) is verify the associative law. I could also do it by hand, given a week, but I do not think I would learn anything from that exercise.

Some content on this page was disabled on December 15, 2019.

Categories: math education

Newer Entries Older Entries

mathbabe

Archive

Columbia Data Science course, week 8: Data visualization, broadening the definition of data science, Square, fraud detection

Columbia Data Science course, week 7: Hunch.com, recommendation engines, SVD, alternating least squares, convexity, filter bubbles

The investigative mathematical journalist

Columbia Data Science course, week 6: Kaggle, crowd-sourcing, decision trees, random forests, social networks, and experimental design

Columbia Data Science course, week 5: GetGlue, time series, financial modeling, advanced regression, and ethics

What is a model?

Columbia Data Science course, week 4: K-means, Classifiers, Logistic Regression, Evaluation

Evaluating professor evaluations

Columbia Data Science course, week 3: Naive Bayes, Laplace Smoothing, and scraping data off the web

Why are the Chicago public school teachers on strike?

Columbia data science course, week 2: RealDirect, linear regression, k-nearest neighbors

How is math used outside academia?

Columbia data science course, week 1: what is data science?

Update on organic food

Looterism

Columbia Data Science Institute: it’s gonna happen

Does mathematics have a place in higher education?

Exploit me some more please

HCSSiM Workshop day 17

HCSSiM Workshop day 16

Degenerate Elliptic Curves in the plane

Top Posts & Pages

Follow Blog via Email

Recent Posts

Meta