### Search Results

Keyword: ‘bayesian prior’

## The overburdened prior

At my new job I’ve been spending my time editing my book with Rachel Schutt (who is joining me at JRL next week! Woohoo!). It’s called Doing Data Science and it’s based on these notes I took when she taught a class on data science at Columbia last semester. Right now I’m working on the alternating least squares chapter, where we learned from Matt Gattis how to build and optimize a recommendation system. A very cool algorithm.

However, to be honest I’ve started to feel very sorry for the one parameter we call $\lambda.$ It’s also sometimes referred to as “the prior”.

Let me tell you, the world is asking too much from this little guy, and moreover most of the big-data world is too indifferent to its plight. Let me explain.

$\lambda$ as belief

First, he’s supposed to reflect an actual prior belief – namely, his size is supposed to reflect a mathematical vision of how big we think the coefficients in our solution should be.

In an ideal world, we would think deeply about this question of size before looking at our training data, and think only about the scale of our data (i.e. the input), the scale of the preferences (i.e. the recommendation system output) and the quality and amount of training data we have, and using all of that, we’d figure out our prior belief on the size or at least the scale of our hoped-for solution.

I’m not statistician, but that’s how I imagine I’d spend my days if I were: thinking through this reasoning carefully, and even writing it down carefully, before I ever start my training. It’s a discipline like any other to carefully state your beliefs beforehand so you know you’re not just saying what the data wants to hear.

$\lambda$ as convergence insurance

But then there’s the next thing we ask of our parameter $\lambda,$ namely we assign him the responsibility to make sure our algorithm converges.

Because our algorithm isn’t a closed form solution, but rather we are discovering coefficients of two separate matrices $U$ and $V$, fixing one while we tweak the other, then switching. The algorithm stops when, after a full cycle of fixing and tweaking, none of the coefficients have moved by more than some pre-ordained $\epsilon.$

The fact that this algorithm will in fact stop is not obvious, and in fact it isn’t always true.

It is (mostly*) true, however, if our little $\lambda$ is large enough, which is due to the fact that our above-mentioned imposed belief of size translates into a penalty term, which we minimize along with the actual error term. This little miracle of translation is explained in this post.

And people say that all the time. When you say, “hey what if that algorithm doesn’t converge?” They say, “oh if $\lambda$ is big enough it always does.”

But that’s kind of like worrying about your teenage daughter getting pregnant so you lock her up in her room all the time. You’ve solved the immediate problem by sacrificing an even bigger goal.

Because let’s face it, if the prior $\lambda$ is too big, then we are sacrificing our actual solution for the sake of conveniently small coefficients and convergence. In the asymptotic limit, which I love thinking about, our coefficients all go to zero and we get nothing at all. Our teenage daughter has run away from home with her do-nothing boyfriend.

By the way, there’s a discipline here too, and I’d suggest that if the algorithm doesn’t converge you might also want to consider reducing your number of latent variables rather than increasing your $\lambda$ since you could be asking too much from your training data. It just might not be able to distinguish that many important latent characteristics.

$\lambda$ as tuning parameter

Finally, we have one more job for our little $\lambda$, we’re not done with him yet. Actually for some people this is his only real job, because in practice this is how he’s treated. Namely, we optimize him so that our results look good under whatever metric we decide to care about (but it’s probably the mean squared error of preference prediction on a test set (hopefully on a test set!)).

In other words, in reality most of the above nonsense about $\lambda$ is completely ignored.

This is one example among many where having the ability to push a button that makes something hard seem really easy might be doing more harm than good. In this case the button says “optimize with respect to $\lambda$“, but there are other buttons that worry me just as much, and moreover there are lots of buttons being built right now that are even more dangerous and allow the users to be even more big-data-blithe.

I’ve said it before and I’ll say it again: you do need to know about inverting a matrix, and other math too, if you want to be a good data scientist.

* There’s a change-of-basis ambiguity that’s tough to get rid of here, since you only choose the number of latent variables, not their order. This doesn’t change the overall penalty term, so you can minimize that with large enough $\lambda,$ but if you’re incredibly unlucky I can imagine you might bounce between different solutions that differ by a base change. In this case your steps should get smaller, i.e. the amount you modify your matrix each time you go through the algorithm. This is only a theoretical problem by the way but I’m a nerd.

## An easy way to think about priors on linear regression

Every time you add a prior to your multivariate linear regression it’s equivalent to changing the function you’re trying to minimize. It sometimes makes it easier to understand what’s going on when you think about it this way, and it only requires a bit of vector calculus. Of course it’s not the most sophisticated way of thinking of priors, which also have various bayesian interpretations with respect to the assumed distribution of the signals etc., but it’s handy to have more than one way to look at things.

Plain old vanilla linear regression

Let’s first start with your standard linear regression, where you don’t have a prior. Then you’re trying to find a “best-fit” vector of coefficients $\beta$ for the linear equation $y = \beta x$. For linear regression, we know the solution will minimize the sum of the squares of the error terms, namely

$\sum_i (y_i - x_i \beta)^2$.

Here the various $i$‘s refer to the different data points.

How do we find the minimum of that? First rewrite it in vector form, where we have a big column vector of all the different $y_i$‘s and we just call it $y,$ and similarly we have a matrix for the $x_i$‘s and we call it $x.$ Then we are aiming to minimize $(y- x \beta)^\tau (y-x \beta).$

Now we appeal to an old calculus idea, namely that we can find the minimum of an upward-sloping function by locating where its derivative is zero.

Moreover, the derivative of $v^\tau v$ is just $dv^\tau v + v^\tau dv,$ or in other words  $2 \cdot dv^\tau v.$ In our case this works out to $2 \cdot d(y - x \beta)^\tau (y- x \beta),$ or, since we’re taking the derivative with respect to $\beta$ and so $x$ and $y$ are constants, we can rewrite as $-x^\tau (y- x\beta).$ Setting that equal to zero, we can ignore the factor of 2 and we get $x^\tau x \beta = x^\tau y,$ or in other words the familiar formula:

$\beta = (x^\tau x)^{-1} x^\tau y$.

Adding a prior on the variance, or penalizing large coefficients

There are various ways people go about adding a diagonal prior – and various ways people explain why they’re doing it. For the sake of simplicity I’ll use one “tuning parameter” for this prior, called $\lambda$ (but I could let there be a list of different $\lambda_j$‘s if I wanted) and I’ll focus on how we’re adding a “penalty term” for large coefficients.

In other words, we can think of trying to minimize the following more complicated sum:

$\frac{\sum_i (y_i - x_i \beta)^2}{N} + \sum_j \lambda^2 \beta_j^2$.

Here the $i$‘s refer to different data points (and $N$ is the number of data points) but the $j$‘s refer to the different $\beta$ coefficients, so the number of signals in the regression, which is typically way smaller.

When we minimize this, we are simultaneously trying to find a “good fit” in the sense of a linear regression, and trying to find that good fit with small coefficients, since the sum on the right grows larger as the coefficients get bigger. The extent to which we care more about the first goal or the second is just a question about how large $\lambda^2$ is compared to the variances of the signals $x_i.$ This is why $\lambda$ is sometimes called a tuning parameter. We normalize the left term by $N$ so the solution is robust to adding more data.

How do we minimize that guy? Same idea, where we rewrite it in vector form first:

$(y - x \beta)^\tau (y-x\beta)/N + (\lambda I \beta)^\tau (\lambda I \beta)$

Again, we set the derivative to zero and ignore the factor of 2 to get:

$- x^\tau (y - x \beta)/N + \lambda I^\tau (\lambda I \beta) = 0.$

Since $I$ is symmetric, we can simplify to $x^\tau x \beta/N + \lambda^2 I \beta = x^\tau y,$ or:

$\beta = (x^\tau x/N + \lambda^2 I)^{-1} x^\tau y/N,$

which of course can be rewritten as

$\beta = (x^\tau x + N \cdot \lambda^2 I)^{-1} x^\tau y.$

If you have a prior on the actual values of the coefficents of $\beta$

Next I want to talk about a slightly fancier version of the same idea, namely when you have some idea of what you think the coefficients of $\beta$ should actually be, maybe because you have some old data or some other study or whatever. Say your prior is that $\beta$ should be something like the vector $r,$ and so you want to penalize not the distance to zero (i.e. the sheer size of the coefficients of $\beta$) but rather the distance to the vector $r.$ Then we want to minimize:

$\frac{\sum_i (y_i - x_i \beta)^2}{N} + \sum_j \lambda^2 (\beta_j - r_j)^2$.

We vectorize as

$(y - x \beta)^\tau (y-x\beta)/N + (\lambda I (\beta - r))^\tau (\lambda I (\beta - r))$

Again, we set the derivative to zero and ignore the factor of 2 to get:

$- x^\tau (y - x \beta)/N + \lambda^2 I (\beta - r) = 0,$

so we can conclude:

$\beta = (x^\tau x/N + \lambda^2 I)^{-1} (x^\tau y/N + \lambda^2 r),$

which can be rewritten as

$\beta = (x^\tau x + N \cdot \lambda^2 I)^{-1} (x^\tau y + N \cdot \lambda^2 r).$

## Bayesian regressions (part 2)

In my first post about Bayesian regressions, I mentioned that you can enforce a prior about the size of the coefficients by fiddling with the diagonal elements of the prior covariance matrix. I want to go back to that since it’s a key point.

Recall the covariance matrix represents the covariance of the coefficients, so those diagonal elements correspond to the variance of the coefficients themselves, which is a natural proxy for their size.

For example, you may just want to make sure the coefficients don’t get too big, or in other words there’s a penalty for large coefficients. Actually there’s a name for just having this prior, and it’s called L2 regularization. You just set the prior to be $P = \lambda I$, where $I$ is the identity matrix, and $\lambda$ is a tuning parameter- you can set the strength of the prior by turning $\lambda$up to eleven“.

You’re going to end up adding this prior to the actual sample covariance matrix as measured by the data, so don’t worry about the prior matrix being invertible (but definitely do make sure it’s symmetrical).

$X^{\tau} X \mapsto X^{\tau}X + P$

Moreover, you can have many different priors, corresponding to different parts of the covariance matrix, and you can add them all up together to get a final prior.

$X^{\tau} X \mapsto X^{\tau} X + \sum_i P_i$

From my first post, I had two priors, both on the coefficients of lagged values of some time series. First, I expect the signal to die out logarithmically or something as we go back in time, so I expect the size of the coefficients to die down as a power of some parameter. In other words, I’ll actually have two parameters: one for the decrease on each lag and one overall tuning parameter. My prior matrix will be diagonal and the $i$th entry will be of the form $\lambda \gamma^i$ for some $\gamma$ and for a tuning parameter $\lambda.$

My second prior was that the entries should vary smoothly, which I claimed was enforceable by fiddling with the super and sub diagonals of the covariance matrix. This is because those entries describe the covariance between adjacent coefficients (and all of my coefficients in this simple example correspond to lagged values of some time series).

In other words, ignoring the variances of each variable (since we already have a handle on the variance from our first prior), we are setting a prior on the correlation between adjacent terms. We expect the correlation to be pretty high (and we can estimate it with historical data). I’ll work out exactly what that second prior is in a later post, but in the end we have two priors, both with tuning parameters, which we may be able to combine into one tuning parameter, which again determines the strength of the overall prior after adding the two up.

Because we are tamping down the size of the coefficients, as well as linking them through a high correlation assumption, the net effect is that we are decreasing the number of effective coefficients, and the regression has less work to do. Of course this all depends on how strong the prior is too; we could make the prior so weak that it has no effect, or we could make it so strong that the data doesn’t effect the result at all!

In my next post I will talk about combining priors with exponential downweighting.

## Bayesian regressions (part 1)

I’ve decided to talk about how to set up a linear regression with Bayesian priors because it’s super effective and not as hard as it sounds. Since I’m not a trained statistician, and certainly not a trained Bayesian, I’ll be coming at it from a completely unorthodox point of view. For a more typical “correct” way to look at it see for example this book (which has its own webpage).

The goal of today’s post is to abstractly discuss “bayesian priors” and illustrate their use with an example. In later posts, though, I promise to actually write and share python code illustrating bayesian regression.

The way I plan to be unorthodox is that I’m completely ignoring distributional discussions. My perspective is, I have some time series (the $x_i$‘s) and I want to predict some other time series (the $y$) with them, and let’s see if using a regression will help me- if it doesn’t then I’ll look for some other tool. But what I don’t want to do is spend all day deciding whether things are in fact student-t distributed or normal or something else. I’d like to just think of this as a machine that will be judged on its outputs. Feel free to comment if this is palpably the wrong approach or dangerous in any way.

A “bayesian prior” can be thought of as equivalent to data you’ve already seen before starting on your dataset. Since we think of the signals (the $x_i$‘s) and response ($y$) as already known, we are looking for the most likely coefficients $\beta_i$ that would explain it all. So the form a bayesian prior takes is: some information on what those $\beta_i$‘s look like.

The information you need to know about the $\beta_i$‘s is two-fold. First you need to know their values and second you need to have a covariance matrix to describe their statistical relationship to each other. When I was working as a quant, we almost always had strong convictions about the latter but not the former, although in the literature I’ve been reading lately I see more examples where the values (really the mean values) for the $\beta_i$‘s are chosen but with an “uninformative covariance assumption”.

Let me illustrate with an example. Suppose you are working on the simplest possible model: you are taking a single time series and seeing how earlier values of $x$ predict the next value of $x$. So in a given update of your regression, $y= x_t$ and each $x_i$ is of the form $x_{t-a}$ for some $a>0.$

What is your prior for this? Turns out you already have one (two actually) if you work in finance. Namely, you expect the signal of the most recent data to be stronger than whatever signal is coming from older data (after you decide how many past signals to use by first looking at a lagged correlation plot). This is just a way of saying that the sizes of the coefficients should go down as you go further back in time. You can make a prior for that by working on the diagonal of the covariance matrix.

Moreover, you expect the signals to vary continuously- you (probably) don’t expect the third-from recent variable $x_{t-3}$ to have a positive signal but the second-from recent variable $x_{t-2}$ to have a negative signal (especially if your lagged autocorrelation plot looks like this). This prior is expressed as a dampening of the (symmetrical) covariance matrix along the subdiagonal and superdiagonal.

In my next post I’ll talk about how to combine exponential down-weighting of old data, which is sacrosanct in finance, with bayesian priors. Turns out it’s pretty interesting and you do it differently depending on circumstances. By the way, I haven’t found any references for this particular topic so please comment if you know of any.

## Women on a board of directors: let’s use Bayesian inference

I wanted to show how to perform a “women on the board of directors” analysis using Bayesian inference. What this means is that we need to form a “prior” on what we think the distribution of the answer could be, and then we update our prior with the data available.  In this case we simplify the question we are trying to answer: given that we see a board with 3 women and 7 men (so 10 total), what is the fraction of women available for the board of directors in the general population? The reason we may want to answer this question is that then we can compare the answer to other available answers, derived other ways (say by looking at the makeup of upper level management) and see if there’s a bias.

In order to illustrate Bayesian techniques, I’ve simplified it further to be a discrete question.  So I’ve pretended that there are only 11 answers you could possible have, namely that the fraction of available women (in the population of people qualified to be put on the board of directors) is 0%, 10%, 20%, …, 90%, or 100%.

Moreover, I’ve put the least judgmental prior on the situation, namely that there is an equal chance for any of these 11 possibilities.  Thus the prior distribution is uniform:

We have absolutely no idea what the fraction of qualified women is.

The next step is to update our prior with the available data.  In this case we have the data point that there a board with 3 women and 7 men.  In this case we are sure that there are some women and some men available, so the updated probability of there being 0% women or 100% women should both be zero (and we will see that this is true).  Moreover, we would expect to see that the most likely fraction will be 30%, and we will see that too.  What Bayesian inference gives to us, though, is the relative probabilities of the other possibilities, based on the likelihood that one of them is true given the data.  So for example if we are assuming for the moment that 70% of the qualified people are women, what is the likelihood that the board ends up being 3 women and 7 men?  We can compute that as (0.70)^3*(0.30)^7.  We multiply that by 1/11, the probability that 70% is the right answer (according to our prior) to get the “unscaled posterior distribution”, or the likelihoods of each possibility.  Here’s a graph of these numbers when I do it for all 11 possibilities:

We learn the relative likelihoods of the outcome "3 out of 10" given the various ratios of women

In order to make this a probability distribution we need to make sure the total adds up to 1, so we scale to get the actual posterior distribution:

We scale these to add up to 1

What we observe is, for example, that it’s about twice as likely for 50% of women to be qualified as it is for 10% of women to be qualified, even though those answers are equally distant from the best guess of 30%.  This kind of “confidence of error” is what Bayesian inference is good for.  Also, keep in mind that if we had had a more informed prior the above graph would look different; for example we could use the above graph as a prior for the next time we come across a board of directors.  In fact that’s exactly how this kind of inference is used: iteratively, as we travel forward through time collecting data.  We typically want to start out with a prior that is pretty mild (like the uniform distribution above) so that we aren’t skewing the end results too much, and let the data speak for itself.  In fact priors are typically of the form, “things should vary smoothly”; more on what that could possibly mean in a later post.

#### Here’s the python code I wrote to make these graphs:

#!/usr/bin/env python
from matplotlib.pylab import *
from numpy import *
# plot prior distribution:
figure()
bar(arange(0,1.1,0.1), array([1.0/11]*11), width = 0.1, label = “prior probability distribution”)
xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )
xlim(0, 1.1)
legend()
show()
# compute likelihoods for each of the 11 possible ratios of women:
likelihoods = []
for x in arange(0, 1.1, 0.1):
likelihoods.append(x**3*(1-x)**7)
# plot unscaled posterior distribution:
figure()
bar(arange(0,1.1,0.1), array([1.0/11]*11)*array(likelihoods), width = 0.1, label = “unscaled posterior probability distribution”)
xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )
xlim(0, 1.1)
legend()
show()
# plot scaled posterior distribution:
figure()
bar(arange(0,1.1,0.1), array([1.0/11]*11)*array(likelihoods)/sum(array([1.0/11]*11)*array(likelihoods)), width = 0.1, label = “scaled posterior probability distribution”)
xticks(arange(0,1.1,0.1) + 0.05, [str(x) for x in arange(0,1.1,0.1)] )
xlim(0, 1.1)
legend()
show()

#### Here’s the R code that Daniel Krasner wrote for these graphs:

barplot( rep((1/11), 11), width = .1, col=”blue”, main = “prior probability distribution”)
likelihoods = c()
for (x in seq(0, 1.0, by = .1))
likelihoods = c(likelihoods, (x^3)*((1-x)^7));
barplot(likelihoods, width = .1, col=”blue”, main =  “unscaled posterior probability distribution”)
barplot(likelihoods/sum(seq((1/11), 11)*likelihoods), width = .1, col=”blue”, main =  “scaled posterior probability distribution”)

## Learning to eat again

So, I’m learning to eat again. Like a newborn child perhaps, but worse, since I have all sorts of memories of how much I can eat and what I like to eat that are misleading. A Bayesian prior that I can’t easily shake.

Pescatarian

For example, once I was cleared to eat ground meat, I made myself a pot of beef chili, which is something I’ve always loved to eat. I knew I could only eat a bit of it at a time, but I figured that was fine, since I’d share it with other people. But the truth is, I couldn’t eat it at all. I tried one tiny bowl of it and it felt like a million tons in my stomach.

That’s been the way it is for me, with pretty much all meat, including chicken. I can’t seem to eat meat and feel good afterwards.

By contrast, I can eat fish. To be more precise, sashimi. I’ve really enjoyed salmon sashimi. And tofu. I’ve been pretty much addicted to tofu. Anything Thai, and the lighter the sauce the better.

Traveling

Traveling while learning to eat sucks. I went away with my kids for a few days to West Springfield, MA. Talk about a food desert. The best restaurant we went to was Bertucci’s, followed by IHOP, followed by Friendly’s. Not an exaggeration. And since I’m not eating pasta or doughy bread, Bertucci’s was tough. And since I don’t want to eat sweet things, IHOP was basically impossible. And since I don’t digest fried things, Friendly’s was awful.

Out of desperation, I google searched “good healthy food near me” and it came up with two results: Dunkin Donuts and a martini bar.

Basically I lived off of the cheese I brought with me for the trip. I now kind of understand why rich people pay so much to vacation in fancy places with healthy food. I would have paid good money for avocado toast.

As a side note, I’ve never been more aware of how most of America eats. The food available in places like this is unhealthy, addictive, and omnipresent. Not to mention very, very cheap. Which is to say, there is a systemic problem we will have to face sooner or later when it comes to health.

Throwing the Rulebook Out the Window

I think I mentioned before that the instructions I’ve received from the surgeon’s office – specifically, from the nutritionist – have been hard to follow, in part because they’re extra strict to make allowances for the fact that they assume practically everyone cheats. That’s not a theory, I asked. And since I’m actually trying to be compliant, that makes it kind of ridiculous.

For example, the instructions tell you to eat meat with mayonnaise so that it will go down easier. But they also tell you not to ever eat something with more than 25% fat in a meal. That’s hard to do, so the conclusion is to mix up your meat with diet mayonnaise to force it down.

I mean, yuck. Who wants to force down chunks of chicken or beef with diet mayonnaise? I’d rather never eat meat in the first place.

More generally, though, I once again think the entire causal relationship has been misunderstood.

It’s easy enough to do as a nutritionist: if you notice that people who eat high fat foods don’t lose as much weight as people who eat lean foods, it’s natural to tell everyone to eat lean foods. But that doesn’t mean such advice will be heeded or will work.

My perspective is that I’ve thrown the dice on this surgery, and it has changed my hormones, and my stomach biome, and my tastes will change, and I might end up being one of those people who both desire and consume lean foods. And if I’m lucky, and I end up wanting to eat lean foods, this surgery will have been a success. But I cannot make it a success with sheer force of will.

You see, I also like my cheese, and sometimes my entire “meal” (still the size of a snack) consists of eating cheese, and I’m sure it’s more than 25% fat, but I’m not planning to replace it with diet cheese. Instead, I’m happy to report, other meals all I want is fruit, or salad, and they’ll have to balance stuff out.

Long story short, I’m ending up relatively noncompliant, after all. The only things I’m being super careful about are my vitamin patches and my protein intake, which seem important.

I don’t know how this will all end up, but I do know that I value delicious and satisfying food, and I’d rather be listening to my body and eating good food than ignoring my body and eating plastic.

Categories: Uncategorized

My, my, my. It’s been a while. Aunt Pythia plum forgot about her duties last Saturday, what with all the math nerds and such in San Antonio.

Many apologies! But don’t think Aunt Pythia didn’t miss you, because nothing could be less true: Aunt Pythia positively pined for you this last week. It was excruciating and slightly adorable. Trust me on that one.

Before I begin, Aunt Pythia wants to share her latest knitting pattern with you, since it’s butt cold here in the East and was even freezing cold in Texas, so we all need cowls. Yes we do, and here’s the one I’m making (along with the hat!):

Mine is burgundy and black. And I’ve heard from good sources that this doesn’t actually look like Klimt’s art at all, even though it’s called a “Klimt cowl.” Artistic license.

Isn’t that just darling? And warm? Aunt Pythia knew you’d agree.

OK, onto the day’s delightful task. I am feeling more than usually oracle-esque today, tell me if you agree in the comments below. And in any case,

ask Aunt Pythia a question at the bottom of the page!

By the way, if you don’t know what the hell Aunt Pythia is talking about, go here for past advice columns and here for an explanation of the name Pythia.

——

Dear Aunt Pythia,

What has happened to the Occupy movement? In the media that I read, it is totally disappeared. I was thinking that you were still involved, at least in Finance. Right now, it seems like the current administration is owned by Wall Street bankers. That can’t be a good situation. Is there a mathematical angle to this?

Missing Person

Dear Missing Person,

The Alt Banking group still meets every week on Sunday afternoons. We often have super interesting guest speakers and we’ve been writing pieces for the Huffington Post. We also continue to get positive feedback about our book and our cards. Feel free to come to the meetings! And even if you can’t come, you can get on the mailing list by emailing that request to alt.banking.OWS@gmail.com.

In terms of the Obama administration, yes, it’s owned by Wall Street, and to be honest I didn’t think it could get worse but we’ll see if I’m wrong starting now. I hear the Republican congress has even worse plans for watering down Dodd Frank than have already been exposed.

Jesse Eisenger’s recent column was right on, in my opinion. If Obama wants to redeem himself and leave a less-than-shameful legacy, he needs to act big right now. Also, keep an eye on Bernie Sanders from now on, as well as Liz Warren.

Aunt Pythia

——

Dear Aunt Pythia,

I live in San Francisco, but I work on international human rights not in the tech industry. Naturally, a handful of my friends work at Google or at start-ups – things that fall under the umbrella of “tech.”

I had dinner with them tonight and I walked away feeling very agitated. Whenever I hang out with them, I always walk away with the sense that they think they’re smarter than me. I can’t figure out if this is my projection onto them or they really give off this attitude.

The night was going fine but then we talked about the Google shuttles fiasco. We had a friend visiting from out of town who was curious why people were protesting the buses. I told her that some people felt that it was reducing access to local transport, since they used government bus stops. All three of my tech friends, two of whom work at Google, scrambled to tell me 1. that I have a skewed perception, I’m blowing things out of proportion, and that I don’t have an accurate assessment of the situation 2. that they really haven’t caused an decrease in access to services and 3. that now that Google has an official contract with the MTA, that everything is fine and resolved.

My response to 1 was that I was merely explaining, in one sentence, why there were protests to someone who is unfamiliar with the situation. I wasn’t trying to capture all the nuances in one sentence. My response to 2 was that I actually met a group of people from a disability advocacy group that had to stage a protest because the shuttles were blocking access to the municipal buses. It was causing situations like making blind people or people in wheelchairs go around a Google shuttle to get on a bus in the middle of a street. I never got to respond to point 3.

I know that the situation with Google and other tech industries is nuanced, but the lack of scrutiny and the immediate scramble for defending a large player like Google seems so ridiculous to me. I’m not a Google fangirl or any sort of product fangirl, so I don’t understand this mentality. When I gave the example of the disabled people lacking access to the city buses, one of the Google employees stated that it must have been some individual case of a badly trained bus driver. My response was that it happened enough that they had to protest, and that they’re going to hold Google responsible not the individual bus driver. He said they were wrong for doing that. I think he’s wrong for thinking that!

I guess my questions are this: Are my tech friends assholes? Is the future of America doomed if privileged people are so threatened by simple conversations like this? And how do I engage with people like that without feeling like I’m being talked down to/talked as if I’m not smart enough to understand?

Don’t Understand My Brethren That Emphasize Constant Hurrahs In Electronics/Tech Seriously

Dear DUMBTECHIES,

First of all, awesome sign off.

Second of all, this is not about you being dumb. This is about them being defensive. Defensiveness leads to terrible reasoning abilities, so the only way for defensive people to win arguments, since they can’t do it with their logic, is to do it with a bullying attitude. In other words, they aggressively describe their stupid reasoning, and then act like you must be an idiot if you don’t see what they are saying as obvious. But it’s all a front because they know they have nothing to stand on. If they weren’t defensive, they would treat you like an intelligent person and ask you what you think.

Important Life Lesson: 99 times out of 100, if you are in a conversation where the person talking to you is making you feel dumb, then it’s about them, not you. It means they feel dumb about something and they are compensating. If you can, turn it around on them immediately, even if it’s as simple as saying, “you’re acting like my points are dumb, but I don’t think they are, I’m just trying to have a conversation. Is there something about this topic that makes you uncomfortable?”

So, why the defensiveness? Here’s the thing, Google employees work for Google, and it’s kind of a cult, like many companies are, and they feel lucky to be there and want other people to think they’re lucky too, so they defend things even when those things don’t make sense.

I actually don’t think they are any weirder in this regard than people who work in other industries, defending things like the wisdom of financial engineering or the wisdom of promoting fossil fuel. People are pretty good at defending their own interests. These guys just happen to be working at a very recognizable place.

In terms of approaching the topic, if you ever choose to discuss this again, I would suggest talking about what would happen if the Google buses ceased to exist – how would Googlers get to work? To what extent would that interfere with municipal buses? Certainly traffic would increase, for example. And since everyone has the right to go to work, you are working from a super reasonable starting position, namely thinking through the pluses and minuses of the Google bus system. Admit there are pluses and maybe the other side can start to admit there are minuses.

Or you could just hang out with other folks.

Good luck!

Aunt Pythia

p.s. I could be wrong, they could just really think they’re smarter than you. Cults also have a way of encouraging that kind of thing. But if they really think so, they might admit it. Ask them if they think they are smarter than “non-Googlers” and see what they say.

——

Dear Aunt Pythia,

What are your predictions for kinky sex in the UK now that they banned all fun porn?

Curious Sub

Dear Curious,

What? Seriously? Oh wait, yes. Among the outlawed activities is “facesitting,” which makes little sense to me, given that “unlike smothering, in facesitting the bottom partner is not deprived of air.” What’s next, banning doggy style?

Also, female ejaculation is now banned. What? This is one of the few indications in porn that the woman is alive, and now we’re banning it. That makes sense.

OK, well, it’s dumb. And stupid as well, since the internet will provide horny people from the UK with plenty of facesitting and female ejaculation opportunities if they so desire. Basically it’s a loss of market share. I’m tempted to add “and nothing else” but when market share gets moved to places further in the shadow, things get less consensual and more coerced, and that’s never good.

Auntie P

——

Aunt Pythia,

Fivethirtyeight recently published the article “Economists Aren’t As Nonpartisan As We Think”. What really interested me in this piece was the author’s chart that demonstrated that on average, political bias has crept into the numerical results of economic research.

In the footnotes they explained a bit more: “Specifically, we ran a regression of numerical results, which were standardized within fields, on predicted ideology while controlling for field. Among the models we ran, the R squared ranged from 0.07 to 0.14.”

I did a little searching and found that R squared values can be misleading. Either way this single result with a R squared value of 0.07 – 0.14 seems a bit weak-sauce if you are trying to support such a broad claim as “economists are partisan”.

So, my questions for you is what does the chart in the Fivethirtyeight article mean? What is the meaning of the R squared value in this research. Is this a robust claim?

Many Thanks,
Mr. Should be studying for finals

Dear Mr. Should,

I’m gonna have to go Bayesian on your ass and mention that the title of the piece should have been, Economists Aren’t As Partisan As We Wish They Were, But We Knew That Already. Anyone who has ever read or spoken to economists would already suspect this.

Which is to say, I have a bayesian prior that this result is true, and their R squared value is enough to add fuel to my fire.

It’s not just economists, though. It’s everyone! See above w.r.t. Googlers, for example.

Here’s another thing getting in the way of me critiquing this paper: one of the authors, Suresh Naidu, is a good friend of mine.

In general, though, even when I already think something’s true, and when my friends are involved, I try to remember that data analysis is, at best, an evidence-gathering activity, not a proof. After it’s done a bunch of different ways and remains robust to various important choices, I start believing it more and more. For example, global warming is real.

Aunt Pythia

——

Well, you’ve wasted yet another Saturday morning with Aunt Pythia! I hope you’re satisfied! If you could, please ask me a question. And don’t forget to make an amazing sign-off, they make me very very happy.

Categories: Aunt Pythia

## Strata: one down, one to go

Yesterday I gave a talk called “Finance vs. Machine Learning” at Strata. It was meant to be a smack-down, but for whatever reason I couldn’t engage people to personify the two disciplines and have a wrestling match on stage. For the record, I offered to be on either side. Either they were afraid to hurt a girl or they were afraid to lose to a girl, you decide.

Unfortunately I didn’t actually get to the main motivation for the genesis of this talk, namely the realization I had a while ago that when machine learners talk about “ridge regression” or “Tikhonov regularization” or even “L2 regularization” it comes down to the same thing that quants call a very simple bayesian prior that your coefficients shouldn’t be too large. I talked about this here.

What I did have time for: I talked about “causal modeling” in the finance-y sense (discussion of finance vs. statistician definition of causal here), exponential downweighting with a well-chosen decay, storytelling as part of feature selection, and always choosing to visualize everything, and always visualizing the evolution of a statistic rather than a snapshot statistic.

They videotaped me but I don’t see it on the strata website yet. I’ll update if that happens.

This morning, at 9:35, I’ll be in a keynote discussion with Julie Steele for 10 minutes entitled “You Can’t Learn That in School”, which will be live streamed. It’s about whether data science can and should be taught in academia.

For those of you wondering why I haven’t blogged the Columbia Data Science class like I usually do Thursday, these talks are why. I’ll get to it soon, I promise! Last night’s talks by Mark Hansen, data vizzer extraordinaire and Ian Wong, Inference Scientist from Square, were really awesome.

## Columbia Data Science course, week 7: Hunch.com, recommendation engines, SVD, alternating least squares, convexity, filter bubbles

Last night in Rachel Schutt’s Columbia Data Science course we had Matt Gattis come and talk to us about recommendation engines. Matt graduated from MIT in CS, worked at SiteAdvisor, and co-founded hunch as its CTO, which recently got acquired by eBay. Here’s what Matt had to say about his company:

Hunch

Hunch is a website that gives you recommendations of any kind. When we started out it worked like this: we’d ask you a bunch of questions (people seem to love answering questions), and then you could ask the engine questions like, what cell phone should I buy? or, where should I go on a trip? and it would give you advice. We use machine learning to learn and to give you better and better advice.

Later we expanded into more of an API where we crawled the web for data rather than asking people direct questions. We can also be used by third party to personalize content for a given site, a nice business proposition which led eBay to acquire us. My role there was doing the R&D for the underlying recommendation engine.

Matt has been building code since he was a kid, so he considers software engineering to be his strong suit. Hunch is a cross-domain experience so he doesn’t consider himself a domain expert in any focused way, except for recommendation systems themselves.

The best quote Matt gave us yesterday was this: “Forming a data team is kind of like planning a heist.” He meant that you need people with all sorts of skills, and that one person probably can’t do everything by herself. Think Ocean’s Eleven but sexier.

A real-world recommendation engine

You have users, and you have items to recommend. Each user and each item has a node to represent it. Generally users like certain items. We represent this as a bipartite graph. The edges are “preferences”. They could have weights: they could be positive, negative, or on a continuous scale (or discontinuous but many-valued like a star system). The implications of this choice can be heavy but we won’t get too into them today.

So you have all this training data in the form of preferences. Now you wanna predict other preferences. You can also have metadata on users (i.e. know they are male or female, etc.) or on items (a product for women).

For example, imagine users came to your website. You may know each user’s gender, age, whether they’re liberal or conservative, and their preferences for up to 3 items.

We represent a given user as a vector of features, sometimes including only their meta data, sometimes including only their preferences (which would lead to a sparse vector since you don’t know all their opinions) and sometimes including both, depending on what you’re doing with the vector.

Nearest Neighbor Algorithm?

Let’s review nearest neighbor algorithm (discussed here): if we want to predict whether a user A likes something, we just look at the user B closest to user A who has an opinion and we assume A’s opinion is the same as B’s.

To implement this you need a definition of a metric so you can measure distance. One example: Jaccard distance, i.e. the number of things preferences they have in common divided by the total number of things. Other examples: cosine similarity or euclidean distance. Note: you might get a different answer depending on which metric you choose.

What are some problems using nearest neighbors?

• There are too many dimensions, so the closest neighbors are too far away from each other. There are tons of features, moreover, that are highly correlated with each other. For example, you might imagine that as you get older you become more conservative. But then counting both age and politics would mean you’re double counting a single feature in some sense. This would lead to bad performance, because you’re using redundant information. So we need to build in an understanding of the correlation and project onto smaller dimensional space.
• Some features are more informative than others. Weighting features may therefore be helpful: maybe your age has nothing to do with your preference for item 1. Again you’d probably use something like covariances to choose your weights.
• If your vector (or matrix, if you put together the vectors) is too sparse, or you have lots of missing data, then most things are unknown and the Jaccard distance means nothing because there’s no overlap.
• There’s measurement (reporting) error: people may lie.
• There’s a calculation cost – computational complexity.
• Euclidean distance also has a scaling problem: age differences outweigh other differences if they’re reported as 0 (for don’t like) or 1 (for like). Essentially this means that raw euclidean distance doesn’t explicitly optimize.
• Also, old and young people might think one thing but middle-aged people something else. We seem to be assuming a linear relationship but it may not exist
• User preferences may also change over time, which falls outside the model. For example, at Ebay, they might be buying a printer, which makes them only want ink for a short time.
• Overfitting is also a problem. The one guy is closest, but it could be noise. How do you adjust for that? One idea is to use k-nearest neighbor, with say k=5.
• It’s also expensive to update the model as you add more data.

Matt says the biggest issues are overfitting and the “too many dimensions” problem. He’ll explain how he deals with them.

Going beyond nearest neighbor: machine learning/classification

In its most basic form, we’ve can model separately for each item using a linear regression. Denote by $f_{i, j}$ user $i$‘s preference for item $j$ (or attribute, if item $j$ is a metadata item). Say we want to model a given user’s preferences for a given item using only the 3 metadata properties of that user, which we assume are numeric. Then we can look for the best choice of $\beta_k$ as follows:

$p_i = \beta_1 f_{1, i} + \beta_2 f_{2, i} + \beta_3 f_{3, i} +$ $\epsilon$

Remember, this model only works for one item. We need to build as many models as we have items. We know how to solve the above per item by linear algebra. Indeed one of the drawbacks is that we’re not using other items’ information at all to create the model for a given item.

This solves the “weighting of the features” problem we discussed above, but overfitting is still a problem, and it comes in the form of having huge coefficients when we don’t have enough data (i.e. not enough opinions on given items). We have a bayesian prior that these weights shouldn’t be too far out of whack, and we can implement this by adding a penalty term for really large coefficients.

This ends up being equivalent to adding a prior matrix to the covariance matrix. how do you choose lambda? Experimentally: use some data as your training set, evaluate how well you did using particular values of lambda, and adjust.

Important technical note: You can’t use this penalty term for large coefficients and assume the “weighting of the features” problem is still solved, because in fact you’re implicitly penalizing some coefficients more than others. The easiest way to get around this is to normalize your variables before entering them into the model, similar to how we did it in this earlier class.

The dimensionality problem

We still need to deal with this very large problem. We typically use both Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).

To understand how this works, let’s talk about how we reduce dimensions and create “latent features” internally every day. For example, we invent concepts like “coolness” – but I can’t directly measure how cool someone is, like I could weigh them or something. Different people exhibit pattern of behavior which we internally label to our one dimension of “coolness”.

We let the machines do the work of figuring out what the important “latent features” are. We expect them to explain the variance in the answers to the various questions. The goal is to build a model which has a representation in a lower dimensional subspace which gathers “taste information” to generate recommendations.

SVD

Given a matrix $X,$ compose it into three matrices:

$X = U S V^{\tau}.$

Here $X$ is $m \times n, U$ is $m \times k, S$ is $k\times k,$ and $V$ is $k\times n,$ where $m$ is the number of users, $n$ is the number of items, and $k$ is the rank of $X.$

The rows of $U$ correspond to users, whereas $V$ has a row for each item. The square matrix $S$ is diagonal where each entry is a singular value, which measure the importance of each dimension. If we put them in decreasing order, which we do, then the dimensions are ordered by importance from highest to lowest. Every matrix has such a decomposition.

Important properties:

• The columns of $U$ and $V$ are orthogonal to each other.
• So we can order the columns by singular values.
• We can take lower rank approximation of X by throwing away part of $S.$ In this way we might have $k$ much smaller than either $n$ or $m$, and this is what we mean by compression.
• There is an important interpretation to the values in the matrices $U$ and $V.$ For example, we can see, by using SVD, that “the most important latent feature” is often something like seeing if you’re a man or a woman.

[Question: did you use domain expertise to choose questions at Hunch? Answer: we tried to make them as fun as possible. Then, of course, we saw things needing to be asked which would be extremely informative, so we added those. In fact we found that we could ask merely 20 questions and then predict the rest of them with 80% accuracy. They were questions that you might imagine and some that surprised us, like competitive people v. uncompetitive people, introverted v. extroverted, thinking v. perceiving, etc., not unlike MBTI.]

More details on our encoding:

• Most of the time the questions are binary (yes/no).
• We create a separate variable for every variable.
• Comparison questions may be better at granular understanding, and get to revealed preferences, but we don’t use them.

Note if we have a rank $k$ matrix $X$ and we use the SVD above, we can take the approximation with only $k-3$ rows of the middle matrix $S,$ so in other words we take the top $k-3$ most important latent features, and the corresponding rows of $U$ and $V,$ and we get back something very close to $X.$

Note that the problem of sparsity or missing data is not fixed by the above SVD approach, nor is the computational complexity problem; SVD is expensive.

PCA

Now we’re still looking for $U$ and $V$ as above, but we don’t have $S$ anymore, so $X = U \cdot V^{\tau},$ and we have a more general optimization problem. Specifically, we want to minimize:

$argmin \sum_{i, j \in P} (p_{i, j} - u_i \cdot v_j)^2.$

Let me explain. We denote by $u_i$ the row of $U$ corresponding to user $i,$ and similarly we denote by $v_j$ the row of $V$ corresponding to item $j.$ Items can include meta-data information (so the age vectors of all the users will be a row in $V$).

Then the dot product $u_i \cdot v_j$ is taken to mean the predicted value of user $i$‘s preference for item $j,$ and we compare that to the actual preference $p_{i, j}$. The set $P$ is just the set of all actual known preferences or meta-data attribution values.

So, we want to find the best choices of $U$ and $V$ which overall minimize the squared differences between prediction and observation on everything we actually know, and the idea is that if it’s really good on stuff we know, it will also be good on stuff we’re guessing.

Now we have a parameter, namely the number $D$ which is how may latent features we want to use. The matrix $U$ will have a row for each user and a column for each latent feature, and the matrix $V$ will have a row for each item and a column for each latent features.

How do we choose $D?$ It’s typically about 100, since it’s more than 20 (we already know we had a pretty good grasp on someone if we ask them 20 questions) and it’s as much as we care to add before it’s computational too much work. Note the resulting latent features will be uncorrelated, since they are solving an efficiency problem (not a proof).

But how do we actually find $U$ and $V?$

Alternating Least Squares

This optimization doesn’t have a nice closed formula like ordinary least squares with one set of coefficients. Instead, we use an iterative algorithm like with gradient descent. As long as your problem is convex you’ll converge ok (i.e. you won’t find yourself at a local but not global maximum), and we will force our problem to be convex using regularization.

Algorithm:

• Pick a random $V$
• Optimize $U$ while $V$ is fixed
• Optimize $V$ while $U$ is fixed
• Keep doing the above two steps until you’re not changing very much at all.

Example: Fix $V$ and update $U.$

The way we do this optimization is user by user. So for user $i,$ we want to find

$argmin_{u_i} \sum_{j \in P_i} (p_{i, j} - u_i * v_j)^2,$

where $v_j$ is fixed. In other words, we just care about this user for now.

But wait a minute, this is the same as linear least squares, and has a closed form solution! In other words, set:

$u_i = (V_{*, i}^{\tau} V_{*, i})^{-1} V_{*, i}^{\tau} P_{* i},$

where $V_{*, i}$ is the subset of $V$ for which we have preferences coming from user $i.$ Taking the inverse is easy since it’s $D \times D,$ which is small. And there aren’t that many preferences per user, so solving this many times is really not that hard. Overall we’ve got a do-able update for $U.$

When you fix U and optimize V, it’s analogous; you only ever have to consider the users that rated that movie, which may be pretty large, but you’re only ever inverting a $D \times D$ matrix.

Another cool thing: since each user is only dependent on their item’s preferences, we can parallelize this update of $U$ or $V.$ We can run it on as many different machines as we want to make it fast.

There are lots of different versions of this. Sometimes you need to extend it to make it work in your particular case.

Note: as stated this is not actually convex, but similar to the regularization we did for least squares, we can add a penalty for large entries in $U$ and $V,$ depending on some parameter $\lambda,$ which again translates to the same thing, i.e. adding a diagonal matrix to the covariance matrix, when you solve least squares. This makes the problem convex if $\lambda$ is big enough.

You can add new users, new data, keep optimizing U and V. You can choose which users you think need more updating. Or if they have enough ratings, you can decide not to update the rest of them.

As with any machine learning model, you should perform cross-validation for this model – leave out a bit and see how you did. This is a way of testing overfitting problems.

Thought experiment – filter bubbles

What are the implications of using error minimization to predict preferences? How does presentation of recommendations affect the feedback collected?

For example, can we end up in local maxima with rich-get-richer effects? In other words, does showing certain items at the beginning “give them an unfair advantage” over other things? And so do certain things just get popular or not based on luck?

How do we correct for this?

## Data Science needs more pedagogy

Yesterday Flowing Data posted an article about the history of data science (h/t Chris Wiggins). Turns out the field and the name were around at least as early as 2001, and statistician William Cleveland was all about planning it. He broke the field down into parts thus:

• Multidisciplinary Investigation (25%) — collaboration with subject areas
• Models and Methods for Data (20%) — more traditional applied statistics
• Computing with Data (15%) — hardware, software, and algorithms
• Pedagogy (15%) — how to teach the subject
• Tool Evaluation (5%) — keeping track of new tech
• Theory (20%) — the math behind the data

First of all this is a great list, and super prescient for the time. In fact it’s an even better description of data science than what’s actually happening.

The post mentions that we probably don’t see that much theory, but I’ve certainly seen my share of theory when I go to Meetups and such. Most of the time the theory is launched into straight away and I’m on my phone googling terms for half of the talk.

The post also mentions we don’t see much pedagogy, and here I strongly concur. By “pedagogy” I’m not talking about just teaching other people what you did or how you came up with a model, but rather how you thought about modeling and why you made the decisions you did, what the context was for those decisions and what the other options were (that you thought of). It’s more of a philosophy of modeling.

It’s not hard to pinpoint why we don’t get much in the way of philosophy. The field is teeming with super nerds who are focused on the very cool model they wrote and the very nerdy open source package they used, combined with some weird insight they gained as a physics Ph.D. student somewhere. It’s hard enough to sort out their terminology, never mind expecting a coherent explanation with broad context, explained vocabulary, and confessed pitfalls. The good news is that some of them are super smart and they share specific ideas and sometimes even code (yum).

In other words, most data scientists (who make cool models) think and talk at the level of 0.02 feet, whereas pedagogy is something you actually need to step back to see. I’m not saying that no attempt is ever made at this, but my experiences have been pretty bad. Even a simple, thoughtful comparison of how different fields (bayesian statisticians, machine learners, or finance quants) go about doing the same thing (like cleaning data, or removing outliers, or choosing a bayesian prior strength) would be useful, and would lead to insights like, why do these field do it this way whereas those fields do it that way? Is it because of the nature of the problems they are trying to solve?

A good pedagogical foundation for data science will allow us to not go down the same dead end roads as each other, not introduce the same biases in multiple models, and will make the entire field more efficient and better at communicating. If you know of a good reference for something like this, please tell me.

## Updating your big data model

When you are modeling for the sake of real-time decision-making you have to keep updating your model with new data, ideally in an automated fashion. Things change quickly in the stock market or the internet, and you don’t want to be making decisions based on last month’s trends.

One of the technical hurdles you need to overcome is the sheer size of the dataset you are using to first train and then update your model. Even after aggregating your model with MapReduce or what have you, you can end up with hundreds of millions of lines of data just from the past day or so, and you’d like to use it all if you can.

The problem is, of course, that over time the accumulation of all that data is just too unwieldy, and your python or Matlab or R script, combined with your machine, can’t handle it all, even with a 64 bit setup.

Luckily with exponential downweighting, you can update iteratively; this means you can take your new aggregated data (say a day’s worth), update the model, and then throw it away altogether. You don’t need to save the data anywhere, and you shouldn’t.

As an example, say you are running a multivariate linear regression. I will ignore bayesian priors (or, what is an example of the same thing in a different language, regularization terms) for now. Then in order to have an updated coefficient vector $\beta$, you need to update your “covariance matrix” $X^{\tau} X$ and the other term (which must have a good name but I don’t know it) $X^{\tau} y$ and simply compute

$\beta = (X^{\tau} X)^{-1} X^{\tau} y.$

So the problem simplifies to, how can we update $X^{\tau} X$ and $X^{\tau} y$?

As I described before in this post for example, you can use exponential downweighting. Whereas before I was expounding on how useful this method is for helping you care about new data more than old data, today my emphasis is on the other convenience, which is that you can throw away old data after updating your objects of interest.

So in particular, we will follow the general rule in updating an object $T$ that it’s just some part old, some part new:

$T(t+1) = \lambda T(t) + (1-\lambda) T(t, t+1),$

where by $T(t)$ I mean the estimate of the thing $T$ at time $t,$ and by $T(t, t+a)$ I mean the estimate of the thing $T$ given just the data between time $t$ and time $t+a.$

The speed at which I forget data is determined by my choice of $\lambda,$ and should be determined by the market this model is being used in. For example, currency trading is fast-paced, and long-term bonds not as much. How long does it take the market to forget news or to acclimate to new news? The same kind of consideration should be used in modeling the internet. How quickly do users change their behaviors? This could depend on the season as well- things change quickly right after Christmas shopping season is done compared to the lazy summer months.

Specifically, I want to give an example of this update rule for the covariance matrix $X^{\tau}X,$ which really isn’t a true covariance matrix because I’m not scaling it correctly, but I’ll ignore that because it doesn’t matter for this discussion.

Namely, I claim that after updating $X^{\tau}X$ with the above exponential downweighting rule, I have the covariance matrix of data that was itself exponentially downweighted. This is totally trivial but also kind of important- it means that we are not creating some kind of new animal when we add up covariance matrices this way.

Just to be really dumb, start with a univariate regression example, so where we have a single signal $x$ and a single response $y$. Say we get our first signal $x_1$ and our first reponse $y_1.$ Our first estimate for the covariance matrix is $x_1^2.$

Now we get a new piece of data $(x_2, y_2)$, and we want to downweight the old stuff, so we multiply $x_1$ and $y_1$ by some number $\mu.$ Then our signal vector looks like $[\mu x_1 x_2]$ and the new estimate for the covariance matrix is

$M(2) = \mu^2 x_1^2 + x_2^2 = \mu^2 M(1) + M(1, 2),$

where by $M(t)$ I mean the estimate of the covariance matrix at time $t$ as above. Up to scaling this is the exact form from above, where $\lambda = \frac{\mu^2}{1+\mu^2}.$

Things to convince yourself of:

1. This works when we move from $n$ pieces of data to $n+1$ pieces of data.
2. This works when we move from a univariate regression to a multivariate regression and we’re actually talking about square matrices.
3. Same goes for the $X^{\tau} y$ term in the same exact way (except it ends up being a column matrix rather than a square matrix).
4. We don’t really have to worry about scaling; this uses the fact that everything in sight is quadratic in $\mu$, the downweighting scalar, and the final product we care about is $\beta =(X^{\tau}X)^{-1} X^{\tau}y,$ where, if we did decide to care about scalars, we would mutliply $X^{\tau} y$ by the appropriate scalar but then end up dividing by that same scalar when we find the inverse of $X^{\tau} X.$
5. We don’t have to update one data point at a time. We can instead compute the new part’ of the covariance matrix and the other thingy for a whole day’s worth of data, downweight our old estimate of the covariance matrix and other thingy, and then get a new version for both.
6. We can also incorporate bayesian priors into the updating mechanism, although you have decide whether the prior itself needs to be downweighted or not; this depends on whether the prior is coming from a fading prior belief (like, oh I think the answer is something like this because all the studies that have been done say something kind of like that, but I’d be convinced otherwise if the new model tells me otherwise) or if it’s a belief that won’t be swayed (like, I think newer data is more important, so if I use lagged values of the quarterly earnings of these companies then the more recent earnings are more important and I will penalize the largeness of their coefficients less).

End result: we can cut our data up into bite-size chunks our computer can handle, compute our updates, and chuck the data. If we want to maintain some history we can just store the new parts’ of the matrix and column vector per day. Then if we later decide our downweighting was too aggressive or not sufficiently aggressive, we can replay the summation. This is much more efficient as storage than holding on to the whole data set, because it depends only on the number of signals in the model (typically under 200) rather than the number of data points going into the model. So for each day you store a 200-by-200 matrix and a 200-by-1 column vector.

## NYCLU: Stop Question and Frisk data

As I mentioned yesterday, I’m the data wrangler for the Data Without Borders datadive this weekend. There are three N.G.O.’s participating: NYCLU (mine), MIX, and UN Global Pulse. The organizations all pitched their data and their questions last night to the crowd of nerds, and this morning we are meeting bright and early (8am) to start crunching.

I’m particularly psyched to be working with NYCLU on Stop and Frisk data. The women I met from NYCLU last night had spent time at Occupy Wall Street the previous day giving out water and information to the protesters. How cool!

The data is available here. It’s zipped in .por format, which is to say it was collected and used in SPSS, a language that’s not open source. I wanted to get it into csv format for the data miners this morning, but I have been having trouble. Sometimes R can handle .por files but at least my install of R is having trouble with the years 2006-2009. Then we tried installing PSPP, which is an open source version of SPSS, and it seemed to be able to import the .por files and then export as csv, in the sense that it didn’t throw any errors, but actually when we looked we saw major flaws. Finally we found a program called StatTransfer, which seems to work (you can download a trial version for free) but unless you pay $179 for the package, it actually doesn’t transfer all of the lines of the file for you. If anyone knows how to help, please make a comment, I’ll be checking my comments. Of course there could easily be someone at the datadive with SPSS on their computer, which would solve everything, but on the other hand it could also be a major pain and we could waste lots of precious analyzing time with formatting issues. I may just buckle down and pay$179 but I’d prefer to find an open source solution.

UPDATE (9:00am): Someone has SPSS! We’re totally getting that data into csv format. Next step: set up Dropbox account to share it.

UPDATE (9:21am): Have met about 5 or 6 adorable nerds who are eager to work on this sexy data set. YES!

UPDATE (10:02am): People are starting to work in small groups. One guy is working on turning the x- and y-coordinates into latitude and longitude so we can use mapping tools easier. These guys are awesome.

UPDATE (11:37am): Now have a mapping team of 4. Really interesting conversations going on about statistically rigorous techniques for human rights abuses. Looking for publicly available data on crime rates, no luck so far… also looking for police officer id’s on data set but that seems to be missing. Looking also to extend some basic statistics to all of the data set and aggregated by months rather than years so we can plot trends. See it all take place on our wiki!

UPDATE (12:24pm): Oh my god, we have a map. We have officer ID’s (maybe). We have awesome discussions around what bayesian priors are reasonable. This is awesome! Lunch soon, where we will discuss our morning, plan for the afternoon, and regroup. Exciting!

UPDATE (2:18pm): Nice. We just had lunch, and I managed to get a sound byte about every current project, and it’s just amazing how many different things are being tried. Awesome. Will update soon.

UPDATE (7:10pm): Holy shit I’ve been inside crunching data all day while the world explodes around me.

## The basics of quantitative modeling

One exciting goal I have for this blog is to articulate the basic methods of quantitative modeling, followed by, hopefully, collaborative real-time examples of how this craft works out in given examples.  Today I just want to outline the techniques, and in later posts I will follow up with a post which goes into more detail on one or more points.

• Data cleaning: bad data (corrupt) vs. outliers (actual data which have unusual values)
• In sample/ out of sample data
• Predictive variables: choosing and preparing which ones and how many
• Exponential down-weighting of “old” data
• Remaining causal: predictive vs. descriptive modeling
• Regressions: linear and multivariate with exponentially down-weighted data
• Bayesian priors and how to implement them
• Open source tools
• When do you have enough data?
• When do you have statistically significant results?
• Visualizing everything
• General philosophy of avoiding fitting your model to the data

For those of you reading this who know a thing or two about being a quant, please do tell me if I’ve missed something.

I can’t wait!

## Thoughts on the future of math education

This is a guest post by Kevin H. Wilson, a data scientist who usually resides in Brooklyn, but is currently in Chicago as a Mentor at the Data Science for Social Good Fellowship. In past lives he’s gotten a Ph.D. in math, worked as a data scientist at Knewton for several years, and continues to oversee programming classes in rural schools for Microsoft’s TEALS program. This note comes from that latter work and associated policy work.

Programming is a Tool and Should be Taught as Such

A very popular trend nowadays is to demand that computer science be taught in schools. Indeed, President Obama has started an initiative to bring computer science to all schools in America. Before that, Mayor DeBlasio of New York demanded computer science in all schools in the City by 2025. The Hour of Code is basically a mandatory volunteer activity in many tech firms. And a search for high school hackathons or Capture the Flag on Google reveals huge interest in this topic.

These initiatives seem to miss the broader point about computers: they have fundamentally transformed the way that we interact with the world and school should reflect that. As structured now, high school computer science initiatives tend to build programming courses. These courses tend to focus either on the “cool” things you can do with coding, like building games, or the rigorous implementation details of complicated languages, as in AP Computer Science A. Even the better courses, such as AP Computer Science Principles, often constrain the skills learned to a single classroom.

Programming, however, is simply a tool which solves other problems. Specifically, programming is a tool that allows data to be manipulated. Some of that data is static data, like a business’s annual accounts payable and receivable; and some of that data is dynamic streams, like a user’s interaction with a game she’s playing on the computer. Programming’s genius is to abstract these to the same basic paradigm, a paradigm that has made it possible for Google and Facebook and Uber and Blizzard and countless other companies to improve[1] our lives using what, by historical standards, are extremely cheap and accessible devices.

Tools, however, should be taught like tools. To properly teach a tool, it must be used in context and reinforced horizontally (across the school day in multiple subjects) and vertically (across the years as students become more comfortable with more complicated tools). These imperatives have found purchase before, often in the form of encouraging medium- or long-form writing in all subjects,[2] or in the use of (some) math in all science-based courses.[3]  But our generally balkanized curricula often lead to a general perception among students (and, as students become adults, among the general populace) that knowledge is a bunch of tiny islands whose inhabitants are called the “Good at Math” or the “Good at English.”

I believe that computers and their ability to easily manipulate data offers a chance to truly redefine the mathematics curriculum, to make it more horizontal, and to refocus the tools we teach on what is actually useful and stimulating. Statistics, not calculus, should be the pinnacle achievement of high school, not relegated to box-and-whisker plots and an AP course which is accepted by relatively few universities. Algebra, the math of manipulating symbols, should be taught alongside programming. Calculus, a course which I have heard multiple people describe as “easy but for the Algebra,” should be relegated to a unit in Statistics. Trigonometric identities and conics should go away. And earlier math should focus on how and why a student arrives at an answer, and why her procedure always works, not just the answer itself.

The First Bias: Historically Computation was Hard

Why then, if this is such a good idea, hasn’t it happened already? Well, in some limited cases, it has. The Common Core math curriculum has brought statistical modeling to the forefront and clarified the balance between learning facts by rote and understanding why procedures always work. There are beautiful curricula like Bootstrap which place Algebra and Computer Science side-by-side. AP History courses have made understanding primary sources and data important to getting an A, and some teachers have gone so far as to incorporate Excel usage into their classrooms.

But there are extremely long-lived biases preventing more radical transformation. Most interesting to me is that historically statistical analysis was hard. Brahe spent an entire lifetime collecting measurements of the solar system, and Kepler spent another lifetime turning those into general rules.[4]  And the annals of science is littered with famous insights made possible by introducing nicer notation. For instance, Mendeleev, inventor of the Periodic Table, is considered one of the greatest scientists in history simply because he realized that data on atoms was periodic and there was a compact and insightful way to lay out a bunch of numbers that people already had access to!

Programming allows its user to take means or do t-tests or bootstrap or graph or integrate numerically in an instant. These bread and butter techniques, as central to statistics as long division is to arithmetic, involved days and days of human computation when the curriculum was last revised. Imagine the process in the 1930s for even finding the median of 500 numbers, a task whose first step is to meticulously sort those 500 numbers. Imagine sorting 10 decks of cards into one big deck of cards. And imagine that as a necessary step to understanding. Such a requirement is a fantastic way to miss the point the first few times, and, since sorting 500 numbers doesn’t get any faster the 20th time you’ve done it, it is a severe impediment for providing reinforcement opportunities.

The Second Bias: Measuring Computational Ability is Easy

This leads to a second bias, which is toward the easily measurable. Statistics, like programming, is really a tool that allows its user to answer questions about the world around them. But the world is complex, and there shall never be a procedure as ordered as those in the traditional high school mathematics curriculum[5] which allows the user to easily capture “the truth.” If there were, then we people called “Data Scientists” would be out of a job!

This bias toward the easily measurable doesn’t just exist in schools. For instance, Kaggle is a platform for “data science contests.” Basically, teams compete to “best model” some phenomenon present in real data sets using whatever statistical techniques their hearts desire. Typically, in the end, teams submit some code, and Kaggle runs the code on some data the submitter couldn’t see ahead of time and computes a score. The highest score wins.

Any professional data scientist or statistician will tell you this is the easy part. Once you’ve got a nice CSV filled with data, it’s usually pretty clear what the battery of models are that you would probably run on that data. Indeed, there’s now a sort of “meta-Kaggle” competition where academics build algorithms that automatically write solutions to Kaggle problems! These typically do pretty well.

The hard parts about statistics and data science is what comes before you even start to model the data. How was it generated? What assumptions does that imply about your data? What does it look like? Does the data look like it reflects those assumptions?[6] And so forth.

And what do you want to do with this data and what does this imply about what metric of success you should impose? If you’re Google or Facebook, you want to sell more ads, so likely you want more profit as your ultimate metric. If you’re the Chicago Department of Public Health and you’re trying to stop children from contracting lead poisoning, then likely your ultimate metric of success is fewer children with lead poisoning. But these are long term metrics, and so how do you translate them into objectives that you can actually train against?

These questions are the hard ones, and proficiency in answering them is much harder to measure than filling in a few bubbles on a standardized test. They must include writing, long form, explaining choices made and why those choices led where they did.[7] Of course, this sort of mathematical and statistical long form writing isn’t what we typically think of as writing in schools. Instead we imagine portfolios of fictional stories or persuasive essays. This writing would be filled with tables and math and charts and executive summaries, but its ultimate goal, persuading the reader to accept its conclusions, is a completely familiar one.

To assess these skills, we must teach teachers how to teach a new form of writing, and we must grade it. Of course, long form writing takes much more time to grade than multiple choice answers, and so we must find new ways to grade this writing.

The Third Bias: Learning Happens only in the Classroom

This brings us to a third bias which prevents the curriculum from changing: the troubling view that the classroom is the sole responsibility of the teacher. This view leads to many bad behaviors, I think, but most relevant here is simply the fact that teachers and teachers alone must grade literally everything that students produce in the class. But what if some of the grading could be outsourced, or perhaps “insourced”? What if students could grade each other’s work?[8] What if teachers from other schools could grade students’ work? What if parents could grade students’ work? What if parents could grade students who aren’t their children’s work? What if members of the community at large could grade students’ work? What if somebody from the next state over or the next country over could grade students’ work?

This idea is not new. Science fairs are often graded by a “distinguished panel of (non-)experts” and AP tests which involve essays are graded in big hotel ballrooms by college faculty and high school teachers. Students critiquing each other’s work is often an integral part of creative writing classes, if not English classes in middle and high schools. In some places, they’re even letting community members grade some projects and classes.

Moreover, computers, in their capacity to move data around at will, can facilitate this process greatly. Among other things, I work with TEALS, a program out of Microsoft which helps start programming classes in schools. In particular, I help coordinate and train volunteers who live in big cities to teach programming classes for students in far flung areas of the countries. They rely on systems such as Canvas, Edmodo, and Google Classroom to interact with students on a daily basis and to collect and assess homework and plan classes with teachers.

The Fourth Bias: Teachers Must be Trained

TEALS was built, indeed, to overcome the final bias I’ll mention preventing change: teachers know how to teach the current curriculum and teacher training programs are geared toward preparing teachers to teach this curriculum. There are extremely few opportunities for teachers to learn to teach new classes or even for teachers to learn new techniques! Teachers rarely observe, much less critique, other teachers, and the current teacher promotion system typically involves jumping to administration.

This is ludicrous. Every single classroom is a hotbed of experimentation. Each child is slightly different and every area of the United States is inculcated in different norms that affect the way students learn and cooperate. Yet teachers are given very little time to reflect on their teaching, to observe each other, or to, heaven forbid, write about their work in local, regional, or national journals and conferences. It is not at all implausible to imagine a teacher promotion system which includes an academic (as in “the academy”) component.

But all this is to say that teachers, for all their professionalism and hard work, are given very few opportunities to learn and teach new subjects. And education schools, bound to churn out teachers who can tick off various certification requirements and pass particular exams, find it hard to train teachers in rarely-taught subjects. And if a teacher coming to teach a single new and interesting course is so hard, imagine how hard it would be for them to learn an entirely new curriculum or for an education school to begin to support it!

This is certainly not a theoretical concern. Common Core has gotten so much negative press in part because of an extremely botched rollout plan.[9] Teachers were not trained in it, new textbooks and other materials to support it were not ready, and the tests meant to evaluate progress in the standards were, like all new measurement devices, faulty. And this for a set of standards that, while radical in many respects, still had the same shape as what we have been teaching for a century.

On the Other Side of the Fence: Community-Based Projects

What then would lie on the other side of change if roadblocks like these could be removed? Let’s start at what I think would be the best possible end goal: a project that high school seniors would complete before graduation that would serve as the culmination of their years of study. From there we can work backwards.

What I imagine is a project which explicitly uses all the tools students have learned over their years of high school to advocate for change in their communities. This could take many forms depending on the focus the student wants to take. For instance, students focused on writing could write op-eds detailing the history of something that troubles their community and advocating for realistic change. Or perhaps, if journalism is not their cup of tea, they could write a piece of fiction which has at its heart some spiritual conflict recognizable to those in their community.

What most interests me, though, is the sort of work that computers and statistics could open up. Imagine a project in which students identified a potential problem in their community, collected and analyzed data about that problem, and then presented that report to someone who could potentially make changes to the community. Perhaps their data could come from public records, or perhaps their data could come from interviews with community members, or from some other physical collection mechanism they devise.

Imagine a world where students build hardware and place it around their community to measure the effects of pollutants or the weather or traffic. Imagine students analyzing which intersections in their town see the most deaths. Imagine students looking at their community’s finances and finding corruption with tools like Benford’s law.

Or for those who do not come up with an original idea, imagine continuing a long running project, like the school newspaper, but instead the school’s annual weather report, analyzing how the data has changed over time.

All of these projects require a broad range of skills which high schoolers should be proficient in. They require long to medium term planning, they require a reasonable amount of statistical knowledge, they require the ability to manipulate data, they require an understanding of historical trends, and they require the ability to write a piece of persuasive writing that distills and interprets large numbers of facts.

Moreover, such projects have the potential to impact their communities in profound ways. Places like the coal towns of Appalachia are desperately attempting to make their towns more amenable to investment, both in terms of dollars from outside capitalists and in terms of years of life from their progeny. From time to time I have the opportunity to ask kids in Eastern Kentucky whether they planned to stay in their hometowns after their high school graduation, and I have yet to receive a single “yes.”[10] Towns who rally around training their students to change their own thinking, I believe, will receive huge dividends.

Of course, we can daydream about these projects’ effects, but what sorts of curriculum would actually support them? I won’t pretend to remake the entire K-12 curriculum here, and so let’s focus instead on the mathematics curriculum. Further, I don’t have the space or time right now to completely reimagine the the whole of K-12, nor do I think such a reimagining at all practical, so let’s focus on high school subjects.

What Curriculum is Necessary to Support these Projects?

1. Programming and Algebra merge

First, we must teach programming. There is no hope for doing data manipulation if you don’t understand programming to some extent. The question is when and how. I believe that algebra and introductory programming are extremely synergistic subjects. I would not go so far as to say they are interchangeable, but they are both essentially arithmetic abstracted. Algebra focuses a bit more on the individual puzzle, and programming focuses a bit more on realizing the general answer, but beyond this, they fundamentally amount to the realization that when symbols stand in for data, we may begin to see the forest and not the trees.

And just how might these two things be interwoven? Well, we have some examples of what might work. The Common Core, for example, emphasizes “computational thinking” in its mathematics curricula for all grade levels, which essentially means encouraging students to learn how to turn their solutions to specific problems into solutions for more general problems.[11] As such we’re seeing a large number of new teaching materials reflect this mandate. Perhaps my favorite of these is Bootstrap, which I would highly recommend checking out.[12]

2. Geometry is replaced by Discrete Math and Data Structures

Programming, though, is only a means and not an end, so how will we employ it? Next in the traditional curriculum we find geometry. Geometry is officially the study of space and shapes, but traditionally in America it is the place where we teach students formal logic. We drill them on the equivalence of a statement and its contrapositive, we practice the masochistic yoga of two-column proofs, and we tease them with questions such as “is a quadrilateral whose opposite sides are congruent a parallelogram?”

But there isn’t anything particularly special about the SSS and AA rules when it comes to constructing logical puzzles. These sorts of puzzles are simply meant to teach their players how to produce strings of implications from collections of facts. For instance, Lewis Carroll famously constructed nonsensical logic puzzles for his tutees which entertained while abstracting the actual logical process from the distracting influences of reality.

While I find these sorts of logical puzzles entertaining, I don’t think they’re nearly as useful to students as deriving the facts they will prove themselves. Imagine instead a course in discrete math and data structures. In this course, students would still be asked to construct proofs, but the investigation of the facts would involve programming numerous examples and extrapolating the most likely answer from those examples.

Students would come much more prepared to answer questions in discrete math having essentially become familiar with induction and recursion in their programming classes. Students could also empirically discover that sorting a random list with merge sort takes quasilinear time, and then they could go forth and prove it!

Many of these types of empirical studies would also be the beginning of a statistical education. Plotting times for sorting lists of the same size would introduce the concepts of “typical” and “worst” cases, as well as the idea of “deviance”, which are at the very center of statistical conundra.

3. Algebra II begone! Enter Statistics based on Open Data Sets

This then would lead to the next course, a replacement for the traditional and terrible Algebra II. This course, which includes some subset of solutions to systems of (in)equalities, conic sections, trigonometry, and whatever else the state decided to cram in,[13] is generally a useless exercise, where there really is no good answer to the ever-present question of, “Why do we need to know this?”

Thus, I would propose to replace this course wholesale with a course on statistics, expanding on the statistical knowledge that our data structures course laid the foundation for. However, since students have experience in programming and data structures, we can go much, much further than what we traditionally expect from a traditional statistics course. We would still teach about means and medians and z-tests and t-tests, but we can also teach about the extraordinarily powerful permutation test. Here students can really come to understand the hard lessons about what exactly is randomness and what is noise and why these tests are necessary.

Moreover, in traditional statistics courses like AP Statistics, students are usually taught various rules of thumb about sample sizes being sufficiently large and are asked to apply these rules to various fictional situations. But there are a huge number of massive data sets available nowadays, which they could not manipulate without their programming experience. The focus should move away from memorized rules of thumb for small samples to the actual analysis portion and the implications of their explorations for society.[14]

Projects in this course would be multipage reports about exploring their data sets. They would include executive summaries, charts, historical analysis, and policy recommendations. This is a hugely important form of writing which is often not a part of the high school curriculum at all.

4. Machine Learning subsumes Calculus; Calculus becomes a one-month unit

Finally, the capstone class, for the most advanced students, would move away from Calculus and instead into Machine Learning. The typical way this course is taught in colleges nowadays is as an overview of various mathematical and statistical techniques from across the subject, though perhaps the two major themes are linear algebra, especially eigenvectors, and Bayesian statistics, especially the idea of priors, likelihoods, and posteriors. Along the way students would pick up all the Calculus they’ll likely need as they learn about optimizing functions.

Indeed, such a course is already being taught at some of the more elite schools in the country, and I have no doubt that anybody who could climb their way to an AP Calculus course, if taught a curriculum like the one outlined above, would be able to approach a machine learning course.

Of course, as mentioned above, the real capstone of this course of study would be the capstone project. The three previous classes contain all that is necessary to be able to approach such a project, though many other classes that students might take could be brought to bear in spectacular ways. History courses could help students put what they learn into the context of the past; biology courses might yield fruitful areas of study, e.g., around pollution; journalism courses might lead to an interest in public records.

And all throughout, the community would be involved. Perhaps they would serve as mentors for these capstone projects, or perhaps they would help grade some of the more specialized projects during the junior year. Or even better, maybe the final exam for the introductory programming course would involve teaching an Hour of Code to community volunteers. And of course the capstone project would focus around the community itself.

Why would this be better? Lessening the linearity of mathematics

One immediate pushback I’ve gotten when I tell people this story is to ask why I think kids will perform better at this curriculum than the one that we have now. Isn’t this one even harder? To which my answer is, yes, but is is both more interesting to students and their communities and begins to help solve the problem of mathematics notoriously linear structure. To understand tenth grade math requires understanding ninth grade math requires understanding eighth grade math and so forth. Moreover, there are very few places where students who somehow fell behind have a place to catch up. This wall persists even into adulthood, with many parents dreading they day they have to tell their kids, “Oh, honey, I never understood that stuff.”

This mathematical linearity is quite different from traditional humanities curricula. In these curricula, the true emphasis is on practicing the skills of history or the styles of writing or the understanding of culture. And while History has themes and English has great turns of phrase that should be memorized, missing a few for any particular reason does not preclude the student from jumping back into the subject next time around. That great writers spent their youth ignoring their teachers or participating in traditionally educational activities speaks to the flexibility of these subjects to welcome the “late bloomers” into them.

And while the proposed math curriculum does not completely refactor out prerequisites, it does begin to weaken them. This, I think, is a good thing for getting more students on board. The focus shifts from performing specific tasks (like manipulating one trigonometric expression into another) to being able to constantly improve a set of skills, specifically, looking out into the world, identifying a problem, collecting data on that problem, and using that data to help determine means to address that problem.

These skills, identifying problems and supporting the analysis of those problems with facts, is a skill whose importance is paramount. Indeed, the Common Core State Standards for English and Language Arts bring up this point as early as the Seventh Grade.[15] But as data become easier to gather and process, “facts” shall come more and more to mean monstrous collections of data. And being able to discern what “facts” are plausible from these collections of data becomes more and more important.

What next?

There are many obstacles to this dream, even without the status quo biases that I discussed at the beginning. Even the simple job of building materials, much less the community and teacher infrastructure, to support this change is massive and will take years. And though the Common Core standards are reasonable, the move to extreme standardization of the schools does preclude experimentation on the parts of individual schools and teachers with curriculum.

Where next? Immediately, the first order of business is to understand if such a high school curriculum could be built without changing the middle and elementary school curriculum too much, since changing four years worth of curriculum is already extremely disruptive.

Assuming that is the case, then there are several possibilities. One is to take the route of the Bootstrap curriculum and explicitly teach specific skills required by the current curriculum while supplementing them with computer science concepts. This runs into the problem that the school day is already pretty full, especially for high-achieving kids, and adding in new “requirements” would burden them.

Another route is to build a charter or charter-like school around such a curriculum and forsaking the traditional standardized tests. This has the problem of being risky in that if the curricular idea is terrible, then these kids will be disadvantaged relative to their peers.

Whichever way is chosen, the process will be long, involving the hours of many people, not just writing curriculum but also from the community, who themselves by design will be involved in the week-to-week of the courses, and involving the training of many educators in a relatively new type of math curriculum.

Footnotes

1. Some would quibble with the word “improve.” If, dear reader, you are such a person, I implore you to replace this with “radically transform.”
2. Well, except often in math, where even though mathematicians have been writing long form proofs for years, students are often stuck with the terrible two-column variety.
3. Though, traditionally the “vertical” reinforcement of math has gone off the deep end into the various properties of conic sections and the opaque relationships between trigonometric functions without the aid of complex numbers. Common Core actually does a fair bit to help on this front.
4. Though maybe he faked it.
5. Long division, taking determinants, solving polynomials, taking formulaic derivatives all spring to mind, though there are many more.
6. A piece of advice to aspiring data scientists: If you are applying for a job and they ask you to do a written test ahead of time, there should be at least one plot in your writeup. Unless your solution is brilliant, you aren’t getting hired if there’s not at least one plot.
7. To what I think is its tremendous credit, this sort of writing is integral in the PARCC tests developed for Common Core-aligned curricula in some states. I have not had the chance to review the competing test, called Smarter Balance, but I would expect it would be similar.
8. There are actually many teachers who use peer grading, and also quite a bit of research on its effects, some good, some bad. The point here is that we should be open to using novel methods of grading, and especially interested in exploring how computers can facilitate these novel methods.
9. What I do not talk about here but which is also an essential problem with any change to the curriculum is that parents play a huge role in their children’s education, and so any change to the curriculum that involves reeducating teachers must also, to some degree, involve reeducating parents. Since this piece is about high school, by which time many parents have already “given up” on helping their students with homework because they are not “Good at Math” (a fact I do not have hard numbers for, but I have commonly experienced among my students), I’m leaving this massive issue out of the main text.
10. Of course, take this with a grain of salt. I tend to only get to ask this question of kids in computer science classes.
11. These solutions often take the form of “algorithms,” which are central to computer science, and thus the name “computational thinking.”
12. Perhaps my favorite aspect of the Bootstrap curriculum is that they emphasize professional development, a woefully underappreciated aspect of improving the curriculum.
13. There is no universal definition of Algebra II as far as I know. However, the Common Core has gone a long way to standardizing a definition. The PARCC Model Content Frameworks may be useful for the interested.
14. This is not to say that warnings about small samples shouldn’t be ingrained into students as well, but here large data sets can help as well. For instance, a simple exercise for the whole class could involve giving every student a randomly sampled set of 20 rows from a very large data set and asking them to run some sort of analysis. In the end, each student would come to vastly different conclusions, and thus, come to learn that sample size matters.
15. See CCSS.ELA-LITERACY.RI.7.9, which states, “Analyze how two or more authors writing about the same topic shape their presentations of key information by emphasizing different evidence or advancing different interpretations of facts.”
Categories: Uncategorized

Well hello there, cutie, and welcome. Aunt Pythia loves you today, even more than usual!

For some reason she can’t pinpoint, but probably has to do with a general feeling of happiness and fulfillment, Aunt Pythia is even more excited than usual to be here and to toss off unreasonably smug and affectionate opinions and advice. Buckle up and get ready for the kisses and the muffins.

The kisses are harder to picture but they are even more delicious.

Everyone set? OK, fabulous, let’s get going. Oh and by the way, at the bottom of the column please please

think of something to ask Aunt Pythia at the bottom of the page!

I am almost out of questions!!!

By the way, if you don’t know what the hell Aunt Pythia is talking about, go here for past advice columns and here for an explanation of the name Pythia.

——

Dear Aunt Pythia,

How should one deal with sexism and harassment at conferences?

As a white heterosexual male mathematician, I don’t experience much bias against me in my professional life, but I’ve seen (and heard of) a lot of bad stuff happening against anyone not conforming to this norm, which I think is not only bad for the people who experience this, but also bad for mathematics as a whole, for various reasons.

At a recent specialized conference, one of participants (a grad student) was very obviously sexually interested in one of the other grad students (one of only 2 female participants, my field has some serious problems in this regard), who was clearly not interested (and married).

I didn’t know these persons before the conference, and beyond me saying to the the harassing person that she was married and that he shouldn’t annoy her (which didn’t have any impact of course), I didn’t do anything. I would have liked to somehow help the harassed party feel welcome, and communicate that besides that one jerk people were interested in her mathematical ideas, but I didn’t know how to communicate this to her without making it seem inappropriate. So instead, I kept silent, which feels bad. Is there anything I could do next time I was in this type of situation, besides trying not to be a jerk?

Dr. Nonheroic Observer

Dear Dr. NO,

I gotta say, I love your question, but it’s kind of spare on details. What did the guy do? How much did it annoy the married party? It really matters, and my advice to you depends on those facts.

When I think about it, though, I don’t see why the fact that she’s married matters. Speaking as a 17-year married person (as of today!), married people like to flirt sometimes, so it’s not as if it’s intrinsically harassing for someone to express interest in a married person, or for that matter a single person.

But as soon as someone responds with a “not interested” signal, it is of course the responsibility of the interested party to tone it down.

Let me go into three scenarios here, and tell you what I think your response should be in each.

First, the guy likes her. You said it was obvious he was interested and it was also obvious she wasn’t. Depending on how that played out, it could be totally fine and not at all your responsibility to do anything. So, if he was like, hey would you like to go on a walk? and then she said, no thanks I’m going to get some work done and that was that, then whatevs. Again, not holding anything against someone for interest per se.

Now on to the second scenario, which seems more likely, since you mentioned that he annoyed her in spite of your advice to him. So that means he followed her around a lot and generally speaking glommed on her, which probably means he obstructed her normal interaction with other mathematicians at the conference. This is a big problem, because conferences are when the “mathematical socializing” happens, which very often results in collaboration and papers. The fact that men glom onto women prevents that, and might be a reason women don’t join your field.

Your responsibility, beyond telling the guy to lay off, which you did, is to first of all talk math with her explicitly, so she gets some mathematical socializing done. Also be proactive in introducing her to other people who are good math socializers.

Beyond that, I think you need to tell the guy to stop a second time. Ask the guy to think about why she came to the conference, and what she wants and doesn’t want out of the experience. In other words, make him try to think about her perspective rather than his own dick’s perspective. Who knows, it might help, he might just be super nerdy and not actually an asshat.

If that doesn’t work, and if he is in fact an asshat, I suggest you go to her and ask her if he is bothering you. Pretending not to notice isn’t helping her, and she probably has nobody to appeal to and could use an ally. If she says yes, then with her permission, go back to the guy and tell him he is officially bothering her. I guess that would actually work.

Third scenario is when even that doesn’t work, in which case I would go to the organizer of the conference and suggest that the harasser be asked to leave the conference.

I’d be super interested to hear your thoughts, and in particular what you think would happen if you had actually gone to the organizers. Of course, if you were one of the organizers yourself, I’d say you should have threatened the guy with expulsion earlier on.

Write back and tell me more details and tell me whether this advice was helpful!

Aunt Pythia

——

Dear Aunt Pythia,

Why EW? What is wrong with “He went on way too many dates too quickly”? What makes you the judge of what constitutes too many, when you yourself admit that you “have taken myself out of the sex game altogether – or at least the traditional sex game” so your opinion on traditional sex game (which is exactly what this guy is doing) is clearly biased. He is a modern empowered man who is exploring his options before settling down. What you wrote is nothing different than “slut-shaming” just reversing the gender. I hope you will exercise greater sensitivity in the future posts.

NY_NUMBERS

Dear NY_NUMBERS,

I think you must be referring to my response to Huh in this past column in reference to the alleged math genius who “hacked” OK Cupid. And I think you misunderstand me.

I am all for slutty behavior. In fact I am super sex positive. If the guy were just trying to get lots of great sex with lots of amazing women, then more power to him. I’d tell him about Tinder and I’d even direct him to critiquemydickpic for useful and amusing advice.

But actually he was having one or two dates per day looking for love. What?! That’s way too much emotional drainage. How can anyone remain emotionally receptive if they can’t even remember people’s names? I’d be much much happier for him, and I wouldn’t be judgmental, if he had been bringing home a different woman every night for mind-blowing sex. Youth!!

So, if you want to complain about my “ew”, then I think you’d need to say that, if someone can fuck anything that moves, they should also be able to love anything that moves. I’m not sure there’s a name for this but maybe “love-shaming?”.

In any case, I stand by my “ew”: I don’t think loving one or two people per day is possible. And the woman he ended up with found him, which was different and broke his cycle, kind of proving my point.

Aunt Pythia

——

Dear Aunt Pythia,

I’m a statistician with four-or-so years of work experience, but currently in the last half year or so of a applied bayesian stats PhD. I have seen the rise of R and Statistics as a hot, talked about subject. And for some reason, I am getting nervous about all the new cool kids that play around on Kaggle; that they will take ALL THE JOBS, and that there will be no space for slightly less cool, more classically trained statisticians such as myself. After all, all we’re doing is a bit of running a glm, or a cluster analysis, or some plotting. A monkey could learn that in three months. Sometimes I wish everyone would stay away and let me have all the datasets for myself.

Am I being unreasonably nervous about the future?

Have Stats Want to Analyze

Dear HSWA,

First, I wanna say, I had high hopes for your sign off until I wrote it out. Then I was like, wtf?! I even googled it but all I came back with was the Hampton Shaler Water Authority. And I am pretty sure that’s not what you meant. And keeping the “t” in didn’t help.

Second, I’ve got really good advice for you. Next time you’re in an interview, or even if you’re just on a bus somewhere with someone sitting next to you who allows you to talk, mention that Kaggle competitions are shitty bars for actual data scientists, because most of the work of the data scientist is figuring out what the actual question is, and of course how to measure success.

Those things are backed into each Kaggle competition, so hiring people who are good at Kaggle competitions is like hiring the chef who has been supplied with a menu, a bunch of recipes, and all the ingredients to run your new restaurant. Bad idea, because that’s the job of the chef if he’s actually good. In other words, it’s not actually all that impressive to be able to follow directions, you need to be creative and thoughtful.

Make sure you say that to your interviewer, and then follow it up with a story where you worked on a problem and solved it but then realized you’d answered the wrong question and so you asked the right question and then solved that one too.

I’m not nervous for you, thoughtful statisticians are in high demand. Plus you love data, so yeah you’re good.

Good luck!

Aunt Pythia

——

Dear Aunt Pythia,

I’ve been working as faculty in a new department this year and I have repeatedly had the feeling that the support staff is not treating me the way they would if I were 50 and male instead of young and female (although with the rank of professor).

It’s small things like roundly scolding me for using a coffee mug from the wrong cupboard, or hinting I should make sure the kitchen cleaning is easy for staff (I’m not messy!), or the conference support staff ceasing to help with basic support on a conference (and complaining about me to other people), or wanting me to walk some mail to another building.

I realize this is all small potatoes. But I have started to feel like by just taking it passively (e.g. smiling and nodding) I might be saving myself time and anger now but I’m helping to perpetuate the system. I rigorously avoid confrontation and I think I’m typically regarded as a very friendly and helpful team player by my peers. (How could I prove bias anyway, and would confrontation help?). But I’m not sure I can spend my whole life putting up with small potatoes along with the bigger potatoes I encounter from time to time.

Spud Farmer Considering Pesticides

Dear SFCP,

First of all, again, disappointed your sign-off didn’t spell anything. But will let it pass.

Second of all, my guess is that they are sexist. I have a prior on this because I’ve encountered so much sexism in this exact way.

Third of all, I’m also guessing they are administrative people in academia, which means they are also just barely able and/or willing to do their jobs. Again, experience, and since I am administration now in academia, I am allowed to call it. Some people are great, most people are not.

Fourth, I don’t know why you are “rigorously avoiding confrontation” here. The very first thing you should do is choose your tiny battles wisely and create small but useful confrontation. Examples:

• Someone asks you to mail a letter. You say, “oh who usually mails letters? I will be sure to bring it to them.”
• Someone doesn’t want to do their part in helping with basic support on conferences. You say, “Oh that’s not your job? I am so sorry. Who should I be asking for help on this?”
• Someone scolds you for using the wrong coffee cup or some such nonsense. You say, “I am new here and I don’t know the rules but I will be sure to remember this one! I am one of those people with a strong work ethic, and it’s great to see how people around here pull together and make things happen.” You know, be aspirational.

Fifth, if it comes to it, get a faculty ally to explain which staff are bitter and why, and which of them are juts plain nuts, and which ones do everyone else’s job. Useful information. Make sure it’s an ally! Complaining about this stuff to the wrong person could give you a reputation as a complainer.

Sixth, do not let this stuff build up inside you! Make it an amusing part of your day to see how people wiggle out of their responsibilities and blame other people for their mistakes. And keep in mind that the faculty are probably the biggest and best examples of such behavior.

Love,

Auntie P

——

Please submit your well-specified, fun-loving, cleverly-abbreviated question to Aunt Pythia!

Categories: Aunt Pythia