## Aunt Pythia’s advice

You’ve stumbled upon yet another week’s worth of worthy questions that will be awkwardly sidestepped by mathbabe’s alter ego Aunt Pythia.

By the way, if you don’t know what you’re in for, go here for past advice columns and here for an explanation of the name Pythia. Most importantly,

**Please submit your question at the bottom of this column!**

**I’ve officially run out of questions so this is for real. **

**Please come up with something before I do.**

——

*Dear Aunt Pythia,*

*I just moved to NYC from a small university town, and I’m finding it much harder to meet nerd girls. Most of the nerd hangout spots that I’ve found are male dominated, and I meet mostly artists at the bars and coffee shops. Do you have any suggestions beyond trolling the nearest physics department?*

*Nice, Easygoing Roamer Drawn Swiftly Around Real, Engaging Hackers On Town*

Dear NERDSAREHOT,

Let me suggest you enroll in Meetup yesterday and sign yourself up for all the nerd meetups you can find. There are plenty of cute nerd girls who go to those, and it’s a perfect situation for you to ask someone to have a beer afterwards. Also consider getting involved in weekend hackathons, which attract lots of nerd girls as well.

By the way, these events are still male dominated, but that’s a *good* thing. Nerd girls should have their pick. It’s one of the many advantages of being a nerd girl and it aint going away.

Aunt Pythia

——

*Dear Aunt Pythia,*

*I recently got a job as a data scientist, and I’m feeling like my stats skills are woefully inadequate. I have a master’s in pure math and I work as a programmer, but I’ve never taken a statistics class. What books would you recommend I read to get up to speed on statistics? I’m looking for something with examples that’s applicable to my work (not too much definition/theorem/proof), but that isn’t scared of the math.*

*Regretting Spurning Statistics*

Dear RSS,

Congratulations! Can you write back and tell everyone how you got the job? Guest post?

Honestly I learned stats (the stuff I know anyway) by reading wikipedia extensively. It’s surprisingly good. Also, the book I’m writing with Rachel Schutt will contain some good explanations of how stats is used in data science, thanks of course to Rachel, not me. She’s working on the causality chapter right now.

In general my advice to you is, draw lots of pictures, including a histogram as well as a time-value scatter plot of every data set you use, and every data set you generate as well. You’d be surprised by how quickly you learn the statistics that is relevant to your dataset when you’re intimately familiar with its properties.

Good luck!

Aunt Pythia

——

*Dear Aunt Pythia,*

*I have been reading up on regression to the mean originally as described by Galton. He notes that the sons’ height data had reduced variance versus the height data of the preceding fathers’ generation. If this is so, wouldn’t the grandsons’ generation have even more reduced variance in height compared with the 2nd generations’ height…and so on down the generation lineage. Therefore wouldn’t the variance in succeeding generations get narrower and narrower and approach some limit? Where am I going wrong with this, or am I misunderstanding something?*

*MeanIQ*

Dear MeanIQ,

Thanks for bringing my attention to this, it’s clearly an important historical part of linear regression and I’d never heard of it.

You’re absolutely right to think that Galton was wrong. Galton’s working theory was that two people have children by *averaging their characteristics*, which is just not how genetics works (as we now know). Not only would what you say be true, that after a few generations everyone would be the exact same height, but we’d also see that, if you went *backwards* in time, there’d be people of arbitrary height, tall and short.

As for why he saw larger variance in older generations, my best guess is that he had a selection bias. Maybe the decreasing variance he observed was due to environmental factors such as the quality and size of the local food supply, where the “current” generation were localized (and so more consistent) but the “older” generation had come from various other places where they were either better fed or less well fed, which would lead to an increased variance.

There’s another totally different interpretation for the phrase “regression to the mean” which is also confusing though. Namely, the idea that if your first measurement of something is extreme, then your second measurement will tend to be less so. The problem with this is that you have to have a notion of “extreme” in the first place. And if you do, then it’s kind of obvious (and also kind of dumb).

Aunt Pythia

——

*Dear Aunt Pythia,*

*Is the Mathbabe religious? *

*I really like the new mathbabe logo/marque. The typeface is totally flapper and I really like those bulbous upside down B’s, and the offsetting of the bottom text in order to give the text texture. But when I look at the symbolology of the whole logo/marque I can’t help but wonder if the Mathbabe is religious. The T looks like a deproportioned Greek cross, and the alpha above it suggests that there should be an omega below it somewhere. So clearly the new logo/marque has some Christian symbolology, and my eyes keep looking for more. Maybe the A’s are three sided figures that represent the Trinity, and the M represents a firmament that has fallen, and therefore symbolologizes our fallen state.*

*Anyway, it’s cool if you are religious, as lots of great mathematicians were devout people, and some were even priests, like Bayes. And if you’re not that’s cool too. I see you describe sex both profanely and sacredly, so I know you are a spiritual person. And it’s cool if you don’t want to answer either. I respect that religion is a personal matter. Just saw your new logo/marque and was wondering.*

*Semi-semiotic*

Dear Semi-semiotic,

Honestly I have so little religious background that I am not even sure if you’re kidding (but the “symbolologizes” kind of gives you away).

For the record, my parents were atheists who made fun of me when I told them I believed in God in first grade (I think I learned about the *idea* of God from a babysitter). One of their favorite stories of my childhood is when my first grade teacher, a devout Catholic, called up my parents in alarm over my essay which said “I believe in God but please don’t tell my parents” and my mom was like, “Har har that’s a good one, thanks” and hung up on her. Not that my mom is a rude person, she isn’t.

Two more points: First, I plan to refer to myself in third person from now on as “The Mathbabe”, and second, when did I ever refer to sex *sacredly*? That’s bullshit. Blasphemy even.

Aunt Pythia

——

Please please please submit questions, thanks! I’m desperate!

Did you ever explain your logo? Why the out-of-place alpha? Why the prominent “t”? I can see how it could be taken as a cross. Are you “a math babe”. Are you highlighting “at”?

Alpha female.

“There’s another totally different interpretation for the phrase “regression to the mean” which is also confusing though. Namely, the idea that if your first measurement of something is extreme, then your second measurement will tend to be less so. The problem with this is that you have to have a notion of “extreme” in the first place. And if you do, then it’s kind of obvious (and also kind of dumb).”

This is what Galton meant by regression to the mean, unless I badly misunderstand him. And it really wasn’t obvious at the time! Heck, it’s not obvious to everybody now.

I’m not aware that Galton ever thought the variance in height decreased from one generation to the next. On the contrary, I think he believed it to be constant, which I think is roughly correct? And he understood that the fact that the variance stays constant means you have to have regression, writing:

“the distribution of faculties in a population cannot possibly remain

constant, if, on the average, the children resemble their parents. If they did

so, the giants (in any mental or physical particular) would become more

gigantic, and the dwarfs more dwarfish, in each successive generation. The

counteracting tendency is what I called “regression.””

Wait, what?? Now I’m thoroughly confused. Can you read this and then we can talk again?

That wiki article is not so clearly written, but it doesn’t say Galton thought the variance of height in the population decreased from one generation to the next (which is good because I don’t think Galton thought that, nor do I think it’s true!)

I agree that it depends on how you read it. If he took only the “outliers” and saw that the next generation seemed less outlier-ish, then it’s conceivable that he didn’t think that. I guess I’d have to read the original work of Galton to know.

I have (not the whole books, but a lot of the material relevant to regression) and that’s what he did.

Galton was working with parent-offspring pairs. If the parent pair is an outlier, say, 1.5 sd from the mean, the offspring would tend to have a lower variance, say, 1 sd from the mean. Galton thought this had to do with generation skipping, but he had the wrong model of inheritance and did not understand the role of the environment in quantitative traits. Quantitative traits such as height or weight in humans invariably depend on multiple loci, whose extremes will only be present in homozygotes at all loci, and will also be under (often strong) environmental influence. Since the offspring will not experience identical environment and is unlikely to remain homozygous at all the relevant loci, it will be less extreme in the trait than the parents were.

As someone in the pre-job situation of Regretting Spurning Statistics, I would love to see a guest post by him/her.

Hi RSS, as an engineer who forsook it for cops and robbers, I have the following response for you. Here’s the beginning of an article I wrote on data visualization for the Handbook of Quantitative Criminology:

“First, a confession. I taught statistics for 30 years,and formost of that time,I stuck pretty close to the topics covered in standard social science and statistics textbooks, explaining correlation, regression, statistical signiﬁcance, t-tests, ANOVA, etc. In other words, I ended up teaching the same-old, same-old statistical methods (sorry, students), primarily inferential statistics, without stopping to consider how well they ﬁlled the students’ needs. Of course, students need to know these methods, since they have to be able to interpret the ﬁndings of papers written by researchers who learned, and were applying the same-old, same-old methods. But they also need to know what assumptions are implicit in these methods; many are based on random sampling, which is often not the case, and on linearity, normality, independence, and other idealizations that are rarely found in real data – which does not stop researchers from applying them.”

Next section, which I should have included:

“For the most part,these methods were developed early in the last century,when collecting data was an expensive proposition.For this reason, to reduce the cost of data collection, many of the methods were predicated on taking random samples. Moreover, analyzing data could take hours or days, even with small datasets.

Neither of these conditions still holds. Rather than a trickle of data, we are now confronted with a ﬁre hose of data. Of course, this does not mean that there are no longer problems of data quality; much of the [police-collected] data is entered by humans, not automatically recorded,and dates, names, and other entries are not always entered correctly.”

I’m also mostly self-taught at statistics. As far as a practical book, I like Gelman/Hill. Fairly elementary, lots of examples, good for developing intuition: http://www.stat.columbia.edu/~gelman/arm/

For something covering more ground, I’m also a fan of: http://www-stat.stanford.edu/~tibs/ElemStatLearn/

For nerd dating in the city, I personally highly recommend internet dating in general, and OKCupid specifically.

As a male nerd, (who was at the time, in decent physical shape), I had a reasonably easy time finding dates and eventually my fiance. You do need to be able to handle rejection and not get too upset when someone doesn’t respond, or falls off the planet after the third exchange. At my peak of online dating, I had 5 dates with 4 different women in a week.

I found it much easier to get to know women over the comforts of asynchronous textual communication. In person, I can be aloof and awkward, but on the internet, you’d never know ;).

“It’s one of the many advantages of being a nerd girl”

Should this say “It’s one of the many advantages of being a (straight) nerd girl”? :)

Fair.

The stats classes I took were more theory and not really relevant for practical work experience. So to keep learning I read a lot of stats/data viz blogs (Simply Statistics; Stats Chat; Statistical Modeling, Causal Inference, and Social Science; The Why Axis; Citizen Statistician; etc.) and Coursera has free classes. I am in the last week of Data Analysis which has good overviews of statistical concepts and teaches you how to do it in R. The background in stats is good even if you don’t want or need to learn R. You could download the videos since it is the last week or just wait and enroll in the next class. There might be other relevant classes too, or check on Udacity. Looking forward to the Schutt/mathbabe book. If anyone else has some good blog recommendations, please pass them on!

Try “Statistical Inference” by Casella and Berger. I used it for my Introduction to Mathematical Statistics class and I think it’s one of the best out there. It *is* theory, but I have found that for me, the best way to learn statistics was to learn the theory, and then learn how to apply it from the “masters”. I use statistics extensively, but not in big-data.