Let’s not replace the SAT with a big data approach

Home > data science, education, modeling > Let’s not replace the SAT with a big data approach

Let’s not replace the SAT with a big data approach

March 19, 2014 Cathy O'Neil, mathbabe

The big news about the SAT is that the College Boards, which makes the SAT, has admitted there is a problem, which is widespread test-prep and gaming. As I talked about in this post, the SAT mainly serves to sort people by income.

It shouldn’t be a surprise to anyone when a weak proxy gets gamed. Yesterday I discussed this very thing in the context of Google’s PageRank algorithm, and today it’s student learning aptitude. The question is, what do we do next?

Rick Bookstaber wrote an interesting post yesterday (hat tip Marcos Carreira) with an idea to address the SAT problem with the same approach that I’m guessing Google is addressing the PageRank problem, namely by abandoning the poor proxy and getting a deeper, more involved one. Here’s Bookstaber’s suggestion:

You would think that in the emerging world of big data, where Amazon has gone from recommending books to predicting what your next purchase will be, we should be able to find ways to predict how well a student will do in college, and more than that, predict the colleges where he will thrive and reach his potential. Colleges have a rich database at their disposal: high school transcripts, socio-economic data such as household income and family educational background, recommendations and the extra-curricular activities of every applicant, and data on performance ex post for those who have attended. For many universities, this is a database that encompasses hundreds of thousands of students.

There are differences from one high school to the next, and the sample a college has from any one high school might be sparse, but high schools and school districts can augment the data with further detail, so that the database can extend beyond those who have applied. And the data available to the colleges can be expanded by orders of magnitude if students agree to share their admission data and their college performance on an anonymized basis. There already are common applications forms used by many schools, so as far as admission data goes, this requires little more than adding an agreement in the college applications to share data; the sort of agreement we already make with Facebook or Google.

The end result, achievable in a few years, is a vast database of high school performance, drilling down to the specific high school, coupled with the colleges where each student applied, was accepted and attended, along with subsequent college performance. Of course, the nature of big data is that it is data, so students are still converted into numerical representations. But these will cover many dimensions, and those dimensions will better reflect what the students actually do. Each college can approach and analyze the data differently to focus on what they care about. It is the end of the SAT version of standardization. Colleges can still follow up with interviews, campus tours, and reviews of musical performances, articles, videos of sports, and the like. But they will have a much better filter in place as they do so.

Two things about this. First, I believe this is largely already happening. I’m not an expert on the usage of student data at colleges and universities, but the peek I’ve had into this industry tells me that the analytics are highly advanced (please add related comments and links if you have them!). And they have more to do with admissions and college aid – and possibly future alumni giving – than any definition of academic success. So I think Bookstaber is being a bit naive and idealistic if he thinks colleges will use this information for good. They already have it and they’re not.

Secondly, I want to think a little bit harder about when the “big, deeper data” approach makes sense. I think it does for teachers to some extent, as I talked about yesterday, because after all it’s part of a job to get evaluated. For that matter I expect this kind of thing to be part of most jobs soon (but it will be interesting to see when and where it stops – I’m pretty sure Bloomberg will never evaluate himself quantitatively).

I don’t think it makes sense to evaluate children in the same way, though. After all, we’re basically talking about pre-consensual surveillance, not to mention the collection and mining of information far beyond the control of the individual child. And we’re proposing to mine demographic and behavioral data to predict future success. This is potentially much more invasive than just one crappy SAT test. Childhood is a time which we should try to do our best to protect, not quantify.

Also, the suggestion that this is less threatening because “the data is anonymized” is misleading. Stripping out names in historical data doesn’t change or obscure the difference between coming from a rich high school or a poor one. In the end you will be judged by how “others like you” performed, and in this regime the system gets off the hook but individuals are held accountable. If you think about it, it’s exactly the opposite of the American dream.

I don’t want to be naive. I know colleges will do what they can to learn about their students and to choose students to make themselves look good, at least as long as the US News & World Reports exists. I’d like to make it a bit harder for them to do so.

Categories: data science, education, modeling

Comments (15)

Nikos Evangelos

March 19, 2014 at 10:03 am

It’s so important to me that you’re writing on this subject! (Sorry about the generic sound to that, but of course you know me and it’s true.)

LikeLike
Zathras

March 19, 2014 at 10:07 am

“Colleges have a rich database at their disposal: high school transcripts, socio-economic data such as household income and family educational background, recommendations and the extra-curricular activities of every applicant, and data on performance ex post for those who have attended.”

No. No. No.

What colleges have is the potential for a rich database. What do they have instead? Paper copies or pdf’s of transcripts, which look completely different from school to school. Paper copies of recommendations, which are mostly worthless anyway. It would be a HUGE effort to turn this into mineable data.

LikeLike
- Guest2
  
  March 19, 2014 at 10:56 pm
  
  You are wrong about e-transcripts, it is already happening, probably driven along by just the kinds of pressures Cathy describes. Hence, transcripts are being standardized to conform.
  
  https://www.ohiohighered.org/transfer/ohio-hs-etranscripts
  
  LikeLike
  - Zathras
    
    March 20, 2014 at 7:42 am
    
    I have to wonder for how far back one can get an e-transcript. This year’s transcripts aren’t going to help with this work at all. To data mine for student success, you need the e-transcripts from at least 5 years ago. The objective is to find predictors for graduation rates. The most recent graduation data will be people who graduated college in 2013 (soon, there will be the 2014 data). People graduating college in 2013 will typically have graduated high school in 2008 or 2009. So you need e-transcript from at a minimum of 2008 and 2009, when the e-transcripts do not exist, and that only gives you 1 year of college success to look at, probably not enough.
    
    LikeLike
    - Guest2
      
      March 20, 2014 at 8:34 am
      
      It’s not so much high school, but high school and beyond that Big Data will own.
      
      That’s what this discussion about Student Unit Record System (SURS) is all about. Concerns here are real, but in the rush to be efficient, will be over looked. http://www.insidehighered.com/views/2014/03/20/we-need-new-student-data-system-right-kind-one-essay
      
      Another example of techno-tail wagging the biological dog.
      
      LikeLike
- Guest2
  
  March 25, 2014 at 9:28 am
  
  http://www.insidehighered.com/news/2014/01/21/colleges-move-digital-transcripts-managed-outside-firms
  
  Big Corporations are watching, and they already know where they want to take this.
  
  LikeLike
abekohen

March 19, 2014 at 10:11 am

When social engineering goals conflict with data science approaches, social engineering seems to win. Colleges have gone away from predicting an individual’s chance to succeed in favor of social engineering to favor group characteristics over individual merit. The battle against the SAT and other quantifiable metrics is part of the social engineering mission. It was also the original impetus for grade inflation at Harvard.

“In 2005, for example, Asians who were admitted to the University of Michigan scored a median 1400 out of 1600 on the SAT. The median score was 50 points lower for whites, 140 points lower for Hispanics and 240 points lower for blacks, according to a study done by the Center for Equal Opportunity. That same year ‘black and Hispanic male applicants from Michigan with no alumni/ae ties to UM but with a 1240 SAT and 3.2 GPA had a nine in ten chance of admissions (92 percent and 88 percent, respectively); Asians and whites with the exact same background and credentials, on the other hand, had only about a one in ten chance (10 percent and 14 percent, respectively).’ “

LikeLike
- cat
  
  March 19, 2014 at 3:20 pm
  
  “That same year ‘black and Hispanic male applicants from Michigan with no alumni/ae ties to UM but with a 1240 SAT and 3.2 GPA had a nine in ten chance of admissions (92 percent and 88 percent, respectively); Asians and whites with the exact same background and credentials, on the other hand, had only about a one in ten chance (10 percent and 14 percent, respectively).’ “
  
  Is this a whites are the oppressed minority rant because thats what it sounds like.
  
  If you look at the freshman vs population you’ll see the black and Hispanic students are still under-represented in UM vs the population of Michigan. This means more white students apply to UM then UM has freshman slots for. Are you really saying the entire UM freshman class should be white because more whites applied?
  
  LikeLike
  - abekohen
    
    March 19, 2014 at 3:33 pm
    
    Why should the color of one’s skin be a criterion for whether one is admitted to a competitive college? Should it matter that 40% of Berkeley undergrads are Chinese? 60%? 80%? Why? The best qualified applicants with a DIVERSITY of IDEAS should be admitted. If that’s 90% Blacks (as in some professional sports) so be it.
    
    LikeLike
- Guest2
  
  March 19, 2014 at 11:00 pm
  
  “The battle against the SAT and other quantifiable metrics is part of the social engineering mission.”
  
  Ironic, then, isn’t it? SAT was invented by progressive social engineers.
  http://www.pbs.org/wgbh/pages/frontline/shows/sats/where/history.html
  http://www.pbs.org/wgbh/pages/frontline/shows/sats/where/
  
  LikeLike
lindapbrown2013

March 19, 2014 at 11:22 am

I guess we can add “late bloomer” to the list of totally outdated concepts. If you haven’t sold an app to Apple, Google or Mr. Zuckerberg by the time you’re sixteen, who needs you?

LikeLike
Lior Silberman

March 19, 2014 at 8:40 pm

1. Using “big data” approaches seems to me the only sensible approach to college admission. A main motivation for the adoption of the current approach (“individualized admission”, reducing the weight of the SAT and competitive entrance exams and emphasizing “geographic distribution” and “extracurricular engagement”) in the 1920s was racial discrimination, and it is used today with a racial goal in mind (just not the same one), but today the proponents are far less honest about it (because, formally, racial balance goals are not permitted to the universities). Algorithms are not inherently more biased than admissions officers, and are much easier to police because they can’t lie. You can measure any biases and algorithm has by feeding it data and seeing the output, which isn’t a practical way of measuring the biases of human admissions officers.
2. The key point (made today by Ed Felten here) is that the problem is transparency, not the algorithms themselves. What we should insist is not that universities avoid using algorithms, but that admission criteria be public and objective.
Since universities don’t really compete on their admission mechanism, but rather on the education delivered, I see no reason to keep the admission algorithms secret. For example, admissions to my alma mater, the Hebrew University of Jerusalem, only depends on a kind of high school GPA and on an SAT-like exam (and also on some threshold requirements like proficiency in English). One can be admitted based on one of the two scores, or based on a weighted average of the two (50%-50% for most academic programs, 30%-70% for some). Each year the university publishes last year’s admission cutoffs for each department for the benefit of prospective students, recently also through a web calculator where you can input your grades and get a prediction on your admission chances. I should say that, in addition, there is a number of “affirmative action” spots which assigned by additional (but again public) criteria related to socio-economic status, parental education and the like. There is no place for admissions officers to game this system — you can argue with the criteria, you can claim that they are biased in various ways (say because richer students can get more tutoring for the SAT).
3. A key preliminary questions is what admissions should optimize for. I think that they should optimize for the odds of success in college. Some believe in optimizing for other things (say, for students from the “right” backgrounds) others also believe in optimizing the expected future donations to the university, or the expected contributions to the revenue generated by competitive university athletics. In any case, this should be made explicit, so that the admission mechanism can be judged for fidelity to its design goals.
Corollary to this, is that I think universities should also publish the correlation between the admission scores they use and the measures they use to judge admissions. If academic performance is a goal then they should publish the correlation between academic performance and admission scores (binned data of course, not just the numerical coefficient). If “diversity” is a goal then the universities should publish their way of measuring “diversity” and how it corresponds to the admissions information used.
4. Given my personal preference here, I see no problem with saying “past students with similar files performed in this way”. I’m not sure why you think that the performance of “others like you” is a bad predictor of your own performance. More importantly: if you have better predictors of student performance that the performance of other students in the past then it would be great to hear about them. I (hope) everyone is sad that students from certain schools tend to do better at university that students from other schools; but I don’t see why the universities should admit students who are less likely to succeed for hidden non-academic motivations of the universities. If success in university correlates with the school you come from then we need to fix the schools, not rig university admissions. It was wrong to adopt the current system to exclude Jews for the benefit of WASPs, but it is no less wrong to use it today for other racial motivations, especially when officially there can be no direct racial motivations.
5. Once the data is public, we can have a genuine discussion. I would readily believe research showing that certain admission criteria (say the SAT and high-school extracurricular activities) tend to underestimate the capabilities of students from some social classes and races compared to others. But unless we are honest about what criteria we are actually using, and what admission goals we are optimizing for, it is pointless to have a derivative discussion about whether the admission system is achieving those goals.

LikeLike
- Lior Silberman
  
  March 19, 2014 at 8:53 pm
  
  TL;DR version:
  
  Suppose that after seeing many students over several years, you conclude that students with GPA 80% from school A do about as well as students with GPA 75% from school B. Two students show up with GPA 80%, one from each school. Do you take the one from school A or school B?
  
  Should the university care why the difference is there?
  
  LikeLike
Guest2

March 20, 2014 at 8:36 am

“Using “big data” approaches seems to me the only sensible approach to college admission.”

What about from high school and beyond? See the discussion about Student Unit Record System (SURS) above.

LikeLike
Matt

March 20, 2014 at 10:45 am

I find Bookstaber’s comments scary, because of how this data-based future seems to be embraced with no reservations.

Of course maximizing “future alumni giving” would trigger an alarm on many people’s radar, but I think maximizing “academic success” is also dangerous — of course this would be maximized with a highly homogenous freshman class, biased toward whatever particular background has been calculated to be the most likely to succeed. Is that good? Only if “academic success” is really all you wanted. Probably it isn’t.

Big data is a powerful tool, and it is coming whether we like it or not, and it will indeed make college admissions (among everything else) much more effective. The real question, which is still under human control and subject to forces both good and evil, is, much more effective at what?

LikeLike