Home > data science, math education, modeling, musing > Experimentation in education – still a long way to go

Experimentation in education – still a long way to go

September 5, 2013

Yesterday’s New York Times ran a piece by Gina Kolata on randomized experiments in education. Namely, they’ve started to use randomized experiments like they do in medical trials. Here’s what’s going on:

… a little-known office in the Education Department is starting to get some real data, using a method that has transformed medicine: the randomized clinical trial, in which groups of subjects are randomly assigned to get either an experimental therapy, the standard therapy, a placebo or nothing.

They have preliminary results:

The findings could be transformative, researchers say. For example, one conclusion from the new research is that the choice of instructional materials — textbooks, curriculum guides, homework, quizzes — can affect achievement as profoundly as teachers themselves; a poor choice of materials is at least as bad as a terrible teacher, and a good choice can help offset a bad teacher’s deficiencies.

So far, the office — the Institute of Education Sciences — has supported 175 randomized studies. Some have already concluded; among the findings are that one popular math textbook was demonstrably superior to three competitors, and that a highly touted computer-aided math-instruction program had no effect on how much students learned.

Other studies are under way. Cognitive psychology researchers, for instance, are assessing an experimental math curriculum in Tampa, Fla.

If you go to any of the above links, you’ll see that the metric of success is consistently defined as a standardized test score. That’s the only gauge of improvement. So any “progress” that’s made is by definition measured by such a test.

In other words, if we optimize to this system, we will optimize for textbooks which raise standardized test scores. If it doesn’t improve kids’ test scores, it might as well not be in the book. In fact it will probably “waste time” with respect to raising scores, so there will effectively be a penalty for, say, fun puzzles, or understanding why things are true, or learning to write.

Now, if scores are all we cared about, this could and should be considered progress. Certainly Gina Kolata, the NYTimes journalist, didn’t mention that we might not care only about this – she recorded it as unfettered good, as she was expected to by the Education Department, no doubt. But, as a data scientist who gets paid to think about the feedback loops and side effects of choices like “metrics of success,” I have a problem with it.

I don’t have a thing against randomized tests – using them is a good idea, and will maybe even quiet some noise around all the different curriculums, online and in person. I do think, though, that we need to have more ways of evaluating an educational experience than a test score.

After all, if I take a pill once a day to prevent a disease, then what I care about is whether I get the disease, not which pill I took or what color it was. Medicine is a very outcome- focused discipline in a way that education is not. Of course, there are exceptions, say when the treatment has strong and negative side-effects, and the overall effect is net negative. Kind of like when the teacher raises his or her kids’ scores but also causes them to lose interest in learning.

If we go the way of the randomized trial, why not give the students some self-assessments and review capabilities of their text and their teacher (which is not to say teacher evaluations give clean data, because we know from experience they don’t)? Why not ask the students how they liked the book and how much they care about learning? Why not track the students’ attitudes, self-assessment, and goals for a subject for a few years, since we know longer-term effects are sometimes more important that immediate test score changes?

In other words, I’m calling for collecting more and better data beyond one-dimensional test scores. If you think about it, teenagers get treated better by their cell phone companies or Netflix than by their schools.

I know what you’re thinking – that students are all lazy and would all complain about anyone or anything that gave them extra work. My experience is that kids actually aren’t like this, know the difference between rote work and real learning, and love the learning part.

Another complaint I hear coming – long-term studies take too long and are too expensive. But ultimately these things do matter in the long term, and as we’ve seen in medicine, skimping on experiments often leads to bigger and more expensive problems. Plus, we’re not going to improve education overnight.

And by the way, if and/or when we do this, we need to implement strict privacy policies for the students’ answers – you don’t want a 7-year-old’s attitude about math held against him when he of she applies to college.


  1. September 5, 2013 at 9:45 am

    The problem that I see with a lot of the education studies is the short time window they are looking at. What is needed is a study like the nurses health study that tracks people over twenty years or more. Then we can see if one type of education or anther is better at producing inventors, engineers, politicians, criminals or or bankers (oh I already said criminals :)).

    Now the nurses health study doesn’t do much for identifying a root cause but it is good at long term trends and larger trends because of the population size.


  2. September 5, 2013 at 10:27 am

    There’s a tacit assumption in these studies that the best solution is a “one-size fits all” approach to education. A “best” pedagogy or set of materials. They use “average” or “median” scores to summarize different approaches

    But what if one set of kids thrive with pedagogy A but not B and another set thrives with B but not A? It doesn’t appear as though these studies are designed for this. I’d like to see more research on how to determine good ways to motivate learning for each kid and use that to select individualized materials and teaching methods.

    Mass customization in education – an idea worthy of research.


    • September 5, 2013 at 10:33 am

      Great point, we could benefit from attributes of students going in (besides last year’s standardized test score) to the model as well.


      • Michael L.
        September 5, 2013 at 4:41 pm

        What you and Greg are talking about (I think) are moderator or interaction effects. I don’t know if the studies you referred to looked at these but this could be done with sub-group analyses. One could look at the effect of method A versus the control group method for blacks only, girls only, kids in social class X only, or whatever one would like as long as the data were available. But statistician Andrew Vickers in his book What is a P-Value Anyway? discusses some of the vexing statistical issues that arise with such sub-group analyses.


    • Zathras
      September 5, 2013 at 11:28 am

      “Mass customization” : I am putting together a proposal to study this exact issue! I am going to be looking at students in a college setting, based on relative performance in high school, which courses they did well in, etc., and see whether there is a pattern to predict which educational modalities are a relatively good fit for them. Which students will perform well in onlines courses vs. interactive courses vs. regular lectures (yes, there are students that perform better with the dreaded “chalk and talk” modality!). Much more on this later.


      • September 5, 2013 at 11:31 am

        interesting! guest post?


        • Zathras
          September 5, 2013 at 1:59 pm

          Maybe in a few months. I’m still trying to put together something coherent on the subject. 🙂


    • sheenyglass
      September 5, 2013 at 12:53 pm

      “They use “average” or “median” scores to summarize different approaches

      “But what if one set of kids thrive with pedagogy A but not B and another set thrives with B but not A?”

      I’m a bit of a statistics naif, but this also makes me wonder about the way in which evaluating a pedagogical tool as being the best because it increases average scores will either ignore or overweight the extremes. Are those doing this research able to distinguish between tools that provide relatively consistent increases across all students and tools that provide relatively large increases in the scores of those who already perform well without doing much for those at the bottom (or vice versa)?

      It seems to me that, while there is a place for tools that do all of those things (why not provide students with the tools that work best for them), this could introduce a way for commercial educational tool developers to game the system by identifying and targeting demographics which are easy to improve while ignoring demographics that are not.


  3. September 5, 2013 at 10:45 am

    I think it’s important to note that many clinical trials conducted to assess the efficacy of drugs aggregate patient cohorts based on criteria that may unwittingly ignore or conflate important differences in individual medical conditions.

    There’s a growing movement toward personalized medicine that may help overcome the overgeneralization bias in such experiments.

    On a related note, I was inspired by a recent report on personalized education at American Radio Works by Emily Hanford and Stephen Smith.


  4. September 5, 2013 at 3:06 pm

    For all the recent hype of “holistic medicine” and such, we should remember that the tremendous success of Western medicine largely rests on a very reductionistic approach. Most of the randomized studies in medicine compare very specific interventions, and judge the outcome by common medical goals. For example, cancer studies measure the 5-year survival rate. Breastfeeding studies measure weight gain and the prevalence of childhood illnesses.

    So I see two unrelated problems with the situation you describe. First is that comparing entire curricula as opposed to only varying small parts means that we learn very little from the study. In particular, we don’t know what about the new curriculum makes it better. This is an actual deficiency with the study.

    Second, is the choice of the standardized test score as the goal we optimize for. In this I think the problem is with the educational system, not the experimenters. If the teachers want to optimize the test scores, then they need studies telling them how (just like doctors wanting to optimize for 5-year survival rate). If we don’t like this we should send our kids to schools that measure student achievement in other ways.


  5. Denise B
    September 9, 2013 at 10:47 pm

    Leaving aside all the questions about how you score the results, are we to think that it’s a brand new idea to test pedagogical methods in a rigorous way? I suppose that would explain why we run experiments on whole school districts and get surprised by the outcomes. Well, what the hell have 1000+ schools of education been doing for the past 100 years? There are plenty of doctorates in education. What do these people do, if not serious research into teaching and learning?


  6. September 13, 2013 at 12:39 pm

    Cathy, thanks for the thoughtful post! You put your finger on one of the deep issues with designing an education: there is little agreement on what success means, and the things that can be cheaply and quickly measured with the most validity are things like multiple choice answers after technical performance on small tasks. That doesn’t cover a lot of what success might mean. It’s a bit like that old joke about someone who only looks for missing keys under the lamppost because the light is there.


  1. No trackbacks yet.
Comments are closed.
%d bloggers like this: