Home > data science, open source tools > Glucose Prediction Model: absorption curves and dirty data

Glucose Prediction Model: absorption curves and dirty data

July 5, 2011

In this post I started visualizing some blood glucose data using python, and in this post my friend Daniel Krasner kindly rewrote my initial plots in R.

I am attempting to show how to follow the modeling techniques I discussed here in order to try to predict blood glucose levels. Although I listed a bunch of steps, I’m not going to be following them in exactly the order I wrote there, even though I tried to make them in more or less the order we should at least consider them.

For example, it says first to clean the data. However, until you decide a bit about what your model will be attempting to do, you don’t even know what dirty data really means or how to clean it. On the other hand, you don’t want to wait too long to figure something out about cleaning data. It’s kind of a craft rather than a science. I’m hoping that by explaining the steps the craft will become apparent. I’ll talk more about cleaning the data below.

Next, I suggested you choose in-sample and out-of-sample data sets. In this case I will use all of my data for my in-sample data since I happen to know it’s from last year (actually last spring) so I can always ask my friend to send me more recent data when my model is ready for testing. In general it’s a good idea to use at most two thirds of your data as in-sample; otherwise your out-of-sample test is not sufficiently meaningful (assuming you don’t have that much data, which always seems to be the case).

Next, I want to choose my predictive variables. First, we should try to see how much mileage we can get out of predicting future blood glucose levels with past glucose levels. Keeping in mind that the previous post had us using log levels instead of actual glucose levels, since then the distribution of levels is more normal, we will actually be trying to predict log glucose levels (log levels) knowing past log glucose levels.

One good stare at the data will tell us there’s probably more than one past data point that will be needed, since we see that there is pretty consistent moves upwards and downwards. In other words, there is autocorrelation in the log levels, which is to be expected, but we will want to look at the derivative of the log levels in the near past to predict the future log levels. The derivative can be computed by taking the difference of the most recent log level and the previous one to that.

Once we have the best model we can with just knowing past log levels, we will want to add reasonable other signals. The most obvious candidates are the insulin intakes and the carb intakes. These are presented as integer values with certain timestamps. Focusing on the insulin for now, if we know when the insulin is taken and how much, we should be able to model how much insulin has been absorbed into the blood stream at any given time, if we know what the insulin absorption curve looks like.

This leads to the question of, what does the insulin (rate of) absorption curve look like? I’ve heard that it’s pretty much bell-shaped, with a maximum at 1.5 hours from the time of intake; so it looks more or less like a normal distribution’s probability density function. It remains to guess what the maximum height should be, but it very likely depends linearly on the amount of insulin that was taken. We also need to guess at the standard deviation, although we have a pretty good head start knowing the 1.5 hours clue.

Next, the carb intakes will be similar to the insulin intake but trickier, since there is more than one type of carb and different types get absorbed at different rates, but are all absorbed by the bloodstream in a vaguely similar way, which is to say like a bell curve. We will have to be pretty careful to add the carb intake model, since probably the overall model will depend dramatically on our choices.

I’m getting ahead of myself, which is actually kind of good, because we want to make sure our hopeful path is somewhat clear and not too congested with unknowns. But let’s get back to the first step of modeling, which is just using past log glucose levels to predict the next glucose level (we will later try to expand the horizon of the model to predict glucose levels an hour from now).

Looking back at the data, we see gaps and we see crazy values sometimes. Moreover, we see crazy values more often near the gaps. This is probably due to the monitor crapping out near the end of its life and also near the beginning. Actually the weird values at the beginning are easy to take care of- since we are going to work causally, we will know there had been a gap and the data just restarted, so we we will know to ignore the values for a while (we will determine how long shortly) until we can trust the numbers. But it’s much trickier to deal with crazy values near the end of the monitor’s life, since, working causally, we won’t be able to look into the future and see that the monitor will die soon. This is a pretty serious dirty data problem, and the regression we plan to run may be overly affected by the crazy crapping-out monitor problems if we don’t figure out how to weed them out.

There are two things that may help. First, the monitor also has a data feed which is trying to measure the health of the monitor itself. If this monitor monitor is good, it may be exactly what we need to decide, “uh-oh the monitor is dying, stop trusting the data.” The second possible saving grace is that my friend also measured his blood glucose levels manually and inputted those numbers into the machine, which means we have a way to check the two sets of numbers against each other. Unfortunately he didn’t do this every five minutes (well actually that’s a good thing for him), and in particular during the night there were long gaps of time when we don’t have any manual measurements.

A final thought on modeling. We’ve mentioned three sources of signals, namely past blood glucose levels, insulin absorption forecasts, and carbohydrate absorption forecasts. There are a couple of other variables that are known to effect the blood glucose levels. Namely, the time of day and the amount of exercise that the person is doing. We won’t have access to exercise, but we do have access to timestamps. So it’s possible we can incorporate that data into the model as well, once we have some idea of how the glucose is effected by the time of day.

  1. Ed
    July 6, 2011 at 8:11 am

    As an added complication regarding manually inputted blood glucose levels, the continuous monitoring device in question likely measures interstitial glucose level and not blood glucose directly. With the manual inputs, the device uses an algorithm of some sort to determine blood glucose levels. At the simplest level, there’s a time delay between blood and interstitial levels. When the blood glucose levels change rapidly, continuous monitoring devices often crap out.


    • July 6, 2011 at 8:31 am


      Great to know, thanks so much! Do you have any idea how much of a delay there is between the blood and interstitial levels? Does the delay length vary depending on things?



  2. July 8, 2011 at 3:46 pm


    People claim there is a 15 minute delay in BG vs interstitial; 5-10 of this is biological and 5 is because the sensor only gives output every 5 minutes. I don’t think the correlation is anywhere near that regular though, as it seems to depend on other factors, not just rate of change, but hydration levels, etc.

    Anyways, regarding the bad data at the end, I have ideas for cleaning that data which some preliminary testing shows works, I can tell you in more detail, but don’t have time to type up a description at the moment.



  3. Stephen Tashiro
    November 9, 2012 at 1:55 pm

    Were any predictive models for glucose concentration developed in the early days of treating diabetes? – before the days when patients could monitor their glucose levels at home?

    One of my cat’s has diabetes which motivates a casual interest in glucose curves. The treatment for the cat (at my vet) is to develop an insulin dose (1 shot per 12 hours) based on keeping the cat at the clinic for a few days and then to administer the dose at home without any blood glucose measurements. My impression is that this resembles the old-time way to treating humans.


  4. Lane
    November 28, 2013 at 1:48 pm

    Great observations, Cathy. You rock. Question for you: it is straightforward to model blood glucose as an AR(2) time series driven by IID WGN, if one first takes the log of blood glucose, i.e. y(t) = log(bg(t))-average(log(bg(t))), then fits an AR(2) to y(t). The blood glucose predictions are then simply the exponential transformation of the y hats (don’t forget to multiply by the median blood glucose).

    Here’s my question: if I now want to bring in two exogenous inputs – carbs and insulin – and both of these have well behaved linear transfer functions to blood glucose, how do I do it? Life would be fine if blood glucose was normally distributed, but it isn’t. I can’t figure out how to build an ARMAX model while still preserving the lognormal shape of the distribution. I believe I can come up with much better predictions for blood glucose by incorporating the carb and insulin inputs into the AR(2) model.


  1. August 27, 2011 at 10:19 am
Comments are closed.
%d bloggers like this: