Step 0: Installing python and visualizing data
A friend of mine has type I diabetes, and lots of data (glucose levels every five minutes) from his monitor. We’ve talked on and off about how to model future (as in one hour hence) glucose levels, using information on the current level, insulin intake, and carb intake. He was kind enough to allow me to work on this project on this blog. It’s an exciting and potentially really useful project, and it will be great to use as an example for each step of the modeling process.
To be clear: I don’t know if I will be able to successfully model glucose levels (or even better be able to make suggestions for how much insulin or carbs to take in order to keep glucose levels within reasonable levels), but it’s exciting to try and it’s totally worth a try. I’m counting on you to give me suggestions if I’m being dumb and missing something!
I decided to use python to do my modeling, and I went to this awesomely useful page and followed the instructions to install python and matplotlib on my oldish mac book. It worked perfectly (thanks, nerd who wrote that page!).
The data file, which contains 3 months of data, is a csv (comma separated values) file, with the first line describing the name of the values in the lines below it:
Index,Date,Time,Timestamp,New Device Time,BG Reading (mg/dL),Linked BG Meter ID,Temp Basal Amount (U/h),Temp Basal Type,Temp Basal Duration (hh:mm:ss),Bolus Type,Bolus Volume Selected (U),Bolus Volume Delive\ red (U),Programmed Bolus Duration (hh:mm:ss),Prime Type,Prime Volume Delivered (U),Suspend,Rewind,BWZ Estimate (U),BWZ Target High BG (mg/dL),BWZ Target Low BG (mg/dL),BWZ Carb Ratio (grams),BWZ Insulin Sens\ itivity (mg/dL),BWZ Carb Input (grams),BWZ BG Input (mg/dL),BWZ Correction Estimate (U),BWZ Food Estimate (U),BWZ Active Insulin (U),Alarm,Sensor Calibration BG (mg/dL),Sensor Glucose (mg/dL),ISIG Value,Dail\ y Insulin Total (U),Raw-Type,Raw-Values,Raw-ID,Raw-Upload ID,Raw-Seq Num,Raw-Device Type 1,12/15/10,00:00:00,12/15/10 00:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,28.4,ResultDailyTotal,"AMOUNT=28.4, CONCENTRATION=null",5472682886,50184670,236,Paradigm 522 2,12/15/10,00:04:00,12/15/10 00:04:00,,,,,,,,,,,,,,,,,,,,,,,,,,,120,16.54,,GlucoseSensorData,"AMOUNT=120, ISIG=16.54, VCNTR=null, BACKFILL_INDICATOR=null",5472689886,50184670,4240,Paradigm 522 3,12/15/10,00:09:00,12/15/10 00:09:00,,,,,,,,,,,,,,,,,,,,,,,,,,,116,16.21,,GlucoseSensorData,"AMOUNT=116, ISIG=16.21, VCNTR=null, BACKFILL_INDICATOR=null",5472689885,50184670,4239,Paradigm 522
I made a new directory below my home directory for this file and for the python scripts to live, and I started up python from the command line inside that directory. Then I opened emacs (could have been TextEdit or any other editor you like) to write simple script to see my data.
A really easy way of importing this kind of file into python is to use a DictReader. DictReader is looking for a file formatted exactly as this file is, and it’s easy to use. I wrote this simple script to take a look at the values in the “Sensor Glucose” field (note there are sometimes gaps and I had to decide what to do in that case):
And this is the picture that popped out:
I don’t know how easy it is to see this but there are lots of gaps (when there’s a gap I plotted a dot at -1, and the line at -1 looks pretty thick). Moreover, it’s clear this data is being kept in a pretty tight range (probably good news for my friend). Another thing you might notice is that the data looks more likely to be in the lower half of the range than in the upper half. To get at this we will draw a histogram of the data, but this time we will *not* fill in gaps with a bunch of fake “-1″s since that would throw off the histogram. Here are the lines I added in the code:
And this is the histogram that resulted:
This is a pretty skewed, pretty long right-tailed distribution. Since we know the data is always positive (it’s measuring the presence of something in the blood stream), and since the distribution is skewed, this makes me consider using the log values instead of the actual values. This is because, as a rule of thumb, it’s better to use variables that are more or less normally distributed. To picture this I replace one line in my code:
skip_gaps_datalist.append(log(float(row["Sensor Glucose (mg/dL)"])))
And this is the new histogram:
This is definitely more normal.
Next time we will talk more about cleaning this data and what other data we will use for the model.