Back when I was studying at Texas A&M, one of my senior projects was a classification engine built on heartbeat data pulled from a FitBit. The hypothesis was, I could build an algorithm to ‘guess’ what I was doing at any given time, based solely on my personal heartbeat. If you take a second to think about that, you’ll probably realize thats kinda crazy… after building a model, a computer could guess what I was doing at any given time JUST off my heartbeat. If I was to add the time of day, location, historical data, I could continue to get more fidelity in the guesses.
Fitbit, like many companies are collecting a lot of data on its users, and this data is really powerful… Many people do not realize how much can be inferred from a relatively small amount of personal information. I don’t want to make a stance on security, or how to best protect our personal data, rather want to showcase the importance of even the smallest and simplest piece of personal information. For my senior project, I was achieving about a 70% accuracy for just a months worth of work over 5 categories and a single source of data.
It’s pretty simple, albeit a lot of work. Put simply, the goal of this project is to be able to take a series of heartbeats over time, and guess what activity the person was doing during that time, this is known as a classification engine.
The first step to creating this classification engine was to create a large dataset of ‘accurate’ or ‘known’ values to train a model against. For this example, I wore my Fitbit everyday for a month and recorded EVERYTHING I did. If I walked to class, I wrote it down, if I drove home, I recorded it… every meal, shower, bathroom break was recorded with a date & time. At the end of each day, I pulled down the last 24 hours of ‘heart beat data’ and saved the data points with time-stamps for later. This took many hours to do, but good training data is essential when building a classifier. “Crap in crap out”. This data will be used to train our model.
We pull heartbeat rate data from the Fitbit API using the ‘Heart Rate Intraday Time Series’. This endpoint is by whitelist only, so we had to ask permission for BeatWise to use the endpoint; and answer some questions on what the data would be used for. The endpoint output a JSON object filled with heart rate points at every minute the user was wearing their Fitbit device. Example output below:
“activities-heart-intraday”: {
“dataset”: [
{
“time”: “00:00:00”,
“value”: 64
},
{
“time”: “00:00:10”,
“value”: 63
},
{
“time”: “00:00:20”,
“value”: 64
}
],
“datasetInterval”: 1,
“datasetType”: “second”
}
After tagging the points, we looked at how much training data there was for each unique tag. We took the top 5 most frequent tags and used them as a basis for the rest of our program; Asleep, Class, Relaxing, Driving and Physical. The training process is all done offline, and the output is the array of centers.
Since the input is a series of data-points of unknown length, we don’t know whether to run our classification engine a single time, against the entire wave, or cut the wave in half and run it twice… or cut it into 32 parts and run each part individually…. Although you could ‘average’ the activities done over an entire day, it’s much more useful to know what someone was doing throughout the day. The problem is, with too many splits there isn’t enough data to accurately guess the activity happening, but with only a few splits, you miss resolution.
In order to properly guess what you were doing in a time; we have to know what period of time to look at! This is not a trivial problem… The team spent a fair amount of time figuring out the best way to take an entire day and split it up into smaller chunks to compare them to the training data. At first, we were just taking really small chunks, comparing and moving on- however, as the chunks got really small (less than 5 points in a chunk) the comparison was very inaccurate. So rather than starting with really small points, we started by comparing the entire day to our centers; then splitting then splitting the day into half, rerun the comparisons on the two halves and repeat over and over. Only stop if the window gets to some really small number, OR if the next iteration’s comparison score is worse than the parents score. This gets complicated to imagine, but this creates a tree that can be used to find the node with the highest comparison score, as shown in our summary below.
At its core, a series of heartbeats or a heartbeat wave is just that, a wave… Unlike temperature, which is easy to compare two temperature points (the percent higher or lower / how close it is to some max or min), waves are not as easy to compare. We found the easiest way to compare two heartbeat waves was to calculate variables that describe the wave. We convert the series of points as an array that represents this series. We currently take 5 key scores; Average, Variance, Rate of Inclination, # Points Outside 1 Std. Dev. & # of Seconds from Noon. The scoring function is one of the key proponents to the algorithm. The more ways we can represent the signal, the more ways the signals can be differentiated from each other and accuracy increased. In the future we would love to add a Fast Fourier transform function (this is computationally heavy), a geolocation, and many others!
Finally we have a way to compare two heartbeat waves together, by comparing their descriptions. We converted them into a vector space model, we have to compare these vectors. The simple ways of doing so are either; cosine or euclidean. We started with cosine, but found weighted euclidean produced better results.
Once the algorithm could classify the point, we assigned the point that classification and each classification was represented by a color. Using D3.js we were able to plot these points over a day to give a nice visual representation of the algorithm.
The data shown was computed by our algorithm on myself. As you can see, it shows I only got ~4 hours of sleep, which was accurate (finals), and spent a lot of my day walking to and from classes. The data points have about a 70% accuracy when compared to what I actually did that day.
This was just one example of what can be done with classification. The data could be anything! Netflix uses a similar algorithm to recommend what movie you might like to watch. Remember, a little bit of data and information about someone can go a long way!
TLDR: BeatWise is a data visualization platform that uses machine learning to ‘guess’ what a user was doing during the day, based only on a users continuous heartbeat from a Fitbit device. It was fun, difficult, and cool!