Applying the HMM to the sleep dataset¶
Back to the dataset we'll be working with for this tutorial, a Drosophila movement dataset. The data we will be using for this tutorial is real, raw data from the Gilestro lab, where we track and record the movement of fruit flies using machine vision. The tracking is able to discern small movements in the fly several times per second, giving a multitude of variables to work with.
We'll be using the Pandas package to import and store our data in the notebooks. Pandas is a widely used tool in data science; it is built on top of Numpy, which we briefly used previously. At the core of Pandas is the DataFrame, a table format you will all be familiar with from spreadsheets. Pandas provides many tools to manipulate the data before you feed it into any analysis or machine learning tool.
As with Numpy, everything used here will be explained as we use it, but if you'd like to read more about how to use Pandas, there is a quick tutorial on their website. -> here
# first we need to import pandas and numpy
# like numpy it is often imported in a shorthand
import pandas as pd
import numpy as np
# this pandas setting is to suppress warnings in a later function
pd.set_option("future.no_silent_downcasting", True)
Pandas can read many different formats; see here for a detailed list of all file types that can be read and saved to. One of the most common in biology are spreadsheets, or csv files. The training data for this tutorial is saved as a zipped csv file, saved in the data folder called 'training_data.zip'.
Copy the path of that file (see below for the exact file structure) and save it as the variable 'path' and as a string. Then load into the notebook using the function pd.read_csv().
.
|
├── data
| ├── example_hmm.pkl
| ├── training_data_metadata.csv
| └── training_data.zip <----
├── docs
├── notebooks
| ├── 1_Understanding_HMMs.ipynb
| ├── 2a_Cleaning_your_data.ipynb
| └── ...
└── src
# If you're using windows you'll need to put an r in front of the string as python doesn't like backslashes
# .e.g r'C:\Users\USER\Documents\HMM_tutorial\data\training_data.zip'
path = ''
# its common practice to save dataframes as df
df = pd.read_csv(path)
The Data Structure¶
df
id | t | x | y | w | h | max_velocity | mean_velocity | moving | micro | walk | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-04-04_17-39-22_033aee|01 | 31140 | 0.269116 | 0.069594 | 0.038829 | 0.020012 | 75.662162 | 25.713480 | True | False | True |
1 | 2016-04-04_17-39-22_033aee|01 | 31170 | 0.606590 | 0.068019 | 0.048224 | 0.020609 | 27.471271 | 9.145901 | True | False | True |
2 | 2016-04-04_17-39-22_033aee|01 | 31200 | 0.398307 | 0.070464 | 0.049073 | 0.020628 | 19.718721 | 5.478951 | True | False | True |
3 | 2016-04-04_17-39-22_033aee|01 | 31230 | 0.469571 | 0.066383 | 0.046558 | 0.020423 | 20.224544 | 7.475374 | True | False | True |
4 | 2016-04-04_17-39-22_033aee|01 | 31260 | 0.260085 | 0.073667 | 0.047548 | 0.020133 | 34.824007 | 6.163203 | True | False | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2453010 | 2016-09-27_10-56-35_053c6b|15 | 86250 | 0.776577 | 0.064865 | 0.034109 | 0.022879 | 0.799611 | 0.673182 | False | False | False |
2453011 | 2016-09-27_10-56-35_053c6b|15 | 86280 | 0.776577 | 0.064537 | 0.033866 | 0.022686 | 0.774246 | 0.659115 | False | False | False |
2453012 | 2016-09-27_10-56-35_053c6b|15 | 86310 | 0.776577 | 0.064823 | 0.035156 | 0.021957 | 0.779612 | 0.679327 | False | False | False |
2453013 | 2016-09-27_10-56-35_053c6b|15 | 86340 | 0.776577 | 0.064693 | 0.035478 | 0.022051 | 0.772465 | 0.678201 | False | False | False |
2453014 | 2016-09-27_10-56-35_053c6b|15 | 86370 | 0.753328 | 0.064167 | 0.033769 | 0.022784 | 186.156732 | 13.214985 | True | False | True |
2453015 rows × 11 columns
The first column is 'id' which contains a unique ID per fly and will allow us to filter and apply methods to just one fly at a time. The next most important variable is 't' or time. As we are working with time series data, we must ensure this is structured properly, i.e., in sequential order and at regular intervals (the later we will go over). The rest are various variables per timestamp; for this tutorial, we'll only be interested in 'moving', 'micro', and 'walk'.
Checking for missing data¶
Most real datasets will not be perfectly populated, with tracking dropping out over the course of an experiment. In a dataframe or an array where there is data missing for one of columns but not all, the missing data will be represented by a NaN value, which lets methods and functions know there is no data rather than a zero value. However, often analysing packages will throw an error if you feed it NaN values, so it's good practice to check for them first and either remove them or replace them with an approximation.
# Lets filter our dataframe for nan values
# With pandas you can filter the dataframe by the columns
# To filter or slice the dataframe put some square brackets after the dataframe and inside call the column slice
# For finding NaN values we have to call a method, for other regular filtering you just use =, <, > and so on
df[df['x'].isnull()]
# To break down whats happening we can just call whats inside the brackets, you can see that it is an array (or series in pandas terms) with False or True per row.
# This array then dictates what rows get returned from the whole dataframe, i.e. only the ones that fullfil the argument and are True
df['x'].isnull()
# However, we are not just looking at one column.
# Luckily with pandas you can filter by multiple columns, all you need to do is put each filter argument in round brackets and then separate them by an & ("and") or | ("or") logical operator
# By calling OR here we get all the examples where X or Y are NaNs
df[(df['x'].isnull()) | (df['y'].isnull())]
# Now we want to remove those rows containing NaN values as they aren't providing any information
# This time we'll want to call the & operator as we only want rows where both X and Y are not NaNs
# When filtering for NaNs above we're selecting for them, adding the ~ opertor tells the filter to look for the opposite, so when NaN is True it now becomes False
# If taking a slice of a dataframe its good practice to make it a copy, otherwise it will throw up warnings
df_filtered = df[~(df['x'].isnull()) & ~ (df['y'].isnull())].copy(deep = True)
# the new DataFrame now won't have any rows where 'x' and 'y' have NaN values
df_filtered
Task:¶
As stated before for this tutorial, we will be focusing on the variables 'moving', 'micro', 'walk'. Now you know how to filter out NaN values apply this to only these columns.
# To complete
df =
Extra Task:¶
If you're new to Pandas (or just want some practice), have a play around with other types of filtering (such as df[df['mean_velocity'] > 5]). It's a quick and easy way to filter your data, and if you're doing the same thing repeatedly, you can create a function to do it instantly.
Rather than filtering out the NaN values, you can replace them with something else. We could know that tracking drops out when the flies are still for a long time, so we could resonably replace all of the NaN's for 'moving', 'micro', and 'walk' with False. This can be done with the .fillna method, see here for how to do it -> fillna.
Binning the data to a larger time step¶
It's important with hidden Markov models that any timeseries dataset is complete with no skips due to missing data, as the model will assume the array you're feeding it all has the same time step. One way to do this is to increase the timestep; currently, the dataset has a row for every 30 seconds. However, we know from filtering out the NaN values that we won't have them all. So to achieve this, we will increase the time step to 60. As long as there is at least 1 row out of 2 for the 60, we'll have a perfectly populated dataset.
Additionally, doing so will decrease the size of the data, meaning the model will train more quickly. It's always worth trying the model with a few different timesteps to see how this affects the output, and then you can pick the one you think is the most representative and quickest to train.
# we'll go through it step by step, before wrapping it all in a function
# First we'll need the function floor which rounds down numbers
from math import floor
# first we'll create a new column with the new time step
# lambda functions are an easy way to apply a function per row with a specific column
# We then divide the time by our new time and round down. The end result is multiplied by the new time step giving the minimum time as divisable by the time step given
df['bin_t'] = df['t'].map(lambda t: 60 * floor(t / 60)) # the t represenst each row value for the column 't'
df
You should see in the column 'bin_t' that rows next to each other now share a time step. Now we have that we'll want to pivot or group by this column so all that have the same time stamp are collected together.
# The pandas groupby method does this, all you need to do is call the method with the column you want to pivot by in the brackets
# Then you can tell it what aggregating function you want to call on the columns of interest
df_grouped = df.groupby('bin_t').agg(**{
'x' : ('x', 'mean'), # before the brackets is the name of the new column, we'll keep it the same
'y' : ('y', 'mean') # within the brackets is the column you want to use and the function to apply to it. You have 'mean', 'median', 'max'... ect built in, but you can also use your own functions
})
df_grouped
Some of you may have noticed that doing it this way will aggregate our whole dataset and lose information per fly. To keep this information, we can call it a groupby with two levels: the first will be the higher level that the data is grouped by first, and the second will be the one that the functions will be applied to.
# We do exactly the same, but instead of just 'bin_t' we have a list with 'id' first
# Calling it this way on a lot of rows can take a few minutes or more depending on your computer, so don't worry if it takes a while
df_grouped = df.groupby(['id', 'bin_t']).agg(**{
'x' : ('x', 'mean'),
'y' : ('y', 'mean')
})
# We need to reset the index as it will have both 'id' and 'bin_t' as the index
df_grouped.reset_index(inplace = True)
# We'll also rename the column 'bin_t' back to 't' for clarity
df_grouped.rename(columns = {'bin_t' : 't'}, inplace = True)
df_grouped
Task:¶
The same as before, recreate the steps above, but for the columns 'moving', 'micro', 'walk'. Instead of mean, use max, as we care about the most dominant behaviour in that time window, and it will also keep our results as either True or False which are discrete categories.
# To complete
df =
Filling in the gaps¶
Another method to fill in the gaps in the data is interpolation. This is where you determine a value at any given timepoint, given the rest of the dataset. If you have just a few points missing, the interpolation results can be quite accurate. Here we'll run through the steps to interpolate your data.
# First we can check if we have all the data points for each fly
# We'll use this method to check
def check_num_points(data, timestep=60):
array_time = max(data['t']) - min(data['t'])
if (array_time / timestep) + 1 == len(data):
return True
else:
return False
# We need to call the function on each fly individually
# To do this you cal a groupby with 'id' as the argument
df_check = df.groupby('id').apply(check_num_points, include_groups=False) # set include_groups to false when you don't want the grouping column in the analysis
# This gives us a pandas series of True and False for each fly
df_check
# We can count all the True and Falses with the method .value_counts()
df_check.value_counts()
We can see that nearly 50% of our flies are missing some points, so it's best we move ahead with interpolation.
Extra Task:¶
Rather than just returning True or False, you can create a function that returns the percentage of points you have for the amount needed. You can then combine this to filter out the flies that have less than 75% of points.
# Code space for extra task
Like when checking for points, we'll need to create a function that we can call to apply the interpolation per fly, as we want it to only use each fly's data. But we'll walk through the steps before creating it. As the data is discrete, we'll be using forward filling interpolation, which propagates the last valid observation to the next, see here for more information.
If we were working on continuous data, we could use linear interpolation, which we'll briefly demonstrate at the end with np.interp, a one-dimensional linear interpolator, see here for the documentation from numpy.
# for now we'll work with a sub-sample of the main DataFrame so we can check things are working before creating the function
# Task:
# With the 'id' of '2016-04-04_17-38-06_019aee|20', create a sub DataFrame with just this data
small_df =
# Now we'll want to create a time series that contains all the points we want
# For this we can use np.arange which creates an array from a given start point to an end point, with regular steps
# You need to add on you time step to the end as it will only give it to the one step below otherwise
ts_seq = np.arange(min(small_df['t']), max(small_df['t']) + 60, 60)
# You can see it's an array that increase by 60 at each step by checking the difference per point
np.all(np.diff(ts_seq) == 60)
# Next we'll need to merge this back with the data to create new rows with NaN values which we'll replace
# To do this we make a pandas series, which is like singular column dataframe, named 't'
# With both the small dataframe and the series containing 't' we can merge the two together using this column as the key
ts_seq = pd.Series(ts_seq, name = 't')
small_df = small_df.merge(ts_seq, on = 't', how = 'right') # The merge is down to the right as we want the final result to be the length of the new sequence
# Checking for NaN values we can see the new time points are all there
small_df[small_df['moving'].isnull()]
# Now all we need to call is ffill
small_df.ffill(inplace=True)
# The NaNs have bbeen filled
small_df[small_df['moving'].isnull()]
small_df
We can now make a function that will complete this for the whole dataset.
# Fill in the missing parts with what we've done above
def fill_interpolate(data, timestep = 60):
ts_seq =
ts_seq =
new_df =
new_df.ffill(inplace=True)
return new_df
# Now call a groupby method, applying the interpolate function
df = df.groupby('id').apply(fill_interpolate, include_groups=False)
df.reset_index(level = 0, inplace = True)
# Lets use the check function to see if its worked
df_check = df.groupby('id').apply(check_num_points, include_groups=False)
df_check.value_counts()
Linear interpolation¶
For continuous data like the X and Y coordinates, we can use linear interpolation that fills in the data given where it would be placed on a fitted linear line of true data.
# We'll load in the original dataset to get some continuous data again.
interp_df = pd.read_csv(path)
# We'll check to see if any are missing datapoints
df_check = interp_df.groupby('id').apply(check_num_points, include_groups=False)
df_check.value_counts()
They're all missing points, so we'll use the same specimen as last time.
small_interp = interp_df[interp_df['id'] == '2016-04-04_17-38-06_019aee|20'].copy(deep=True)
small_interp
# Like before we'll make a new time series of the length and intervals we want
ts_seq = np.arange(min(small_interp['t']), max(small_interp['t']) + 60, 60)
# Call np.interp with the new time series first, the old second, and the corresponding data third
new_seq = np.interp(ts_seq, small_interp['t'].to_numpy(), small_interp['x'].to_numpy())
new_seq
Extra task:¶
- Can you make a function that will use np.interp on the whole interp_df dataset per fly for variables 'x' and 'y'?
# Code space for extra task
# we can do this for the rest of the columns quickly with a for loop
# for loops are useful for when you need to do the same thing over and over, with a few things changed
for i in ['moving', 'micro', 'walk']:
small_df[i] = np.where(small_df[i] == True, 1, 0)
# The columns are now nicely binary
small_df
# Finally we don't know if each flies data has the a good amount of data points for it
# Flies with a low amount could indicate they died early or the tracking stopped working
len_check = df.groupby('id').agg(**{
'length' : ('t', len)
})
len_check['length'].value_counts()
# You can see most flies have over 9000 data points, but 2 have only 200 odd, we'll want to remove them
# We can find the length of each subset dataframe
# get the ids of those > 300 data points
# use this list to filter the whole dataset
len_df = df.groupby('id').agg(**{
'len' : ('t', len)
})
filt_len = len_df[len_df['len'] < 300]
filt_list = filt_len.index.tolist()
df = df[~df['id'].isin(filt_list)]
df
Coding our categories¶
The HMM we'll be using is categorical, which means if we want to use all the information for the 3 columns, we must create a new column that has numbers that represent each variable when they are true. Hmmlearn takes each observable as a number, with the first being 0, the next being 1, and so on. Here we have 3 observables: not moving, micro-moving, and walking, so we would like them to be 0, 1, and 2, respectively.
For this we'll be using np.where, a function that takes a logical statemeant to update a column whether it's True or False. See here for more information.
# At first we'll look for all rows where the flies aren't moving, if True we label it 0, if not we give it a NaN value for now
df['hmm'] = np.where(df['moving'] == 0, 0, np.nan)
# Next we'll look at micro, a fly cannot be both micro moving and walking, they are distinct. So we can just select for True cases.
# We make the False argument what the column was previously to keep the old category
df['hmm'] = np.where(df['micro'] == 1, 1, df['hmm'])
# Now we'll finish with walk, can you complete it?
# df['hmm'] = np.where()
df['hmm'] = np.where(df['walk'] == 1, 2, df['hmm'])
Before we move on, we need to set the new column to only integers. Pandas has a habit of making data points floats, i.e., a whole number with a decimal point. This will cause problems when training the data, as the model for categorical hmms wants the input to only be integers.
# We can set columns types with the .astype() method
df = df.astype({'hmm' : 'int'})
df
# Next we'll save our cleaned dataframe as a pickle file
# Pickles are a popular format to save multiple variable formts
# Pandas has a built-in function to save the file
df.to_pickle('YOUR_PATH/data/cleaned_data.pkl')
Extra Tasks¶
1. Split the data by Male and Female into separate dataframes.¶
- In the same folder (data) as the training data, there is a csv file containing the metadata for the experiment. This includes a column called "sex," which denotes their sex by "M" for male and "F" for female, along with a column with their ID. Use both of these to filter and split the data into two training datasets.
2. Convert a continuous float column to a discrete categorical column¶
- Convert one of the continuous data columns ["x", "y", "phi", "w", "h"] into discrete categories and use them to train a HMM