Get started on using the Python API to download the Health Gym datasets, implement your first offline reinforcement learning algorithm in the tutorial, or give us feedback on how we can improve.


Health Gym data can be downloaded either directly from this website or using the Python API.

Note: This section is still in progress, an API will be available soon

Install the health gym python package through pip by running the following in your terminal.

					pip install healthgym

To download and access the data from within python.

					import healthgym as hg

hypotension_data = hg.datasets.Hypotension(root: 'path/to/data/', download = True)


All datasets support the following parameters.

					root (string)
# Root directory of dataset where dataset exists or will be saved to if download is set to True. 

download (bool, optional) 
# If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.


This tutorial illustrates a simple reinforcement learning approach to optimise the management of acutely hypotensive patients in the intensive care unit (ICU). The complete Jupyter notebook can be found on Github.

Start by downloading the hypotension dataset from … A detailed description of this dataset can be found at …, but essentially it contains vital signs (including mean arterial blood pressure, MAP), lab tests and treatments (fluid boluses and vasopressors) measured over 48 hours in 3,910 patients with acute hypotension.

Let’s start with the necessary imports
					import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.cross_decomposition import PLSCanonical
from sklearn.cluster import KMeans
Load the dataset and distinguish state- and action-related variables
					df = pd.read_pickle('data_fake.pkl')
num_patients = len(df) // 48
# Divide state and action variables
df_state = df.drop(['fluid_boluses', 'vasopressors'], axis='columns')
df_action = df[['fluid_boluses', 'vasopressors']]
Cross decomposition [1]https://scikit-learn.org/stable/modules/cross_decomposition.html is used to reduce the dimensionality of the state space to 5 variables (similarly to the approach suggested by Liu et al. [2]https://arxiv.org/abs/2107.04491v1
					# Dummify action columns (fluid_boluses, vasopressors)
df_action = pd.get_dummies(df_action, prefix='fluid_boluses', columns=['fluid_boluses'])
df_action = pd.get_dummies(df_action, prefix='vasopressors', columns=['vasopressors'])

# Partial least squares regression
plsca = PLSCanonical(n_components = 5)
X = df_state.astype(float).values
Y = df_action.astype(float).values
X_norm = (X-X.mean(axis=0))/(X.std(axis=0))
Y_norm = (Y-Y.mean(axis=0))/(Y.std(axis=0))
X_canonical, Y_canonical = plsca.fit_transform(X_norm, Y_norm)
K-means clustering is used to assign each patient (at each timepoint) to one of 100 clusters. An appropriate number of clusters can be determined e.g. by looking at improvements in the Davies Boulding score. [3]https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html
					num_states = 100
km = KMeans(n_clusters = num_states, n_init = 10, random_state = 123, verbose = True)
state_number = km.fit(X_canonical).labels_
Before implementing the actual reinforcement learning algorithm, let’s add a few useful variables to the dataframe. ‘hour’ represents the hour (between 0 and 47) in which measurements were taken, ‘action_number’ is a number between 0 and 15 indicating the action taken (every combination of four levels of administered fluid boluses and four levels of vasopressors), ‘state_number’ indicates the cluster associated with the state at time t, and ‘state_number_tp1’ indicates the cluster associated with the state at time t+1.
					df['hour'] = np.tile(range(48), num_patients)
df['fluid_boluses'] = df['fluid_boluses'].replace({'0': 0, '250': 1, '500': 2, '1000': 3})
df['vasopressors'] = df['vasopressors'].replace({'0.0': 0, '1e-06': 1, '8.4': 2, '20.28': 3})
df['action_number'] = 4*df['fluid_boluses'] + df['vasopressors']
df['state_number'] = state_number
df['state_number_tp1'] = df['state_number'].shift(-1)
The reward function for reinforcement learning is defined as a piecewise linear function of the MAP in the next state, following the approach by Gottesman et al. [4]https://arxiv.org/abs/2002.03478
					def reward_function(row):
    if row.MAP > 65: # 0 at >65
        reward = 0
    elif row.MAP > 60: # -0.05 at 60 and 0 at 65
        reward = -0.05 * (65 - row.MAP) / 5            
    elif row.MAP > 55: # -0.15 at 55 and -0.05 at 60
        reward = -0.10 * (60 - row.MAP) / 5 - 0.05
    else: # -1 at 40 and -0.15 at 55
        reward = -0.85 * (55 - row.MAP) / 15 - 0.15
    if row.urine > 30 and row.MAP > 55:
        reward = 0
    return reward

df['reward'] = df.apply(lambda row: reward_function(row), axis=1)
# shift one up so that reward is in the same row of action
df['reward'] = df['reward'].shift(-1)

# ignore last hour since we don't observe the reward of action
df = df[df['hour'] < 47]
df['state_number_tp1'] = df['state_number_tp1'].astype(int)
A simple algorithm for offline reinforcement learning (where we are given a fixed batch of data and can’t interact with the environment to collect new data) is Batch-Constrained Q-learning (BCQL). [5] https://arxiv.org/abs/1812.02900 It determines an optimal policy in a way similar to standard Q-learning [6]https://en.wikipedia.org/wiki/Q-learning, except that the max over action values in the next state is taken only over actions that have actually been observed in the data.
					# Q learning
Q = np.full((100, 16), np.nan, dtype='float') # 100 states, 16 actions
# Set to 0 if state-action combination has actually been observed in the data
for index, row in df.iterrows():
    Q[row['state_number'], row['action_number']] = 0

num_iterations = 100
step_size = 0.1
diff_tracker = np.zeros((num_iterations, 1))
for Q_iter in range(num_iterations):
    Q_old = Q.copy()
    for index, row in df.iterrows():
        # Discount factor gamma = 1
        Q[row['state_number'], row['action_number']] += step_size * (row['reward'] + np.nanmax(Q[row['state_number_tp1'], :]) - Q[row['state_number'], row['action_number']])
    diff_tracker[Q_iter] = np.nanmean(np.abs(Q-Q_old))
    print([Q_iter, diff_tracker[Q_iter]])
That’s it! The expected value of the RL policy can now be compared to the value of the original clinical policy or a random policy. The expected value of the RL policy is higher than for the other policies.
					# Evaluate
Q_RL = 0
Q_clinician = 0
Q_random = 0

for index, row in df.iterrows():
    if row['hour'] == 0:
        Q_RL += np.nanmax(Q[row['state_number'], :])
        Q_clinician += Q[row['state_number'], row['action_number']]
        # random policy
        h = Q[row['state_number'], :]
        h = h[~np.isnan(h)]
        Q_random += h[np.random.choice(h.shape[0], 1)][0]
Q_RL = Q_RL / num_patients
Q_clinician = Q_clinician / num_patients
Q_random = Q_random / num_patients

sns.scatterplot(x=['RL', 'Clinician', 'Random'], y=[Q_RL, Q_clinician, Q_random], markers='s', s=100)
plt.ylabel('Expected policy value')
When comparing the RL policy to the original clinical policy, it appears as if RL recommends the administration of high volumes of fluid boluses at earlier timepoints.
					df_boxplot_RF = df[['MAP']]
df_boxplot_RF['action_number'] = 0
for index, row in df.iterrows():
    df_boxplot_RF.at[index, 'action_number'] = np.nanargmax(Q[row['state_number'], :])

df_boxplot_RF['agent'] = 'RF'
df_boxplot_clinician = df[['MAP', 'action_number']]
df_boxplot_clinician['agent'] = 'Clinician'
df_boxplot = pd.concat([df_boxplot_RF, df_boxplot_clinician], ignore_index=True, sort=False)

fluid_boluses_dict = {
    **dict.fromkeys([0, 1, 2, 3], '[0, 250)'), 
    **dict.fromkeys([4, 5, 6, 7], '[250, 500)'),
    **dict.fromkeys([8, 9, 10, 11], '[500, 1000)'),
    **dict.fromkeys([12, 13, 14, 15], '>= 1000')
vasopressors_dict = {
    **dict.fromkeys([0, 4, 8, 12], '0'), 
    **dict.fromkeys([1, 5, 9, 13], '(0, 8.4)'),
    **dict.fromkeys([2, 6, 10, 14], '[8.4, 20.28)'),
    **dict.fromkeys([3, 7, 11, 15], '>= 20.28')
df_boxplot['fluid_boluses'] = df_boxplot['action_number'].replace(fluid_boluses_dict)
df_boxplot['vasopressors'] = df_boxplot['action_number'].replace(vasopressors_dict)

sns.boxplot(y='MAP', x='fluid_boluses', data=df_boxplot, palette="colorblind", hue='agent')
sns.boxplot(y='MAP', x='vasopressors', data=df_boxplot, palette="colorblind", hue='agent', order=['0', '(0, 8.4)', '[8.4, 20.28)', '>= 20.28'])
This was just an illustrative example of the potential applications of reinforcement learning in health care. More advanced algorithms for offline RL include conservative Q-learning [7]https://arxiv.org/abs/2006.04779 and variations of actor-critic algorithms [8]https://arxiv.org/abs/2106.06860. Implementations of many existing RL algorithms can be found in the stable baselines [9]https://github.com/DLR-RM/stable-baselines3 repository.

If you are a machine learning expert or have access to longitudinal health care data suitable for reinforcement learning, please consider contributing to the Health Gym project.



We welcome any feedback or suggestions for improving the Health Gym data and software

Ask a Question or Provide Feedback

Let us know how we can help

Report Issue or Request Feature

Let us know what we could do better

Submit a Pull Request or a New Example

If you are a machine learning expert or have access to longitudinal health care data suitable for reinforcement learning, please consider contributing to the Health Gym project.