Antiretroviral Therapy in HIV

This dataset comprises viral loads, CD4 counts, and drug regimen information for 8,916 patients with HIV.

What you should know about this dataset

These synthetic data were generated based on the EuResist Integrated Database. The cohort was defined to mimic the dataset used in [1]Parbhoo, S., Bogojeska, J., Zazzi, M., Roth, V. & Doshi-Velez, F. Combining Kernel and Model Based Learning for HIV Therapy Selection. American Medical Informatics Association Summits on … Continue reading, but modified to incorporate a published WHO guideline [2]World Health Organisation. Consolidated Guidelines on the Use of Antiretroviral Drugs for Treating and Preventing HIV Infection: Recommendations for a Public Health Approach (2016). for the standardisation of antiretroviral drug use. All synthetic records have a length of 60 months.


The following tables contain a list of variables in the synthetic data together with descriptive statistics.

wdt_ID Variable name Data Type Unit Numeric Statistics
1 Viral Load (VL) numeric copies/mL Median: 54.77 (Q1: 16.51, Q3: 209.03)
2 Absolute Count for CD4 (CD4) numeric cells/mm3 Median: 465.81 (Q1: 279.26, Q3: 840.34)
3 Relative Count for CD4 (Rel CD4) numeric cells/mm3 Median: 25.57 (Q1: 18.20, Q3: 35.72)
4 Gender binary - Male: 93.42%, Female: 6.58%
5 Ethnic categorical - 4 Classes Asian: 0.47%; Afro: 2.55%; Caucasian: 26.81%; Other: 70.17%
6 Base Drug Combination categorical - 6 Classes FTC + TDF: 73.66%; 3TC + ABC: 14.08%; FTC + TAF: 0.98%; DRV + FTC + TDF: 5.50%; FTC + RTVB + TDF: 2.30%; Other: 3.47%
7 Complementary INI categorical - 4 Classes DTG: 11.96%; RAL: 0.49%; EVG: 4.69%; Not Applied: 82.86%
8 Complementary NNRTI categorical - 4 Classes NVP: 0.19%; EFV: 9.27%; RPV: 43.76%; Not Applied: 46.78%
9 Extra PI categorical - 6 Classes DRV: 0.69%; RTVB: 4.02%; LPV: 1.08%; RTV: 2.02%; ATV: 4.26%; Not Applied: 87.92%
10 Extra pk Enhancer (Extra pk-En) binary - False: 96.70%, True: 3.30%

Comparison of marginal distributions

Comparison of marginal distributions between the real and synthetic data for continuous, categorical and binary variables.

Numeric variable comparisons

VL, CD4, and Rel CD4

Categorical variable comparisons

Ethnicity and Components of the Drug Regimen

Binary variable comparisons

Gender, use of pk Enhancer, and the measurement (M) variables

Comparison of correlations

The matrices below show:

1) Static correlations between pairs of variables for all patients at any time, in the real and synthetic data.

2) Dynamic correlations between variables in the real and synthetic data. Time series for each patient were linearly decomposed into trends (indicating a general upward or downward slope) and cycles (indicating periodic patterns). Correlations were then computed between trends and cycles for all patients.

Real Data


Real Data


Real Data



Health Gym data can be downloaded either directly from this website or using the Python API.

Install the health gym python package through pip by running the following in your terminal.

					pip install healthgym

To download and access the data from within python.

					import healthgym as hg

hypotension_data = hg.datasets.HIV(root: 'path/to/data/', download = True)


All datasets support the following parameters.

					root (string)
# Root directory of dataset where dataset exists or will be saved to if download is set to True. 

download (bool, optional) 
# If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

Download the dataset

Direct download as CSV file or view GitHub repository

Be part of the community
Here is how to reach out

Join our community on github.