
Conducting Testing By Nic 2nd-March-2022
This dataset comprises viral loads, CD4 counts, and drug regimen information for 8,916 patients with HIV.
What you should know about this dataset
Testing testing testing
- Variables
- Distributions
- Correlations
- Setup
Variables
The following tables contain a list of variables in the synthetic data together with descriptive statistics.
wdt_ID | Variable name | Data Type | Unit | Numeric Statistics |
---|---|---|---|---|
1 | Viral Load (VL) | numeric | copies/mL | Median: 54.77 (Q1: 16.51, Q3: 209.03) |
2 | Absolute Count for CD4 (CD4) | numeric | cells/mm3 | Median: 465.81 (Q1: 279.26, Q3: 840.34) |
3 | Relative Count for CD4 (Rel CD4) | numeric | cells/mm3 | Median: 25.57 (Q1: 18.20, Q3: 35.72) |
4 | Gender | binary | - | Male: 93.42%, Female: 6.58% |
5 | Ethnic | categorical | - | 4 Classes Asian: 0.47%; Afro: 2.55%; Caucasian: 26.81%; Other: 70.17% |
6 | Base Drug Combination | categorical | - | 6 Classes FTC + TDF: 73.66%; 3TC + ABC: 14.08%; FTC + TAF: 0.98%; DRV + FTC + TDF: 5.50%; FTC + RTVB + TDF: 2.30%; Other: 3.47% |
7 | Complementary INI | categorical | - | 4 Classes DTG: 11.96%; RAL: 0.49%; EVG: 4.69%; Not Applied: 82.86% |
8 | Complementary NNRTI | categorical | - | 4 Classes NVP: 0.19%; EFV: 9.27%; RPV: 43.76%; Not Applied: 46.78% |
9 | Extra PI | categorical | - | 6 Classes DRV: 0.69%; RTVB: 4.02%; LPV: 1.08%; RTV: 2.02%; ATV: 4.26%; Not Applied: 87.92% |
10 | Extra pk Enhancer (Extra pk-En) | binary | - | False: 96.70%, True: 3.30% |
11 | VL Measured (VL (M)) | binary | - | False: 79.35%, True: 20.65% |
12 | CD4 (M) | binary | - | False: 83.39%, True: 16.61% |
13 | Drug Recorded (Drug (M)) | binary | - | False: 15.56%, True: 84.44% |
Comparison of marginal distributions
Previously the Carousal wasn’t working so here I am back at testing it
Numeric variable comparisons
VL, CD4, and Rel CD4
- Synthetic
- Real
Categorical variable comparisons
Ethnicity and Components of the Drug Regimen
- Synthetic
- Real
Binary variable comparisons
Gender, use of pk Enhancer, and the measurement (M) variables
- Synthetic
- Real
Comparison of correlations
The matrices below show:
1) Static correlations between pairs of variables for all patients at any time, in the real and synthetic data.
2) Dynamic correlations between variables in the real and synthetic data. Time series for each patient were linearly decomposed into trends (indicating a general upward or downward slope) and cycles (indicating periodic patterns). Correlations were then computed between trends and cycles for all patients.
- Static
- Trends
- Cycles
Real Data
Synthetic
Real Data
Synthetic
Real Data
Synthetic
Setup
Health Gym data can be downloaded either directly from this website or using the Python API.
Install the health gym python package through pip by running the following in your terminal.
pip install healthgym
To download and access the data from within python.
import healthgym as hg
hypotension_data = hg.datasets.HIV(root: 'path/to/data/', download = True)
All datasets support the following parameters.
root (string)
# Root directory of dataset where dataset exists or will be saved to if download is set to True.
download (bool, optional)
# If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
Download the dataset
Direct download as CSV file or view GitHub repository