Plasmode Simulation

Franklin et al. 2014: resampling from the observed covariate and exposure data without modification in all simulated datasets to preserve the associations among the following variables.

Variables used

Original demographic variables (9)

age, sex, education, race, marital status, income, born, cycle 

Original behaviour variables (5)

smoking, diet, high cholesterol, physical activity, sleep

Demographic, behaviour and health history / access variables are all binary or categorical variables.

Original health history / access variables (2)

diabetes family history, medical access

Transformed lab variables (6) (complex forms)

Tranfored.var.1 = log(globulin)
Tranfored.var.2 = protein*calcium
Tranfored.var.3 = diastolicBP/systolicBP)^2
Tranfored.var.4 = sqrt(uric acid+bilirubin)/2
Tranfored.var.5 = phosphorus^2/(sodium*potassium)
Tranfored.var.6 = log(systolicBP+10)

Original lab variables were: uric acid, protein, bilirubin, phosphorus, sodium, potassium, globulin, calcium, systolicBP, diastolicBP.

Tranfored var.1 + var.2 + var.3 + var.4 + var.5 + var.6 were used in the model (transformed lab variables instead of the original lab variables).

Count based prescription codes (1) (proxies of comorbidity)

Simple count = sum of selected ICD-10 CM codes

Simple count \(= \sum_{i}^{94} R_s\) where \(R_s\) are the selected recurrence covariates.

Using only partial list of recurrence covariates

We proceeded to select only those binary recurrence covariates that had a relative risk (RR) of less than 0.8 or greater than 1.2 compared to the outcome. Out of 143 recurrence covariates, 94 of them met this criterion. Therefore, 49 remaining covariates were not used in calculating the Simple count variable, and can be considered as noise.

True outcome model

Diabetes (outcome) =  Obese (exposure) + 
  
                      demographic/behaviour/health history variables + 
  
                      transformed lab variables +
  
                      simple count with selected ICD-10 codes
Role of variables

The outcome model formula dictates the relationship between exposure and each covariate, but observed association between exposure and covariates are retained from the data. It is possible that some of these covariates may not be associated with the exposure, but at least the association with the outcome is guaranteed by the outcome mode. To mimic the original data associations, we also retained the observed associations between the outcome and the covariates.

Simulation setup
  • Size of each cohort = 3,000
  • 500 simulations/cohort generation
  • Outcome rate = 0.4
  • Exposure prevalence = 0.2
  • True exposure Odds ratio = 1 or Risk difference = 0

We maintained all covariate coefficients associated with the remaining covariates to be consistent with the original data. Only the association of the ‘simple count’ with the outcome was amplified 5 times. By exaggerating the effect of the ‘simple count’ variable, we aimed to simulate a scenario in which a pronounced and strong unmeasured confounder may exist.