Franklin et al. 2014: resampling from the observed covariate and exposure data without modification in all simulated datasets to preserve the associations among the following variables.
Variables used
Original demographic variables (9)
age, sex, education, race, marital status, income, born, cycle
Original behaviour variables (5)
smoking, diet, high cholesterol, physical activity, sleep
Demographic, behaviour and health history / access variables are all binary or categorical variables.
Original health history / access variables (2)
diabetes family history, medical access
Transformed lab variables (6) (complex forms)
Tranfored.var.1 = log(globulin)
Tranfored.var.2 = protein*calcium
Tranfored.var.3 = diastolicBP/systolicBP)^2
Tranfored.var.4 = sqrt(uric acid+bilirubin)/2
Tranfored.var.5 = phosphorus^2/(sodium*potassium)
Tranfored.var.6 = log(systolicBP+10)
Original lab variables were: uric acid, protein, bilirubin, phosphorus, sodium, potassium, globulin, calcium, systolicBP, diastolicBP.
Tranfored var.1 + var.2 + var.3 + var.4 + var.5 + var.6 were used in the model (transformed lab variables instead of the original lab variables).
Count based prescription codes (1) (proxies of comorbidity)
Simple count = sum of selected ICD-10 CM codes
Simple count
\(= \sum_{i}^{94} R_s\) where \(R_s\) are the selected recurrence covariates.
We proceeded to select only those binary recurrence covariates that had a relative risk (RR) of less than 0.8 or greater than 1.2 compared to the outcome. Out of 143 recurrence covariates, 94 of them met this criterion. Therefore, 49 remaining covariates were not used in calculating the Simple count
variable, and can be considered as noise.
True outcome model
Diabetes (outcome) = Obese (exposure) +
demographic/behaviour/health history variables +
transformed lab variables +
simple count with selected ICD-10 codes
The outcome model formula dictates the relationship between exposure and each covariate, but observed association between exposure and covariates are retained from the data. It is possible that some of these covariates may not be associated with the exposure, but at least the association with the outcome is guaranteed by the outcome mode. To mimic the original data associations, we also retained the observed associations between the outcome and the covariates.
- Size of each cohort = 3,000
- 500 simulations/cohort generation
- Outcome rate = 0.4
- Exposure prevalence = 0.2
- True exposure Odds ratio = 1 or Risk difference = 0
We maintained all covariate coefficients associated with the remaining covariates to be consistent with the original data. Only the association of the ‘simple count’ with the outcome was amplified 5 times. By exaggerating the effect of the ‘simple count’ variable, we aimed to simulate a scenario in which a pronounced and strong unmeasured confounder may exist.
Franklin, Jessica M, Sebastian Schneeweiss, Jennifer M Polinski, and Jeremy A Rassen. 2014. “Plasmode Simulation for the Evaluation of Pharmacoepidemiologic Methods in Complex Healthcare Databases.” Computational Statistics & Data Analysis 72: 219–26.
Karim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.
Pang, Menglan, Tibor Schuster, Kristian B Filion, Mireille E Schnitzer, Maria Eberg, and Robert W Platt. 2016. “Effect Estimation in Point-Exposure Studies with Binary Outcomes and High-Dimensional Covariate Data–a Comparison of Targeted Maximum Likelihood Estimation and Inverse Probability of Treatment Weighting.” The International Journal of Biostatistics 12 (2).