Chapter 1 Plasmode simulation

Healthcare claims databases contain numerous (usually thousands) collected variables. Simulating such a high-dimensional dataset is problematic in a Monte Carlo study because it is difficult to recreate a realistic data generating process that takes into account of associations among a large number of covariates under consideration. Plasmode is a simulation technique that relies on resampling techniques to obtain data that can preserve the empirical associations among the covariates. During the process of plasmode simulation, the analyst can assign a desired value for the true treatment effect in the data generating process. Such a plasmode study begins with an existing cohort, with an assumed data generating process, as in the following equation, and we can modify the existing cohort and injected known effects (signals) into it.

\[\begin{eqnarray}\label{plasmodeequation} {logit\big[Pr(Y = 1)\big] = \alpha_0 + \theta \times \alpha_1 T + \gamma \times \alpha_2 X,} \end{eqnarray}\] where \(Y\) is the outcome (e.g., all-cause mortality following an acute myocardial infarction), \(T\) is the treatment indicator (whether or not the patient being treated with statin), \(X\) is the high-dimensional covariate matrix that includes the important investigator-specified covariates (listed in eTable ), additional investigator-specified covariates (listed in eTable ) and the list of created empirical covariates obtained by running the hdPS algorithm on the complete statin user dataset with \(32,792\) patients. These empirical variables should act as proxy or surrogate of the unmeasured confounders. As for the parameters in equation, \(\alpha_0\) is the intercept, \(\alpha_1\) is the treatment effect, \(\alpha_2\) is the vector of effects associated with covariates listed in \(X\), \(\theta\) is the treatment effect multiplier and \(\gamma\) is the covariate effect multiplier.

The plasmode simulation algorithm samples exposed and unexposed subjects with replacement from the empirical dataset in such a way that guarantees a desired study size (\(m\)) and a prevalence of exposure (\(p_E\)) in the simulated plasmode samples (J. M. Franklin et al. 2014, 2015; Jessica M. Franklin et al. 2017). Also, this simulation algorithm allows researchers to specify the intercept value in the outcome-generating model to guarantee a desired prevalence of outcome (\(p_Y\)) (J. M. Franklin et al. 2014, 2015).

Methodologically, the plasmode simulation realistically generates the data by controlling the relationship with outcome by retaining \(\alpha_2\) estimates (parameter estimates associates with the covariates) in the outcome generation model same as the estimates obtained from the empirical data fitting. The plasmode simulation uses resampling techniques such as bootstrap to select patients in a specific sample with replacement. Here, the bootstrap samples (of specified size \(m\)) are collected from the complete set of covariate-exposure matrix \(Z = (T,X)\). As none of these variables in the covariate-exposure matrix, \(Z\) are permuted or modified in any way, in each bootstrap sample (of a reasonable size), systematically, the relationships should remain intact among exposure and covariates (J. M. Franklin et al. 2014). Therefore, relationship with covariates and outcomes are controlled by fixing \(\alpha_2\) values in the outcome generation model and boostrap ensures joint distribution of exposure and covariates are unaltered, there should not be any obvious reason why the relationship among covariates and exposure should be different in plasmode samples. In that sense, in the plasmode simulation, the `amount of confounding’ from a covariate (i.e., relationship of a covariate with the outcome as well as the exposure; both of which relationships are required for a covariate to be considered as a confounder) is controlled (J. M. Franklin et al. 2014).

However, among other things, this simulation mechanism do allow researchers to change the multipliers of the treatment effect and the covariate effects by changing \(\theta\) parameter value and \(\gamma\) parameter vector respectively. In certain combination of these parameters values, it is possible that an important confounder in the empirical study may not remain important in the plasmode samples. Future research should investigate further in this issue. Plasmode simulations are built based on a given empirical data setting, and the generalizability of the results is an issue for such simulations. See Karim, Pang, and Platt (2018) as an exmple of the use of plasmode simulation.

References

Franklin, J. M., W. Eddings, R. J. Glynn, and S. Schneeweiss. 2015. “Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.” American Journal of Epidemiology 182 (7): 651–59.
Franklin, J. M., S. Schneeweiss, J. M. Polinski, and J. A. Rassen. 2014. “Plasmode Simulation for the Evaluation of Pharmacoepidemiologic Methods in Complex Healthcare Databases.” Computational Statistics & Data Analysis 72: 219–26.
Franklin, Jessica M, Wesley Eddings, Peter C Austin, Elizabeth A Stuart, and Sebastian Schneeweiss. 2017. “Comparing the Performance of Propensity Score Methods in Healthcare Database Studies with Rare Outcomes.” Statistics in Medicine. https://doi.org/10.1002/sim.7250.
Karim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.