17  Statistical Approaches

17.1 Background

Recent work compares multiple variable selection strategies for hdPS analysis (Karim and Lei 2025). The study aims to identify methods that best balance bias, precision, and computational cost in causal inference using observational data. It is based on NHANES 2013–2018 data evaluating the association between obesity and diabetes.

17.2 Simulation Design

Element Details
Data Source NHANES 2013–2018
Sample Size 3,000 participants per iteration
Iterations 500
Prevalence Scenarios 1. Frequent exposure & frequent outcome
2. Rare exposure & frequent outcome
3. Frequent exposure & rare outcome
True Effect OR = 1 (null); RD = 0
Outcome Generation Included nonlinear transforms, interactions, and a comorbidity index from 94 proxies
Noise Variables 48 of 142 proxy covariates used as noise

17.3 Methods Compared

Method Description
Kitchen Sink Includes all investigator and proxy covariates (no selection)
Bross hdPS Selects top 100 proxies using the Bross formula
Hybrid (Bross + LASSO) First applies Bross, then refines with LASSO
LASSO Penalized regression with cross-validation
Elastic Net Combines LASSO and Ridge penalties to handle collinearity
Random Forest Ranks variables by importance using Gini impurity
XGBoost Boosted trees optimizing impurity reduction
Forward Selection Adds variables sequentially based on adjusted R²
Backward Elimination Removes variables iteratively based on adjusted R²
Genetic Algorithm Evolves variable subsets via stochastic search

17.4 Simulation Results

Figure 1. Bias across Methods in NHANES Plasmode Simulation

Figure 2. Coverage across Methods in NHANES Plasmode Simulation

See interactive results: 👉 Shiny App

17.5 Key Takeaways

  • Simpler methods (Forward/Backward selection) offer strong coverage with efficiency.
  • Bross-based and Hybrid hdPS methods remain reliable and interpretable.
  • Method choice should reflect the specific inferential goal: bias reduction vs variance minimization.