Using Machine Learning and Robust Methods to Harness High-Dimensional (HD) Proxies from Administrative Data for Minimizing Residual Confounding
25 Sep 2025
Slides: https://ehsanx.github.io/DRhdPS-Slides
Talk is about a combination of several recent papers in PDS, Epidemiology, Am Stat, PLoS ONE, Res Methods Med Health Sci, Am J Epidemiol, Pharm Stat, and J Clin Epidemiol.
Case Study
Methods
Results
Comparative performance of methods in HD.
Summary & Take-Home Messages
Practical guidance for HD proxy adjustment.
Variable | Description |
---|---|
A | Disease-modifying drug (DMD) use vs. no use |
Y | All-cause mortality |
Data source | Cohort of 19,360 with MS in BC (1996–2017) |
Clinical (disease severity, disability level), and lifestyle factors (smoking, alcohol, and physical activity) unmeasured.
Modified disjunctive cause criterion (VanderWeele et al. 2019)
Unmeasured | Potential Proxy from BC Admin Data |
---|---|
Disease severity | Hospital/ER visits, specialist consults, symptom-drug use |
Disability | Mobility aid codes, rehab visits, fall-related hospitalizations |
Depression/ Anxiety | Antidepressant, mental health codes |
Smoking | Smoking-cessation Rx, COPD/lung cancer codes |
Alcohol use | Liver disease, alcohol-related emergencies |
Adjusting for too many proxies in the outcome regression can be problematic!
Regression: Y ~ A + C + P
PS: P(A|C + P) then Y ~ A in matched/weighted sample
Schneeweiss et al. 2009, 2018: use health services utilization data even if not interpretable for the research question.
Proxy | Measure | Codes |
---|---|---|
Frail health | Oxygen use | CPT-4 |
Sick, not critical | Hypertension during hospital stay | ICD-9, ICD-10 |
Health-seeking | Regular check-ups/screening | ICD-9, CPT-4, #PCP visits |
Chronically ill | Frequent specialist visits, many Rx | #specialist visits, NDC, ATC |
Forward selection and backward elimination performed comparably: Karim and Lei (2025).
Category | Example / Description |
---|---|
Standard methods | regression or PS without proxies |
hdPS | Bross formula |
ML extensions | Random forest, LASSO for PS |
Super Learner | Ensemble of multiple learners |
TMLE | DR |
Double cross-fit TMLE | TMLE accommodating more flexible learners |
https://ehsank.shinyapps.io/hdPS-TMLE/
https://ehsank.shinyapps.io/hdPS-TMLE/
Contact: ehsan.karim@ubc.ca
Code & Resources: https://github.com/ehsanx/
CHSPR Seminar