
Using Machine Learning and Robust Methods to Harness High-Dimensional (HD) Proxies from Administrative Data for Minimizing Residual Confounding
25 Sep 2025

Slides: https://ehsanx.github.io/DRhdPS-Slides
Talk is about a combination of several recent papers in PDS, Epidemiology, Am Stat, PLoS ONE, Res Methods Med Health Sci, Am J Epidemiol, Pharm Stat, and J Clin Epidemiol.






Case Study
Methods
Results
Comparative performance of methods in HD.
Summary & Take-Home Messages
Practical guidance for HD proxy adjustment.
| Variable | Description |
|---|---|
| A | Disease-modifying drug (DMD) use vs. no use |
| Y | All-cause mortality |
| Data source | Cohort of 19,360 with MS in BC (1996–2017) |


Clinical (disease severity, disability level), and lifestyle factors (smoking, alcohol, and physical activity) unmeasured.

Modified disjunctive cause criterion (VanderWeele et al. 2019)

| Unmeasured | Potential Proxy from BC Admin Data |
|---|---|
| Disease severity | Hospital/ER visits, specialist consults, symptom-drug use |
| Disability | Mobility aid codes, rehab visits, fall-related hospitalizations |
| Depression/ Anxiety | Antidepressant, mental health codes |
| Smoking | Smoking-cessation Rx, COPD/lung cancer codes |
| Alcohol use | Liver disease, alcohol-related emergencies |
Adjusting for too many proxies in the outcome regression can be problematic!
Regression: Y ~ A + C + P
PS: P(A|C + P) then Y ~ A in matched/weighted sample

Schneeweiss et al. 2009, 2018: use health services utilization data even if not interpretable for the research question.
| Proxy | Measure | Codes |
|---|---|---|
| Frail health | Oxygen use | CPT-4 |
| Sick, not critical | Hypertension during hospital stay | ICD-9, ICD-10 |
| Health-seeking | Regular check-ups/screening | ICD-9, CPT-4, #PCP visits |
| Chronically ill | Frequent specialist visits, many Rx | #specialist visits, NDC, ATC |








Forward selection and backward elimination performed comparably: Karim and Lei (2025).





| Category | Example / Description |
|---|---|
| Standard methods | regression or PS without proxies |
| hdPS | Bross formula |
| ML extensions | Random forest, LASSO for PS |
| Super Learner | Ensemble of multiple learners |
| TMLE | DR |
| Double cross-fit TMLE | TMLE accommodating more flexible learners |

https://ehsank.shinyapps.io/hdPS-TMLE/

https://ehsank.shinyapps.io/hdPS-TMLE/



Contact: ehsan.karim@ubc.ca
Code & Resources: https://github.com/ehsanx/

CHSPR Seminar