Boosting Real-World Evidence in Health Services:

Using Machine Learning and Robust Methods to Harness High-Dimensional (HD) Proxies from Administrative Data for Minimizing Residual Confounding

M. Ehsan Karim, UBC SPPH

25 Sep 2025

Acknowledgements


  • Slides: https://ehsanx.github.io/DRhdPS-Slides

  • Talk is about a combination of several recent papers in PDS, Epidemiology, Am Stat, PLoS ONE, Res Methods Med Health Sci, Am J Epidemiol, Pharm Stat, and J Clin Epidemiol.

References

[1]
Karim, Hossain, Ng, Zhu, Frank and Tremlett (2025), DOI: 10.1002/pds.70112
[2]
Karim, Pang and Platt (2018), DOI: 10.1097/EDE.0000000000000787
[3]
[4]
Karim and Lei (2025), DOI: 10.1371/journal.pone.0324639
[5]
Karim and Wang (2025), DOI: under review
[6]
Frank and Karim (2024), DOI: 10.1177/26320843231176662
[7]
Mondol and Karim (2024), DOI: 10.1093/aje/kwae447
[8]
Karim and Mondol (2025), DOI: 10.1002/pst.70022
[9]
Karim and Lei (2025), DOI: 10.1002/pds.70155
[10]
Hossain, Sadatsafavi, Wong, Cook, Johnston and Karim (2025), DOI: 10.1016/j.jclinepi.2025.111857
[11]
Hossain, Wong, Sadatsafavi, Cook, Johnston and Karim (2025), DOI: 10.1002/pds.70172
[12]
Hossain, Ng, Zhu, Tremlett and Karim (2025), DOI: 10.1002/pds.70174

Outline

  • Case Study

    • Multiple sclerosis (MS) cohort from BC admin data.
    • residual confounding in real-world data.
  • Methods

    • Standard, hdPS, ML, DR
  • Results
    Comparative performance of methods in HD.

  • Summary & Take-Home Messages
    Practical guidance for HD proxy adjustment.

MS Case Study: Karim, Hossain, Ng, Zhu, Frank and Tremlett (2025)

Variable Description
A Disease-modifying drug (DMD) use vs. no use
Y All-cause mortality
Data source Cohort of 19,360 with MS in BC (1996–2017)

Confounding in MS Case Study

Clinical (disease severity, disability level), and lifestyle factors (smoking, alcohol, and physical activity) unmeasured.

Confounder Selection in Real-World

Modified disjunctive cause criterion (VanderWeele et al. 2019)

Reducing Confounding in MS Study

Examples of 1-to-1 Proxies in MS

Unmeasured Potential Proxy from BC Admin Data
Disease severity Hospital/ER visits, specialist consults, symptom-drug use
Disability Mobility aid codes, rehab visits, fall-related hospitalizations
Depression/ Anxiety Antidepressant, mental health codes
Smoking Smoking-cessation Rx, COPD/lung cancer codes
Alcohol use Liver disease, alcohol-related emergencies

Regression vs. Propensity Score (PS)

Adjusting for too many proxies in the outcome regression can be problematic!

  1. Regression: Y ~ A + C + P

  2. PS: P(A|C + P) then Y ~ A in matched/weighted sample

hdPS: Select proxy by Bross Formula

hdPS vs. ML:

Karim, Pang and Platt (2018)

ML Extension Idea: Karim (2025)

hdPS vs. Stat Performance

Forward selection and backward elimination performed comparably: Karim and Lei (2025).

Learners Not Created Equal!!

Karim and Wang (2025)

Double Robust (DR) Estimator: TMLE

ML and DR methods in hdPS context

  • Handles a large number of proxies
  • Flexible ML: model-specification better / invalid inference
  • DR + cross-fit: allows valid inference with flexible models

Standard, hdPS, ML vs. DR: Karim and Lei (2025)

Category Example / Description
Standard methods regression or PS without proxies
hdPS Bross formula
ML extensions Random forest, LASSO for PS
Super Learner Ensemble of multiple learners
TMLE DR
Double cross-fit TMLE TMLE accommodating more flexible learners

Metric: Bias

https://ehsank.shinyapps.io/hdPS-TMLE/

Metric: 95% Coverage

https://ehsank.shinyapps.io/hdPS-TMLE/

Other Extensions of HD proxies

Summary

  • HD proxy adjustment (health services dimensions) can help;
    • even if not directly interpretable for the research question
  • ML can help, but not all learners are equal;
    • Complex ML libraries can be unstable
  • Simpler methods often perform better and are easier to interpret in high-dimensional settings.

Thank You!


Contact: ehsan.karim@ubc.ca

Code & Resources: https://github.com/ehsanx/