
High-Dimensional Propensity Scores
Background

The use of retrospective health care claims datasets is frequently criticized for lacking complete information on potential confounders. Ultimately, the treatment effects estimated utilizing such data sources may be subject to residual confounding. Digital electronic administrative records routinely collect a large volume of health-related information; and many of whom are usually not considered in conventional pharmacoepidemiological studies.
Proposal to reduce residual confounding bias
In 2009, a high-dimensional propensity score (hdPS) algorithm was proposed that utilizes such information as surrogates or proxies for mismeasured and unobserved confounders in an effort to reduce residual confounding bias. Since then, many machine learning and semi-parametric extensions of this algorithm have been proposed to exploit the wealth of high-dimensional proxy information properly.
Schneeweiss et al. (2009)
Purpose of the workshop
The primary goal of this workshop is to provide a practical foundation in hdPS. Specifically, we will:
Contrast Methodologies: Clarify the distinctions between standard Propensity Score analysis and the high-dimensional framework.
Systematize the hdPS Workflow: Walk through the underlying logic, sequential steps, and implementation guidelines using reproducible R code applied to an open-source dataset.
Algorithmic Extensions: The rationale for integrating Machine Learning with hdPS.
To provide a comprehensive view beyond our live session, this website includes supplementary modules designed for independent study. These materials offer deeper dives into:
Complex Data Structures: The adaptation and implementation of hdPS within survival analysis and longitudinal data frameworks to handle time-varying complexities.
Critical Appraisal & Publication: A discussion on current controversies, methodological advantages, and best practices for reporting hdPS results in peer-reviewed manuscripts.
Workshop prerequisite
Attendees should have prerequisite knowledge of multiple regression analysis and working knowledge in R (e.g., basic data manipulation and regression fitting).
R Codes
R Codes can be downloaded here:
Version history
Infographics and visual syntheses in this presentation were prepared using NotebookLM, while the statistical findings and R-based visualizations were generated by the presenter.
Different versions and updates of the materials were presented in the following sessions
- 2026 Biostatistics section of Statistical Society of Canada through Instats, Feb 17.
- 2025 Canadian Society for Epidemiology and Biostatistics, Montreal, Quebec, August 11.
- 2025 Society of Epidemiologic Research Workshops, July 11.
- 2025 Statistical Society of Canada, Biostatistics Workshop, May 25 (together with Md Belal Hossain)
- 2024 Society of Epidemiologic Research Workshops, May 10th.
- 2023 R/Medicine Conference, Virtual, June 5.
- 2023 Society of Epidemiologic Research Workshops, Virtual, May 4.
Additional relevant talks (selected):
- Statistical issues in administrative data, Banff International Research Station, Banff, Feb 2019.
- Statistics Conference in Genomics, Pharmaceutical Science, and Health Data Science, August 15-17, 2022 University of Victoria, Victoria, BC
- Work in Progress Seminar, CHEOS, St. Paul’s Hospital (Hurlburt Auditorium), Dec 14th, 2022.
- Statistics and Biostatistics seminar series, at the Department of Statistics and Actuarial Science, University of Waterloo, April 26, 2023.
- Conference on Statistics and Data Science with Applications in Biology, Genetics, Public Health, and Finance, Thompson Rivers University, Kamloops, August 21-24, 2023.
Citation
Karim, M. E. (2025). High-dimensional propensity score and its machine learning extensions in residual confounding control. The American Statistician, 79(1), 72-90. DOI: 10.1080/00031305.2024.2368794.

Comments
For any comments regarding this document, reach out to me.