ML in TB Research via Administrative Data

Predictive Research question 1

Prediction modelling with machine learning methods: How do we overcome missing data challenges?

Predicting rare outcomes, e.g., long-term mortality in tuberculosis patients, is critical for public health planning and intervention. However, utilizing large health administrative datasets for this purpose is challenging due to missing values in known key predictors. The multiple imputation technique is popularly known for dealing with missing predictor values. However, machine learning methods face limitations as these methods are incompatible for pooling model outputs from multiple imputed data using Rubin’s rule, highlighting the need for innovative approaches in handling missing data in risk prediction modelling. We conducted extensive data analyses with simulations to explore the comparative performance of different epidemiological techniques to overcome missing data challenges in prediction modelling with machine learning methods.

Predictive Research question 2

Can we supplement unmeasured predictors in risk prediction modelling using machine learning methods?

Developing prediction models using health administrative databases is increasingly common for predicting long-term health outcomes. However, health administrative databases often lack clinical predictors but have many other variables that can correlate with and thus act as proxies for such variables. Traditionally, the investigators develop their models based on the list of available predictors in the databases. Such investigator-specified models are prone to provide poor predictions. To supplement unmeasured predictors in risk prediction modelling, we developed machine learning models with investigator-specified predictors and proxies (e.g., ICD9/10 codes). Using extensive data analysis and simulations with administrative health data on participants diagnosed with tuberculosis (Hossain et al. 2023), we showed that our machine-learning method outperforms the model with only investigator-specified predictors in predicting survival or binary outcomes.

References

Hossain, Md Belal, James C Johnston, Victoria J Cook, Mohsen Sadatsafavi, Hubert Wong, Kamila Romanowski, and Mohammad Ehsanul Karim. 2023. “Role of Latent Tuberculosis Infection on Elevated Risk of Cardiovascular Disease: A Population-Based Cohort Study of Immigrants in British Columbia, Canada, 1985–2019.” Epidemiology & Infection 151: e68.