Chapter 11 Critical Appraisal

Given the popularity of ML and AI methods, it is important to be able to critically appraise a paper that reports ML / AI analyses results.

11.1 Existing guidelines

Liu et al. (2020) and Rivera et al. (2020) provided the CONSORT-AI and SPIRIT‑AI Extensions. Vinny et al. (2021) reported a number of ways to critically appraise an article that analysed the data using an ML approach.

11.2 Key considerations

Below a summary of the general considerations are listed in critical appraisal of a ML research article.

Issues to consider Details
Clinical utility Explanation or rationale of why these prediction models are being built or developed? Was the study aim clear? Is it prediction of outcome, or identification important features?
Data source and study description How was the data collected? What was the study design? RCT, observational, cross-sectional, longitudinal, nationally representative survey? Study start, end dates reported? What was the baseline? Are the data measurable in clinical setting routinely or they are measured irregularly?
Target population Was it clear who was the target population where this model was developed and where it can be generalized?
Analytic data How was the data pre-processed? Inclusion, exclusion criteria properly implemented to properly target the intended population? Clinicians were consulted to discuss the appropriateness of inclusion, exclusion criteria? Protocol published a priori?
Data dimension, and split ratio Total data size, analytic data size, training, tuning, testing data size?
Outcome label How was the gold standard determined, and what was the quality? The prediction of such outcome clinically relevant?
Features How many covariates or features used? How were these variables selected? Subject area experts consulted in selection and identification of some or all of these variables? Any of these variables transformed or dichotomized or categorized or combined? A table of baseline characteristics of the subjects, stratified by the outcome labels presented?
Missing data Were the amount of missing observations reported? Any explanation of why they were missing? How were the missing values handles? Complete case or multiple imputation?
ML model choice Rationale of the ML model choice (logistic, LASSO, CART or extensions, ensemble, or others)? Model specification? Additive, linear or not? Amount of data adequate given the model complexity (number of parameters)?
ML model details Details about ML model and implementation reported? Model fine tuned? Model somehow customized? Hyperparameters provided?
Optimism or overfitting What method was used to address these issues? What measures of performances were used? Was there any performance gap (between tuned model vs internal validation model)? Model performance reasonable, or unrealistic?
Generalizability External validation data present? Model was tested in real-world clinical setting?
Reproducibility repeatable and reproducible? These can be in 3 levels (i) model (ii) code (iii) data or their combinations. Software code provided? Which software and version was used? Was the computing time reported?
Interpretability Clinicians were consulted? Results were interpreted in collaboration with clinicians and subject area experts? Model results believable, interpretable?
Subgroup Clinically important subgroups considered?

11.3 Example

  1. Download the article by Antman et al. (2000) (link). Try to identify how many of the above key considerations they have reported in the process of developing a risk score?
  2. OpenSafely article in Nature: let’s discuss research goal.

11.4 Exercise

Find an article in the medical literature (published in a peer-reviewed journal, could be related to the area that you work on, or are interested in) that used a machine learning method to build a clinical prediction model (here is an example). Critically appraise that article.


Antman, Elliott M, Marc Cohen, Peter JLM Bernink, Carolyn H McCabe, Thomas Horacek, Gary Papuchis, Branco Mautner, Ramon Corbalan, David Radley, and Eugene Braunwald. 2000. “The TIMI Risk Score for Unstable Angina/Non–ST Elevation MI: A Method for Prognostication and Therapeutic Decision Making.” Jama 284 (7): 835–42.
Liu, Xiaoxuan, Samantha Cruz Rivera, David Moher, Melanie J Calvert, and Alastair K Denniston. 2020. “Reporting Guidelines for Clinical Trial Reports for Interventions Involving Artificial Intelligence: The CONSORT-AI Extension.” Bmj 370.
Rivera, Samantha Cruz, Xiaoxuan Liu, An-Wen Chan, Alastair K Denniston, and Melanie J Calvert. 2020. “Guidelines for Clinical Trial Protocols for Interventions Involving Artificial Intelligence: The SPIRIT-AI Extension.” Bmj 370.
Vinny, Pulikottil W, Rahul Garg, MV Padma Srivastava, Vivek Lal, and Venugoapalan Y Vishnu. 2021. “Critical Appraisal of a Machine Learning Paper: A Guide for the Neurologist.” Annals of Indian Academy of Neurology 24 (4): 481.