Chapter 9 Model Validation Considerations

9.1 Reliability of outcome labels

Many clinical outcomes come from reliable data sources.
- Death data is considered ‘hard’ outcome, usually come from death registry, and are often not subject to much error.
However, there are other clinical outcomes that are considered ‘soft’ outcomes (e.g., subject to mis-classifications, mis-remembering).
- EDSS scale in multiple sclerosis may be measured differently in different jurisdictions or clinical training practices
- Number of relapses over past 2 years may be subject to recall bias

Unsurprising, quality of outcome data matters in building a good prediction model.

Blindly believing or reporting the ML results might not be useful. Researchers should always seek for explanation of analysis results. This is particularly true if the results are somewhat surprising or unexpected.
It is always a good idea to check with a person with clinical expert to assess and scrutinize the results.
Analysts should always be skeptic of results when there is a large gap between the performances of training (after tuning) and test sets.
Analyst should always care about reproducibility and repeatability of the analysis results (requires keeping good log of what parameter, hyperparameter, computational settings settings were used).
Indeed internal validity is helpful, but the external validity should also be assessed when possible. Often the models developed under laboratory condition may not perform well under real-world clinical settings.