Concepts (M)

Missing Data Analysis

This section is about understanding, categorizing, and addressing missing data in clinical and epidemiological research. It highlights the prevalence of missing data in these fields, the common use of complete case analysis without considering the implications, and the types of missingness: Missing Completely at Random (MCAR), Missing at Random (MAR), and Not Missing at Random (NMAR), each requiring different approaches and considerations. The consequences of not properly addressing missing data are detailed as bias, incorrect standard errors/precision, and a substantial loss of power.

This section also delves into strategies for addressing missing data, focusing on ad-hoc approaches and imputation methods. Ad-hoc approaches, such as ignoring missing data or using a missing category indicator, are generally dismissed as statistically invalid. In contrast, imputation, particularly multiple imputation (MI), is presented as a more robust and statistically sound method. Multiple imputation involves creating multiple complete datasets by predicting missing values and pooling the results to address the uncertainty associated with missing data. The section further discusses the types of imputation, the necessity of including a sufficient number of predictive variables, and the use of subject-area knowledge in building imputation models, providing a nuanced understanding of the challenges and solutions associated with missing data in research.

Reporting Guideline section delves into the complexities of handling missing data in statistical analysis, primarily through MI methods, especially Multiple Imputation by Chained Equations (MICE). It lays out the assumptions necessary for these methods (MCAR, MAR, MNAR). The guide also details how MICE works, using sequential regression imputation to create multiple imputed datasets, thereby allowing for more accurate and robust statistical inferences. Additionally, it provides comprehensive instructions on reporting MICE analysis, including detailing the missingness rates, the reasons for missing data, the assumptions made, and the specifics of the imputation and pooling methods used, ensuring transparency and reproducibility in research.

Reading list

Key reference: (Sterne and al. 2009)

Optional reading: (Van Buuren 2018)

Further optional readings: (Lumley 2011; Granger, Sergeant, and Lunt 2019; Hughes et al. 2019)

Video Lessons

Missing Data Analysis

The Unseen Threat: Why Missing Data Matters in Research

Missing data is an inevitable feature of nearly all clinical and epidemiological research. From survey non-response to equipment malfunction or participant dropout, gaps in a dataset are the rule, not the exception. For many years, the profound impact of these gaps on the validity of research findings was often overlooked, partly because the statistical methods to properly address them were not readily accessible to most researchers. However, the landscape has changed. Powerful, principled methods for handling missing data, such as multiple imputation, are now available in standard statistical software, raising the standard of evidence and the expectation of rigor for all quantitative research. Understanding and correctly applying these methods is no longer a niche specialty but a core competency for any modern researcher.

The Critical Consequences of Inaction

Ignoring missing data or handling it improperly is not a neutral act; it actively degrades the quality of scientific inquiry. The consequences are severe and can undermine the very conclusions of a study. There are three primary ways in which missing data can corrupt research findings:

  1. Bias: When the missing values are systematically different from the observed ones, any analysis that ignores this fact will produce biased results. The estimates, such as regression coefficients or odds ratios, will be consistently wrong, misrepresenting the true relationships that exist in the population.
  2. Loss of Power: The most common (and often default) method for handling missing data is to simply discard any observation that has a missing value. This approach, known as complete case analysis, reduces the sample size. A smaller sample size diminishes the statistical power of a study, meaning it reduces the ability to detect real effects or relationships, even when they truly exist.
  3. Incorrect Precision: Improperly handling missing data can lead to incorrect standard errors. For instance, some naive methods make the data appear more perfect and less variable than it truly is. This results in standard errors that are too small and confidence intervals that are too narrow, giving a false sense of certainty in the findings.

Critique of Flawed “Ad-Hoc” Approaches

Given the challenges, researchers have often resorted to simple, “ad-hoc” solutions. While appealing in their simplicity, these methods are statistically invalid under most realistic conditions and should be avoided.

  • Complete Case Analysis (Listwise Deletion): This method, the default in many software packages, involves analyzing only the subset of observations with no missing data on any variable. While simple, it is only statistically valid under a very strict and rare assumption about the missing data mechanism Its widespread use without proper justification is one of the most common and serious errors in the literature. As a general rule of thumb, some methodologists suggest that complete case analysis could be considered for the primary analysis if the percentage of missing observations across all variables combined is below approximately 5%, but this requires a very strong justification and should not be based solely on a statistical test. Furthermore, if only the outcome variable has missing values, complete case analysis can be more statistically efficient than multiple imputation.
  • Single Imputation (e.g., Mean/Median): This approach involves “filling in” each missing value with a single number, such as the mean or median of the observed values for that variable. While this creates a complete dataset, it artificially reduces the natural variability of the data. All the imputed values are identical, which shrinks the standard deviation and leads to underestimated standard errors and overly optimistic (i.e., too small) p-values.
  • Indicator Method: Another flawed technique is to create a new “missing” category for a variable and include this indicator in a regression model. This is not a valid statistical approach and can introduce significant bias into the model’s estimates. This method treats the lack of information as if it were a meaningful, substantive category. For example, if income data is missing for lower-income individuals, creating a “Missing” category can mask the true relationship between income and health, potentially causing the model to underestimate the effect of income. The bias can be especially noticeable if the variable with the missing category is an important confounder.

The persistence of these suboptimal methods points to a critical issue beyond mere statistical technique. The failure to explicitly report the extent of missing data or to justify the method used to handle it is a matter of scientific integrity. Principled missing data analysis is not just about getting a more accurate p-value; it is about a commitment to transparency and producing the most robust and honest results possible from the available evidence.


The “Why” Behind the “What”: Assumptions of Missingness

Before any action can be taken to address missing data, the researcher must make a reasoned judgment about the mechanism that caused the data to be missing. This is not a statistical procedure but a theoretical assessment based on subject-matter knowledge. The choice of assumption is the single most important step, as it dictates the entire analytical strategy that follows. There are three core mechanisms of missingness.

A Detailed Breakdown of the Three Core Mechanisms

  • Missing Completely at Random (MCAR): This is the simplest but most restrictive assumption. Data are said to be MCAR if the probability of a value being missing is completely unrelated to any other variable in the dataset, whether observed or unobserved. The missingness is a pure, random process. A classic, albeit rare, example would be a researcher accidentally dropping a test tube, causing a random data point to be lost. Under MCAR, the complete cases are a random subsample of the original target sample.
  • Missing at Random (MAR): This is a more relaxed and often more plausible assumption. Data are MAR if the probability of a value being missing can be fully explained by other observed variables in the dataset. The “at random” part of the name can be misleading; it does not mean the missingness is truly random. It means that conditional on the data we have observed, the missingness is random. For example, in a health survey, men might be less likely than women to answer questions about depression. Here, the probability of missingness on the depression variable depends on the ‘gender’ variable, which is observed. As long as we account for gender in our analysis, we can correct for the potential bias.
  • Missing Not at Random (NMAR): This is the most challenging scenario. Data are NMAR if the probability of a value being missing is related to the value of that variable itself, even after accounting for all other observed variables. In this case, the reason for the missingness is the unobserved value. For example, individuals with very high incomes may be less likely to report their income, or patients who are feeling very ill may be more likely to miss a follow-up appointment. Under NMAR, the missingness is non-ignorable, and standard methods are generally biased.

Table: Summary of Missing Data Mechanisms

A clear way to distinguish these abstract concepts is with a summary table. It serves as a quick reference and reinforces the key distinctions that guide the choice of analytical method.

Mechanism Definition Implication for Analysis Plausibility in Practice
MCAR Missingness is a purely random process, unrelated to any data. Complete Case Analysis is unbiased but may be inefficient (loss of power). Rare. Often an unrealistic assumption.
MAR Missingness is explainable by other observed variables. Complete Case Analysis is biased. Principled methods like Multiple Imputation are required and valid. Often considered a plausible working assumption, especially with a rich dataset containing many predictors of missingness.
NMAR Missingness depends on the unobserved missing value itself. Both Complete Case Analysis and standard Multiple Imputation are biased. Requires specialized sensitivity analyses. Plausible in many scenarios, especially those involving social stigma, extreme values, or health outcomes.

The crucial takeaway is that the most powerful and widely used methods for handling missing data, such as Multiple Imputation, operate under the MAR assumption. This leads to a fundamental challenge for the researcher. The slides explicitly state, “it is not possible to distinguish between MAR and MNAR using observed data”. This creates an apparent paradox: to proceed with the best available methods, one must make an assumption that cannot be statistically proven or disproven with the data at hand.

The resolution to this paradox lies in shifting the burden of proof from a statistical test to a well-reasoned, subject-matter argument. A researcher cannot simply run a test to “choose” MAR. Instead, they must build a compelling case for why MAR is a plausible assumption in their specific research context. This involves a deep understanding of the data collection process and the substantive area of study. The strength of the final analysis rests not on a p-value from a test, but on the plausibility of this foundational, untestable assumption.


Can We Test the Assumptions? The Role and Limits of MCAR Tests

Given the importance of the missingness assumption, it is natural to ask if there are formal statistical tests to guide the decision. While there is no test to distinguish between MAR and NMAR, there are tests for the strictest assumption, MCAR. These tests, however, should be seen as limited diagnostic tools, not as definitive oracles.

Conceptual Goal of MCAR Tests

Tests for MCAR, such as Little’s Chi-Squared Test, are designed to evaluate the null hypothesis that the data are, in fact, Missing Completely at Random. Conceptually, they work by partitioning the data based on the pattern of missingness (e.g., one group missing variable X, another group missing variable Y, a third group with complete data). The test then compares the characteristics of the observed data—typically the means of the variables—across these different groups. If the data are truly MCAR, one would expect the variable means to be similar across all patterns of missingness. A statistically significant test result suggests that the means differ, which provides evidence against the MCAR assumption.

The Critical Limitations

While useful, it is essential to understand the significant limitations of MCAR tests to avoid misinterpreting their results.

  • A One-Way Street: Hypothesis tests are designed to reject, not accept, a null hypothesis. Therefore, a significant p-value from an MCAR test provides evidence to reject the MCAR assumption. However, a non-significant result does not prove that the data are MCAR. It simply means there was insufficient evidence in the data to reject the null hypothesis. This could be due to the data truly being MCAR, or it could be due to low statistical power.
  • The MAR/NMAR Blind Spot: This is the most critical limitation. An MCAR test provides no information whatsoever to help distinguish between MAR and NMAR. If the test rejects MCAR, the researcher knows the data are either MAR or NMAR, but the test offers no guidance on which is more likely. This is often the more crucial decision for the subsequent analysis.
  • Power Issues: MCAR tests can have low statistical power, especially in smaller datasets or when the departure from MCAR is subtle. This means the test might fail to detect a true deviation from MCAR, leading to a non-significant result even when the data are actually MAR or NMAR.

These tests should be viewed as one piece of exploratory evidence in a broader investigation of the missing data mechanism, not as a standalone decision-making algorithm. Their primary utility is to serve as a statistical “red flag.” If an MCAR test is significant, it provides strong evidence that a naive approach like complete case analysis is inappropriate and will likely lead to biased results. If the test is not significant, the researcher is not absolved of responsibility. They must still rely on their subject-matter expertise and knowledge of the data collection process to make a reasoned judgment about the plausibility of MAR versus NMAR before proceeding with more advanced methods.


The Solution: A Journey Through Imputation Methods

Once the missing data problem has been diagnosed and an assumption about its mechanism has been made, the next step is to implement a solution. The most principled solutions involve imputation, the process of filling in missing data with substituted values to create a complete dataset. This section explores the evolution of imputation techniques, from flawed single imputation methods to the more robust multiple imputation framework.

Single Imputation: A First Step

Single imputation methods replace each missing value with one plausible value. While this produces a conveniently complete dataset, it is a fundamentally flawed approach because it fails to account for the uncertainty inherent in the imputation process.

  • Mean Imputation: The simplest method, where each missing value is replaced by the mean of the observed values for that variable. This artificially reduces the variance of the variable and distorts its relationships with other variables.
  • Regression Imputation: An improvement that uses the relationships between variables. A regression model is built using the complete cases to predict the missing variable from other variables. The missing values are then filled in with their predicted values. However, this method is still flawed because all the imputed values fall perfectly on the regression line, understating the true variability of the data.
  • Stochastic Regression Imputation: This method addresses the flaw of regression imputation by adding a random error term to each predicted value. This restores the natural variance but can sometimes produce implausible values (e.g., negative height) if the error term is large.
  • Hot-Deck Imputation: In this method, a missing value is filled with an observed response from a “donor” individual who is similar on key matching variables. The donor is picked at random from a pool of similar individuals, ensuring the imputed value is a realistic, observed value from the dataset.
  • Predictive Mean Matching (PMM): A sophisticated and generally well-regarded single imputation method. Like regression imputation, it starts by generating a predicted value for each missing entry. However, instead of using this prediction directly, it identifies a small set of “donor” observations from the complete cases whose predicted values are closest to the prediction for the missing entry. It then randomly selects one of these donors and uses their actual, observed value as the imputed value. This ensures that all imputed values are plausible and realistic, as they are drawn from the set of observed data.

When Single Imputation May Be Considered

While generally discouraged for final inferential analysis, there are specific scenarios where single imputation may be considered a pragmatic choice : * Clinical Trials: It is often preferred for imputing missing baseline covariates in randomized clinical trials. * Missing Outcome with Auxiliary Variables: If only the outcome variable is missing and strong auxiliary variables (proxies for the outcome) are available, single imputation may be more effective than complete case analysis. * Prediction Problems: In machine learning contexts focused on prediction, single imputation methods can be used, though pooling results from multiple imputations is not straightforward.

The Unifying Flaw of Single Imputation

Despite their increasing sophistication, all single imputation methods share a critical, unifying flaw: the subsequent statistical analysis treats the imputed values as if they were real, observed data. This failure to acknowledge the uncertainty of the imputation process—the fact that we do not know the true missing value and have only made an educated guess—leads to standard errors that are too small, confidence intervals that are too narrow, and p-values that are artificially significant. The analysis becomes overly precise and overly optimistic.

Multiple Imputation (MI): The Gold Standard

Multiple Imputation (MI) was developed specifically to solve this uncertainty problem. Instead of creating one “complete” dataset, MI creates multiple (e.g., \(m=20\) or \(m=40\)) complete datasets. Each dataset is generated using a similar process to stochastic imputation, but because of the random component, the imputed values are slightly different in each of the m datasets. This collection of datasets explicitly represents our uncertainty about what the true missing values might have been.

By analyzing all m datasets and then formally combining the results, MI provides a single final estimate that correctly incorporates both the normal sampling uncertainty (from having a finite sample) and the additional uncertainty that arises from the missing data. This makes it the gold standard approach for handling missing data under the MAR assumption.


The Multiple Imputation Workflow in Detail

The MI process can be demystified by breaking it down into three conceptual steps: Impute, Analyze, and Pool. This workflow provides a flexible and powerful framework for obtaining valid statistical inferences in the presence of missing data.

Step 1: The Imputation Phase - Creating Plausible Realities

The goal of this phase is to generate m complete datasets where the imputed values are plausible draws from their predicted distribution, conditional on all the observed data.

  • Method (MICE): The most common and flexible algorithm for this phase is Multiple Imputation by Chained Equations (MICE), also known as Fully Conditional Specification (FCS). MICE is an iterative process that handles missing data on multiple variables at once. It tackles the problem one variable at a time, cycling through the variables with missing data. For each variable, it fits a regression model to predict it from all other variables in the dataset and then imputes the missing values based on that model’s predictions, including a random component. This cycle is repeated several times until the process converges, resulting in one complete dataset. The entire process is then repeated m times to generate the m imputed datasets.
  • Building the Imputation Model: The success of MI hinges on the quality of the imputation model. This model should be inclusive and, in general, more complex than the final scientific model. The goal of the imputation model is not to test a hypothesis but to accurately preserve the complex web of relationships (correlations, means, variances) among all variables in the dataset. A good imputation model should contain:
    • All variables from the final analysis model, including the outcome variable.
    • Auxiliary variables: These are variables that are correlated with the variables that have missingness, or are correlated with the missingness itself, even if they are not of scientific interest in the final analysis. Including them helps make the MAR assumption more plausible and can improve the precision of the final estimates.
    • Higher-order terms (e.g., squared terms) or interactions if they are thought to be important for capturing the relationships in the data.
  • Practical Considerations for the Imputation Model:
    • Number of Imputations (m): A common rule of thumb suggests that the number of imputations, m, should be at least as large as the percentage of subjects with any missing data. Modern recommendations often suggest between 20 and 100 imputations.
    • Number of Iterations: MICE is an iterative algorithm. In each cycle, it updates the imputed values based on the progressively improved predictions from the other variables. The algorithm is run for a set number of iterations to allow the imputed values to stabilize, a state known as convergence.
    • Handling Non-Normal Data: For continuous variables that are not normally distributed (e.g., skewed), one approach is to transform the variable before imputation and transform it back afterward. However, this can distort relationships and complicate interpretation. A more robust and often preferred strategy within MICE is to use Predictive Mean Matching (PMM), which is well-suited for non-normal data because it imputes values directly from the observed data, thereby preserving the original distribution.

A common point of confusion is why the outcome variable should be included as a predictor in the imputation model. This seems circular or like “cheating.” However, this stems from a misunderstanding of the imputation model’s goal. The goal is not merely to predict a missing covariate \(X\), but to impute \(X\) in a way that preserves its true relationship with the outcome Y. The outcome \(Y\) is often the single best predictor of \(X\). Excluding it from the imputation model would cause the imputed values of \(X\) to have a weaker relationship with \(Y\) than the observed values of \(X\) do, biasing any estimated association between \(X\) and \(Y\) towards zero. The imputation model’s purpose is structural preservation, which enables the subsequent analysis model to accurately test a specific hypothesis.

Step 2: The Analysis Phase - Analyzing Each Reality

Once the m complete datasets have been generated, the researcher performs their primary scientific analysis independently on each of the datasets. For example, if the research question involves fitting a logistic regression model, that exact same model is fitted to dataset 1, dataset 2, and so on, up to dataset m. This step is straightforward and results in m different sets of parameter estimates (e.g., m regression coefficients) and m different standard errors.

Step 3: The Pooling Phase - Synthesizing the Results with Rubin’s Rules

This is the final and crucial step where the results from the m separate analyses are combined into a single, valid inference using a set of formulas known as Rubin’s Rules.

  • The Pooled Estimate: The final point estimate for any parameter (e.g., a regression coefficient) is simply the average of the m estimates obtained in the analysis phase.
  • The Pooled Variance: This is the key to MI’s success. The total variance of the pooled estimate correctly accounts for all sources of uncertainty and is composed of two parts:
    1. Within-Imputation Variance (\(\bar{U}\)): This is the average of the variances from each of the m analyses. It represents the normal sampling uncertainty we would have if our data had been complete from the start.
    2. Between-Imputation Variance (\(B\)): This is the variance of the parameter estimates across the m datasets. It directly captures the extra uncertainty that is due to the missing data. If the missing data were not very influential, the estimates from all m datasets would be very similar, and \(B\) would be small. If the missing data were very influential, the estimates would vary more, and \(B\) would be large.

The formula for the total variance (\(T\)) is \(T = \bar{U} + B(1 + 1/m)\). This elegant formula shows how MI correctly inflates the standard error to account for the uncertainty from missing data (\(B\)), solving the primary problem of single imputation and yielding valid confidence intervals and p-values. The “fraction of missing information” (FMI) is a useful metric derived from this process, which quantifies the proportion of the total variance that is attributable to the missing data.

Step 4: Convergence and Diagnostics

After running the imputation, it is essential to perform diagnostic checks. A key diagnostic is the convergence plot, which traces the mean and standard deviation of the imputed values for each variable across the iterations for each imputed dataset. For healthy convergence, these trace lines should appear as stationary, horizontal bands of random noise, without any clear upward or downward trends. This indicates that the algorithm has stabilized and the imputed values are reliable.


Handling Missing Outcomes with MID

A common point of hesitation for researchers new to imputation is what to do when the dependent variable (outcome) itself is missing. There is often a fear that imputing the outcome might artificially create the very results the study aims to find. While this concern is understandable, simply deleting subjects with missing outcomes (complete case analysis) is often biased under the MAR assumption. A strategy known as ‘Multiple Imputation, then Deletion’ (MID) offers a principled solution.

The Dilemma of Imputing the Outcome

If a predictor variable \(X\) is missing for a subject, the value of their outcome \(Y\) can be very informative for imputing \(X\). Ignoring subjects with a missing outcome during the imputation phase means throwing away valuable information that could have improved the imputation of other variables. However, some argue that using the imputed outcomes in the final analysis model may add unnecessary noise, especially if the imputation model for the outcome is not perfectly specified.

The ‘Multiple Imputation, then Deletion’ (MID) Strategy

The MID approach cleverly navigates this dilemma with a three-step conceptual process. It is particularly popular when there is a high percentage of missing values in the outcome (e.g., 20%-50%).

  • Step A (Impute): Perform a standard multiple imputation on the entire dataset. Crucially, the outcome variable (\(Y\)) is included in the imputation model and is itself imputed. This ensures that all available information, including from subjects with missing outcomes, is used to create the best possible imputations for the predictor variables (\(X\)s).
  • Step B (Delete): After the imputation phase is complete and the m datasets have been generated, delete the observations for which the outcome variable was originally missing. This means the imputed values of \(Y\) are discarded and will not be used in the final analysis model.
  • Step C (Analyze & Pool): Proceed with the standard analysis and pooling steps using only the observations that had an observed outcome from the beginning. The analysis is performed on the m datasets, each of which now contains only subjects with observed outcomes but has fully imputed predictors.

Rationale, Extensions, and Sensitivity Analysis

The core idea behind MID is to separate the task of imputing predictors from the task of estimating the relationship of interest. It uses the information from the full dataset (including subjects with missing \(Y\)) to get the best possible imputations for the predictors, and then uses only the reliable, observed data to get the best possible estimate of the relationship between those predictors and the outcome. This strategy operates under the assumption that the imputed outcomes themselves do not add useful information to the regression analysis of interest and may only add statistical noise. The same MID logic can be applied if a key exposure variable is missing. When in doubt, MID can also be used as a sensitivity analysis: a researcher can compare the results from a full MI analysis with the results from an MID analysis to gauge the impact of the imputed outcomes on the final conclusions.


Effect Modification Analysis with MI

The flexible Impute -> Analyze -> Pool framework of MI is not limited to simple main effects models. It can be readily extended to investigate more complex scientific questions.

Effect modification occurs when the effect of an exposure on an outcome differs across levels of a third variable, the effect modifier. For example, a new drug’s effect on blood pressure might be stronger in women than in men. Here, gender is an effect modifier. Statistically, this is often tested by including an interaction term in a regression model (e.g., \(Y \sim \text{Drug} + \text{Gender} + \text{Drug} \times \text{Gender}\)).

To test for effect modification in the presence of missing data, the MI workflow is adapted as follows:

  • Step 1 (Impute): Perform multiple imputation as usual. It is critical that the exposure, the outcome, and the potential effect modifier are all included in the imputation model. To best preserve the potential interaction, it is also highly recommended to include the interaction term itself in the imputation model.
  • Step 2 (Analyze): In the analysis phase, fit the regression model that includes the interaction term (e.g., \(Y \sim X + Z + X \times Z\)) to each of the m imputed datasets.
  • Step 3 (Pool): Pool the results from the m models using Rubin’s Rules. This will yield a single pooled estimate, standard error, and p-value for the main effects of \(X\) and \(Z\), and, most importantly, for the interaction term \(X \times Z\). A statistically significant pooled interaction term provides evidence for effect modification.

While pooling the interaction term is statistically valid, interpreting the coefficient for an interaction term can be non-intuitive. A more practical and often more interpretable approach involves performing a stratified analysis in Step 2. Instead of fitting one interaction model, one can fit separate, simpler models for each level of the effect modifier. This process yields stratum-specific effect estimates (e.g., the final pooled Odds Ratio for treatment in males and the final pooled Odds Ratio for treatment in females). These can then be directly compared to assess effect modification in a way that is often easier to communicate and understand than an interaction coefficient.

Variable Selection with MI

A common challenge is how to perform variable selection (e.g., stepwise regression) when using MI. Because the analysis is run on m different datasets, the variable selection procedure might choose a different set of “best” predictors for each one, making it difficult to pool the results into a single final model. Several strategies have been proposed to handle this :

  • Majority Rule: Perform variable selection on each of the m imputed datasets. The final model includes only those variables that are selected in a majority (more than half) of the analyses.
  • Stacked Regression: Stack all m imputed datasets into one large dataset. Then, perform a single variable selection procedure on this large, stacked dataset.
  • Wald Test Approach: This method involves fitting nested models and using a pooled Wald test (or a similar test statistic) to compare them. This is generally considered a highly principled approach for variable selection with multiply imputed data.

The Challenge of NMAR and Sensitivity Analysis

The most difficult missing data mechanism to handle is Missing Not at Random (NMAR), where the probability of missingness depends on the unobserved value itself.

Why NMAR Produces Bias

Standard methods like complete case analysis and MAR-based multiple imputation assume that the missingness can be explained by observed data. This assumption is violated under NMAR. For example, if patients who are sicker are more likely to drop out of a study, their missing health data is directly related to their unobserved, worsening health status. Because the reason for missingness cannot be directly observed or modeled from the available data, standard methods will produce biased estimates.

Sensitivity Analysis for NMAR

Since the NMAR assumption cannot be formally tested against MAR, the recommended approach is to conduct a sensitivity analysis. This involves intentionally imputing the missing values under different plausible NMAR scenarios to see how sensitive the study’s conclusions are to these changes. For example, one might impute missing health data under a “best-case” scenario (assuming dropouts were healthier than observed) and a “worst-case” scenario (assuming they were sicker). If the study’s main conclusions remain unchanged across these different scenarios, it provides greater confidence in the robustness of the findings. One common technique for this is delta-adjustment, where the imputed values are systematically shifted to reflect a hypothesized difference between the missing and observed groups.


Reporting guidelines when missing data is present

Best Practices for Transparent Reporting

This guide has journeyed from the fundamental problems caused by missing data—bias, power loss, and incorrect precision—to the principled, modern solution of Multiple Imputation. The core takeaways are that handling missing data requires careful thought, that the choice of an underlying assumption like MAR is a reasoned argument based on subject-matter knowledge rather than a statistical fact, and that the ultimate goal of imputation is not just to fill in blanks, but to do so in a way that preserves the original data structure and correctly represents our uncertainty about the missing values.

To ensure that research is both reproducible and credible, transparent reporting is paramount. Based on common problems identified in the scientific literature, any analysis using MI should be accompanied by a clear and detailed description of the process.

A Blueprint for Reporting Multiple Imputation

A robust report or publication should include the following key elements:

  • Extent of Missing Data: Report the percentage of missing observations for each variable included in the analysis.
  • Assumed Missing Data Mechanism: Explicitly state the assumed mechanism (e.g., MAR) and provide a brief, clear justification for why this assumption is plausible in the context of the study’s design and data collection procedures.
  • Imputation Software: State the specific software package and version used to perform the multiple imputation (e.g., mice package in R, version 3.13.0).
  • Imputation Model Specification: Describe the imputation model in detail. This includes listing all variables used as predictors in the imputation model, specifying any auxiliary variables that were included to improve the imputation, and noting the type of model used for each variable being imputed (e.g., predictive mean matching, logistic regression).
  • Number of Imputations: Report the number of imputed datasets (m) that were created.
  • Pooling Method: State that the results were combined across the m datasets using Rubin’s Rules.
  • Diagnostics: Briefly mention any diagnostic checks that were performed to assess the convergence of the imputation algorithm and the plausibility of the imputed values.

By following this blueprint, researchers can provide the necessary information for readers and reviewers to critically evaluate the analysis, thereby strengthening the credibility of the findings and contributing to a more rigorous and transparent scientific culture.

Video Lesson Slides

Missing data

Reporting guideline

References

Granger, Elizabeth, Jamie C. Sergeant, and Mark Lunt. 2019. “Avoiding Pitfalls When Combining Multiple Imputation and Propensity Scores.” Statistics in Medicine 38 (26): 5120–32.
Hughes, Rachael A., Jon Heron, Jonathan A. Sterne, and Kate Tilling. 2019. “Accounting for Missing Data in Statistical Analyses: Multiple Imputation Is Not Always the Answer.” International Journal of Epidemiology 1: 11.
Lumley, Thomas. 2011. Complex Surveys: A Guide to Analysis Using r. Vol. 565. John Wiley & Sons.
Sterne, Jonathan A., and et al. 2009. “Multiple Imputation for Missing Data in Epidemiological and Clinical Research: Potential and Pitfalls.” BMJ 338: b2393.
Van Buuren, Stef. 2018. Flexible Imputation of Missing Data. Chapman; Hall/CRC.