Missing data analysis

Background

The chapter provides a comprehensive guide on missing data analysis, emphasizing various imputation techniques to address data gaps. It begins by introducing the concept of imputation and the different types of missing data: MCAR, MAR, and MNAR. The tutorial then delves into multiple imputation methods for complex survey data, highlighting the importance of visualizing missing data patterns, creating multiple imputed datasets, and pooling results for a consolidated analysis. The challenges of imputing dependent and exposure variables are addressed, with a focus on the benefits of using auxiliary variables. The guide also explores the estimation of model performance in datasets with missing values, using metrics like the AUC and the Archer-Lemeshow test. Special attention is given to handling subpopulations with missing observations, testing the MCAR assumption empirically, and understanding effect modification within multiple imputation.

In the vast landscape of survey data analysis, one challenge consistently emerges as both a hurdle and an opportunity: missing data. As we delve into this chapter, we’ll confront the often-encountered issue of incomplete or absent data points in survey datasets. Missing data isn’t just a challenge; it’s an invitation to refine our analytical techniques. This chapter will equip you with the tools and methodologies to handle missing data adeptly, ensuring that our survey data analysis remains robust and reliable.

Important

Datasets:

All of the datasets used in this tutorial can be accessed from this GitHub repository folder

Overview of tutorials

Missing data and imputation

Imputation is a technique used to replace missing data with substituted values. In health research, missing data is a common issue, and imputation helps in ensuring datasets are complete, leading to more accurate analyses. There are three types of missing data: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). The type of missingness determines how the missing data should be handled. Various single imputation methods, such as mean imputation, regression imputation, and predictive mean matching, are used based on the nature of the missing data. Multiple imputation is a process where the incomplete dataset is imputed multiple times, and the results are pooled together for more accurate analyses. Variable selection is crucial when analyzing missing data, and methods like majority, stack, and Wald are used to determine the best model. It’s also essential to assess the impact of missing values on model fitting (convergence and diagnostics) to ensure the reliability of the results.

Multiple imputation in complex survey data

In the tutorial involve understanding how to assess the missing data patterns and visualize to understand the extent of missingness. Multiple imputations are then performed to address the missing data, creating multiple versions of the dataset with varying imputations. After imputation, new variables are created or modified for analysis, and the integrity of the imputed data is checked both visually. The tutorial also emphasizes the importance of combining multiple imputed datasets for analysis. Logistic regression is applied to the imputed datasets, and the results are pooled to get a single set of estimates. The tutorial concludes with a variable selection process to identify the most relevant variables for the model.

Multiple imputation then deletion (MID)

This tutorial emphasizes the challenges of imputing dependent and exposure variables. The tutorial underscores the potential benefits of using auxiliary variables in the imputation process. While traditional Multiple Imputation (MI) and MID can yield similar results, MID is particularly advantageous when there’s a significant percentage of missing values in the outcome variable. The tutorial walks through the process of data loading, identifying missing values, performing standard imputations, and adding missing indicators. Subsequent steps involve structuring the data for survey design, fitting statistical models to each imputed dataset, and pooling the results for a consolidated analysis. The final stages focus on calculating and presenting odds ratios to interpret the relationships between variables.

Model performance from multiple imputed datasets

In the context of survey data analysis, the provided tutorial outlines the process of estimating model performance, particularly when dealing with weighted data that has missing values. The focus is about estimating treatment effects, both individually and in a pooled manner. Model performance is gauged using the Area Under the Curve (AUC) and the Archer-Lemeshow (AL) test. This is done for models with and without interactions. The results provide insights into the model’s accuracy and fit, with the AUC offering a measure of the model’s discriminative ability and the AL test indicating the model’s goodness of fit to the data. The appendix provides a closer look at the user-defined functions used throughout the analysis.

Dealing with subpopulations with missing observations

The primary objective is to showcase how to handle missing data analysis with multiple imputation in the backdrop of complex surveys, particularly when we are interested in subpopulations. The process involves working with the analytic data, imputing missing values from this dataset, accounting for ineligible subjects from the complete data, and reincorporating these ineligible subjects into the imputed datasets. This ensures that the survey’s features can be utilized and the design subsetted accordingly. After importing and inspecting the dataset, the analysis subsets the data based on eligibility criteria, imputes missing values, and prepares the survey design. The subsequent steps involve design-adjusted logistic regression and pooling of estimates using Rubin’s rule.

Testing MCAR assumption empirically in the data

The tutorial discusses the process of testing for Missing Completely At Random (MCAR) in datasets. Initially, essential packages are loaded to facilitate the analysis. A DAG is defined to represent the causal relationships between dataset variables, and this DAG is used to simulate a dataset. The DAG is then visualized, and the simulated dataset undergoes random data omission to mimic MCAR scenarios. Various visualizations, such as margin plots, are employed to understand the distribution of missing values in relation to other variables. Little’s MCAR test, a statistical method, is applied to determine if the data is indeed MCAR. The test’s limitations are also discussed. Additionally, tests for multivariate normality and homoscedasticity are conducted. In a subsequent section, data is intentionally set to missing based on a specific rule, and similar analyses and visualizations are performed to understand the nature of this missingness.

Effect modification within multiple imputation

The tutorial delves into the intricacies of effect modification within the realm of survey data analysis. A dataset is comprising several imputed datasets is used. The primary objective is to investigate how two specific variables interact in predicting a particular outcome. To this end, logistic regression models are constructed for each level of a categorical variable. Emphasis is placed on the significance of Odds Ratios (ORs) in interpreting these interactions. Subsequently, simple slopes analyses are performed for each imputed dataset, shedding light on the relationship between the predictor and the outcome at distinct levels of a moderating variable. The outcomes from each imputed dataset are then pooled to offer a comprehensive understanding of the effect modification.

Tip

Optional Content:

You’ll find that some sections conclude with an optional video walkthrough that demonstrates the code. Keep in mind that the content might have been updated since these videos were recorded. Watching these videos is optional.

Warning

Bug Report:

Fill out this form to report any issues with the tutorial.