Missing in outcome

This section provides a theoretical background on the concept of Multiple Imputation and then Deletion (MID). It highlights the challenges of imputing dependent and exposure variables and introduces the idea of using auxiliary variables to aid imputation. The section also contrasts the results of traditional MI with MID, especially when the number of imputed datasets is high.

Often researchers are reluctant to impute values in the dependent variable (and exposure variable). Particularly, for dependent variable, imputation might not help too much.
However, if you have a good auxiliary variable (e.g., strongly correlated predictor, that are not used in the main analysis), often multiple imputation method can help. Use of auxiliary variables is one of the greatest strengths of MI methods.
MI algorithm generally do not have any special treatment for dependent variable in its original form, and hence ignoring dependent variable completely may not be a good idea in many scenarios.
Multiple imputation followed by deletion of imputed outcomes is known as MID. This is very popular, especially when you have high percentage missing values in the outcome variable (e.g., 20%-50%). For low missing % in outcome, the advantage can be minimal.
We are extending this idea to deletion of imputed exposures as well (researchers are often reluctant to impute primary exposure of interest).
Original MI and MID may result in similar results when m (number of imputed datasets) is higher.

Data

In the initial chunk, we load several packages that provide functions and tools necessary for the subsequent analysis. These packages facilitate multiple imputation, data visualization, and statistical modeling among other tasks.

# Load required packages
library(mice)
library(DataExplorer)
library(VIM)
library(jtools)
library(survey)
library(mitools)

We load the necessary data:

load("Data/missingdata/NHANES17.RData")

The data is briefly inspected to understand its structure. An identifier column is added to uniquely identify each row or observation in the dataset.

require(mice)
nhanes2

nhanes2$id <- 1:nrow(nhanes2)
nhanes2

Outcome and exposure has missing

This chunk focuses on identifying which rows have missing values in both the outcome and exposure variables. The outcome and exposure variables are crucial for the analysis, so understanding where they are missing is essential.

# assume outcome = bmi and exposure = chl 
nhanes2.excludingYA <- subset(nhanes2, !is.na(bmi) & !is.na(chl) )
nhanes2.excludingYA # data without missing A and Y

# identify ids of subjects with missing A & Y 
nhanes2.excludingYA$id
#>  [1]  2  5  7  8  9 13 14 17 18 19 22 23 25

Impute as usual

Using the entire dataset, missing values are imputed. This is done by first initializing an imputation model and then performing the imputation to create multiple datasets where missing values are filled in. The result is a list of datasets with imputed values. That means, we impute Y and A for now, as well as other covariates with missing values.

# use full data to impute 
ini <- mice(nhanes2, pri = FALSE)
ini$method
#>      age      bmi      hyp      chl       id 
#>       ""    "pmm" "logreg"    "pmm"       ""
pred <- ini$predictorMatrix
pred
#>     age bmi hyp chl id
#> age   0   1   1   1  1
#> bmi   1   0   1   1  1
#> hyp   1   1   0   1  1
#> chl   1   1   1   0  1
#> id    1   1   1   1  0
pred[,"id"] <- 0 # as this is not a predictor
m <- 5
imp <- mice(data=nhanes2, m=m, maxit=3, seed=504007)
#> 
#>  iter imp variable
#>   1   1  bmi  hyp  chl
#>   1   2  bmi  hyp  chl
#>   1   3  bmi  hyp  chl
#>   1   4  bmi  hyp  chl
#>   1   5  bmi  hyp  chl
#>   2   1  bmi  hyp  chl
#>   2   2  bmi  hyp  chl
#>   2   3  bmi  hyp  chl
#>   2   4  bmi  hyp  chl
#>   2   5  bmi  hyp  chl
#>   3   1  bmi  hyp  chl
#>   3   2  bmi  hyp  chl
#>   3   3  bmi  hyp  chl
#>   3   4  bmi  hyp  chl
#>   3   5  bmi  hyp  chl
# list format in m data
impdata <- mice::complete(imp, action = "all")
impdata # all IDs are present
#> $`1`
#>      age  bmi hyp chl id
#> 1  20-39 20.4  no 187  1
#> 2  40-59 22.7  no 187  2
#> 3  20-39 27.4  no 187  3
#> 4  60-99 25.5 yes 204  4
#> 5  20-39 20.4  no 113  5
#> 6  60-99 22.5 yes 184  6
#> 7  20-39 22.5  no 118  7
#> 8  20-39 30.1  no 187  8
#> 9  40-59 22.0  no 238  9
#> 10 40-59 20.4  no 199 10
#> 11 20-39 27.5  no 199 11
#> 12 40-59 26.3  no 284 12
#> 13 60-99 21.7  no 206 13
#> 14 40-59 28.7 yes 204 14
#> 15 20-39 29.6  no 187 15
#> 16 20-39 35.3 yes 204 16
#> 17 60-99 27.2 yes 284 17
#> 18 40-59 26.3 yes 199 18
#> 19 20-39 35.3  no 218 19
#> 20 60-99 25.5 yes 284 20
#> 21 20-39 35.3 yes 184 21
#> 22 20-39 33.2  no 229 22
#> 23 20-39 27.5  no 131 23
#> 24 60-99 24.9  no 218 24
#> 25 40-59 27.4  no 186 25
#> 
#> $`2`
#>      age  bmi hyp chl id
#> 1  20-39 28.7 yes 238  1
#> 2  40-59 22.7  no 187  2
#> 3  20-39 30.1  no 187  3
#> 4  60-99 22.5 yes 218  4
#> 5  20-39 20.4  no 113  5
#> 6  60-99 20.4 yes 184  6
#> 7  20-39 22.5  no 118  7
#> 8  20-39 30.1  no 187  8
#> 9  40-59 22.0  no 238  9
#> 10 40-59 22.5 yes 238 10
#> 11 20-39 26.3 yes 118 11
#> 12 40-59 20.4  no 199 12
#> 13 60-99 21.7  no 206 13
#> 14 40-59 28.7 yes 204 14
#> 15 20-39 29.6  no 187 15
#> 16 20-39 24.9  no 238 16
#> 17 60-99 27.2 yes 284 17
#> 18 40-59 26.3 yes 199 18
#> 19 20-39 35.3  no 218 19
#> 20 60-99 25.5 yes 218 20
#> 21 20-39 27.5  no 187 21
#> 22 20-39 33.2  no 229 22
#> 23 20-39 27.5  no 131 23
#> 24 60-99 24.9  no 218 24
#> 25 40-59 27.4  no 186 25
#> 
#> $`3`
#>      age  bmi hyp chl id
#> 1  20-39 25.5  no 229  1
#> 2  40-59 22.7  no 187  2
#> 3  20-39 22.0  no 187  3
#> 4  60-99 20.4  no 199  4
#> 5  20-39 20.4  no 113  5
#> 6  60-99 22.7  no 184  6
#> 7  20-39 22.5  no 118  7
#> 8  20-39 30.1  no 187  8
#> 9  40-59 22.0  no 238  9
#> 10 40-59 27.4  no 184 10
#> 11 20-39 30.1  no 238 11
#> 12 40-59 22.0  no 186 12
#> 13 60-99 21.7  no 206 13
#> 14 40-59 28.7 yes 204 14
#> 15 20-39 29.6  no 229 15
#> 16 20-39 22.0  no 118 16
#> 17 60-99 27.2 yes 284 17
#> 18 40-59 26.3 yes 199 18
#> 19 20-39 35.3  no 218 19
#> 20 60-99 25.5 yes 218 20
#> 21 20-39 29.6  no 131 21
#> 22 20-39 33.2  no 229 22
#> 23 20-39 27.5  no 131 23
#> 24 60-99 24.9  no 218 24
#> 25 40-59 27.4  no 186 25
#> 
#> $`4`
#>      age  bmi hyp chl id
#> 1  20-39 24.9  no 187  1
#> 2  40-59 22.7  no 187  2
#> 3  20-39 27.4  no 187  3
#> 4  60-99 24.9 yes 206  4
#> 5  20-39 20.4  no 113  5
#> 6  60-99 22.5  no 184  6
#> 7  20-39 22.5  no 118  7
#> 8  20-39 30.1  no 187  8
#> 9  40-59 22.0  no 238  9
#> 10 40-59 26.3 yes 187 10
#> 11 20-39 29.6  no 229 11
#> 12 40-59 27.4  no 204 12
#> 13 60-99 21.7  no 206 13
#> 14 40-59 28.7 yes 204 14
#> 15 20-39 29.6  no 186 15
#> 16 20-39 35.3  no 184 16
#> 17 60-99 27.2 yes 284 17
#> 18 40-59 26.3 yes 199 18
#> 19 20-39 35.3  no 218 19
#> 20 60-99 25.5 yes 199 20
#> 21 20-39 27.5  no 187 21
#> 22 20-39 33.2  no 229 22
#> 23 20-39 27.5  no 131 23
#> 24 60-99 24.9  no 284 24
#> 25 40-59 27.4  no 186 25
#> 
#> $`5`
#>      age  bmi hyp chl id
#> 1  20-39 27.2  no 238  1
#> 2  40-59 22.7  no 187  2
#> 3  20-39 24.9  no 187  3
#> 4  60-99 20.4  no 229  4
#> 5  20-39 20.4  no 113  5
#> 6  60-99 21.7  no 184  6
#> 7  20-39 22.5  no 118  7
#> 8  20-39 30.1  no 187  8
#> 9  40-59 22.0  no 238  9
#> 10 40-59 20.4 yes 187 10
#> 11 20-39 25.5  no 118 11
#> 12 40-59 21.7  no 187 12
#> 13 60-99 21.7  no 206 13
#> 14 40-59 28.7 yes 204 14
#> 15 20-39 29.6  no 199 15
#> 16 20-39 27.5  no 187 16
#> 17 60-99 27.2 yes 284 17
#> 18 40-59 26.3 yes 199 18
#> 19 20-39 35.3  no 218 19
#> 20 60-99 25.5 yes 206 20
#> 21 20-39 33.2  no 206 21
#> 22 20-39 33.2  no 229 22
#> 23 20-39 27.5  no 131 23
#> 24 60-99 24.9  no 204 24
#> 25 40-59 27.4  no 186 25
#> 
#> attr(,"class")
#> [1] "mild" "list"

Include a missing indicator (Y & A)

For each imputed dataset, a new column is added to indicate whether the outcome and exposure variables were originally missing. This “missing indicator” column will be used later to subset the data.

# Define formula (making binary Y)
formula <- as.formula("I(bmi>25) ~ chl + hyp")
data.list <- vector("list", m)
# subset the data without Y and A's that had missing values
# and record those subset data
for (i in 1:m) {
  analytic.i <- impdata[[i]]
  analytic.i$miss <- 1
  analytic.i$miss[analytic.i$id %in% nhanes2.excludingYA$id] <- 0
  data.list[[i]] <- analytic.i
}
data.list  # only relevant IDs are present
#> [[1]]
#>      age  bmi hyp chl id miss
#> 1  20-39 20.4  no 187  1    1
#> 2  40-59 22.7  no 187  2    0
#> 3  20-39 27.4  no 187  3    1
#> 4  60-99 25.5 yes 204  4    1
#> 5  20-39 20.4  no 113  5    0
#> 6  60-99 22.5 yes 184  6    1
#> 7  20-39 22.5  no 118  7    0
#> 8  20-39 30.1  no 187  8    0
#> 9  40-59 22.0  no 238  9    0
#> 10 40-59 20.4  no 199 10    1
#> 11 20-39 27.5  no 199 11    1
#> 12 40-59 26.3  no 284 12    1
#> 13 60-99 21.7  no 206 13    0
#> 14 40-59 28.7 yes 204 14    0
#> 15 20-39 29.6  no 187 15    1
#> 16 20-39 35.3 yes 204 16    1
#> 17 60-99 27.2 yes 284 17    0
#> 18 40-59 26.3 yes 199 18    0
#> 19 20-39 35.3  no 218 19    0
#> 20 60-99 25.5 yes 284 20    1
#> 21 20-39 35.3 yes 184 21    1
#> 22 20-39 33.2  no 229 22    0
#> 23 20-39 27.5  no 131 23    0
#> 24 60-99 24.9  no 218 24    1
#> 25 40-59 27.4  no 186 25    0
#> 
#> [[2]]
#>      age  bmi hyp chl id miss
#> 1  20-39 28.7 yes 238  1    1
#> 2  40-59 22.7  no 187  2    0
#> 3  20-39 30.1  no 187  3    1
#> 4  60-99 22.5 yes 218  4    1
#> 5  20-39 20.4  no 113  5    0
#> 6  60-99 20.4 yes 184  6    1
#> 7  20-39 22.5  no 118  7    0
#> 8  20-39 30.1  no 187  8    0
#> 9  40-59 22.0  no 238  9    0
#> 10 40-59 22.5 yes 238 10    1
#> 11 20-39 26.3 yes 118 11    1
#> 12 40-59 20.4  no 199 12    1
#> 13 60-99 21.7  no 206 13    0
#> 14 40-59 28.7 yes 204 14    0
#> 15 20-39 29.6  no 187 15    1
#> 16 20-39 24.9  no 238 16    1
#> 17 60-99 27.2 yes 284 17    0
#> 18 40-59 26.3 yes 199 18    0
#> 19 20-39 35.3  no 218 19    0
#> 20 60-99 25.5 yes 218 20    1
#> 21 20-39 27.5  no 187 21    1
#> 22 20-39 33.2  no 229 22    0
#> 23 20-39 27.5  no 131 23    0
#> 24 60-99 24.9  no 218 24    1
#> 25 40-59 27.4  no 186 25    0
#> 
#> [[3]]
#>      age  bmi hyp chl id miss
#> 1  20-39 25.5  no 229  1    1
#> 2  40-59 22.7  no 187  2    0
#> 3  20-39 22.0  no 187  3    1
#> 4  60-99 20.4  no 199  4    1
#> 5  20-39 20.4  no 113  5    0
#> 6  60-99 22.7  no 184  6    1
#> 7  20-39 22.5  no 118  7    0
#> 8  20-39 30.1  no 187  8    0
#> 9  40-59 22.0  no 238  9    0
#> 10 40-59 27.4  no 184 10    1
#> 11 20-39 30.1  no 238 11    1
#> 12 40-59 22.0  no 186 12    1
#> 13 60-99 21.7  no 206 13    0
#> 14 40-59 28.7 yes 204 14    0
#> 15 20-39 29.6  no 229 15    1
#> 16 20-39 22.0  no 118 16    1
#> 17 60-99 27.2 yes 284 17    0
#> 18 40-59 26.3 yes 199 18    0
#> 19 20-39 35.3  no 218 19    0
#> 20 60-99 25.5 yes 218 20    1
#> 21 20-39 29.6  no 131 21    1
#> 22 20-39 33.2  no 229 22    0
#> 23 20-39 27.5  no 131 23    0
#> 24 60-99 24.9  no 218 24    1
#> 25 40-59 27.4  no 186 25    0
#> 
#> [[4]]
#>      age  bmi hyp chl id miss
#> 1  20-39 24.9  no 187  1    1
#> 2  40-59 22.7  no 187  2    0
#> 3  20-39 27.4  no 187  3    1
#> 4  60-99 24.9 yes 206  4    1
#> 5  20-39 20.4  no 113  5    0
#> 6  60-99 22.5  no 184  6    1
#> 7  20-39 22.5  no 118  7    0
#> 8  20-39 30.1  no 187  8    0
#> 9  40-59 22.0  no 238  9    0
#> 10 40-59 26.3 yes 187 10    1
#> 11 20-39 29.6  no 229 11    1
#> 12 40-59 27.4  no 204 12    1
#> 13 60-99 21.7  no 206 13    0
#> 14 40-59 28.7 yes 204 14    0
#> 15 20-39 29.6  no 186 15    1
#> 16 20-39 35.3  no 184 16    1
#> 17 60-99 27.2 yes 284 17    0
#> 18 40-59 26.3 yes 199 18    0
#> 19 20-39 35.3  no 218 19    0
#> 20 60-99 25.5 yes 199 20    1
#> 21 20-39 27.5  no 187 21    1
#> 22 20-39 33.2  no 229 22    0
#> 23 20-39 27.5  no 131 23    0
#> 24 60-99 24.9  no 284 24    1
#> 25 40-59 27.4  no 186 25    0
#> 
#> [[5]]
#>      age  bmi hyp chl id miss
#> 1  20-39 27.2  no 238  1    1
#> 2  40-59 22.7  no 187  2    0
#> 3  20-39 24.9  no 187  3    1
#> 4  60-99 20.4  no 229  4    1
#> 5  20-39 20.4  no 113  5    0
#> 6  60-99 21.7  no 184  6    1
#> 7  20-39 22.5  no 118  7    0
#> 8  20-39 30.1  no 187  8    0
#> 9  40-59 22.0  no 238  9    0
#> 10 40-59 20.4 yes 187 10    1
#> 11 20-39 25.5  no 118 11    1
#> 12 40-59 21.7  no 187 12    1
#> 13 60-99 21.7  no 206 13    0
#> 14 40-59 28.7 yes 204 14    0
#> 15 20-39 29.6  no 199 15    1
#> 16 20-39 27.5  no 187 16    1
#> 17 60-99 27.2 yes 284 17    0
#> 18 40-59 26.3 yes 199 18    0
#> 19 20-39 35.3  no 218 19    0
#> 20 60-99 25.5 yes 206 20    1
#> 21 20-39 33.2  no 206 21    1
#> 22 20-39 33.2  no 229 22    0
#> 23 20-39 27.5  no 131 23    0
#> 24 60-99 24.9  no 204 24    1
#> 25 40-59 27.4  no 186 25    0
# record the fits from each data

Design, subset and fit

For each imputed dataset, a statistical model is fitted. Before fitting, the data is structured to account for survey design features. Only rows without originally missing outcome and exposure values are used for model fitting. The results of the model fitting for each dataset are stored for later analysis.

require(survey)
fit.list <- vector("list", 5)
for (i in 1:m) {
  analytic.i <- data.list[[i]]
  # assigning survey features = 1
  w.design0 <- svydesign(id=~1, weights=~1,
                        data=analytic.i)
  w.design <- subset(w.design0, miss == 0)
  fit <- svyglm(formula, design=w.design, family=binomial)
  fit.list[[i]] <-  fit
}

Pooled results

After fitting models to each imputed dataset, the results are combined or “pooled”. This pooled result provides a more robust estimate by considering the variability across the imputed datasets.

require(mitools)
pooled.estimates <- MIcombine(fit.list)
pooled.estimates
#> Multiple imputation results:
#>       MIcombine.default(fit.list)
#>                  results         se
#> (Intercept) -1.769918773 2.73723080
#> chl          0.009747869 0.01499608
#> hypyes      18.128613035 1.05775401

# or you can do it this way
betas<-MIextract(fit.list,fun=coef)
vars<-MIextract(fit.list, fun=vcov)
summary(MIcombine(betas,vars))
#> Multiple imputation results:
#>       MIcombine.default(betas, vars)
#>                  results         se      (lower      upper) missInfo
#> (Intercept) -1.769918773 2.73723080 -7.13479255  3.59495501      0 %
#> chl          0.009747869 0.01499608 -0.01964391  0.03913964      0 %
#> hypyes      18.128613035 1.05775401 16.05545326 20.20177281      0 %

# report beta coef
sum.pooled <- summary(pooled.estimates, digits = 2)
#> Multiple imputation results:
#>       MIcombine.default(fit.list)
#>             results    se (lower upper) missInfo
#> (Intercept) -1.7699 2.737  -7.13  3.595      0 %
#> chl          0.0097 0.015  -0.02  0.039      0 %
#> hypyes      18.1286 1.058  16.06 20.202      0 %
sum.pooled

Report OR

The pooled results are further processed to calculate and report odds ratios, which provide insights into the relationships between variables in the context of logistic regression.

sum.pooled.OR <- summary(pooled.estimates, logeffect=TRUE, digits = 2)
#> Multiple imputation results:
#>       MIcombine.default(fit.list)
#>             results      se  (lower  upper) missInfo
#> (Intercept) 1.7e-01 4.7e-01 8.0e-04 3.6e+01      0 %
#> chl         1.0e+00 1.5e-02 9.8e-01 1.0e+00      0 %
#> hypyes      7.5e+07 7.9e+07 9.4e+06 5.9e+08      0 %
sum.pooled.OR

Using publish package may be possible, but requires complicated process. Look for publish.MIresult (but this can be complicated).