NHANES: Subsetting
The tutorial demonstrates how to work with subset of complex survey data, specifically focusing on an NHANES example.
The required packages are loaded.
Load data
Survey data is loaded into the R environment.
Check missingness
A subset of variables is selected, and the presence of missing data is visualized.
A new variable is also created to categorize cholesterol levels as “healthy” or “unhealthy.”
Subsetting Complex Survey data
We are subsetting based on whether the subjects have missing observation (e.g., only retaining those with complete information). This is often an eligibility criteria in studies. In missing data analysis, we will learn more about more appropriate approaches.
dim(analytic.full.data)
#> [1] 9254 10
head(analytic.full.data$ID) # full data
#> [1] 93703 93704 93705 93706 93707 93708
analytic.complete.case.only <- as.data.frame(na.omit(analytic.full.data))
dim(analytic.complete.case.only)
#> [1] 6636 10
head(analytic.complete.case.only$ID) # complete case
#> [1] 93705 93706 93707 93708 93709 93711
head(analytic.full.data$ID[analytic.full.data$ID %in% analytic.complete.case.only$ID])
#> [1] 93705 93706 93707 93708 93709 93711
Below we show how to identify who has missing observations vs not based on full (analytic.full.data
) and complete case (analytic.complete.case.only
) data. See Heeringa et al (2010) book page 114 (section 4.5.3 “Preparation for Subclass analyses”) and also page 218 (section 7.5.4 “appropriate analysis: incorporating all Sample Design Features”). This is done for 2 reasons:
- full complex survey design structure is taken into account, so that variance estimation is done correctly. If one or more PSU were excluded because none of the complete cases were observed in those PSU, the sub-population (complete cases) will not have complete information of how many PSU were actually present in the original complex design. Then in the population, a reduced number of PSUs would be used to calculate variance (number of SPU is a component of the variance calculation formula, see equation (5.2) in Heeringa et al (2010) textbook. Same is true for strata.), and will result in a wrong/biased variance estimate. Also see West et al. doi: 10.1177/1536867X0800800404
- size of sub-population (here, those with complete cases) is recognized as a random variable; not just a fixed size.
table(analytic.full.data$miss)
#>
#> 0 1
#> 6636 2618
# IDs not in complete case data
head(analytic.full.data$ID[analytic.full.data$miss==1])
#> [1] 93703 93704 93710 93720 93724 93725
# IDs in complete case data
head(analytic.full.data$ID[analytic.full.data$miss==0])
#> [1] 93705 93706 93707 93708 93709 93711
Logistic regression on sub-population
A logistic regression model is run on the subset of data that has no missing values. Here, it distinguishes between correct and incorrect approaches to account for the complex survey design.
Wrong approach
Correct approach
Full model
fit <- svyglm(model.formula, family = quasibinomial,
design = w.design) # subset design
publish(fit)
#> Variable Units Coefficient CI.95 p-value
#> diabetes No Ref
#> Yes 0.38 [0.20;0.57] 0.0049202
#> gender Female Ref
#> Male 0.22 [0.03;0.40] 0.0568343
#> born Born in 50 US states or Washingt Ref
#> Others -0.66 [-0.84;-0.47] 0.0002304
#> Refused -12.26 [-13.65;-10.88] < 1e-04
#> race Black Ref
#> Hispanic 0.20 [-0.08;0.47] 0.2075536
#> Other -0.17 [-0.38;0.03] 0.1439474
#> White -0.37 [-0.66;-0.09] 0.0355030
#> bmi -0.04 [-0.05;-0.02] 0.0007697
Variable selection
Finally, we discuss variable selection methods. We employ backward elimination to determine which variables are significant predictors while retaining an important variable in the model. If unsure about usefulness of some (gender, born, race, bmi) variables in predicting the outcome, check via backward elimination while keeping important variable (diabetes, say, that has been established in the literature) in the model
model.formula <- as.formula("I(cholesterol.bin=='healthy')~
diabetes+gender+born+race+bmi")
scope <- list(upper = ~ diabetes+gender+born+race+bmi, lower = ~ diabetes)
fit <- svyglm(model.formula, design=w.design, # subset design
family=quasibinomial)
fitstep <- step(fit, scope = scope, trace = FALSE, direction = "backward")
publish(fitstep) # final model
#> Variable Units Coefficient CI.95 p-value
#> diabetes No Ref
#> Yes 0.38 [0.20;0.57] 0.0049202
#> gender Female Ref
#> Male 0.22 [0.03;0.40] 0.0568343
#> born Born in 50 US states or Washingt Ref
#> Others -0.66 [-0.84;-0.47] 0.0002304
#> Refused -12.26 [-13.65;-10.88] < 1e-04
#> race Black Ref
#> Hispanic 0.20 [-0.08;0.47] 0.2075536
#> Other -0.17 [-0.38;0.03] 0.1439474
#> White -0.37 [-0.66;-0.09] 0.0355030
#> bmi -0.04 [-0.05;-0.02] 0.0007697
Also see (Stata 2023) for further details.
Video content (optional)
For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.