Exercise 1 Solution (D)

We will use the article. We will use the following article by Flegal et al. (2016).

We will reproduce some results from the article. The authors used NHANES 2013-14 dataset to create their main analytic dataset. The dataset contains 10,175 subjects with 12 relevant variables:

SEQN: Respondent sequence number
RIDAGEYR: Age in years at screening
RIAGENDR: Gender
DMDEDUC2: Education level
RIDRETH3: Race/ethnicity
RIDEXPRG: Pregnancy status at exam
WTINT2YR: Full sample 2 year weights
SDMVPSU: Masked variance pseudo-PSU
SDMVSTRA: Masked variance pseudo-stratum
BMXBMI: Body mass index in kg/m**2
SMQ020: Whether smoked at least 100 cigarettes in life
SMQ040: Current status of smoking (Do you now smoke cigarettes?)

Question 1: Creating data and table

1(a) Importing dataset

# load the dataset
load("Data/surveydata/Flegal2016.RData")
ls()
#> [1] "dat.full"
names(dat.full)
#>  [1] "SEQN"     "RIDAGEYR" "RIAGENDR" "DMDEDUC2" "RIDRETH3" "RIDEXPRG"
#>  [7] "WTINT2YR" "SDMVPSU"  "SDMVSTRA" "BMXBMI"   "SMQ020"   "SMQ040"

1(b) Subsetting according to eligibility

Subset the dataset according to the eligibility criteria described in the second paragraph of the Methods section.

Hint: The authors restricted their study to
- adults aged 20 years and more,
- non-missing body mass index, and
- non-pregnant.

Your analytic sample size should be 5,455, as described in the first sentence in the Results section.

# 20+
dat.analytic <- subset(dat.full, RIDAGEYR>=20) # N = 5,769

# Non-missing outcome
dat.analytic <- subset(dat.analytic, !is.na(BMXBMI)) # N = 5,520

# Non-pregnant
dat.analytic <- subset(dat.analytic, is.na(RIDEXPRG) | RIDEXPRG != 
                         "Yes, positive lab pregnancy test") # N = 5,455

dim(dat.analytic)
#> [1] 5455   12

1(c) Reproduce Table 1

Reproduce Table 1 of the article.

Hint 1: The authors reported unweighted frequencies, and thus, survey features should not be utilized to answer this question. Please be advised to order the categories as shown in the table. tableone package could be helpful.
Hint 2: the authors did not show the results for the Other race category. But in your table, you could include all race categories.

library(tableone)

dat <- dat.analytic

# Age
dat$age <- cut(dat$RIDAGEYR, c(20, 40, 60, Inf), right = FALSE)

# Gender
dat$gender <- dat$RIAGENDR

# Race/Hispanic origin group
dat$race <- dat$RIDRETH3
dat$race <- car::recode(dat$race, " 'Non-Hispanic White'='White'; 'Non-Hispanic Black'=
                        'Black'; 'Non-Hispanic Asian'='Asian'; c('Mexican American',
                        'Other Hispanic')='Hispanic'; 'Other Race - Including Multi-Rac'=
                        'Other'; else=NA", levels = c("White", "Black", "Asian",
                                                      "Hispanic", "Other"))

# Table 1: Overall 
tab11 <- CreateTableOne(vars = "age", strata = "race", data = dat, test = F, 
                        addOverall = T)

# Table 1: Male
tab12 <- CreateTableOne(vars = "age", strata = "race", test = F, addOverall = T, 
                        data = subset(dat, gender == "Male"))

# Table 1: Female
tab13 <- CreateTableOne(vars = "age", strata = "race", test = F, addOverall = T,
                        data = subset(dat, gender == "Female"))

# Reproducing Table 1
tab1a <- list(Overall = tab11, Male = tab12, Female = tab13)
print(tab1a, format = "f") # Showing only frequencies 
#> $Overall
#>              Stratified by race
#>               Overall White Black Asian Hispanic Other
#>   n           5455    2343  1115  623   1214     160  
#>   age                                                 
#>      [20,40)  1810     734   362  216    412      86  
#>      [40,60)  1896     759   383  251    449      54  
#>      [60,Inf) 1749     850   370  156    353      20  
#> 
#> $Male
#>              Stratified by race
#>               Overall White Black Asian Hispanic Other
#>   n           2638    1130  556   300   573      79   
#>   age                                                 
#>      [20,40)   909     386  182   106   189      46   
#>      [40,60)   897     360  179   120   215      23   
#>      [60,Inf)  832     384  195    74   169      10   
#> 
#> $Female
#>              Stratified by race
#>               Overall White Black Asian Hispanic Other
#>   n           2817    1213  559   323   641      81   
#>   age                                                 
#>      [20,40)   901     348  180   110   223      40   
#>      [40,60)   999     399  204   131   234      31   
#>      [60,Inf)  917     466  175    82   184      10

Question 2

2(a) Reproduce Table 1 with survey features

Not in this article but in many other articles, you would see n comes from the analytic sample and % comes from the survey design that accounts for survey features such as strata, clusters and survey weights. In Question 1, you see how n comes from the analytic sample. Your task for Question 2(a) is to create % part of the Table 1 with survey features, i.e., % should come from the survey design that accounts for strata, clusters and survey weights. You do not need to show the frequiencis but show only the percentages (for categorical variables).

Hints:

Subset the design, not the sample. For this step, you need to work with your full data. If you have generated a variable in your analytic dataset, that variable should also be present in the full dataset.
Generate age, gender, and race variable in your full data. Codes shown in Question 1 could be helpful.
Make the design on the full data and then subset the design.
Reproduce Table 1 with the design from the previous step. The svyCreateTableOne function could be a helpful function.

## Create all variables in the full data
# Age
dat.full$age <- cut(dat.full$RIDAGEYR, c(20, 40, 60, Inf), right = FALSE)

# Gender
dat.full$gender <- dat.full$RIAGENDR

# Race/Hispanic origin group
dat.full$race <- dat.full$RIDRETH3
dat.full$race <- car::recode(dat.full$race, " 'Non-Hispanic White'='White'; 
                             'Non-Hispanic Black'='Black'; 'Non-Hispanic Asian'='Asian'; 
                             c('Mexican American','Other Hispanic')='Hispanic'; 
                             'Other Race - Including Multi-Rac'='Other'; 
                             else=NA", levels = c("White", "Black", "Asian",
                                                  "Hispanic", "Other"))

# Survey features
dat.full$survey.weight <- dat.full$WTINT2YR
dat.full$psu <- dat.full$SDMVPSU
dat.full$strata <- dat.full$SDMVSTRA

# Subset the design
dat.full$miss <- 1
dat.full$miss[dat.full$SEQN %in% dat.analytic$SEQN] <- 0
svy.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~survey.weight,
                         data = dat.full, nest = TRUE)
svy.design <- subset(svy.design0, miss == 0)

# Table 1: Overall 
tab11 <- svyCreateTableOne(vars = "age", strata = "race", data = svy.design, 
                           test = F, addOverall = T)

# Table 1: Male
tab12 <- svyCreateTableOne(vars = "age", strata = "race", test = F, addOverall = T, 
                        data = subset(svy.design, gender == "Male"))

# Table 1: Female
tab13 <- svyCreateTableOne(vars = "age", strata = "race", test = F, addOverall = T,
                        data = subset(svy.design, gender == "Female"))

# Reproducing Table 1
tab1b <- list(Overall = tab11, Male = tab12, Female = tab13)
print(tab1b, format = "p") # Showing only percentages    
#> $Overall
#>              Stratified by race
#>               Overall     White       Black      Asian      Hispanic  
#>   n           217464332.1 143721046.2 24697511.9 11380679.8 31863752.6
#>   age (%)                                                             
#>      [20,40)         35.5        30.8       38.9       40.2       49.6
#>      [40,60)         37.5        37.5       39.3       38.5       35.6
#>      [60,Inf)        27.0        31.7       21.8       21.2       14.8
#>              Stratified by race
#>               Other    
#>   n           5801341.5
#>   age (%)              
#>      [20,40)       49.2
#>      [40,60)       38.4
#>      [60,Inf)      12.4
#> 
#> $Male
#>              Stratified by race
#>               Overall     White      Black      Asian     Hispanic   Other    
#>   n           105744090.8 70126957.2 11285409.5 5233819.5 15947523.0 3150381.6
#>   age (%)                                                                     
#>      [20,40)         37.2       32.3       41.3      41.7       51.6      52.9
#>      [40,60)         37.6       37.9       39.0      38.5       35.1      36.1
#>      [60,Inf)        25.1       29.8       19.7      19.8       13.2      11.0
#> 
#> $Female
#>              Stratified by race
#>               Overall     White      Black      Asian     Hispanic   Other    
#>   n           111720241.3 73594089.0 13412102.4 6146860.3 15916229.6 2650960.0
#>   age (%)                                                                     
#>      [20,40)         33.8       29.4       36.9      39.0       47.6      44.7
#>      [40,60)         37.4       37.1       39.5      38.6       36.0      41.2
#>      [60,Inf)        28.8       33.5       23.6      22.5       16.4      14.1

2(b) Reproduce Table 3

Reproduce the first column of Table 3 of the article (i.e., among men, explore the relationship between obesity and four predictors shown in the table).

If necessary, re-level or re-order the levels.
You need to generate obesity as BMI \(\ge 30 \text{ kg/m}^2\)
You need to generate smoking status and education. The unweighted frequencies should be matched with the frequencies in eTable 1 and eTable 2. Make sure these variables are in your full dataset as well.
Subset the design, not the sample.
Fit the model. Do not need to report the model summary.
The authors used SAS to produce the results vs. We are using R. The estimates could be slightly different (in second decimal point) from the estimates presented in Table 3, but they should be approximately similar.
You can use Publish or jtools package to report the odds ratios. Your odds ratios could be look like as follows:

# Obese
dat.full$obese <- ifelse(dat.full$BMXBMI >= 30, "Yes", "No")

# Smoking status
dat.full$smoking <- dat.full$SMQ020
dat.full$smoking <- car::recode(dat.full$smoking, " 'Yes'='Current smoker'; 
                                'No'='Never smoker'; else=NA",
                                levels = c("Never smoker", "Former smoker", 
                                           "Current smoker"))
dat.full$smoking[dat.full$SMQ040 == "Not at all"] <- "Former smoker"

# Education
dat.full$education <- dat.full$DMDEDUC2
dat.full$education <- car::recode(dat.full$education, " c('Some college or AA degree', 
                             'College graduate or above') = '>High school';
                             'High school graduate/GED or equi' = 'High school';
                             c('Less than 9th grade',
                             '9-11th grade (Includes 12th grad') = '<High school';
                             else = NA", levels = c("High school", "<High school",
                                                    ">High school"))

# Survey features
dat.full$survey.weight <- dat.full$WTINT2YR
dat.full$psu <- dat.full$SDMVPSU
dat.full$strata <- dat.full$SDMVSTRA

# Subset the design
dat.full$miss <- 1
dat.full$miss[dat.full$SEQN %in% dat.analytic$SEQN] <- 0
svy.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~survey.weight,
                         data = dat.full, nest = TRUE)
svy.design <- subset(svy.design0, miss == 0)

# Table 3 - column 1
fit.male <- svyglm(I(obese=="Yes") ~ age + race + smoking + education, 
                   family = binomial, design = subset(svy.design, gender == "Male"))
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
publish(fit.male)
#>   Variable          Units OddsRatio       CI.95    p-value 
#>        age        [20,40)       Ref                        
#>                   [40,60)      1.28 [0.94;1.74]   0.173590 
#>                  [60,Inf)      1.20 [0.77;1.85]   0.458803 
#>       race          White       Ref                        
#>                     Black      1.24 [0.93;1.65]   0.202249 
#>                     Asian      0.27 [0.20;0.37]   0.000345 
#>                  Hispanic      1.22 [0.88;1.69]   0.293132 
#>                     Other      1.23 [0.67;2.23]   0.532168 
#>    smoking   Never smoker       Ref                        
#>             Former smoker      1.25 [0.96;1.64]   0.157176 
#>            Current smoker      0.71 [0.55;0.92]   0.047198 
#>  education    High school       Ref                        
#>              <High school      0.93 [0.67;1.29]   0.685577 
#>              >High school      0.97 [0.72;1.29]   0.821357

2(c) Model selection

From the literature, you know that age and race needs to be adjusted in the model, but you are not sure about smoking and education. Run an AIC based backward selection process to figure out whether you want to add smoking or education, or both in the final model in 2(b). What is your conclusion, i.e., which variables are selected/dropped [Expected answer: one short sentence]?

Hints:

Your design must be free from missing values. Even after applying eligibility criteria, you may have some missing values on multiple variables (see eTable 1 and eTable 2). This is especially important for model selection process.
You can create a complete case analytic dataset (i.e., dataset without missing values in obesity, four predictors, and survey features). Then create the design on the full data and subset the design for the complete case samples.
step function could be helpful.

# Obese
dat$obese <- ifelse(dat$BMXBMI >= 30, "Yes", "No")
dat$obese <- factor(dat$obese, levels = c("No", "Yes"))

# Smoking status
dat$smoking <- dat$SMQ020
dat$smoking <- car::recode(dat$smoking, " 'Yes'='Current smoker'; 
                                'No'='Never smoker'; else=NA",
                                levels = c("Never smoker", "Former smoker", 
                                           "Current smoker"))
dat$smoking[dat$SMQ040 == "Not at all"] <- "Former smoker"

# Education
dat$education <- dat$DMDEDUC2
dat$education <- car::recode(dat$education, " c('Some college or AA degree', 
                             'College graduate or above') = '>High school';
                             'High school graduate/GED or equi' = 'High school';
                             c('Less than 9th grade',
                             '9-11th grade (Includes 12th grad') = '<High school';
                             else = NA", levels = c("High school", "<High school",
                                                    ">High school"))

# Survey design
dat$survey.weight <- dat$WTINT2YR
dat$psu <- dat$SDMVPSU
dat$strata <- dat$SDMVSTRA

# Select only relevant variables
dat2 <- dat[,c("SEQN", "age", "race", "smoking", "education", 
                         "survey.weight", "psu", "strata")]

# Remove missing values
dat2 <- dat2[complete.cases(dat2),] # 5 missing values dropped

# Subset the design
dat.full$miss <- 1
dat.full$miss[dat.full$SEQN %in% dat2$SEQN] <- 0
svy.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~survey.weight,
                         data = dat.full, nest = TRUE)
svy.design <- subset(svy.design0, miss == 0)

# Initial model
fit1 <- svyglm(I(obese=="Yes") ~ age + race + smoking + education, 
                   family = binomial, design = subset(svy.design, gender == "Male"))
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!

scope <- list(upper = ~ age + race + smoking + education, 
              lower = ~ age + race)

# Final model: Backward elimination using the AIC criteria
fit2 <- step(fit1, scope = scope, trace = FALSE, k = 2, direction = "backward")
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!

# Summary of the final model
publish(fit2)
#>  Variable          Units OddsRatio       CI.95   p-value 
#>       age        [20,40)       Ref                       
#>                  [40,60)      1.28 [0.94;1.73]   0.15579 
#>                 [60,Inf)      1.19 [0.76;1.86]   0.46397 
#>      race          White       Ref                       
#>                    Black      1.23 [0.92;1.65]   0.19682 
#>                    Asian      0.27 [0.20;0.37]   < 1e-04 
#>                 Hispanic      1.21 [0.90;1.61]   0.24507 
#>                    Other      1.23 [0.68;2.23]   0.52048 
#>   smoking   Never smoker       Ref                       
#>            Former smoker      1.25 [0.96;1.64]   0.14296 
#>           Current smoker      0.71 [0.54;0.92]   0.03895

The education variable is dropped from the final model.

2(d) Testing for interactions

Check whether the interaction between age and smoking should be added in the 2(b) model (yes or no answer required, along with the code and p-value):

# Model with interaction between age and smoking
fit2 <- update(fit.male, .~. + age:smoking)
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
anova(fit.male, fit2)$p
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
#> [1] 0.404668

No. We won’t keep the age and smoking interaction.