Exercise 1 solution (S)
We will use the article by Moon et al. (2021). We will reproduce some results from the article. The authors aggregated 4 NHANES cycles 2005-12 to create their analytic dataset. The full dataset contains 40,790 subjects with the following relevant variables for this exercise:
Survey information
- SEQN: Respondent sequence number
- strata: Masked pseudo strata (strata is nested within PSU)
- psu: Masked pseudo PSU
- survey.weight: Full sample 8 year interview weight divided by 4
- survey.cycle: NHANES cycle
Outcome variable
- cvd: Cardiovascular disease
Exposure
- nocturia: Binary nocturia
Confounders and other variables
- age: Age in years at screening
- gender: Gender
- race: Race/Ethnicity
- smoking: 100+ cigarettes in life
- alcohol: Alcohol consumption (12+ drinks in 1 year)
- sleep: Sleep duration, h
- bmi: Body Mass Index in kg/m\(^2\)
- systolic: Systolic blood pressure, mmHg
- diastolic: Diastolic blood pressure, mmHg
- tcholesterol: Total cholesterol, mg/dl
- triglycerides: Triglycerides, mg/dl
- hdl: HDL‐cholesterol, mg/dl
- diabetes: Diabetes mellitus
- hypertension: Hypertension
Two important warnings before we start:
In this paper, there is insufficient information to create the analytic dataset. This is mainly because of not sufficiently defining the covariates and not explicitly explaining the inclusion/exclusion criteria.
The authors did not consider survey features. Since we will utilize survey features in our analysis, our results will likely be different than the results shown by the authors in Table 2.
Problem 1:
1(a) Importing dataset
1(b) Subsetting according to eligibility
# Age 20+
dat.analytic <- dat.full[complete.cases(dat.full$age),]
# Complete outcome and exposure information
dat.analytic <- dat.analytic[complete.cases(dat.analytic$cvd),]
dat.analytic <- dat.analytic[complete.cases(dat.analytic$nocturia),]
# Keep important variables only
vars <- c(
# Survey features
"SEQN", "strata", "psu", "survey.weight",
# Survey cycle
"survey.cycle",
# Binary exposure
"nocturia",
# Outcome
"cvd",
# Covariates
"age", "gender", "race" , "smoking", "alcohol", "sleep", "bmi", "diabetes",
"hypertension", "tcholesterol", "triglycerides", "hdl", "systolic", "diastolic")
dat.analytic <- dat.analytic[,vars]
# Complete case
dat.analytic <- na.omit(dat.analytic) # N = 15,404 (numbers do not match with Fig 1)
dim(dat.analytic)
#> [1] 15404 21
1(c) Run the design-adjusted logistic regression
Create the first column of Table 2 of the article, i.e., explore the relationship between binary nocturia and CVD among adults aged 20 years and more. Adjust the model for the following covariates: age, gender, race, body mass index, smoking status, alcohol consumption, sleep duration, total cholesterol, triglycerides, HDL-cholesterol, hypertension, diabetes mellitus, and survey cycles.
Note:
The authors did not utilize the survey features (e.g., strata, psu, survey weights). But you should utilize the survey features to answer this question.
You must create your design on the full data and then subset the design.
Report the odds ratio with the 95% CI.
# Create an indicator variable in the full data
dat.full$indicator <- 1
dat.full$indicator[dat.full$SEQN %in% dat.analytic$SEQN] <- 0
# Design setup
svy.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~survey.weight,
data = dat.full, nest = TRUE)
# Subset the design
svy.design <- subset(svy.design0, indicator == 0)
# Design-adjusted logistic
fit.logit <- svyglm(I(cvd == "Yes") ~ nocturia + age + gender + race + bmi +
smoking + alcohol + sleep + tcholesterol + triglycerides +
hdl + hypertension + diabetes + survey.cycle,
family = binomial, design = svy.design)
publish(fit.logit)
#> Variable Units OddsRatio CI.95 p-value
#> nocturia <2 Ref
#> 2+ 1.44 [1.21;1.71] 0.0001496
#> age [20,40) Ref
#> [40,60) 4.21 [3.05;5.82] < 1e-04
#> [60,80) 11.46 [7.89;16.64] < 1e-04
#> [80,Inf) 25.28 [17.51;36.50] < 1e-04
#> gender Male Ref
#> Female 0.68 [0.58;0.79] < 1e-04
#> race Hispanics Ref
#> Non-Hispanic White 1.32 [1.10;1.57] 0.0036168
#> Non-Hispanic Black 1.15 [0.92;1.44] 0.2362499
#> Other races 1.55 [1.05;2.30] 0.0319116
#> bmi 1.02 [1.01;1.03] 0.0003273
#> smoking No Ref
#> Yes 1.74 [1.46;2.07] < 1e-04
#> alcohol No Ref
#> Yes 0.92 [0.59;1.45] 0.7273627
#> sleep 0.96 [0.90;1.01] 0.1146287
#> tcholesterol 0.99 [0.99;0.99] < 1e-04
#> triglycerides 1.00 [1.00;1.00] 0.4801803
#> hdl 0.99 [0.98;1.00] 0.0416900
#> hypertension No Ref
#> Yes 2.73 [2.27;3.29] < 1e-04
#> diabetes No Ref
#> Yes 1.83 [1.51;2.22] < 1e-04
#> survey.cycle 2005-06 Ref
#> 2007-08 0.84 [0.65;1.07] 0.1644272
#> 2009-10 0.91 [0.73;1.12] 0.3793696
#> 2011-11 0.82 [0.68;0.99] 0.0398975
Problem 2: Propensity score matching by DuGoff et al. (2014)
2(a): 1:1 matching
Create the second column of Table 2 (exploring the relationship between binary nocturia and CVD; the same exposure and outcome used in Problem 1) using the propensity score 1:1 matching analysis as per DuGoff et al. (2014) recommendations.
You should consider all four steps in the propensity score (PS) analysis:
Step 1: Fit the PS model by considering survey features as covariates. Other covariates for the PS model are the covariates used in 1(c).
Step 2: Match an exposed subject (nocturia \(\ge2\) times) with a control subject (nocturia <2 times) without replacement within the caliper of 0.2 times the standard deviation of the logit of PS. Set your seed to 123.
Step 3: Balance checking using SMD. Consider SMD <0.1 as a good covariate balancing.
Step 4: Fit the outcome model on the matched data. If needed, adjust for imbalanced covariates in the outcome model. Report the odds ratio with the 95% CI. You should utilize the survey feature as the design (NOT covariates).
Step 1
## Specify the PS model to estimate propensity scores
ps.formula <- as.formula(I(nocturia=="2+") ~ age + gender + race + bmi + smoking +
alcohol + sleep + tcholesterol + triglycerides + hdl +
hypertension + diabetes + survey.cycle +
psu + strata + survey.weight)
## Fit the PS model
ps.fit <- glm(ps.formula, data = dat.analytic, family = binomial("logit"))
dat.analytic$PS <- predict(ps.fit, type = "response")
#summary(dat.analytic$PS)
## Caliper: 0.2*sd(logit of PS)
caliper <- 0.2*sd(log(dat.analytic$PS/(1-dat.analytic$PS)))
#caliper
Step 2
# Match exposed and unexposed subjects
set.seed(123)
match.obj <- matchit(ps.formula,
data = dat.analytic,
distance = dat.analytic$PS,
method = "nearest",
replace = FALSE,
caliper = caliper,
ratio = 1)
dat.analytic$PS <- match.obj$distance
#summary(match.obj$distance)
## See how many matched
#match.obj
## Extract matched data
matched.data <- match.data(match.obj)
Step 3
## Summary of PS by the exposure
#tapply(matched.data$distance, matched.data$nocturia, summary)
## Balance checking
vars <- c("age", "gender", "race", "bmi", "smoking", "alcohol", "sleep", "tcholesterol",
"triglycerides", "hdl", "hypertension", "diabetes", "survey.cycle")
tab1m <- CreateTableOne(vars = vars, strata = "nocturia", data = matched.data,
test = F, includeNA = F)
print(tab1m, smd = TRUE, format = "p") # All SMDs <0.1
#> Stratified by nocturia
#> <2 2+ SMD
#> n 4502 4502
#> age (%) 0.047
#> [20,40) 19.5 20.2
#> [40,60) 32.9 31.7
#> [60,80) 39.2 38.7
#> [80,Inf) 8.4 9.5
#> gender = Female (%) 50.7 49.8 0.017
#> race (%) 0.020
#> Hispanics 24.4 24.3
#> Non-Hispanic White 46.0 45.3
#> Non-Hispanic Black 25.4 26.0
#> Other races 4.2 4.4
#> bmi (mean (SD)) 29.96 (7.07) 30.18 (7.06) 0.031
#> smoking = Yes (%) 57.5 57.0 0.009
#> alcohol = Yes (%) 4.0 3.5 0.022
#> sleep (mean (SD)) 6.77 (1.39) 6.73 (1.60) 0.024
#> tcholesterol (mean (SD)) 199.01 (42.48) 197.51 (44.94) 0.034
#> triglycerides (mean (SD)) 163.55 (140.06) 164.00 (144.94) 0.003
#> hdl (mean (SD)) 53.44 (16.85) 53.20 (16.81) 0.014
#> hypertension = Yes (%) 47.3 48.7 0.029
#> diabetes = Yes (%) 15.3 17.4 0.058
#> survey.cycle (%) 0.030
#> 2005-06 22.8 22.0
#> 2007-08 27.5 27.3
#> 2009-10 28.2 28.0
#> 2011-11 21.6 22.8
Step 4
## Setup the design with survey features
dat.full$matched <- 0
dat.full$matched[dat.full$SEQN %in% matched.data$SEQN] <- 1
## Survey setup for full data
w.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~survey.weight,
data = dat.full, nest = TRUE)
## Subset matched data
w.design.m <- subset(w.design0, matched == 1)
## Outcome model
fit.psm <- svyglm(I(cvd == "Yes") ~ nocturia, design = w.design.m, family = binomial)
publish(fit.psm)
#> Variable Units OddsRatio CI.95 p-value
#> nocturia <2 Ref
#> 2+ 1.33 [1.14;1.57] 0.0007554
2(b): Interpretation
Compare your results with the results reported by the authors. [Expected answer: 1-2 sentences]
In the US population aged 20 years or more, the odds of CVD was 33% higher among adults with \(\ge 2\) times nocturia compared to PS matched controls with <2 nocturia. Unlike the results shown by the authors in Table 2, this odds ratio of 1.33 is a population-level estimate (i.e., the estimate is PATT).
Problem 3: Propensity score matching by Austin et al. (2018)
3(a): 1:4 matching
Repeat Problem 2(a), i.e., create the second column of Table 2 (exploring the relationship between binary nocturia and CVD), but using the propensity score 1:4 matching analysis as per Austin et al. (2018) recommendations.
You should consider all four steps in the propensity score (PS) analysis:
Step 1: Fit the PS model by considering survey features as design, i.e., fit the design-adjusted PS model. Other covariates for the PS model are the covariates used in 1(c).
Step 2: Match an exposed subject (nocturia \(\ge2\) times) with 4 control subjects (nocturia <2 times) with replacement within the caliper of 0.2 times the standard deviation of the logit of PS. Set your seed to 123.
Step 3: Balance checking using SMD. Consider SMD <0.1 as a good covariate balancing. Remember, you need to consider matching weights in checking the covariate balance.
Step 4: Fit the outcome model on the matched data. If needed, adjust for imbalanced covariates in the outcome model. Report the odds ratio with the 95% CI. You should utilize the survey feature as the design (NOT covariates).
Note:
- For step 4, you need to multiply matching weights and survey weights when creating your design. After creating the design with the new weight, subset the design for the matched sample. This step is required to get survey-based estimates.
Step 1
## Specify the PS model to estimate propensity scores
ps.formula3 <- as.formula(I(nocturia=="2+") ~ age + gender + race + bmi + smoking +
alcohol + sleep + tcholesterol + triglycerides + hdl +
hypertension + diabetes + survey.cycle)
ps.design <- svydesign(id = ~psu, weights = ~survey.weight, strata = ~strata,
data = dat.analytic, nest = TRUE)
## Fit the PS model
ps.fit3 <- svyglm(ps.formula3, design = ps.design, family = binomial)
dat.analytic$PS3 <- predict(ps.fit3, type = "response")
#summary(dat.analytic$PS3)
## Caliper: 0.2*sd(logit of PS)
caliper3 <- 0.2*sd(log(dat.analytic$PS3/(1-dat.analytic$PS3)))
#caliper3
Step 2
# Match exposed and unexposed subjects
set.seed(123)
match.obj3 <- matchit(ps.formula3,
data = dat.analytic,
distance = dat.analytic$PS3,
method = "nearest",
replace = TRUE,
caliper = caliper3,
ratio = 4)
dat.analytic$PS3 <- match.obj$distance
#summary(match.obj$distance)
## See how many matched
#match.obj3
## Extract matched data
matched.data3 <- match.data(match.obj3)
Step 3
## Summary of PS by the exposure
#tapply(matched.data3$distance, matched.data3$nocturia, summary)
## Balance checking
vars <- c("age", "gender", "race", "bmi", "smoking", "alcohol", "sleep", "tcholesterol",
"triglycerides", "hdl", "hypertension", "diabetes", "survey.cycle")
## Setup the design with survey features
dat.full$matched <- 0
dat.full$matched[dat.full$SEQN %in% matched.data3$SEQN] <- 1
## Merging PS weights in full data
dat.full$psweight <- 0
dat.full$psweight[dat.full$SEQN %in% matched.data3$SEQN] <- matched.data3$weights
## Survey setup for full data with ps weights - for balance checking
w.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~psweight,
data = dat.full, nest = TRUE)
## Subset matched data
w.design.m <- subset(w.design0, matched == 1)
tab13m <- svyCreateTableOne(vars = vars, strata = "nocturia", data = w.design.m,
test = F, includeNA = F)
print(tab13m, smd = TRUE, format = "p") # All SMDs <0.1
#> Stratified by nocturia
#> <2 2+ SMD
#> n 7185.0 4544.0
#> age (%) 0.053
#> [20,40) 18.4 20.0
#> [40,60) 33.3 31.4
#> [60,80) 39.3 38.9
#> [80,Inf) 9.1 9.7
#> gender = Female (%) 51.8 49.9 0.038
#> race (%) 0.015
#> Hispanics 24.3 24.1
#> Non-Hispanic White 44.6 45.1
#> Non-Hispanic Black 26.9 26.5
#> Other races 4.2 4.4
#> bmi (mean (SD)) 29.97 (7.09) 30.21 (7.08) 0.035
#> smoking = Yes (%) 58.3 57.2 0.022
#> alcohol = Yes (%) 4.0 3.6 0.023
#> sleep (mean (SD)) 6.73 (1.38) 6.73 (1.60) 0.004
#> tcholesterol (mean (SD)) 197.49 (42.18) 197.44 (45.02) 0.001
#> triglycerides (mean (SD)) 159.71 (129.75) 163.66 (144.57) 0.029
#> hdl (mean (SD)) 53.27 (16.71) 53.25 (16.82) 0.001
#> hypertension = Yes (%) 47.6 49.2 0.031
#> diabetes = Yes (%) 16.9 18.0 0.028
#> survey.cycle (%) 0.016
#> 2005-06 21.4 22.0
#> 2007-08 27.7 27.2
#> 2009-10 28.0 27.8
#> 2011-11 23.0 23.0
Step 4
## Survey setup for full data with new weight = survey weights * ps weights
w.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~survey.weight*psweight,
data = dat.full, nest = TRUE)
## Subset matched data
w.design.m <- subset(w.design0, matched == 1)
## Outcome model
fit.psm <- svyglm(I(cvd == "Yes") ~ nocturia, design = w.design.m, family = binomial)
publish(fit.psm)
#> Variable Units OddsRatio CI.95 p-value
#> nocturia <2 Ref
#> 2+ 1.33 [1.14;1.54] 0.0005248
3(b): Interpretation
Compare the results with Problem 2. What’s the overall conclusion? [Expected answer: 2-3 sentences]
In this exercise, we observed that the association between nocturia and CVD is approximately the same using DuGoff et al.’s 1:1 matching and Austin et al.’s 1:4 matching. At the population-level, we observed 33% higher odds of CVD among adults with \(\ge 2\) times nocturia than PS matched controls with <2 nocturia.