Exercise 1 solution (S)

We will use the article by Moon et al. (2021). We will reproduce some results from the article. The authors aggregated 4 NHANES cycles 2005-12 to create their analytic dataset. The full dataset contains 40,790 subjects with the following relevant variables for this exercise:

Survey information

Outcome variable

Exposure

Confounders and other variables

Two important warnings before we start:

Problem 1:

1(a) Importing dataset

load(file = "Data/propensityscore/Moon2021.RData")
ls()
#> [1] "dat.full"

1(b) Subsetting according to eligibility

# Age 20+
dat.analytic <- dat.full[complete.cases(dat.full$age),]

# Complete outcome and exposure information
dat.analytic <- dat.analytic[complete.cases(dat.analytic$cvd),] 
dat.analytic <- dat.analytic[complete.cases(dat.analytic$nocturia),] 

# Keep important variables only
vars <- c(
  # Survey features
  "SEQN", "strata", "psu", "survey.weight", 
  
  # Survey cycle
  "survey.cycle", 
  
  # Binary exposure
  "nocturia",
  
  # Outcome
  "cvd",
  
  # Covariates
  "age", "gender", "race" , "smoking", "alcohol", "sleep", "bmi", "diabetes", 
  "hypertension", "tcholesterol", "triglycerides", "hdl", "systolic", "diastolic")

dat.analytic <- dat.analytic[,vars]

# Complete case
dat.analytic <- na.omit(dat.analytic) #  N = 15,404 (numbers do not match with Fig 1)
dim(dat.analytic)
#> [1] 15404    21

1(c) Run the design-adjusted logistic regression

Create the first column of Table 2 of the article, i.e., explore the relationship between binary nocturia and CVD among adults aged 20 years and more. Adjust the model for the following covariates: age, gender, race, body mass index, smoking status, alcohol consumption, sleep duration, total cholesterol, triglycerides, HDL-cholesterol, hypertension, diabetes mellitus, and survey cycles.

Note:

  • The authors did not utilize the survey features (e.g., strata, psu, survey weights). But you should utilize the survey features to answer this question.

  • You must create your design on the full data and then subset the design.

  • Report the odds ratio with the 95% CI.

# Create an indicator variable in the full data
dat.full$indicator <- 1
dat.full$indicator[dat.full$SEQN %in% dat.analytic$SEQN] <- 0

# Design setup
svy.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~survey.weight,
                         data = dat.full, nest = TRUE)

# Subset the design
svy.design <- subset(svy.design0, indicator == 0)

# Design-adjusted logistic
fit.logit <- svyglm(I(cvd == "Yes") ~ nocturia + age + gender + race + bmi + 
                      smoking + alcohol + sleep + tcholesterol + triglycerides + 
                      hdl + hypertension + diabetes + survey.cycle, 
                    family = binomial, design = svy.design)
publish(fit.logit)
#>       Variable              Units OddsRatio         CI.95     p-value 
#>       nocturia                 <2       Ref                           
#>                                2+      1.44   [1.21;1.71]   0.0001496 
#>            age            [20,40)       Ref                           
#>                           [40,60)      4.21   [3.05;5.82]     < 1e-04 
#>                           [60,80)     11.46  [7.89;16.64]     < 1e-04 
#>                          [80,Inf)     25.28 [17.51;36.50]     < 1e-04 
#>         gender               Male       Ref                           
#>                            Female      0.68   [0.58;0.79]     < 1e-04 
#>           race          Hispanics       Ref                           
#>                Non-Hispanic White      1.32   [1.10;1.57]   0.0036168 
#>                Non-Hispanic Black      1.15   [0.92;1.44]   0.2362499 
#>                       Other races      1.55   [1.05;2.30]   0.0319116 
#>            bmi                         1.02   [1.01;1.03]   0.0003273 
#>        smoking                 No       Ref                           
#>                               Yes      1.74   [1.46;2.07]     < 1e-04 
#>        alcohol                 No       Ref                           
#>                               Yes      0.92   [0.59;1.45]   0.7273627 
#>          sleep                         0.96   [0.90;1.01]   0.1146287 
#>   tcholesterol                         0.99   [0.99;0.99]     < 1e-04 
#>  triglycerides                         1.00   [1.00;1.00]   0.4801803 
#>            hdl                         0.99   [0.98;1.00]   0.0416900 
#>   hypertension                 No       Ref                           
#>                               Yes      2.73   [2.27;3.29]     < 1e-04 
#>       diabetes                 No       Ref                           
#>                               Yes      1.83   [1.51;2.22]     < 1e-04 
#>   survey.cycle            2005-06       Ref                           
#>                           2007-08      0.84   [0.65;1.07]   0.1644272 
#>                           2009-10      0.91   [0.73;1.12]   0.3793696 
#>                           2011-11      0.82   [0.68;0.99]   0.0398975

Problem 2: Propensity score matching by DuGoff et al. (2014)

2(a): 1:1 matching

Create the second column of Table 2 (exploring the relationship between binary nocturia and CVD; the same exposure and outcome used in Problem 1) using the propensity score 1:1 matching analysis as per DuGoff et al. (2014) recommendations.

You should consider all four steps in the propensity score (PS) analysis:

  • Step 1: Fit the PS model by considering survey features as covariates. Other covariates for the PS model are the covariates used in 1(c).

  • Step 2: Match an exposed subject (nocturia \(\ge2\) times) with a control subject (nocturia <2 times) without replacement within the caliper of 0.2 times the standard deviation of the logit of PS. Set your seed to 123.

  • Step 3: Balance checking using SMD. Consider SMD <0.1 as a good covariate balancing.

  • Step 4: Fit the outcome model on the matched data. If needed, adjust for imbalanced covariates in the outcome model. Report the odds ratio with the 95% CI. You should utilize the survey feature as the design (NOT covariates).

Step 1

## Specify the PS model to estimate propensity scores
ps.formula <- as.formula(I(nocturia=="2+") ~ age + gender + race + bmi + smoking + 
                           alcohol + sleep + tcholesterol + triglycerides + hdl + 
                           hypertension + diabetes + survey.cycle + 
                           psu + strata + survey.weight)

## Fit the PS model
ps.fit <- glm(ps.formula, data = dat.analytic, family = binomial("logit"))
dat.analytic$PS <- predict(ps.fit, type = "response")
#summary(dat.analytic$PS)

## Caliper: 0.2*sd(logit of PS)
caliper <- 0.2*sd(log(dat.analytic$PS/(1-dat.analytic$PS)))
#caliper

Step 2

# Match exposed and unexposed subjects 
set.seed(123)
match.obj <- matchit(ps.formula, 
                     data = dat.analytic,
                     distance = dat.analytic$PS, 
                     method = "nearest", 
                     replace = FALSE,
                     caliper = caliper, 
                     ratio = 1)
dat.analytic$PS <- match.obj$distance
#summary(match.obj$distance)

## See how many matched
#match.obj

## Extract matched data
matched.data <- match.data(match.obj) 

Step 3

## Summary of PS by the exposure
#tapply(matched.data$distance, matched.data$nocturia, summary)

## Balance checking
vars <- c("age", "gender", "race", "bmi", "smoking", "alcohol", "sleep", "tcholesterol",
         "triglycerides", "hdl", "hypertension", "diabetes", "survey.cycle")


tab1m <- CreateTableOne(vars = vars, strata = "nocturia", data = matched.data, 
                           test = F, includeNA = F)
print(tab1m, smd = TRUE, format = "p") # All SMDs <0.1
#>                            Stratified by nocturia
#>                             <2              2+              SMD   
#>   n                           4502            4502                
#>   age (%)                                                    0.047
#>      [20,40)                  19.5            20.2                
#>      [40,60)                  32.9            31.7                
#>      [60,80)                  39.2            38.7                
#>      [80,Inf)                  8.4             9.5                
#>   gender = Female (%)         50.7            49.8           0.017
#>   race (%)                                                   0.020
#>      Hispanics                24.4            24.3                
#>      Non-Hispanic White       46.0            45.3                
#>      Non-Hispanic Black       25.4            26.0                
#>      Other races               4.2             4.4                
#>   bmi (mean (SD))            29.96 (7.07)    30.18 (7.06)    0.031
#>   smoking = Yes (%)           57.5            57.0           0.009
#>   alcohol = Yes (%)            4.0             3.5           0.022
#>   sleep (mean (SD))           6.77 (1.39)     6.73 (1.60)    0.024
#>   tcholesterol (mean (SD))  199.01 (42.48)  197.51 (44.94)   0.034
#>   triglycerides (mean (SD)) 163.55 (140.06) 164.00 (144.94)  0.003
#>   hdl (mean (SD))            53.44 (16.85)   53.20 (16.81)   0.014
#>   hypertension = Yes (%)      47.3            48.7           0.029
#>   diabetes = Yes (%)          15.3            17.4           0.058
#>   survey.cycle (%)                                           0.030
#>      2005-06                  22.8            22.0                
#>      2007-08                  27.5            27.3                
#>      2009-10                  28.2            28.0                
#>      2011-11                  21.6            22.8

Step 4

## Setup the design with survey features
dat.full$matched <- 0
dat.full$matched[dat.full$SEQN %in% matched.data$SEQN] <- 1

## Survey setup for full data 
w.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~survey.weight, 
                      data = dat.full, nest = TRUE)

## Subset matched data
w.design.m <- subset(w.design0, matched == 1)

## Outcome model
fit.psm <- svyglm(I(cvd == "Yes") ~ nocturia, design = w.design.m, family = binomial)
publish(fit.psm)
#>  Variable Units OddsRatio       CI.95     p-value 
#>  nocturia    <2       Ref                         
#>              2+      1.33 [1.14;1.57]   0.0007554

2(b): Interpretation

Compare your results with the results reported by the authors. [Expected answer: 1-2 sentences]

In the US population aged 20 years or more, the odds of CVD was 33% higher among adults with \(\ge 2\) times nocturia compared to PS matched controls with <2 nocturia. Unlike the results shown by the authors in Table 2, this odds ratio of 1.33 is a population-level estimate (i.e., the estimate is PATT).

Problem 3: Propensity score matching by Austin et al. (2018)

3(a): 1:4 matching

Repeat Problem 2(a), i.e., create the second column of Table 2 (exploring the relationship between binary nocturia and CVD), but using the propensity score 1:4 matching analysis as per Austin et al. (2018) recommendations.

You should consider all four steps in the propensity score (PS) analysis:

  • Step 1: Fit the PS model by considering survey features as design, i.e., fit the design-adjusted PS model. Other covariates for the PS model are the covariates used in 1(c).

  • Step 2: Match an exposed subject (nocturia \(\ge2\) times) with 4 control subjects (nocturia <2 times) with replacement within the caliper of 0.2 times the standard deviation of the logit of PS. Set your seed to 123.

  • Step 3: Balance checking using SMD. Consider SMD <0.1 as a good covariate balancing. Remember, you need to consider matching weights in checking the covariate balance.

  • Step 4: Fit the outcome model on the matched data. If needed, adjust for imbalanced covariates in the outcome model. Report the odds ratio with the 95% CI. You should utilize the survey feature as the design (NOT covariates).

Note:

  • For step 4, you need to multiply matching weights and survey weights when creating your design. After creating the design with the new weight, subset the design for the matched sample. This step is required to get survey-based estimates.

Step 1

## Specify the PS model to estimate propensity scores
ps.formula3 <- as.formula(I(nocturia=="2+") ~ age + gender + race + bmi + smoking + 
                           alcohol + sleep + tcholesterol + triglycerides + hdl + 
                           hypertension + diabetes + survey.cycle)

ps.design <- svydesign(id = ~psu, weights = ~survey.weight, strata = ~strata,
                       data = dat.analytic, nest = TRUE)

## Fit the PS model
ps.fit3 <- svyglm(ps.formula3, design = ps.design, family = binomial)
dat.analytic$PS3 <- predict(ps.fit3, type = "response")
#summary(dat.analytic$PS3)

## Caliper: 0.2*sd(logit of PS)
caliper3 <- 0.2*sd(log(dat.analytic$PS3/(1-dat.analytic$PS3)))
#caliper3

Step 2

# Match exposed and unexposed subjects 
set.seed(123)
match.obj3 <- matchit(ps.formula3, 
                      data = dat.analytic,
                      distance = dat.analytic$PS3, 
                      method = "nearest", 
                      replace = TRUE,
                      caliper = caliper3, 
                      ratio = 4)
dat.analytic$PS3 <- match.obj$distance
#summary(match.obj$distance)

## See how many matched
#match.obj3

## Extract matched data
matched.data3 <- match.data(match.obj3) 

Step 3

## Summary of PS by the exposure
#tapply(matched.data3$distance, matched.data3$nocturia, summary)

## Balance checking
vars <- c("age", "gender", "race", "bmi", "smoking", "alcohol", "sleep", "tcholesterol",
         "triglycerides", "hdl", "hypertension", "diabetes", "survey.cycle")

## Setup the design with survey features
dat.full$matched <- 0
dat.full$matched[dat.full$SEQN %in% matched.data3$SEQN] <- 1

## Merging PS weights in full data
dat.full$psweight <- 0
dat.full$psweight[dat.full$SEQN %in% matched.data3$SEQN] <- matched.data3$weights

## Survey setup for full data with ps weights - for balance checking
w.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~psweight, 
                      data = dat.full, nest = TRUE)

## Subset matched data
w.design.m <- subset(w.design0, matched == 1)

tab13m <- svyCreateTableOne(vars = vars, strata = "nocturia", data = w.design.m, 
                           test = F, includeNA = F)
print(tab13m, smd = TRUE, format = "p") # All SMDs <0.1
#>                            Stratified by nocturia
#>                             <2               2+               SMD   
#>   n                          7185.0           4544.0                
#>   age (%)                                                      0.053
#>      [20,40)                   18.4             20.0                
#>      [40,60)                   33.3             31.4                
#>      [60,80)                   39.3             38.9                
#>      [80,Inf)                   9.1              9.7                
#>   gender = Female (%)          51.8             49.9           0.038
#>   race (%)                                                     0.015
#>      Hispanics                 24.3             24.1                
#>      Non-Hispanic White        44.6             45.1                
#>      Non-Hispanic Black        26.9             26.5                
#>      Other races                4.2              4.4                
#>   bmi (mean (SD))             29.97 (7.09)     30.21 (7.08)    0.035
#>   smoking = Yes (%)            58.3             57.2           0.022
#>   alcohol = Yes (%)             4.0              3.6           0.023
#>   sleep (mean (SD))            6.73 (1.38)      6.73 (1.60)    0.004
#>   tcholesterol (mean (SD))   197.49 (42.18)   197.44 (45.02)   0.001
#>   triglycerides (mean (SD))  159.71 (129.75)  163.66 (144.57)  0.029
#>   hdl (mean (SD))             53.27 (16.71)    53.25 (16.82)   0.001
#>   hypertension = Yes (%)       47.6             49.2           0.031
#>   diabetes = Yes (%)           16.9             18.0           0.028
#>   survey.cycle (%)                                             0.016
#>      2005-06                   21.4             22.0                
#>      2007-08                   27.7             27.2                
#>      2009-10                   28.0             27.8                
#>      2011-11                   23.0             23.0

Step 4

## Survey setup for full data with new weight = survey weights * ps weights
w.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~survey.weight*psweight, 
                      data = dat.full, nest = TRUE)

## Subset matched data
w.design.m <- subset(w.design0, matched == 1)

## Outcome model
fit.psm <- svyglm(I(cvd == "Yes") ~ nocturia, design = w.design.m, family = binomial)
publish(fit.psm)
#>  Variable Units OddsRatio       CI.95     p-value 
#>  nocturia    <2       Ref                         
#>              2+      1.33 [1.14;1.54]   0.0005248

3(b): Interpretation

Compare the results with Problem 2. What’s the overall conclusion? [Expected answer: 2-3 sentences]

In this exercise, we observed that the association between nocturia and CVD is approximately the same using DuGoff et al.’s 1:1 matching and Austin et al.’s 1:4 matching. At the population-level, we observed 33% higher odds of CVD among adults with \(\ge 2\) times nocturia than PS matched controls with <2 nocturia.