PS Weighting (US)

Propensity analysis problem

In this chapter, we will use propensity score weighting (using SMD cut-point 0.2, may adjust for imbalanced and/or all covariates in the outcome model, if any) analysis as per the following recommendations - (Zanutto 2006) - (DuGoff, Schuler, and Stuart 2014) - (Austin, Jembere, and Chiu 2018)

Dataset

  • The following modified NHANES dataset
    • NHANES15lab5.RData
  • use the data set “analytic.with.miss” within this file.
    • for obtaining the final treatment effect estimates, you can omit missing values, but only after creating the design (e.g., subset the design, not the data itself directly).
  • The same dataset was used for propensity score weighting

Variables

  • Outcome: diabetes
    • ‘No’ as the reference category
  • Exposure: bmi
    • convert to binary with >25 vs. <= 25,
    • with > 25 as the reference category
  • Confounder list:
    • gender
    • age
      • assume continuous
    • race
    • income
    • education
    • married
    • cholesterol
    • diastolicBP
    • systolicBP
  • Mediator:
    • physical.work
      • ‘No’ as the reference category
  • Survey features
    • psu
    • strata
    • weight

Pre-processing

Load data

load(file="Data/propensityscore/NHANES15lab5.RData")

Variable summary

# Full data
dat.full <- analytic.with.miss

# Exposure
dat.full$bmi <- with(dat.full, ifelse(bmi>25, "Overweight", 
                                      ifelse(bmi<=25, "Not overweight", NA)))
dat.full$bmi <- as.factor(dat.full$bmi)
dat.full$bmi <- relevel(dat.full$bmi, ref = "Overweight")

# Drop unnecessary variables 
dat.full$born <- NULL
dat.full$physical.work <- NULL

# Rename the weight variable into interview.weight
names(dat.full)[names(dat.full) == "weight"] <- "interview.weight"

Complete case data

We will use the complete case data to perform the analysis.

# Complete case data 
analytic.data <- dat.full[complete.cases(dat.full),]
dim(analytic.data)
#> [1] 6316   15

Reproducibility

set.seed(504)

Approach by Zanutto (2006)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula <- as.formula(I(bmi=="Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP)

Step 2

For the second step, we will calculate the both unstabilized and stabilized weights. However, stabilized inverse probability weight is often recommended to prevent from extreme weights (Hernán and Robins 2006).

# Propensity scores
ps.fit <- glm(ps.formula, data = analytic.data, family = binomial("logit"))
analytic.data$ps <- predict(ps.fit, type = "response", newdata = analytic.data)

# Unstabilized weight
analytic.data$usweight <- with(analytic.data, ifelse(I(bmi=="Not overweight"), 
                                                     1/ps, 1/(1-ps)))

# Unstabilized weight summary
round(summary(analytic.data$usweight), 2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.01    1.33    1.63    2.10    2.14   62.31

# Stabilized weight
analytic.data$sweight <- with(analytic.data, ifelse(I(bmi=="Not overweight"), 
                                                    mean(I(bmi=="Not overweight"))/ps, 
                                                    (1-mean(I(bmi=="Not overweight")))/(1-ps)))

# Stabilized weight summary
round(summary(analytic.data$sweight), 2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.44    0.70    0.84    1.04    1.08   25.40

We can see that the mean of stabilized weights is 1, while it is approximately 2.1 for unstabilized weights. For both unstabilized and stabilized weights, it seems there are extreme weights, particularly for the unstabilized weights. Extreme weights could be dealt with weight truncation, typically truncated at the 1st and 99th percentiles (Cole and Hernán 2008).

Let us truncate the weights at the 1st and 99th percentiles

# Truncating unstabilized weight
analytic.data <- analytic.data %>% 
  mutate(usweight_t = pmin(pmax(usweight, quantile(usweight, 0.01)), 
                           quantile(usweight, 0.99)))
summary(analytic.data$usweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.055   1.335   1.629   2.020   2.140  10.912

# Truncating stabilized weight
analytic.data <- analytic.data %>% 
  mutate(sweight_t = pmin(pmax(sweight, quantile(sweight, 0.01)), 
                          quantile(sweight, 0.99)))
summary(analytic.data$sweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.4861  0.7035  0.8400  1.0044  1.0833  4.4901

Step 3

Now we will check the distribution of the covariates by the exposure status on the pseudo population in terms of pre-specified SMD.

# Covariates
vars <- c("gender", "age", "race", "income", "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP")

# Design with truncated unstabilized weight
design.unstab <- svydesign(ids = ~ID, weights = ~usweight_t, data = analytic.data)

# Design with truncated stabilized weight
design.stab <- svydesign(ids = ~ID, weights = ~sweight_t, data = analytic.data)

# Balance checking with truncated unstabilized weight
tab.unstab <- svyCreateTableOne(vars = vars, strata = "bmi", data = design.unstab, test = F)
print(tab.unstab, smd = T)
#>                          Stratified by bmi
#>                           Overweight      Not overweight  SMD   
#>   n                        6199.3          6559.7               
#>   gender = Male (%)        2999.3 (48.4)   3126.6 (47.7)   0.014
#>   age (mean (SD))           48.38 (17.38)   49.83 (18.11)  0.082
#>   race (%)                                                 0.047
#>      Black                 1304.7 (21.0)   1407.8 (21.5)        
#>      Hispanic              1901.9 (30.7)   1873.7 (28.6)        
#>      Other                  956.4 (15.4)   1029.3 (15.7)        
#>      White                 2036.2 (32.8)   2249.0 (34.3)        
#>   income (%)                                               0.033
#>      <25k                  1664.3 (26.8)   1829.5 (27.9)        
#>      Between.25kto54k      1986.5 (32.0)   2133.5 (32.5)        
#>      Between.55kto99k      1449.1 (23.4)   1466.6 (22.4)        
#>      Over100k              1099.4 (17.7)   1130.0 (17.2)        
#>   education (%)                                            0.019
#>      College               3469.5 (56.0)   3610.4 (55.0)        
#>      High.School           2047.7 (33.0)   2217.9 (33.8)        
#>      School                 682.1 (11.0)    731.4 (11.1)        
#>   married (%)                                              0.040
#>      Married               3728.2 (60.1)   3853.6 (58.7)        
#>      Never.married         1101.9 (17.8)   1147.0 (17.5)        
#>      Previously.married    1369.2 (22.1)   1559.1 (23.8)        
#>   cholesterol (mean (SD))  181.58 (40.93)  183.13 (43.59)  0.037
#>   diastolicBP (mean (SD))   66.31 (14.64)   66.87 (14.78)  0.038
#>   systolicBP (mean (SD))   121.04 (16.61)  123.60 (23.62)  0.125

# Balance checking with truncated stabilized weight
tab.stab <- svyCreateTableOne(vars = vars, strata = "bmi", data = design.stab, test = F)
print(tab.stab, smd = T)
#>                          Stratified by bmi
#>                           Overweight      Not overweight  SMD   
#>   n                        3665.9          2677.9               
#>   gender = Male (%)        1771.7 (48.3)   1276.5 (47.7)   0.013
#>   age (mean (SD))           48.40 (17.38)   49.84 (18.12)  0.081
#>   race (%)                                                 0.048
#>      Black                  772.6 (21.1)    575.0 (21.5)        
#>      Hispanic              1126.3 (30.7)    764.6 (28.6)        
#>      Other                  561.1 (15.3)    420.5 (15.7)        
#>      White                 1206.0 (32.9)    917.9 (34.3)        
#>   income (%)                                               0.033
#>      <25k                   982.8 (26.8)    747.3 (27.9)        
#>      Between.25kto54k      1176.4 (32.1)    870.8 (32.5)        
#>      Between.55kto99k       857.6 (23.4)    598.5 (22.3)        
#>      Over100k               649.2 (17.7)    461.3 (17.2)        
#>   education (%)                                            0.019
#>      College               2051.4 (56.0)   1473.3 (55.0)        
#>      High.School           1210.5 (33.0)    905.9 (33.8)        
#>      School                 404.0 (11.0)    298.7 (11.2)        
#>   married (%)                                              0.040
#>      Married               2205.9 (60.2)   1572.9 (58.7)        
#>      Never.married          650.0 (17.7)    468.0 (17.5)        
#>      Previously.married     810.1 (22.1)    637.1 (23.8)        
#>   cholesterol (mean (SD))  181.62 (40.91)  183.15 (43.61)  0.036
#>   diastolicBP (mean (SD))   66.37 (14.53)   66.87 (14.81)  0.034
#>   systolicBP (mean (SD))   121.06 (16.58)  123.63 (23.65)  0.126

As we can see, all SMDs are less than our specified cut-point of 0.2, indicating that there is good covariate balancing. next, we will fit the outcome model on the pseudo population (i.e., weighted data). Note that we must utilize the survey feature as the design for the population-level estimate. For this step, we will multiply propensity score weight and survey weight and create a new weight variable.

Step 4 - with unstabilized weight

require(survey)
require(jtools)

# Create an indicator variable in the full dataset
dat.full$ind <- 0
dat.full$ind[dat.full$ID %in% analytic.data$ID] <- 1

# New weight = interview weight * unstabilized weight 
analytic.data$new.usweight_t <- with(analytic.data, interview.weight * usweight_t)
summary(analytic.data$new.usweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    6437   26567   42862   77768   88106 1831548

#  New weight variable in the full dataset
dat.full$new.usweight_t <- 0
dat.full$new.usweight_t[dat.full$ID %in% analytic.data$ID] <- 
  analytic.data$new.usweight_t
summary(dat.full$new.usweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       0   24068   49261   55161 1831548

# Survey setup with full data 
w.design0 <- svydesign(id = ~psu, strata = ~strata, weights = ~new.usweight_t, 
                      data = dat.full, nest = TRUE)

# Subset the design for analytic sample
w.design.s <- subset(w.design0, ind == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit <- svyglm(out.formula, design = w.design.s, family = binomial("logit"))
summ(fit, exp = TRUE, confint = TRUE, digits = 3)
Observations 6316
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.062
Pseudo-R² (McFadden) 0.047
AIC 3371.907
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.164 0.141 0.190 -24.213 0.000
bmiNot overweight 0.280 0.217 0.361 -9.814 0.000
Standard errors: Robust

Step 4 - with stabilized weight

Similarly, we can fit the outcome model with stabilized weights.

# Create an indicator variable in the full dataset
dat.full$ind <- 0
dat.full$ind[dat.full$ID %in% analytic.data$ID] <- 1

# New weight = interview weight * stabilized weight
analytic.data$new.sweight_t <- with(analytic.data, interview.weight * sweight_t)
summary(analytic.data$new.sweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    2985   13311   22494   39174   43896  753678

#  New weight variable in the full dataset
dat.full$new.sweight_t <- 0
dat.full$new.sweight_t[dat.full$ID %in% analytic.data$ID] <- analytic.data$new.sweight_t
summary(dat.full$new.sweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       0   12115   24814   28232  753678

# Survey setup with full data 
w.design0 <- svydesign(id = ~psu, strata = ~strata, weights = ~new.sweight_t, 
                       data = dat.full, nest = TRUE)

# Subset the design for analytic sample
w.design.s2 <- subset(w.design0, ind == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit.stab <- svyglm(out.formula, design = w.design.s2, family = binomial("logit"))
summ(fit.stab, exp = TRUE, confint = TRUE, digits = 3)
Observations 6316
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.055
Pseudo-R² (McFadden) 0.041
AIC 3712.451
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.164 0.141 0.190 -24.235 0.000
bmiNot overweight 0.280 0.217 0.361 -9.815 0.000
Standard errors: Robust

Double adjustment

library(survey)
# Outcome model with covariates adjustment
fit.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.s2, family = binomial("logit"))
summ(fit.DA, exp = TRUE, confint = TRUE, digits = 3)
Observations 6316
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.194
Pseudo-R² (McFadden) 0.149
AIC 3331.238
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.045 0.015 0.133 -5.608 NA
bmiNot overweight 0.234 0.175 0.312 -9.826 NA
genderMale 1.402 1.190 1.652 4.038 NA
age 1.050 1.041 1.058 11.793 NA
raceHispanic 0.790 0.600 1.040 -1.677 NA
raceOther 0.849 0.428 1.683 -0.470 NA
raceWhite 0.541 0.371 0.789 -3.192 NA
incomeBetween.25kto54k 0.723 0.478 1.093 -1.540 NA
incomeBetween.55kto99k 0.647 0.411 1.018 -1.883 NA
incomeOver100k 0.476 0.309 0.733 -3.371 NA
educationHigh.School 0.937 0.725 1.211 -0.498 NA
educationSchool 0.930 0.568 1.524 -0.286 NA
marriedNever.married 0.895 0.628 1.275 -0.615 NA
marriedPreviously.married 0.791 0.529 1.184 -1.137 NA
cholesterol 0.993 0.990 0.997 -3.760 NA
diastolicBP 1.007 0.999 1.014 1.744 NA
systolicBP 1.002 0.993 1.012 0.529 NA
Standard errors: Robust

# Log odds ratio with p-values
summary(fit.DA, df.resid = degf(w.design.s2))
#> 
#> Call:
#> svyglm(formula = I(diabetes == "Yes") ~ bmi + gender + age + 
#>     race + income + education + married + cholesterol + diastolicBP + 
#>     systolicBP, design = w.design.s2, family = binomial("logit"))
#> 
#> Survey design:
#> subset(w.design0, ind == 1)
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               -3.095813   0.552073  -5.608 4.99e-05 ***
#> bmiNot overweight         -1.454514   0.148024  -9.826 6.29e-08 ***
#> genderMale                 0.338112   0.083733   4.038  0.00107 ** 
#> age                        0.048468   0.004110  11.793 5.48e-09 ***
#> raceHispanic              -0.235285   0.140287  -1.677  0.11422    
#> raceOther                 -0.164132   0.349277  -0.470  0.64517    
#> raceWhite                 -0.613643   0.192233  -3.192  0.00606 ** 
#> incomeBetween.25kto54k    -0.324911   0.211048  -1.540  0.14451    
#> incomeBetween.55kto99k    -0.435472   0.231288  -1.883  0.07927 .  
#> incomeOver100k            -0.742995   0.220399  -3.371  0.00420 ** 
#> educationHigh.School      -0.065157   0.130721  -0.498  0.62540    
#> educationSchool           -0.072034   0.251806  -0.286  0.77874    
#> marriedNever.married      -0.110932   0.180496  -0.615  0.54803    
#> marriedPreviously.married -0.233875   0.205646  -1.137  0.27327    
#> cholesterol               -0.006873   0.001828  -3.760  0.00189 ** 
#> diastolicBP                0.006728   0.003859   1.744  0.10169    
#> systolicBP                 0.002438   0.004605   0.529  0.60430    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Zero or negative residual df; p-values not defined
#> 
#> (Dispersion parameter for binomial family taken to be 0.9452878)
#> 
#> Number of Fisher Scoring iterations: 6
Tip

Double adjustment:

As explained in the previous chapter, double adjustment should be applied thoughtfully, with careful consideration of model specification, covariate selection, and underlying assumptions to ensure valid and reliable results. Always consider the specific context of the study and consult statistical guidelines or experts when applying advanced methods like double adjustment in propensity score analysis.

Approach by DuGoff et al. (2014)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula2 <- as.formula(I(bmi == "Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP + 
                           psu + strata + interview.weight)

Step 2

# Propensity scores
ps.fit2 <- glm(ps.formula2, data = analytic.data, family = binomial("logit"))
analytic.data$ps2 <- fitted(ps.fit2)

# Stabilized weight
analytic.data$sweight.dug <- with(analytic.data, ifelse(I(bmi=="Not overweight"), 
                                                    mean(I(bmi=="Not overweight"))/ps2, 
                                                    (1-mean(I(bmi=="Not overweight")))/(1-ps2)))
summary(analytic.data$sweight.dug)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.4349  0.6942  0.8296  1.0307  1.0859 23.0226

# Truncating stabilized weight
analytic.data <- analytic.data %>% 
  mutate(sweight.dug_t = pmin(pmax(sweight.dug, quantile(sweight.dug, 0.01)), 
                          quantile(sweight.dug, 0.99)))
summary(analytic.data$sweight.dug_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.4802  0.6942  0.8296  1.0019  1.0859  4.4041

Step 3

# Balance checking
cov2 <- c("gender", "age", "race", "income", "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP")

# Design with truncated stabilized weight
design.stab <- svydesign(ids = ~ID, weights = ~sweight.dug_t, data = analytic.data)

# Balance checking with truncated stabilized weight
tab.stab2 <- svyCreateTableOne(vars = cov2, strata = "bmi", data = design.stab, test = F)
print(tab.stab2, smd = T)
#>                          Stratified by bmi
#>                           Overweight      Not overweight  SMD   
#>   n                        3652.9          2675.0               
#>   gender = Male (%)        1766.5 (48.4)   1276.8 (47.7)   0.013
#>   age (mean (SD))           48.47 (17.42)   49.80 (17.99)  0.075
#>   race (%)                                                 0.056
#>      Black                  769.1 (21.1)    568.7 (21.3)        
#>      Hispanic              1121.5 (30.7)    755.4 (28.2)        
#>      Other                  553.0 (15.1)    417.0 (15.6)        
#>      White                 1209.4 (33.1)    933.9 (34.9)        
#>   income (%)                                               0.023
#>      <25k                   982.1 (26.9)    734.0 (27.4)        
#>      Between.25kto54k      1172.1 (32.1)    871.1 (32.6)        
#>      Between.55kto99k       846.1 (23.2)    597.0 (22.3)        
#>      Over100k               652.6 (17.9)    472.9 (17.7)        
#>   education (%)                                            0.012
#>      College               2049.1 (56.1)   1484.2 (55.5)        
#>      High.School           1202.5 (32.9)    893.7 (33.4)        
#>      School                 401.3 (11.0)    297.1 (11.1)        
#>   married (%)                                              0.035
#>      Married               2201.3 (60.3)   1581.2 (59.1)        
#>      Never.married          647.4 (17.7)    465.6 (17.4)        
#>      Previously.married     804.2 (22.0)    628.2 (23.5)        
#>   cholesterol (mean (SD))  181.74 (40.94)  182.97 (43.25)  0.029
#>   diastolicBP (mean (SD))   66.46 (14.42)   66.88 (14.79)  0.029
#>   systolicBP (mean (SD))   121.14 (16.59)  123.39 (23.38)  0.111

All SMDs are less than our specified cut-point of 0.2.

Step 4

# Create an indicator variable in the full dataset
dat.full$ind <- 0
dat.full$ind[dat.full$ID %in% analytic.data$ID] <- 1

# New weight = interview weight * stabilized weight
analytic.data$new.sweight.dug_t <- with(analytic.data, interview.weight * sweight.dug_t)
summary(analytic.data$new.sweight.dug_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    2996   13343   22260   39436   43914  855145

#  New weight variable in the full dataset
dat.full$new.sweight.dug_t <- 0
dat.full$new.sweight.dug_t[dat.full$ID %in% analytic.data$ID] <- analytic.data$new.sweight.dug_t
summary(dat.full$new.sweight.dug_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       0   12058   24980   28225  855145

# Survey setup with full data 
w.design0 <- svydesign(id = ~psu, strata = ~strata, weights = ~new.sweight.dug_t, 
                       data = dat.full, nest = TRUE)

# Subset the design for analytic sample
w.design.s2 <- subset(w.design0, ind == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit.stab.dug <- svyglm(out.formula, design = w.design.s2, family = binomial("logit"))
summ(fit.stab.dug, exp = TRUE, confint = TRUE, digits = 3)
Observations 6316
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.060
Pseudo-R² (McFadden) 0.045
AIC 3518.691
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.161 0.140 0.185 -26.027 0.000
bmiNot overweight 0.272 0.203 0.364 -8.727 0.000
Standard errors: Robust

Double adjustment

# Outcome model with covariates adjustment
fit2.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.s2, family = binomial("logit"))
summ(fit2.DA, exp = TRUE, confint = TRUE, digits = 3)
Observations 6316
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.196
Pseudo-R² (McFadden) 0.152
AIC 3154.781
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.043 0.014 0.131 -5.537 NA
bmiNot overweight 0.235 0.169 0.325 -8.730 NA
genderMale 1.393 1.150 1.688 3.381 NA
age 1.049 1.042 1.056 13.564 NA
raceHispanic 0.813 0.624 1.061 -1.524 NA
raceOther 0.841 0.454 1.558 -0.549 NA
raceWhite 0.535 0.369 0.774 -3.316 NA
incomeBetween.25kto54k 0.720 0.492 1.053 -1.693 NA
incomeBetween.55kto99k 0.659 0.440 0.985 -2.035 NA
incomeOver100k 0.491 0.338 0.713 -3.735 NA
educationHigh.School 0.930 0.753 1.150 -0.667 NA
educationSchool 0.994 0.625 1.580 -0.026 NA
marriedNever.married 0.982 0.671 1.439 -0.092 NA
marriedPreviously.married 0.783 0.534 1.150 -1.245 NA
cholesterol 0.993 0.989 0.996 -3.913 NA
diastolicBP 1.006 0.998 1.013 1.426 NA
systolicBP 1.004 0.995 1.012 0.853 NA
Standard errors: Robust

# Log odds ratio with p-values
summary(fit2.DA, df.resid = degf(w.design.s2))
#> 
#> Call:
#> svyglm(formula = I(diabetes == "Yes") ~ bmi + gender + age + 
#>     race + income + education + married + cholesterol + diastolicBP + 
#>     systolicBP, design = w.design.s2, family = binomial("logit"))
#> 
#> Survey design:
#> subset(w.design0, ind == 1)
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               -3.143943   0.567807  -5.537 5.70e-05 ***
#> bmiNot overweight         -1.450047   0.166099  -8.730 2.89e-07 ***
#> genderMale                 0.331561   0.098056   3.381  0.00411 ** 
#> age                        0.047790   0.003523  13.564 7.97e-10 ***
#> raceHispanic              -0.206657   0.135568  -1.524  0.14822    
#> raceOther                 -0.172590   0.314341  -0.549  0.59105    
#> raceWhite                 -0.625821   0.188746  -3.316  0.00471 ** 
#> incomeBetween.25kto54k    -0.328863   0.194223  -1.693  0.11107    
#> incomeBetween.55kto99k    -0.417687   0.205223  -2.035  0.05989 .  
#> incomeOver100k            -0.711500   0.190509  -3.735  0.00199 ** 
#> educationHigh.School      -0.072053   0.108040  -0.667  0.51496    
#> educationSchool           -0.006136   0.236588  -0.026  0.97965    
#> marriedNever.married      -0.017832   0.194660  -0.092  0.92822    
#> marriedPreviously.married -0.244058   0.196041  -1.245  0.23226    
#> cholesterol               -0.007067   0.001806  -3.913  0.00138 ** 
#> diastolicBP                0.005516   0.003869   1.426  0.17440    
#> systolicBP                 0.003739   0.004381   0.853  0.40682    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Zero or negative residual df; p-values not defined
#> 
#> (Dispersion parameter for binomial family taken to be 0.9373219)
#> 
#> Number of Fisher Scoring iterations: 6

Approach by Austin et al. (2018)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula3 <- as.formula(I(bmi == "Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP)

# Survey design
require(survey)
analytic.design <- svydesign(id = ~psu, weights = ~interview.weight, strata = ~strata,
                             data = analytic.data, nest = TRUE)

Step 2

# Propensity scores
ps.fit3 <- svyglm(ps.formula3, design = analytic.design, family = binomial("logit"))
analytic.data$ps3 <- fitted(ps.fit3)

# Stabilized weight
analytic.data$sweight.aus <- with(analytic.data, ifelse(I(bmi=="Not overweight"), 
                                                    mean(I(bmi=="Not overweight"))/ps3, 
                                                    (1-mean(I(bmi=="Not overweight")))/(1-ps3)))
summary(analytic.data$sweight.aus)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.4436  0.7138  0.8410  1.0392  1.0645 26.2776

# Truncating stabilized weight
analytic.data <- analytic.data %>% 
  mutate(sweight.aus_t = pmin(pmax(sweight.aus, quantile(sweight.aus, 0.01)), 
                          quantile(sweight.aus, 0.99)))
summary(analytic.data$sweight.aus_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.5198  0.7138  0.8410  1.0071  1.0645  4.8830

Step 3

# Balance checking
vars <- c("gender", "age", "race", "income", "education", "married", "cholesterol", 
          "diastolicBP", "systolicBP")

# Design with truncated stabilized weight
design.stab <- svydesign(ids = ~ID, weights = ~sweight.aus_t, data = analytic.data)

# Balance checking with truncated stabilized weight
tab.stab.aus <- svyCreateTableOne(vars = vars, strata = "bmi", data = design.stab, test = F)
print(tab.stab.aus, smd = T)
#>                          Stratified by bmi
#>                           Overweight      Not overweight  SMD   
#>   n                        3434.3          2926.8               
#>   gender = Male (%)        1609.8 (46.9)   1458.6 (49.8)   0.059
#>   age (mean (SD))           48.51 (17.32)   49.95 (18.14)  0.081
#>   race (%)                                                 0.099
#>      Black                  737.1 (21.5)    623.8 (21.3)        
#>      Hispanic              1085.0 (31.6)    826.0 (28.2)        
#>      Other                  471.2 (13.7)    489.7 (16.7)        
#>      White                 1141.0 (33.2)    987.3 (33.7)        
#>   income (%)                                               0.041
#>      <25k                   933.7 (27.2)    813.0 (27.8)        
#>      Between.25kto54k      1108.2 (32.3)    946.8 (32.3)        
#>      Between.55kto99k       776.7 (22.6)    685.4 (23.4)        
#>      Over100k               615.7 (17.9)    481.6 (16.5)        
#>   education (%)                                            0.039
#>      College               1908.2 (55.6)   1601.1 (54.7)        
#>      High.School           1129.9 (32.9)   1011.4 (34.6)        
#>      School                 396.2 (11.5)    314.3 (10.7)        
#>   married (%)                                              0.030
#>      Married               2063.9 (60.1)   1737.6 (59.4)        
#>      Never.married          605.8 (17.6)    501.4 (17.1)        
#>      Previously.married     764.6 (22.3)    687.8 (23.5)        
#>   cholesterol (mean (SD))  182.67 (41.11)  182.70 (43.38)  0.001
#>   diastolicBP (mean (SD))   66.76 (14.28)   66.85 (14.85)  0.006
#>   systolicBP (mean (SD))   121.39 (16.74)  123.98 (23.72)  0.126

All SMDs are less than our specified cut-point of 0.2.

Step 4

# Create an indicator variable in the full dataset
dat.full$ind <- 0
dat.full$ind[dat.full$ID %in% analytic.data$ID] <- 1

# New weight = interview weight * stabilized weight
analytic.data$new.sweight.aus_t <- with(analytic.data, interview.weight * sweight.aus_t)
summary(analytic.data$new.sweight.aus_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    3255   13604   22527   38953   43663  819622

#  New weight variable in the full dataset
dat.full$new.sweight.aus_t <- 0
dat.full$new.sweight.aus_t[dat.full$ID %in% analytic.data$ID] <- analytic.data$new.sweight.aus_t
summary(dat.full$new.sweight.aus_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       0   12356   24675   28169  819622

# Survey setup with full data 
w.design0 <- svydesign(id = ~psu, strata = ~strata, weights = ~new.sweight.aus_t, 
                       data = dat.full, nest = TRUE)

# Subset the design for analytic sample
w.design.s2 <- subset(w.design0, ind == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit.stab.aus <- svyglm(out.formula, design = w.design.s2, family = binomial("logit"))
summ(fit.stab.aus, exp = TRUE, confint = TRUE, digits = 3)
Observations 6316
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.055
Pseudo-R² (McFadden) 0.041
AIC 3622.951
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.162 0.140 0.187 -24.912 0.000
bmiNot overweight 0.291 0.225 0.377 -9.397 0.000
Standard errors: Robust

Double adjustment

# Outcome model with covariates adjustment
fit3.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.s2, family = binomial("logit"))
summ(fit3.DA, exp = TRUE, confint = TRUE, digits = 3)
Observations 6316
Dependent variable I(diabetes == "Yes")
Type Survey-weighted generalized linear model
Family binomial
Link logit
Pseudo-R² (Cragg-Uhler) 0.193
Pseudo-R² (McFadden) 0.149
AIC 3247.730
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 0.045 0.015 0.129 -5.735 NA
bmiNot overweight 0.233 0.175 0.311 -9.886 NA
genderMale 1.416 1.186 1.690 3.850 NA
age 1.049 1.041 1.057 12.066 NA
raceHispanic 0.773 0.585 1.022 -1.808 NA
raceOther 0.867 0.451 1.666 -0.428 NA
raceWhite 0.520 0.353 0.765 -3.316 NA
incomeBetween.25kto54k 0.700 0.461 1.062 -1.677 NA
incomeBetween.55kto99k 0.650 0.407 1.039 -1.801 NA
incomeOver100k 0.467 0.303 0.721 -3.437 NA
educationHigh.School 0.948 0.744 1.209 -0.428 NA
educationSchool 0.976 0.598 1.592 -0.099 NA
marriedNever.married 0.896 0.629 1.276 -0.609 NA
marriedPreviously.married 0.784 0.525 1.169 -1.195 NA
cholesterol 0.993 0.990 0.997 -3.653 NA
diastolicBP 1.005 0.998 1.013 1.412 NA
systolicBP 1.004 0.995 1.012 0.826 NA
Standard errors: Robust

# Log odds ratio with p-values
summary(fit3.DA, df.resid = degf(w.design.s2))
#> 
#> Call:
#> svyglm(formula = I(diabetes == "Yes") ~ bmi + gender + age + 
#>     race + income + education + married + cholesterol + diastolicBP + 
#>     systolicBP, design = w.design.s2, family = binomial("logit"))
#> 
#> Survey design:
#> subset(w.design0, ind == 1)
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               -3.109181   0.542121  -5.735 3.94e-05 ***
#> bmiNot overweight         -1.456869   0.147369  -9.886 5.81e-08 ***
#> genderMale                 0.347655   0.090292   3.850  0.00157 ** 
#> age                        0.047457   0.003933  12.066 4.01e-09 ***
#> raceHispanic              -0.257364   0.142343  -1.808  0.09069 .  
#> raceOther                 -0.142755   0.333175  -0.428  0.67440    
#> raceWhite                 -0.653902   0.197190  -3.316  0.00470 ** 
#> incomeBetween.25kto54k    -0.357122   0.212996  -1.677  0.11432    
#> incomeBetween.55kto99k    -0.430289   0.238964  -1.801  0.09190 .  
#> incomeOver100k            -0.761311   0.221498  -3.437  0.00367 ** 
#> educationHigh.School      -0.052990   0.123816  -0.428  0.67475    
#> educationSchool           -0.024734   0.249884  -0.099  0.92246    
#> marriedNever.married      -0.109989   0.180532  -0.609  0.55148    
#> marriedPreviously.married -0.243725   0.203990  -1.195  0.25072    
#> cholesterol               -0.006605   0.001808  -3.653  0.00235 ** 
#> diastolicBP                0.005253   0.003721   1.412  0.17842    
#> systolicBP                 0.003618   0.004380   0.826  0.42172    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Zero or negative residual df; p-values not defined
#> 
#> (Dispersion parameter for binomial family taken to be 0.9416613)
#> 
#> Number of Fisher Scoring iterations: 6

References

Austin, Peter C, Nathaniel Jembere, and Maria Chiu. 2018. “Propensity Score Matching and Complex Surveys.” Statistical Methods in Medical Research 27 (4): 1240–57.
Cole, Stephen R, and Miguel A Hernán. 2008. “Constructing Inverse Probability Weights for Marginal Structural Models.” American Journal of Epidemiology 168 (6): 656–64.
DuGoff, Eva H, Megan Schuler, and Elizabeth A Stuart. 2014. “Generalizing Observational Study Results: Applying Propensity Score Methods to Complex Surveys.” Health Services Research 49 (1): 284–303.
Hernán, Miguel A, and James M Robins. 2006. “Estimating Causal Effects from Epidemiological Data.” Journal of Epidemiology and Community Health 60 (7): 578.
Zanutto, Elaine L. 2006. “A Comparison of Propensity Score and Linear Regression Analysis of Complex Survey Data.” Journal of Data Science 4 (1): 67–91.