PS Weighting (US)

Propensity analysis problem

In this chapter, we will use propensity score weighting (using SMD cut-point 0.2, may adjust for imbalanced and/or all covariates in the outcome model, if any) analysis as per the following recommendations - (Zanutto 2006) - (DuGoff, Schuler, and Stuart 2014) - (Austin, Jembere, and Chiu 2018)

Dataset

The following modified NHANES dataset
- NHANES15lab5.RData
use the data set “analytic.with.miss” within this file.
- for obtaining the final treatment effect estimates, you can omit missing values, but only after creating the design (e.g., subset the design, not the data itself directly).
The same dataset was used for propensity score weighting

Variables

Outcome: diabetes
- ‘No’ as the reference category
Exposure: bmi
- convert to binary with >25 vs. <= 25,
- with > 25 as the reference category
Confounder list:
- gender
- age
  - assume continuous
- race
- income
- education
- married
- cholesterol
- diastolicBP
- systolicBP
Mediator:
- physical.work
  - ‘No’ as the reference category
Survey features
- psu
- strata
- weight

Pre-processing

Load data

load(file="Data/propensityscore/NHANES15lab5.RData")

Variable summary

# Full data
dat.full <- analytic.with.miss

# Exposure
dat.full$bmi <- with(dat.full, ifelse(bmi>25, "Overweight", 
                                      ifelse(bmi<=25, "Not overweight", NA)))
dat.full$bmi <- as.factor(dat.full$bmi)
dat.full$bmi <- relevel(dat.full$bmi, ref = "Overweight")

# Drop unnecessary variables 
dat.full$born <- NULL
dat.full$physical.work <- NULL

# Rename the weight variable into interview.weight
names(dat.full)[names(dat.full) == "weight"] <- "interview.weight"

Complete case data

We will use the complete case data to perform the analysis.

# Complete case data 
analytic.data <- dat.full[complete.cases(dat.full),]
dim(analytic.data)
#> [1] 6316   15

Reproducibility

set.seed(504)

Approach by Zanutto (2006)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula <- as.formula(I(bmi=="Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP)

Step 2

For the second step, we will calculate the both unstabilized and stabilized weights. However, stabilized inverse probability weight is often recommended to prevent from extreme weights (Hernán and Robins 2006).

# Propensity scores
ps.fit <- glm(ps.formula, data = analytic.data, family = binomial("logit"))
analytic.data$ps <- predict(ps.fit, type = "response", newdata = analytic.data)

# Unstabilized weight
analytic.data$usweight <- with(analytic.data, ifelse(I(bmi=="Not overweight"), 
                                                     1/ps, 1/(1-ps)))

# Unstabilized weight summary
round(summary(analytic.data$usweight), 2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.01    1.33    1.63    2.10    2.14   62.31

# Stabilized weight
analytic.data$sweight <- with(analytic.data, ifelse(I(bmi=="Not overweight"), 
                                                    mean(I(bmi=="Not overweight"))/ps, 
                                                    (1-mean(I(bmi=="Not overweight")))/(1-ps)))

# Stabilized weight summary
round(summary(analytic.data$sweight), 2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.44    0.70    0.84    1.04    1.08   25.40

We can see that the mean of stabilized weights is 1, while it is approximately 2.1 for unstabilized weights. For both unstabilized and stabilized weights, it seems there are extreme weights, particularly for the unstabilized weights. Extreme weights could be dealt with weight truncation, typically truncated at the 1st and 99th percentiles (Cole and Hernán 2008).

Let us truncate the weights at the 1st and 99th percentiles

# Truncating unstabilized weight
analytic.data <- analytic.data %>% 
  mutate(usweight_t = pmin(pmax(usweight, quantile(usweight, 0.01)), 
                           quantile(usweight, 0.99)))
summary(analytic.data$usweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.055   1.335   1.629   2.020   2.140  10.912

# Truncating stabilized weight
analytic.data <- analytic.data %>% 
  mutate(sweight_t = pmin(pmax(sweight, quantile(sweight, 0.01)), 
                          quantile(sweight, 0.99)))
summary(analytic.data$sweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.4861  0.7035  0.8400  1.0044  1.0833  4.4901

Step 3

Now we will check the distribution of the covariates by the exposure status on the pseudo population in terms of pre-specified SMD.

# Covariates
vars <- c("gender", "age", "race", "income", "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP")

# Design with truncated unstabilized weight
design.unstab <- svydesign(ids = ~ID, weights = ~usweight_t, data = analytic.data)

# Design with truncated stabilized weight
design.stab <- svydesign(ids = ~ID, weights = ~sweight_t, data = analytic.data)

# Balance checking with truncated unstabilized weight
tab.unstab <- svyCreateTableOne(vars = vars, strata = "bmi", data = design.unstab, test = F)
print(tab.unstab, smd = T)
#>                          Stratified by bmi
#>                           Overweight      Not overweight  SMD   
#>   n                        6199.3          6559.7               
#>   gender = Male (%)        2999.3 (48.4)   3126.6 (47.7)   0.014
#>   age (mean (SD))           48.38 (17.38)   49.83 (18.11)  0.082
#>   race (%)                                                 0.047
#>      Black                 1304.7 (21.0)   1407.8 (21.5)        
#>      Hispanic              1901.9 (30.7)   1873.7 (28.6)        
#>      Other                  956.4 (15.4)   1029.3 (15.7)        
#>      White                 2036.2 (32.8)   2249.0 (34.3)        
#>   income (%)                                               0.033
#>      <25k                  1664.3 (26.8)   1829.5 (27.9)        
#>      Between.25kto54k      1986.5 (32.0)   2133.5 (32.5)        
#>      Between.55kto99k      1449.1 (23.4)   1466.6 (22.4)        
#>      Over100k              1099.4 (17.7)   1130.0 (17.2)        
#>   education (%)                                            0.019
#>      College               3469.5 (56.0)   3610.4 (55.0)        
#>      High.School           2047.7 (33.0)   2217.9 (33.8)        
#>      School                 682.1 (11.0)    731.4 (11.1)        
#>   married (%)                                              0.040
#>      Married               3728.2 (60.1)   3853.6 (58.7)        
#>      Never.married         1101.9 (17.8)   1147.0 (17.5)        
#>      Previously.married    1369.2 (22.1)   1559.1 (23.8)        
#>   cholesterol (mean (SD))  181.58 (40.93)  183.13 (43.59)  0.037
#>   diastolicBP (mean (SD))   66.31 (14.64)   66.87 (14.78)  0.038
#>   systolicBP (mean (SD))   121.04 (16.61)  123.60 (23.62)  0.125

# Balance checking with truncated stabilized weight
tab.stab <- svyCreateTableOne(vars = vars, strata = "bmi", data = design.stab, test = F)
print(tab.stab, smd = T)
#>                          Stratified by bmi
#>                           Overweight      Not overweight  SMD   
#>   n                        3665.9          2677.9               
#>   gender = Male (%)        1771.7 (48.3)   1276.5 (47.7)   0.013
#>   age (mean (SD))           48.40 (17.38)   49.84 (18.12)  0.081
#>   race (%)                                                 0.048
#>      Black                  772.6 (21.1)    575.0 (21.5)        
#>      Hispanic              1126.3 (30.7)    764.6 (28.6)        
#>      Other                  561.1 (15.3)    420.5 (15.7)        
#>      White                 1206.0 (32.9)    917.9 (34.3)        
#>   income (%)                                               0.033
#>      <25k                   982.8 (26.8)    747.3 (27.9)        
#>      Between.25kto54k      1176.4 (32.1)    870.8 (32.5)        
#>      Between.55kto99k       857.6 (23.4)    598.5 (22.3)        
#>      Over100k               649.2 (17.7)    461.3 (17.2)        
#>   education (%)                                            0.019
#>      College               2051.4 (56.0)   1473.3 (55.0)        
#>      High.School           1210.5 (33.0)    905.9 (33.8)        
#>      School                 404.0 (11.0)    298.7 (11.2)        
#>   married (%)                                              0.040
#>      Married               2205.9 (60.2)   1572.9 (58.7)        
#>      Never.married          650.0 (17.7)    468.0 (17.5)        
#>      Previously.married     810.1 (22.1)    637.1 (23.8)        
#>   cholesterol (mean (SD))  181.62 (40.91)  183.15 (43.61)  0.036
#>   diastolicBP (mean (SD))   66.37 (14.53)   66.87 (14.81)  0.034
#>   systolicBP (mean (SD))   121.06 (16.58)  123.63 (23.65)  0.126

As we can see, all SMDs are less than our specified cut-point of 0.2, indicating that there is good covariate balancing. next, we will fit the outcome model on the pseudo population (i.e., weighted data). Note that we must utilize the survey feature as the design for the population-level estimate. For this step, we will multiply propensity score weight and survey weight and create a new weight variable.

Step 4 - with unstabilized weight

require(survey)
require(jtools)

# Create an indicator variable in the full dataset
dat.full$ind <- 0
dat.full$ind[dat.full$ID %in% analytic.data$ID] <- 1

# New weight = interview weight * unstabilized weight 
analytic.data$new.usweight_t <- with(analytic.data, interview.weight * usweight_t)
summary(analytic.data$new.usweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    6437   26567   42862   77768   88106 1831548

#  New weight variable in the full dataset
dat.full$new.usweight_t <- 0
dat.full$new.usweight_t[dat.full$ID %in% analytic.data$ID] <- 
  analytic.data$new.usweight_t
summary(dat.full$new.usweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       0   24068   49261   55161 1831548

# Survey setup with full data 
w.design0 <- svydesign(id = ~psu, strata = ~strata, weights = ~new.usweight_t, 
                      data = dat.full, nest = TRUE)

# Subset the design for analytic sample
w.design.s <- subset(w.design0, ind == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit <- svyglm(out.formula, design = w.design.s, family = binomial("logit"))
summ(fit, exp = TRUE, confint = TRUE, digits = 3)

Observations	6316
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.062
Pseudo-R² (McFadden)	0.047
AIC	3371.907

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.164	0.141	0.190	-24.213	0.000
bmiNot overweight	0.280	0.217	0.361	-9.814	0.000
Standard errors: Robust

Step 4 - with stabilized weight

Similarly, we can fit the outcome model with stabilized weights.

# Create an indicator variable in the full dataset
dat.full$ind <- 0
dat.full$ind[dat.full$ID %in% analytic.data$ID] <- 1

# New weight = interview weight * stabilized weight
analytic.data$new.sweight_t <- with(analytic.data, interview.weight * sweight_t)
summary(analytic.data$new.sweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    2985   13311   22494   39174   43896  753678

#  New weight variable in the full dataset
dat.full$new.sweight_t <- 0
dat.full$new.sweight_t[dat.full$ID %in% analytic.data$ID] <- analytic.data$new.sweight_t
summary(dat.full$new.sweight_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       0   12115   24814   28232  753678

# Survey setup with full data 
w.design0 <- svydesign(id = ~psu, strata = ~strata, weights = ~new.sweight_t, 
                       data = dat.full, nest = TRUE)

# Subset the design for analytic sample
w.design.s2 <- subset(w.design0, ind == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit.stab <- svyglm(out.formula, design = w.design.s2, family = binomial("logit"))
summ(fit.stab, exp = TRUE, confint = TRUE, digits = 3)

Observations	6316
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.055
Pseudo-R² (McFadden)	0.041
AIC	3712.451

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.164	0.141	0.190	-24.235	0.000
bmiNot overweight	0.280	0.217	0.361	-9.815	0.000
Standard errors: Robust

Double adjustment

library(survey)
# Outcome model with covariates adjustment
fit.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.s2, family = binomial("logit"))
summ(fit.DA, exp = TRUE, confint = TRUE, digits = 3)

Observations	6316
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.194
Pseudo-R² (McFadden)	0.149
AIC	3331.238

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.045	0.015	0.133	-5.608	NA
bmiNot overweight	0.234	0.175	0.312	-9.826	NA
genderMale	1.402	1.190	1.652	4.038	NA
age	1.050	1.041	1.058	11.793	NA
raceHispanic	0.790	0.600	1.040	-1.677	NA
raceOther	0.849	0.428	1.683	-0.470	NA
raceWhite	0.541	0.371	0.789	-3.192	NA
incomeBetween.25kto54k	0.723	0.478	1.093	-1.540	NA
incomeBetween.55kto99k	0.647	0.411	1.018	-1.883	NA
incomeOver100k	0.476	0.309	0.733	-3.371	NA
educationHigh.School	0.937	0.725	1.211	-0.498	NA
educationSchool	0.930	0.568	1.524	-0.286	NA
marriedNever.married	0.895	0.628	1.275	-0.615	NA
marriedPreviously.married	0.791	0.529	1.184	-1.137	NA
cholesterol	0.993	0.990	0.997	-3.760	NA
diastolicBP	1.007	0.999	1.014	1.744	NA
systolicBP	1.002	0.993	1.012	0.529	NA
Standard errors: Robust


# Log odds ratio with p-values
summary(fit.DA, df.resid = degf(w.design.s2))
#> 
#> Call:
#> svyglm(formula = I(diabetes == "Yes") ~ bmi + gender + age + 
#>     race + income + education + married + cholesterol + diastolicBP + 
#>     systolicBP, design = w.design.s2, family = binomial("logit"))
#> 
#> Survey design:
#> subset(w.design0, ind == 1)
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               -3.095813   0.552073  -5.608 4.99e-05 ***
#> bmiNot overweight         -1.454514   0.148024  -9.826 6.29e-08 ***
#> genderMale                 0.338112   0.083733   4.038  0.00107 ** 
#> age                        0.048468   0.004110  11.793 5.48e-09 ***
#> raceHispanic              -0.235285   0.140287  -1.677  0.11422    
#> raceOther                 -0.164132   0.349277  -0.470  0.64517    
#> raceWhite                 -0.613643   0.192233  -3.192  0.00606 ** 
#> incomeBetween.25kto54k    -0.324911   0.211048  -1.540  0.14451    
#> incomeBetween.55kto99k    -0.435472   0.231288  -1.883  0.07927 .  
#> incomeOver100k            -0.742995   0.220399  -3.371  0.00420 ** 
#> educationHigh.School      -0.065157   0.130721  -0.498  0.62540    
#> educationSchool           -0.072034   0.251806  -0.286  0.77874    
#> marriedNever.married      -0.110932   0.180496  -0.615  0.54803    
#> marriedPreviously.married -0.233875   0.205646  -1.137  0.27327    
#> cholesterol               -0.006873   0.001828  -3.760  0.00189 ** 
#> diastolicBP                0.006728   0.003859   1.744  0.10169    
#> systolicBP                 0.002438   0.004605   0.529  0.60430    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Zero or negative residual df; p-values not defined
#> 
#> (Dispersion parameter for binomial family taken to be 0.9452878)
#> 
#> Number of Fisher Scoring iterations: 6

Tip

Double adjustment:

As explained in the previous chapter, double adjustment should be applied thoughtfully, with careful consideration of model specification, covariate selection, and underlying assumptions to ensure valid and reliable results. Always consider the specific context of the study and consult statistical guidelines or experts when applying advanced methods like double adjustment in propensity score analysis.

Approach by DuGoff et al. (2014)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula2 <- as.formula(I(bmi == "Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP + 
                           psu + strata + interview.weight)

Step 2

# Propensity scores
ps.fit2 <- glm(ps.formula2, data = analytic.data, family = binomial("logit"))
analytic.data$ps2 <- fitted(ps.fit2)

# Stabilized weight
analytic.data$sweight.dug <- with(analytic.data, ifelse(I(bmi=="Not overweight"), 
                                                    mean(I(bmi=="Not overweight"))/ps2, 
                                                    (1-mean(I(bmi=="Not overweight")))/(1-ps2)))
summary(analytic.data$sweight.dug)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.4349  0.6942  0.8296  1.0307  1.0859 23.0226

# Truncating stabilized weight
analytic.data <- analytic.data %>% 
  mutate(sweight.dug_t = pmin(pmax(sweight.dug, quantile(sweight.dug, 0.01)), 
                          quantile(sweight.dug, 0.99)))
summary(analytic.data$sweight.dug_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.4802  0.6942  0.8296  1.0019  1.0859  4.4041

Step 3

# Balance checking
cov2 <- c("gender", "age", "race", "income", "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP")

# Design with truncated stabilized weight
design.stab <- svydesign(ids = ~ID, weights = ~sweight.dug_t, data = analytic.data)

# Balance checking with truncated stabilized weight
tab.stab2 <- svyCreateTableOne(vars = cov2, strata = "bmi", data = design.stab, test = F)
print(tab.stab2, smd = T)
#>                          Stratified by bmi
#>                           Overweight      Not overweight  SMD   
#>   n                        3652.9          2675.0               
#>   gender = Male (%)        1766.5 (48.4)   1276.8 (47.7)   0.013
#>   age (mean (SD))           48.47 (17.42)   49.80 (17.99)  0.075
#>   race (%)                                                 0.056
#>      Black                  769.1 (21.1)    568.7 (21.3)        
#>      Hispanic              1121.5 (30.7)    755.4 (28.2)        
#>      Other                  553.0 (15.1)    417.0 (15.6)        
#>      White                 1209.4 (33.1)    933.9 (34.9)        
#>   income (%)                                               0.023
#>      <25k                   982.1 (26.9)    734.0 (27.4)        
#>      Between.25kto54k      1172.1 (32.1)    871.1 (32.6)        
#>      Between.55kto99k       846.1 (23.2)    597.0 (22.3)        
#>      Over100k               652.6 (17.9)    472.9 (17.7)        
#>   education (%)                                            0.012
#>      College               2049.1 (56.1)   1484.2 (55.5)        
#>      High.School           1202.5 (32.9)    893.7 (33.4)        
#>      School                 401.3 (11.0)    297.1 (11.1)        
#>   married (%)                                              0.035
#>      Married               2201.3 (60.3)   1581.2 (59.1)        
#>      Never.married          647.4 (17.7)    465.6 (17.4)        
#>      Previously.married     804.2 (22.0)    628.2 (23.5)        
#>   cholesterol (mean (SD))  181.74 (40.94)  182.97 (43.25)  0.029
#>   diastolicBP (mean (SD))   66.46 (14.42)   66.88 (14.79)  0.029
#>   systolicBP (mean (SD))   121.14 (16.59)  123.39 (23.38)  0.111

All SMDs are less than our specified cut-point of 0.2.

Step 4

# Create an indicator variable in the full dataset
dat.full$ind <- 0
dat.full$ind[dat.full$ID %in% analytic.data$ID] <- 1

# New weight = interview weight * stabilized weight
analytic.data$new.sweight.dug_t <- with(analytic.data, interview.weight * sweight.dug_t)
summary(analytic.data$new.sweight.dug_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    2996   13343   22260   39436   43914  855145

#  New weight variable in the full dataset
dat.full$new.sweight.dug_t <- 0
dat.full$new.sweight.dug_t[dat.full$ID %in% analytic.data$ID] <- analytic.data$new.sweight.dug_t
summary(dat.full$new.sweight.dug_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       0   12058   24980   28225  855145

# Survey setup with full data 
w.design0 <- svydesign(id = ~psu, strata = ~strata, weights = ~new.sweight.dug_t, 
                       data = dat.full, nest = TRUE)

# Subset the design for analytic sample
w.design.s2 <- subset(w.design0, ind == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit.stab.dug <- svyglm(out.formula, design = w.design.s2, family = binomial("logit"))
summ(fit.stab.dug, exp = TRUE, confint = TRUE, digits = 3)

Observations	6316
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.060
Pseudo-R² (McFadden)	0.045
AIC	3518.691

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.161	0.140	0.185	-26.027	0.000
bmiNot overweight	0.272	0.203	0.364	-8.727	0.000
Standard errors: Robust

Double adjustment

# Outcome model with covariates adjustment
fit2.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.s2, family = binomial("logit"))
summ(fit2.DA, exp = TRUE, confint = TRUE, digits = 3)

Observations	6316
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.196
Pseudo-R² (McFadden)	0.152
AIC	3154.781

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.043	0.014	0.131	-5.537	NA
bmiNot overweight	0.235	0.169	0.325	-8.730	NA
genderMale	1.393	1.150	1.688	3.381	NA
age	1.049	1.042	1.056	13.564	NA
raceHispanic	0.813	0.624	1.061	-1.524	NA
raceOther	0.841	0.454	1.558	-0.549	NA
raceWhite	0.535	0.369	0.774	-3.316	NA
incomeBetween.25kto54k	0.720	0.492	1.053	-1.693	NA
incomeBetween.55kto99k	0.659	0.440	0.985	-2.035	NA
incomeOver100k	0.491	0.338	0.713	-3.735	NA
educationHigh.School	0.930	0.753	1.150	-0.667	NA
educationSchool	0.994	0.625	1.580	-0.026	NA
marriedNever.married	0.982	0.671	1.439	-0.092	NA
marriedPreviously.married	0.783	0.534	1.150	-1.245	NA
cholesterol	0.993	0.989	0.996	-3.913	NA
diastolicBP	1.006	0.998	1.013	1.426	NA
systolicBP	1.004	0.995	1.012	0.853	NA
Standard errors: Robust


# Log odds ratio with p-values
summary(fit2.DA, df.resid = degf(w.design.s2))
#> 
#> Call:
#> svyglm(formula = I(diabetes == "Yes") ~ bmi + gender + age + 
#>     race + income + education + married + cholesterol + diastolicBP + 
#>     systolicBP, design = w.design.s2, family = binomial("logit"))
#> 
#> Survey design:
#> subset(w.design0, ind == 1)
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               -3.143943   0.567807  -5.537 5.70e-05 ***
#> bmiNot overweight         -1.450047   0.166099  -8.730 2.89e-07 ***
#> genderMale                 0.331561   0.098056   3.381  0.00411 ** 
#> age                        0.047790   0.003523  13.564 7.97e-10 ***
#> raceHispanic              -0.206657   0.135568  -1.524  0.14822    
#> raceOther                 -0.172590   0.314341  -0.549  0.59105    
#> raceWhite                 -0.625821   0.188746  -3.316  0.00471 ** 
#> incomeBetween.25kto54k    -0.328863   0.194223  -1.693  0.11107    
#> incomeBetween.55kto99k    -0.417687   0.205223  -2.035  0.05989 .  
#> incomeOver100k            -0.711500   0.190509  -3.735  0.00199 ** 
#> educationHigh.School      -0.072053   0.108040  -0.667  0.51496    
#> educationSchool           -0.006136   0.236588  -0.026  0.97965    
#> marriedNever.married      -0.017832   0.194660  -0.092  0.92822    
#> marriedPreviously.married -0.244058   0.196041  -1.245  0.23226    
#> cholesterol               -0.007067   0.001806  -3.913  0.00138 ** 
#> diastolicBP                0.005516   0.003869   1.426  0.17440    
#> systolicBP                 0.003739   0.004381   0.853  0.40682    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Zero or negative residual df; p-values not defined
#> 
#> (Dispersion parameter for binomial family taken to be 0.9373219)
#> 
#> Number of Fisher Scoring iterations: 6

Approach by Austin et al. (2018)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula3 <- as.formula(I(bmi == "Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP)

# Survey design
require(survey)
analytic.design <- svydesign(id = ~psu, weights = ~interview.weight, strata = ~strata,
                             data = analytic.data, nest = TRUE)

Step 2

# Propensity scores
ps.fit3 <- svyglm(ps.formula3, design = analytic.design, family = binomial("logit"))
analytic.data$ps3 <- fitted(ps.fit3)

# Stabilized weight
analytic.data$sweight.aus <- with(analytic.data, ifelse(I(bmi=="Not overweight"), 
                                                    mean(I(bmi=="Not overweight"))/ps3, 
                                                    (1-mean(I(bmi=="Not overweight")))/(1-ps3)))
summary(analytic.data$sweight.aus)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.4436  0.7138  0.8410  1.0392  1.0645 26.2776

# Truncating stabilized weight
analytic.data <- analytic.data %>% 
  mutate(sweight.aus_t = pmin(pmax(sweight.aus, quantile(sweight.aus, 0.01)), 
                          quantile(sweight.aus, 0.99)))
summary(analytic.data$sweight.aus_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.5198  0.7138  0.8410  1.0071  1.0645  4.8830

Step 3

# Balance checking
vars <- c("gender", "age", "race", "income", "education", "married", "cholesterol", 
          "diastolicBP", "systolicBP")

# Design with truncated stabilized weight
design.stab <- svydesign(ids = ~ID, weights = ~sweight.aus_t, data = analytic.data)

# Balance checking with truncated stabilized weight
tab.stab.aus <- svyCreateTableOne(vars = vars, strata = "bmi", data = design.stab, test = F)
print(tab.stab.aus, smd = T)
#>                          Stratified by bmi
#>                           Overweight      Not overweight  SMD   
#>   n                        3434.3          2926.8               
#>   gender = Male (%)        1609.8 (46.9)   1458.6 (49.8)   0.059
#>   age (mean (SD))           48.51 (17.32)   49.95 (18.14)  0.081
#>   race (%)                                                 0.099
#>      Black                  737.1 (21.5)    623.8 (21.3)        
#>      Hispanic              1085.0 (31.6)    826.0 (28.2)        
#>      Other                  471.2 (13.7)    489.7 (16.7)        
#>      White                 1141.0 (33.2)    987.3 (33.7)        
#>   income (%)                                               0.041
#>      <25k                   933.7 (27.2)    813.0 (27.8)        
#>      Between.25kto54k      1108.2 (32.3)    946.8 (32.3)        
#>      Between.55kto99k       776.7 (22.6)    685.4 (23.4)        
#>      Over100k               615.7 (17.9)    481.6 (16.5)        
#>   education (%)                                            0.039
#>      College               1908.2 (55.6)   1601.1 (54.7)        
#>      High.School           1129.9 (32.9)   1011.4 (34.6)        
#>      School                 396.2 (11.5)    314.3 (10.7)        
#>   married (%)                                              0.030
#>      Married               2063.9 (60.1)   1737.6 (59.4)        
#>      Never.married          605.8 (17.6)    501.4 (17.1)        
#>      Previously.married     764.6 (22.3)    687.8 (23.5)        
#>   cholesterol (mean (SD))  182.67 (41.11)  182.70 (43.38)  0.001
#>   diastolicBP (mean (SD))   66.76 (14.28)   66.85 (14.85)  0.006
#>   systolicBP (mean (SD))   121.39 (16.74)  123.98 (23.72)  0.126

All SMDs are less than our specified cut-point of 0.2.

Step 4

# Create an indicator variable in the full dataset
dat.full$ind <- 0
dat.full$ind[dat.full$ID %in% analytic.data$ID] <- 1

# New weight = interview weight * stabilized weight
analytic.data$new.sweight.aus_t <- with(analytic.data, interview.weight * sweight.aus_t)
summary(analytic.data$new.sweight.aus_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    3255   13604   22527   38953   43663  819622

#  New weight variable in the full dataset
dat.full$new.sweight.aus_t <- 0
dat.full$new.sweight.aus_t[dat.full$ID %in% analytic.data$ID] <- analytic.data$new.sweight.aus_t
summary(dat.full$new.sweight.aus_t)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       0   12356   24675   28169  819622

# Survey setup with full data 
w.design0 <- svydesign(id = ~psu, strata = ~strata, weights = ~new.sweight.aus_t, 
                       data = dat.full, nest = TRUE)

# Subset the design for analytic sample
w.design.s2 <- subset(w.design0, ind == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit.stab.aus <- svyglm(out.formula, design = w.design.s2, family = binomial("logit"))
summ(fit.stab.aus, exp = TRUE, confint = TRUE, digits = 3)

Observations	6316
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.055
Pseudo-R² (McFadden)	0.041
AIC	3622.951

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.162	0.140	0.187	-24.912	0.000
bmiNot overweight	0.291	0.225	0.377	-9.397	0.000
Standard errors: Robust

Double adjustment

# Outcome model with covariates adjustment
fit3.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.s2, family = binomial("logit"))
summ(fit3.DA, exp = TRUE, confint = TRUE, digits = 3)

Observations	6316
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.193
Pseudo-R² (McFadden)	0.149
AIC	3247.730

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.045	0.015	0.129	-5.735	NA
bmiNot overweight	0.233	0.175	0.311	-9.886	NA
genderMale	1.416	1.186	1.690	3.850	NA
age	1.049	1.041	1.057	12.066	NA
raceHispanic	0.773	0.585	1.022	-1.808	NA
raceOther	0.867	0.451	1.666	-0.428	NA
raceWhite	0.520	0.353	0.765	-3.316	NA
incomeBetween.25kto54k	0.700	0.461	1.062	-1.677	NA
incomeBetween.55kto99k	0.650	0.407	1.039	-1.801	NA
incomeOver100k	0.467	0.303	0.721	-3.437	NA
educationHigh.School	0.948	0.744	1.209	-0.428	NA
educationSchool	0.976	0.598	1.592	-0.099	NA
marriedNever.married	0.896	0.629	1.276	-0.609	NA
marriedPreviously.married	0.784	0.525	1.169	-1.195	NA
cholesterol	0.993	0.990	0.997	-3.653	NA
diastolicBP	1.005	0.998	1.013	1.412	NA
systolicBP	1.004	0.995	1.012	0.826	NA
Standard errors: Robust


# Log odds ratio with p-values
summary(fit3.DA, df.resid = degf(w.design.s2))
#> 
#> Call:
#> svyglm(formula = I(diabetes == "Yes") ~ bmi + gender + age + 
#>     race + income + education + married + cholesterol + diastolicBP + 
#>     systolicBP, design = w.design.s2, family = binomial("logit"))
#> 
#> Survey design:
#> subset(w.design0, ind == 1)
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               -3.109181   0.542121  -5.735 3.94e-05 ***
#> bmiNot overweight         -1.456869   0.147369  -9.886 5.81e-08 ***
#> genderMale                 0.347655   0.090292   3.850  0.00157 ** 
#> age                        0.047457   0.003933  12.066 4.01e-09 ***
#> raceHispanic              -0.257364   0.142343  -1.808  0.09069 .  
#> raceOther                 -0.142755   0.333175  -0.428  0.67440    
#> raceWhite                 -0.653902   0.197190  -3.316  0.00470 ** 
#> incomeBetween.25kto54k    -0.357122   0.212996  -1.677  0.11432    
#> incomeBetween.55kto99k    -0.430289   0.238964  -1.801  0.09190 .  
#> incomeOver100k            -0.761311   0.221498  -3.437  0.00367 ** 
#> educationHigh.School      -0.052990   0.123816  -0.428  0.67475    
#> educationSchool           -0.024734   0.249884  -0.099  0.92246    
#> marriedNever.married      -0.109989   0.180532  -0.609  0.55148    
#> marriedPreviously.married -0.243725   0.203990  -1.195  0.25072    
#> cholesterol               -0.006605   0.001808  -3.653  0.00235 ** 
#> diastolicBP                0.005253   0.003721   1.412  0.17842    
#> systolicBP                 0.003618   0.004380   0.826  0.42172    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Zero or negative residual df; p-values not defined
#> 
#> (Dispersion parameter for binomial family taken to be 0.9416613)
#> 
#> Number of Fisher Scoring iterations: 6

References

Austin, Peter C, Nathaniel Jembere, and Maria Chiu. 2018. “Propensity Score Matching and Complex Surveys.” Statistical Methods in Medical Research 27 (4): 1240–57.

Cole, Stephen R, and Miguel A Hernán. 2008. “Constructing Inverse Probability Weights for Marginal Structural Models.” American Journal of Epidemiology 168 (6): 656–64.

DuGoff, Eva H, Megan Schuler, and Elizabeth A Stuart. 2014. “Generalizing Observational Study Results: Applying Propensity Score Methods to Complex Surveys.” Health Services Research 49 (1): 284–303.

Hernán, Miguel A, and James M Robins. 2006. “Estimating Causal Effects from Epidemiological Data.” Journal of Epidemiology and Community Health 60 (7): 578.

Zanutto, Elaine L. 2006. “A Comparison of Propensity Score and Linear Regression Analysis of Complex Survey Data.” Journal of Data Science 4 (1): 67–91.