PSM in BMI-diabetes

Propensity analysis problem

Perform a propensity score matching (pair matching, without replacement, using SMD cut-point 0.2, may adjust for imbalanced and/or all covariates in the outcome model, if any) analysis as per the following recommendations
- Zanutto (2006)
- DuGoff, Schuler, and Stuart (2014)
- Austin, Jembere, and Chiu (2018)

See explained in the previous chapter on PSM in OA-CVD (US), there are four steps of the propensity score matching:

Step 1: PS model specification
Step 2: Matching based on the estimated propensity scores
Step 3: Balance checking
Step 4: Outcome modelling

Only Step 1 is different for Zanutto (2006), DuGoff et al. (2014), and Austin et al. (2018) approaches.

Dataset

The following modified NHANES dataset
- NHANES15lab5.RData
use the data set “analytic.with.miss” within this file.
- for obtaining the final treatment effect estimates, you can omit missing values, but only after creating the design (e.g., subset the design, not the data itself directly).

Variables

Outcome: diabetes
- ‘No’ as the reference category
Exposure: bmi
- convert to binary with >25 vs. <= 25,
- with > 25 as the reference category
Confounder list:
- gender
- age
  - assume continuous
- race
- income
- education
- married
- cholesterol
- diastolicBP
- systolicBP
Mediator:
- physical.work
  - ‘No’ as the reference category
Survey features
- psu
- strata
- weight

Pre-processing

Load data

load(file="Data/propensityscore/NHANES15lab5.RData")

Variable summary

# Exposure
analytic.with.miss$bmi <- with(analytic.with.miss, 
                                   ifelse(bmi>25, "Overweight", 
                                          ifelse(bmi<=25, "Not overweight", NA)))
analytic.with.miss$bmi <- as.factor(analytic.with.miss$bmi)
analytic.with.miss$bmi <- relevel(analytic.with.miss$bmi, ref = "Overweight")

# Drop unnecessary variables 
analytic.with.miss$born <- NULL
analytic.with.miss$physical.work <- NULL

# Rename the weight variable into interview.weight
names(analytic.with.miss)[names(analytic.with.miss) == "weight"] <- "interview.weight"

Reproducibility

set.seed(504)

Approach by Zanutto (2006)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula <- as.formula(I(bmi=="Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP)

Step 2

require(MatchIt)
# Complete data for matching
analytic.data <- analytic.with.miss[complete.cases(analytic.with.miss),]

# Create a missing variable in the original dataset
analytic.with.miss$miss <- 1
analytic.with.miss$miss[analytic.with.miss$ID %in% analytic.data$ID] <- 0

# Propensity scores
ps.fit <- glm(ps.formula, data = analytic.data, family = binomial("logit"))
analytic.data$PS <- predict(ps.fit, type = "response", newdata = analytic.data)

# Caliper fixing to 0.2*sd(logit of PS)
caliper <- 0.2*sd(log(analytic.data$PS/(1-analytic.data$PS)))

# Match exposed and unexposed subjects 
match.obj <- matchit(ps.formula, data = analytic.data,
                     distance = analytic.data$PS, 
                     method = "nearest", 
                     replace = FALSE,
                     caliper = caliper, 
                     ratio = 1)
analytic.data$PS <- match.obj$distance

# Extract matched data
matched.data <- match.data(match.obj)

Step 3

# Balance checking
cov <- c("gender", "age", "race", "income", 
         "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP")

tab1m <- CreateTableOne(strata = "bmi", vars = cov, data = matched.data, test = F)
print(tab1m, smd = TRUE)
#>                          Stratified by bmi
#>                           Overweight     Not overweight SMD   
#>   n                         2047           2047               
#>   gender = Male (%)          987 (48.2)     993 (48.5)   0.006
#>   age (mean (SD))          46.96 (17.55)  46.35 (17.64)  0.035
#>   race (%)                                               0.041
#>      Black                   452 (22.1)     424 (20.7)        
#>      Hispanic                558 (27.3)     584 (28.5)        
#>      Other                   328 (16.0)     338 (16.5)        
#>      White                   709 (34.6)     701 (34.2)        
#>   income (%)                                             0.042
#>      <25k                    537 (26.2)     509 (24.9)        
#>      Between.25kto54k        670 (32.7)     661 (32.3)        
#>      Between.55kto99k        472 (23.1)     484 (23.6)        
#>      Over100k                368 (18.0)     393 (19.2)        
#>   education (%)                                          0.023
#>      College                1199 (58.6)    1202 (58.7)        
#>      High.School             667 (32.6)     652 (31.9)        
#>      School                  181 ( 8.8)     193 ( 9.4)        
#>   married (%)                                            0.036
#>      Married                1215 (59.4)    1243 (60.7)        
#>      Never.married           402 (19.6)     374 (18.3)        
#>      Previously.married      430 (21.0)     430 (21.0)        
#>   cholesterol (mean (SD)) 176.45 (39.57) 174.62 (38.70)  0.047
#>   diastolicBP (mean (SD))  64.97 (13.65)  64.19 (13.47)  0.057
#>   systolicBP (mean (SD))  118.07 (15.23) 116.19 (17.60)  0.114

All SMDs are less than your specified cut-point? If yes, that indicates that there is good covariate balancing.

Step 4

require(survey)
# Setup the design with survey features
analytic.with.miss$matched <- 0
analytic.with.miss$matched[analytic.with.miss$ID %in% matched.data$ID] <- 1

# Survey setup for full data
w.design0 <- svydesign(strata = ~strata, id = ~psu, weights = ~interview.weight, 
                      data = analytic.with.miss, nest = TRUE)

# Subset matched data
w.design.m <- subset(w.design0, matched == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit <- svyglm(out.formula, design = w.design.m, family = binomial("logit"))
require(jtools)
summ(fit, exp = TRUE, confint = TRUE, digits = 3)

Observations	4094
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.071
Pseudo-R² (McFadden)	0.054
AIC	2336.725

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.165	0.141	0.195	-21.678	0.000
bmiNot overweight	0.245	0.173	0.346	-7.980	0.000
Standard errors: Robust

Double adjustment

library(survey)
# Outcome model with covariates adjustment
fit.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.m, family = binomial("logit"))
summ(fit.DA, exp = TRUE, confint = TRUE, digits = 3)

Observations	4094
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.212
Pseudo-R² (McFadden)	0.166
AIC	2063.466

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.045	0.010	0.199	-4.101	NA
bmiNot overweight	0.235	0.159	0.348	-7.219	NA
genderMale	1.317	1.082	1.604	2.740	NA
age	1.051	1.044	1.059	13.280	NA
raceHispanic	0.896	0.649	1.237	-0.665	NA
raceOther	1.158	0.565	2.374	0.401	NA
raceWhite	0.778	0.507	1.193	-1.152	NA
incomeBetween.25kto54k	0.830	0.551	1.251	-0.889	NA
incomeBetween.55kto99k	0.872	0.621	1.224	-0.794	NA
incomeOver100k	0.669	0.418	1.072	-1.669	NA
educationHigh.School	0.948	0.733	1.225	-0.408	NA
educationSchool	0.872	0.448	1.695	-0.405	NA
marriedNever.married	1.055	0.677	1.646	0.237	NA
marriedPreviously.married	0.849	0.534	1.348	-0.694	NA
cholesterol	0.991	0.986	0.995	-3.810	NA
diastolicBP	1.005	0.992	1.018	0.749	NA
systolicBP	1.003	0.993	1.012	0.559	NA
Standard errors: Robust

Approach by DuGoff et al. (2014)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula2 <- as.formula(I(bmi == "Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP + 
                           psu + strata + interview.weight)

Step 2

# Propensity scores
ps.fit2 <- glm(ps.formula2, data = analytic.data, family = binomial("logit"))
analytic.data$PS2 <- fitted(ps.fit2)

# Caliper fixing to 0.2*sd(logit of PS)
caliper2 <- 0.2*sd(log(analytic.data$PS2/(1-analytic.data$PS2)))

# Match exposed and unexposed subjects 
set.seed(504)
match.obj2 <- matchit(ps.formula2, data = analytic.data,
                     distance = analytic.data$PS2, 
                     method = "nearest", 
                     replace = FALSE,
                     caliper = caliper2, 
                     ratio = 1)
analytic.data$PS2 <- match.obj2$distance

# Extract matched data
matched.data2 <- match.data(match.obj2)

Step 3

# Balance checking
cov2 <- c("gender", "age", "race", "income", 
          "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP", 
         "psu", "strata", "interview.weight")

tab1m2 <- CreateTableOne(strata = "bmi", vars = cov2, data = matched.data2, test = F)
print(tab1m2, smd = TRUE)
#>                               Stratified by bmi
#>                                Overweight          Not overweight      SMD   
#>   n                                2037                2037                  
#>   gender = Male (%)                 991 (48.6)          985 (48.4)      0.006
#>   age (mean (SD))                 47.10 (17.50)       46.41 (17.72)     0.039
#>   race (%)                                                              0.044
#>      Black                          453 (22.2)          438 (21.5)           
#>      Hispanic                       553 (27.1)          579 (28.4)           
#>      Other                          324 (15.9)          342 (16.8)           
#>      White                          707 (34.7)          678 (33.3)           
#>   income (%)                                                            0.023
#>      <25k                           522 (25.6)          522 (25.6)           
#>      Between.25kto54k               679 (33.3)          662 (32.5)           
#>      Between.55kto99k               471 (23.1)          473 (23.2)           
#>      Over100k                       365 (17.9)          380 (18.7)           
#>   education (%)                                                         0.007
#>      College                       1199 (58.9)         1192 (58.5)           
#>      High.School                    644 (31.6)          648 (31.8)           
#>      School                         194 ( 9.5)          197 ( 9.7)           
#>   married (%)                                                           0.018
#>      Married                       1209 (59.4)         1224 (60.1)           
#>      Never.married                  395 (19.4)          394 (19.3)           
#>      Previously.married             433 (21.3)          419 (20.6)           
#>   cholesterol (mean (SD))        176.73 (39.68)      174.82 (38.77)     0.049
#>   diastolicBP (mean (SD))         65.16 (13.49)       64.39 (13.22)     0.057
#>   systolicBP (mean (SD))         118.27 (15.07)      116.33 (17.57)     0.119
#>   psu (mean (SD))                  1.49 (0.50)         1.49 (0.50)      0.008
#>   strata (mean (SD))             126.39 (4.16)       126.27 (4.29)      0.029
#>   interview.weight (mean (SD)) 38621.02 (34640.92) 37195.99 (36180.33)  0.040

All SMDs are less than your specified cut-point? If yes, that indicates that there is good covariate balancing.

Step 4

# Setup the design with survey features
analytic.with.miss$matched2 <- 0
analytic.with.miss$matched2[analytic.with.miss$ID %in% matched.data2$ID] <- 1

# Survey setup for full data
w.design02 <- svydesign(strata = ~strata, id = ~psu, weights = ~interview.weight, 
                       data = analytic.with.miss, nest = TRUE)

# Subset matched data
w.design.m2 <- subset(w.design02, matched2 == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit2 <- svyglm(out.formula, design = w.design.m2, family = binomial("logit"))
summ(fit2, exp = TRUE, confint = TRUE, digits = 3)

Observations	4074
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.083
Pseudo-R² (McFadden)	0.063
AIC	2322.836

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.174	0.149	0.202	-22.325	0.000
bmiNot overweight	0.221	0.163	0.299	-9.749	0.000
Standard errors: Robust

Double adjustment

# Outcome model with covariates adjustment
fit2.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.m, family = binomial("logit"))
summ(fit2.DA, exp = TRUE, confint = TRUE, digits = 3)

Observations	4094
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.212
Pseudo-R² (McFadden)	0.166
AIC	2063.466

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.045	0.010	0.199	-4.101	NA
bmiNot overweight	0.235	0.159	0.348	-7.219	NA
genderMale	1.317	1.082	1.604	2.740	NA
age	1.051	1.044	1.059	13.280	NA
raceHispanic	0.896	0.649	1.237	-0.665	NA
raceOther	1.158	0.565	2.374	0.401	NA
raceWhite	0.778	0.507	1.193	-1.152	NA
incomeBetween.25kto54k	0.830	0.551	1.251	-0.889	NA
incomeBetween.55kto99k	0.872	0.621	1.224	-0.794	NA
incomeOver100k	0.669	0.418	1.072	-1.669	NA
educationHigh.School	0.948	0.733	1.225	-0.408	NA
educationSchool	0.872	0.448	1.695	-0.405	NA
marriedNever.married	1.055	0.677	1.646	0.237	NA
marriedPreviously.married	0.849	0.534	1.348	-0.694	NA
cholesterol	0.991	0.986	0.995	-3.810	NA
diastolicBP	1.005	0.992	1.018	0.749	NA
systolicBP	1.003	0.993	1.012	0.559	NA
Standard errors: Robust

Tip

Double adjustment:

Double adjustment in PSM can offer an additional layer of control for confounding and enhance the robustness of treatment effect estimates. The primary goal of double adjustment is to further minimize bias in the estimated treatment effect, especially when the propensity score model may not fully balance all covariates across treatment groups. in this approach, The propensity score is estimated, typically using logistic regression, where the treatment assignment is regressed on observed covariates. Units (e.g., individuals) in the treatment group are matched with units in the control group based on their propensity scores. This aims to create comparable groups and balance observed covariates across treatment and control groups. After matching, an outcome model is fitted to estimate the treatment effect while additionally adjusting for the same covariates used in the propensity score model. This second adjustment in the outcome model is what constitutes the “double” in “double adjustment.”

However, it should be applied thoughtfully, with careful consideration of model specification, covariate selection, and underlying assumptions to ensure valid and reliable results. Always consider the specific context of the study and consult statistical guidelines or experts when applying advanced methods like double adjustment in PSM.

Approach by Austin et al. (2018)

Step 1

# Specify the PS model to estimate propensity scores
ps.formula3 <- as.formula(I(bmi == "Not overweight") ~ gender + age + race + income + education + 
                           married + cholesterol + diastolicBP + systolicBP)

# Survey design
require(survey)
analytic.design <- svydesign(id = ~psu, weights = ~interview.weight, strata = ~strata,
                             data = analytic.data, nest = TRUE)

Step 2

# Propensity scores
ps.fit3 <- svyglm(ps.formula3, design = analytic.design, family = binomial("logit"))
analytic.data$PS3 <- fitted(ps.fit3)

# Caliper fixing to 0.2*sd(logit of PS)
caliper3 <- 0.2*sd(log(analytic.data$PS3/(1-analytic.data$PS3)))

# Match exposed and unexposed subjects 
set.seed(504)
match.obj3 <- matchit(ps.formula, data = analytic.data,
                     distance = analytic.data$PS3, 
                     method = "nearest", 
                     replace = FALSE,
                     caliper = caliper3, 
                     ratio = 1)
analytic.data$PS3 <- match.obj3$distance

# Extract matched data
matched.data3 <- match.data(match.obj3)

Step 3

# Balance checking
cov <- c("gender", "age", "race", "income", 
         "education", "married", "cholesterol", 
         "diastolicBP", "systolicBP")

tab1m3 <- CreateTableOne(strata = "bmi", vars = cov, data = matched.data3, test = F)
print(tab1m3, smd = TRUE)
#>                          Stratified by bmi
#>                           Overweight     Not overweight SMD   
#>   n                         2029           2029               
#>   gender = Male (%)          917 (45.2)    1009 (49.7)   0.091
#>   age (mean (SD))          47.20 (17.50)  46.26 (17.62)  0.054
#>   race (%)                                               0.071
#>      Black                   443 (21.8)     416 (20.5)        
#>      Hispanic                552 (27.2)     583 (28.7)        
#>      Other                   313 (15.4)     351 (17.3)        
#>      White                   721 (35.5)     679 (33.5)        
#>   income (%)                                             0.067
#>      <25k                    528 (26.0)     506 (24.9)        
#>      Between.25kto54k        688 (33.9)     656 (32.3)        
#>      Between.55kto99k        439 (21.6)     495 (24.4)        
#>      Over100k                374 (18.4)     372 (18.3)        
#>   education (%)                                          0.034
#>      College                1165 (57.4)    1186 (58.5)        
#>      High.School             654 (32.2)     653 (32.2)        
#>      School                  210 (10.3)     190 ( 9.4)        
#>   married (%)                                            0.026
#>      Married                1216 (59.9)    1231 (60.7)        
#>      Never.married           396 (19.5)     375 (18.5)        
#>      Previously.married      417 (20.6)     423 (20.8)        
#>   cholesterol (mean (SD)) 176.62 (38.32) 175.13 (38.65)  0.039
#>   diastolicBP (mean (SD))  65.02 (13.44)  64.48 (13.37)  0.041
#>   systolicBP (mean (SD))  117.98 (15.46) 116.37 (17.62)  0.097

All SMDs are less than your specified cut-point? If yes, that indicates that there is good covariate balancing.

Step 4

# Setup the design with survey features
analytic.with.miss$matched3 <- 0
analytic.with.miss$matched3[analytic.with.miss$ID %in% matched.data3$ID] <- 1

# Survey setup for full data
w.design03 <- svydesign(strata = ~strata, id = ~psu, weights = ~interview.weight, 
                       data = analytic.with.miss, nest = TRUE)

# Subset matched data
w.design.m3 <- subset(w.design03, matched3 == 1)

# Outcome model
out.formula <- as.formula(I(diabetes == "Yes") ~ bmi)
fit3 <- svyglm(out.formula, design = w.design.m3, family = binomial("logit"))
summ(fit3, exp = TRUE, confint = TRUE, digits = 3)

Observations	4058
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.071
Pseudo-R² (McFadden)	0.054
AIC	2364.398

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.169	0.142	0.202	-19.891	0.000
bmiNot overweight	0.244	0.178	0.334	-8.825	0.000
Standard errors: Robust

Double adjustment

# Outcome model with covariates adjustment
fit3.DA <- svyglm(I(diabetes == "Yes") ~ bmi + gender + age + race + income + 
                 education + married + cholesterol + diastolicBP + systolicBP, 
               design = w.design.m, family = binomial("logit"))
summ(fit3.DA, exp = TRUE, confint = TRUE, digits = 3)

Observations	4094
Dependent variable	I(diabetes == "Yes")
Type	Survey-weighted generalized linear model
Family	binomial
Link	logit

Pseudo-R² (Cragg-Uhler)	0.212
Pseudo-R² (McFadden)	0.166
AIC	2063.466

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	0.045	0.010	0.199	-4.101	NA
bmiNot overweight	0.235	0.159	0.348	-7.219	NA
genderMale	1.317	1.082	1.604	2.740	NA
age	1.051	1.044	1.059	13.280	NA
raceHispanic	0.896	0.649	1.237	-0.665	NA
raceOther	1.158	0.565	2.374	0.401	NA
raceWhite	0.778	0.507	1.193	-1.152	NA
incomeBetween.25kto54k	0.830	0.551	1.251	-0.889	NA
incomeBetween.55kto99k	0.872	0.621	1.224	-0.794	NA
incomeOver100k	0.669	0.418	1.072	-1.669	NA
educationHigh.School	0.948	0.733	1.225	-0.408	NA
educationSchool	0.872	0.448	1.695	-0.405	NA
marriedNever.married	1.055	0.677	1.646	0.237	NA
marriedPreviously.married	0.849	0.534	1.348	-0.694	NA
cholesterol	0.991	0.986	0.995	-3.810	NA
diastolicBP	1.005	0.992	1.018	0.749	NA
systolicBP	1.003	0.993	1.012	0.559	NA
Standard errors: Robust

References

Austin, Peter C, Nathaniel Jembere, and Maria Chiu. 2018. “Propensity Score Matching and Complex Surveys.” Statistical Methods in Medical Research 27 (4): 1240–57.

DuGoff, Eva H, Megan Schuler, and Elizabeth A Stuart. 2014. “Generalizing Observational Study Results: Applying Propensity Score Methods to Complex Surveys.” Health Services Research 49 (1): 284–303.

Zanutto, Elaine L. 2006. “A Comparison of Propensity Score and Linear Regression Analysis of Complex Survey Data.” Journal of Data Science 4 (1): 67–91.