PSM in OA-CVD (CCHS)

This tutorial is a comprehensive guide on implementing Propensity Score Matching (PSM) using R, particularly focusing on a OA - CVD health study from the Canadian Community Health Survey (CCHS). This PSM method is used to reduce bias due to confounding variables in observational studies by matching treated and control units with similar propensity scores. The tutorial illustrates that PSM is an iterative process, where researchers may need to refine their matching strategy to achieve satisfactory balance in the matched sample. Different strategies for estimating the treatment effect in the matched sample are explored, each with its own assumptions and implications.

Load packages

At first, various R packages are loaded to utilize their functions for data manipulation, statistical analysis, and visualization.

# Load required packages
library(MatchIt)
require(tableone)
require(survey)
require(cobalt)
require(Publish)
require(optmatch)

Load data

The dataset is loaded into the R environment. Variables are renamed to avoid conflicts in subsequent analyses.

load(file="Data/propensityscore/cchs123c.RData")
head(analytic11n)

# later we will create another variable called weights
# hence to avoid any conflict/ambiguity,
# renaming weight variable to survey.weight
analytic.miss$survey.weight <- analytic.miss$weight
analytic11n$survey.weight <- analytic11n$weight
analytic.miss$weight <- analytic11n$weight <- NULL

Analysis

We are going to apply propensity score analysis (Matching) in our OA - CVD problem from CCHS. For computation considerations, we will only work with cycle 1.1, and the people from Northern provinces in Canada (analytic11n data).

Step 1

Specify PS

A logistic regression model formula is specified to calculate the propensity scores (PS), which is the probability of receiving the treatment given the observed covariates.

ps.formula <- as.formula("OA ~ age + sex + stress + married + 
                         income + race + bmi + phyact + smoke +
                        doctor + drink + bp + 
                         immigrate + fruit + diab + edu")
var.names <- c("age", "sex", "stress", "married", 
               "income", "race", "bmi", "phyact", "smoke", 
               "doctor", "drink", "bp", 
               "immigrate", "fruit", "diab", "edu")

Fit model

The software fits the PS model using a logistic regression by default. This package actually performs step 1 and 2 with one command matchit.

Look at the website for arguments of matchit (RDocumentation 2023)]. It looks like this

matchit(formula, data, model="logit", discard=0, reestimate=FALSE, nearest=TRUE,
                 replace=FALSE, m.order=2, ratio=1, caliper=0, calclosest=FALSE,
                 subclass=0, sub.by="treat", mahvars=NULL, exact=FALSE, counter=TRUE, full=FALSE, full.options=list(),...)
Tip

Nearest-Neighbor Matching:

Nearest-neighbor matching is a widely used technique in PSM to pair treated and control units based on the proximity of their propensity scores. It is straightforward and computationally efficient, making it a popular choice in many applications of PSM. Nearest-neighbor matching is often termed a “greedy” algorithm because it matches units in order, without considering the global distribution of propensity scores. Once a match is made, it is not revisited, even if a later unit would have been a better match. The method seeks to minimize bias by creating closely matched pairs but can increase variance if the pool of potential matches is reduced too much (e.g., using a very narrow caliper). It is essential to ensure that there is a common support region where the distributions of propensity scores for treated and control units overlap, ensuring comparability.

Step 2

Match subjects by PS

We are going to match using a Nearest neighbor algorithm. This is a greedy matching algorithm. Note that we are not even defining any caliper.

Tip

Caliper:

In the context of PSM, a caliper is a predefined maximum allowable difference between the propensity scores of matched units. Essentially, it sets a threshold for how dissimilar matched units can be in terms of their propensity scores. When a caliper is used, a treated unit is only matched with a control unit if the absolute difference in their propensity scores is less than or equal to the specified caliper width. The caliper is used to avoid bad matches and thereby minimize bias in the estimated treatment effect. The size of the caliper is crucial. Too wide a caliper may allow poor matches, while too narrow a caliper may result in many units going unmatched. Implementing a caliper involves a trade-off between bias and efficiency. Using a caliper reduces bias by avoiding poor matches but may increase variance by reducing the number of matched pairs available for analysis. Therefore, the use of a caliper in PSM is a strategic decision to enhance the quality of matches and thereby improve the validity of causal inferences drawn from observational data. It is a practical tool to ensure that matched units are sufficiently similar in terms of their propensity scores, reducing the likelihood of bias due to poor matches.

# set seed
set.seed(123)
# match
matching.obj <- matchit(ps.formula,
                        data = analytic11n,
                        method = "nearest",
                        ratio = 1)
# see how many matched
matching.obj
#> A matchit object
#>  - method: 1:1 nearest neighbor matching without replacement
#>  - distance: Propensity score
#>              - estimated with logistic regression
#>  - number of obs.: 1424 (original), 134 (matched)
#>  - target estimand: ATT
#>  - covariates: age, sex, stress, married, income, race, bmi, phyact, smoke, doctor, drink, bp, immigrate, fruit, diab, edu
# create the "matched"" data
OACVD.match <- match.data(matching.obj)
# see the dimension
dim(analytic11n)
#> [1] 1424   23
dim(OACVD.match)
#> [1] 134  26

Let’s try to understand how this is working.

Extract matched IDs

m.mat<-matching.obj$match.matrix
head(m.mat)
#>       [,1]    
#> 17864 "96719" 
#> 17921 "17846" 
#> 18191 "97168" 
#> 18256 "111999"
#> 18264 "17989" 
#> 18383 "108197"

Extract the matched treated IDs

treated.id<-as.numeric(row.names(m.mat))
treated.id # basically row names
#>  [1]  17864  17921  18191  18256  18264  18383  18389  18475  39105  96344
#> [11]  96364  96407  96424  96460  96484  96571  96582  96625  96632  96641
#> [21]  96657  96686  96693  96696  96705  96734  96795  96840  96913  97027
#> [31]  97065  97125 108178 108183 108185 108192 111809 111813 111856 111859
#> [41] 111895 111896 111920 111942 112014 112026 112046 112083 112086 112114
#> [51] 112122 112151 112167 112189 112197 112215 112232 112245 112275 112284
#> [61] 112289 112290 112300 112325 112375 126477 126522

Extract the matched untreated IDs

untreated.id <- as.numeric(m.mat)
untreated.id # basically row names
#>  [1]  96719  17846  97168 111999  17989 108197 112384  17909 126561 111880
#> [11] 112184  18117  96865  18120  97023 112379  97017 126562  96356 126470
#> [21] 126385  96374  18203  18262  96972 111924  96354  96983  18235  96882
#> [31] 112054  18321  18426 112349  38996 126516 111814 112087  96569 111932
#> [41] 126539  18315  96665  18225 112052 112324 112165  18329  96609 126376
#> [51]  96474 126570 126547 126343  96680  96558 111931  96718  96533 111823
#> [61] 112177  17953  17904 111908 111962  96644  96576

Extract the matched treated data

tx <- analytic11n[rownames(analytic11n) %in% treated.id,]
head(tx[c("OA", "CVD", "sex", "age", "race", "edu")])

Extract the matched untreated data

utx <- analytic11n[rownames(analytic11n) %in% untreated.id,]
head(utx[c("OA", "CVD", "sex", "age", "race", "edu")])

Extract the matched data altogether

Simply using match.data is enough (as done earlier).

OACVD.match <- match.data(matching.obj)

Assign match ID

OACVD.match$match.ID <- NA
OACVD.match$match.ID[rownames(OACVD.match) %in% treated.id] <- 1:length(treated.id)
OACVD.match$match.ID[rownames(OACVD.match) %in% untreated.id] <- 1:length(untreated.id)
table(OACVD.match$match.ID)
#> 
#>  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
#>  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
#> 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
#>  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
#> 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
#>  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2

Take a look at individual matches for the first match

na.omit(OACVD.match[OACVD.match$match.ID == 1,])

Take a look at individual matches for the second match

na.omit(OACVD.match[OACVD.match$match.ID == 2,])

Step 3

Both graphical and numerical methods are used to assess the quality of the matches and the balance of covariates in the matched sample.

Examining PS graphically

Visually inspect the PS and assess the balance of covariates in the matched sample. Various plots are generated to visualize the distribution of PS across treatment groups and to check the balance of covariates before and after matching.

matchit package

# plot(matching.obj) # covariate balance
plot(matching.obj, type = "jitter") # propensity score locations

#> [1] "To identify the units, use first mouse button; to stop, use second."
#> integer(0)
plot(matching.obj, type = "hist") #check matched treated vs matched control

summrize.output <- summary(matching.obj, standardize = TRUE)
plot(summrize.output)

Overalp check

# plot propensity scores by exposure group
plot(density(OACVD.match$distance[OACVD.match$OA==1]), 
     col = "red", main = "")
lines(density(OACVD.match$distance[OACVD.match$OA==0]), 
      col = "blue", lty = 2)
legend("topright", c("Non-arthritis","OA"), 
       col = c("red", "blue"), lty=1:2)

cobalt package

Overlap check in a more convenient way

# different badwidth
bal.plot(matching.obj, var.name = "distance")

Look at the data

# what is distance variable here?
head(OACVD.match)

Numerical values of PS

summary(OACVD.match$distance)
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.004671 0.044834 0.099094 0.138969 0.200576 0.611166
by(OACVD.match$distance, OACVD.match$OA, summary)
#> OACVD.match$OA: 0
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.004671 0.047344 0.099215 0.135042 0.199279 0.418206 
#> ------------------------------------------------------------ 
#> OACVD.match$OA: 1
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.004671 0.047346 0.098973 0.142897 0.198741 0.611166

Question for the students

  • Are you happy with the matching after reviewing the plots?

Covariate balance in matched sample

Covariate balance is assessed numerically using standardized mean differences (SMD).

Tip

Standardized mean differences: SMD is a versatile and widely used statistical measure that facilitates the comparison of groups in research by providing a scale-free metric of difference and balance. In the context of propensity score matching, achieving low SMD values for covariates after matching is crucial to ensuring the validity of causal inferences drawn from the matched sample.

Benifits:

  • SMD is not influenced by the scale of the measured variable, making it suitable for comparing the balance of different variables measured on different scales.
  • Unlike hypothesis testing, SMD is not affected by sample size, making it a reliable measure for assessing balance in matched samples.
tab1 <- CreateTableOne(strata = "OA", data = OACVD.match, 
                       test = FALSE, vars = var.names)
print(tab1, smd = TRUE)
#>                        Stratified by OA
#>                         0          1          SMD   
#>   n                     67         67               
#>   age (%)                                      0.190
#>      20-29 years         4 ( 6.0)   4 ( 6.0)        
#>      30-39 years        16 (23.9)  11 (16.4)        
#>      40-49 years        16 (23.9)  18 (26.9)        
#>      50-59 years        28 (41.8)  31 (46.3)        
#>      60-64 years         3 ( 4.5)   3 ( 4.5)        
#>      65 years and over   0 ( 0.0)   0 ( 0.0)        
#>      teen                0 ( 0.0)   0 ( 0.0)        
#>   sex = Male (%)        28 (41.8)  32 (47.8)   0.120
#>   stress = stressed (%) 20 (29.9)  21 (31.3)   0.032
#>   married = single (%)  22 (32.8)  23 (34.3)   0.032
#>   income (%)                                   0.183
#>      $29,999 or less    11 (16.4)  13 (19.4)        
#>      $30,000-$49,999    19 (28.4)  15 (22.4)        
#>      $50,000-$79,999    14 (20.9)  12 (17.9)        
#>      $80,000 or more    23 (34.3)  27 (40.3)        
#>   race = White (%)      43 (64.2)  44 (65.7)   0.031
#>   bmi (%)                                     <0.001
#>      Underweight         0 ( 0.0)   0 ( 0.0)        
#>      healthy weight     18 (26.9)  18 (26.9)        
#>      Overweight         49 (73.1)  49 (73.1)        
#>   phyact (%)                                   0.041
#>      Active             16 (23.9)  16 (23.9)        
#>      Inactive           40 (59.7)  39 (58.2)        
#>      Moderate           11 (16.4)  12 (17.9)        
#>   smoke (%)                                    0.258
#>      Never smoker       14 (20.9)   9 (13.4)        
#>      Current smoker     20 (29.9)  27 (40.3)        
#>      Former smoker      33 (49.3)  31 (46.3)        
#>   doctor = Yes (%)      51 (76.1)  47 (70.1)   0.135
#>   drink (%)                                    0.116
#>      Never drank         2 ( 3.0)   2 ( 3.0)        
#>      Current drinker    54 (80.6)  51 (76.1)        
#>      Former driker      11 (16.4)  14 (20.9)        
#>   bp = Yes (%)           7 (10.4)   5 ( 7.5)   0.105
#>   immigrate (%)                                0.180
#>      not immigrant      64 (95.5)  61 (91.0)        
#>      > 10 years          3 ( 4.5)   6 ( 9.0)        
#>      recent              0 ( 0.0)   0 ( 0.0)        
#>   fruit (%)                                    0.146
#>      0-3 daily serving  19 (28.4)  19 (28.4)        
#>      4-6 daily serving  28 (41.8)  32 (47.8)        
#>      6+ daily serving   20 (29.9)  16 (23.9)        
#>   diab = Yes (%)         1 ( 1.5)   4 ( 6.0)   0.238
#>   edu (%)                                      0.105
#>      < 2ndary           14 (20.9)  13 (19.4)        
#>      2nd grad.           1 ( 1.5)   2 ( 3.0)        
#>      Other 2nd grad.     1 ( 1.5)   1 ( 1.5)        
#>      Post-2nd grad.     51 (76.1)  51 (76.1)

Question for the students

  • All SMD < 0.20?

Other balance measures

Individual categories

If you want to check balance at each category (not very useful in general situations). We are generally interested if the variables are balanced or not (not categories).

baltab <- bal.tab(matching.obj)
baltab
#> Call
#>  matchit(formula = ps.formula, data = analytic11n, method = "nearest", 
#>     ratio = 1)
#> 
#> Balance Measures
#>                             Type Diff.Adj
#> distance                Distance   0.0597
#> age_20-29 years           Binary   0.0000
#> age_30-39 years           Binary  -0.0746
#> age_40-49 years           Binary   0.0299
#> age_50-59 years           Binary   0.0448
#> age_60-64 years           Binary   0.0000
#> sex_Male                  Binary   0.0597
#> stress_stressed           Binary   0.0149
#> married_single            Binary   0.0149
#> income_$29,999 or less    Binary   0.0299
#> income_$30,000-$49,999    Binary  -0.0597
#> income_$50,000-$79,999    Binary  -0.0299
#> income_$80,000 or more    Binary   0.0597
#> race_White                Binary   0.0149
#> bmi_Underweight           Binary   0.0000
#> bmi_healthy weight        Binary   0.0000
#> bmi_Overweight            Binary   0.0000
#> phyact_Active             Binary   0.0000
#> phyact_Inactive           Binary  -0.0149
#> phyact_Moderate           Binary   0.0149
#> smoke_Never smoker        Binary  -0.0746
#> smoke_Current smoker      Binary   0.1045
#> smoke_Former smoker       Binary  -0.0299
#> doctor_Yes                Binary  -0.0597
#> drink_Never drank         Binary   0.0000
#> drink_Current drinker     Binary  -0.0448
#> drink_Former driker       Binary   0.0448
#> bp_Yes                    Binary  -0.0299
#> immigrate_not immigrant   Binary  -0.0448
#> immigrate_> 10 years      Binary   0.0448
#> immigrate_recent          Binary   0.0000
#> fruit_0-3 daily serving   Binary   0.0000
#> fruit_4-6 daily serving   Binary   0.0597
#> fruit_6+ daily serving    Binary  -0.0597
#> diab_Yes                  Binary   0.0448
#> edu_< 2ndary              Binary  -0.0149
#> edu_2nd grad.             Binary   0.0149
#> edu_Other 2nd grad.       Binary   0.0000
#> edu_Post-2nd grad.        Binary   0.0000
#> 
#> Sample sizes
#>           Control Treated
#> All          1357      67
#> Matched        67      67
#> Unmatched    1290       0
Individual plots

You could plot each variables individually

bal.plot(matching.obj, var.name = "income")

bal.plot(matching.obj, var.name = "age")

bal.plot(matching.obj, var.name = "race")

bal.plot(matching.obj, var.name = "diab")

bal.plot(matching.obj, var.name = "immigrate")

Love plot
# Individual categories again
love.plot(baltab, threshold = .2)
#> Warning: Unadjusted values are missing. This can occur when un = FALSE and
#> quick = TRUE in the original call to bal.tab().
#> Warning: Standardized mean differences and raw mean differences are present in the same plot. 
#> Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

Repeat of Step 1-3 again

Covariate balance is reassessed in each step to ensure the quality of the match.

Add caliper

The matching process is repeated, this time introducing a caliper to ensure that matches are only made within a specified range of PS.

logitPS <-  -log(1/OACVD.match$distance - 1) 
# logit of the propensity score
.2*sd(logitPS) # suggested in the literature
#> [1] 0.2334615


# Step 1 and 2
matching.obj <- matchit(ps.formula,
                        data = analytic11n,
                        method = "nearest",
                        ratio = 1,
                        caliper = .2*sd(logitPS))
# see how many matched
matching.obj
#> A matchit object
#>  - method: 1:1 nearest neighbor matching without replacement
#>  - distance: Propensity score [caliper]
#>              - estimated with logistic regression
#>  - caliper: <distance> (0.015)
#>  - number of obs.: 1424 (original), 128 (matched)
#>  - target estimand: ATT
#>  - covariates: age, sex, stress, married, income, race, bmi, phyact, smoke, doctor, drink, bp, immigrate, fruit, diab, edu
OACVD.match <- match.data(matching.obj)
# Step 3
by(OACVD.match$distance, OACVD.match$OA, summary)
#> OACVD.match$OA: 0
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.004671 0.041740 0.094963 0.124665 0.184103 0.418206 
#> ------------------------------------------------------------ 
#> OACVD.match$OA: 1
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.004671 0.041694 0.095089 0.125262 0.183739 0.424895
tab1 <- CreateTableOne(strata = "OA", data = OACVD.match, 
                       test = FALSE, vars = var.names)
print(tab1, smd = TRUE)
#>                        Stratified by OA
#>                         0          1          SMD   
#>   n                     64         64               
#>   age (%)                                      0.196
#>      20-29 years         4 ( 6.2)   4 ( 6.2)        
#>      30-39 years        16 (25.0)  11 (17.2)        
#>      40-49 years        16 (25.0)  18 (28.1)        
#>      50-59 years        25 (39.1)  28 (43.8)        
#>      60-64 years         3 ( 4.7)   3 ( 4.7)        
#>      65 years and over   0 ( 0.0)   0 ( 0.0)        
#>      teen                0 ( 0.0)   0 ( 0.0)        
#>   sex = Male (%)        27 (42.2)  30 (46.9)   0.094
#>   stress = stressed (%) 18 (28.1)  18 (28.1)  <0.001
#>   married = single (%)  21 (32.8)  23 (35.9)   0.066
#>   income (%)                                   0.204
#>      $29,999 or less    11 (17.2)  13 (20.3)        
#>      $30,000-$49,999    19 (29.7)  14 (21.9)        
#>      $50,000-$79,999    13 (20.3)  12 (18.8)        
#>      $80,000 or more    21 (32.8)  25 (39.1)        
#>   race = White (%)      40 (62.5)  42 (65.6)   0.065
#>   bmi (%)                                     <0.001
#>      Underweight         0 ( 0.0)   0 ( 0.0)        
#>      healthy weight     18 (28.1)  18 (28.1)        
#>      Overweight         46 (71.9)  46 (71.9)        
#>   phyact (%)                                   0.096
#>      Active             14 (21.9)  16 (25.0)        
#>      Inactive           39 (60.9)  36 (56.2)        
#>      Moderate           11 (17.2)  12 (18.8)        
#>   smoke (%)                                    0.267
#>      Never smoker       14 (21.9)   9 (14.1)        
#>      Current smoker     19 (29.7)  26 (40.6)        
#>      Former smoker      31 (48.4)  29 (45.3)        
#>   doctor = Yes (%)      48 (75.0)  44 (68.8)   0.139
#>   drink (%)                                    0.123
#>      Never drank         2 ( 3.1)   2 ( 3.1)        
#>      Current drinker    52 (81.2)  49 (76.6)        
#>      Former driker      10 (15.6)  13 (20.3)        
#>   bp = Yes (%)           7 (10.9)   5 ( 7.8)   0.107
#>   immigrate (%)                                0.260
#>      not immigrant      62 (96.9)  58 (90.6)        
#>      > 10 years          2 ( 3.1)   6 ( 9.4)        
#>      recent              0 ( 0.0)   0 ( 0.0)        
#>   fruit (%)                                    0.116
#>      0-3 daily serving  19 (29.7)  19 (29.7)        
#>      4-6 daily serving  27 (42.2)  30 (46.9)        
#>      6+ daily serving   18 (28.1)  15 (23.4)        
#>   diab = Yes (%)         0 ( 0.0)   3 ( 4.7)   0.314
#>   edu (%)                                      0.108
#>      < 2ndary           14 (21.9)  13 (20.3)        
#>      2nd grad.           1 ( 1.6)   2 ( 3.1)        
#>      Other 2nd grad.     1 ( 1.6)   1 ( 1.6)        
#>      Post-2nd grad.     48 (75.0)  48 (75.0)

Question for the students

  • Did all of the SMDs decrease?

Look at the data

# what is weights variable for pair matching?
head(OACVD.match)
summary(OACVD.match$weights)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       1       1       1       1       1       1

Step 4

Estimate treatment effect for matched data

Different models (e.g., unconditional logistic regression, survey design) are fitted to estimate the treatment effect in the matched sample.

Unconditional logistic

# Wrong model for population!!
outcome.model <- glm(CVD ~ OA, data = OACVD.match, family = binomial())
publish(outcome.model)
#>  Variable Units OddsRatio       CI.95  p-value 
#>        OA            0.48 [0.09;2.74]   0.4119

Survey design

Convert data to design

The matched data is converted to a survey design object to account for the matched pairs in the analysis.

analytic.miss$matched <- 0
length(analytic.miss$ID) # full data
#> [1] 397173
length(OACVD.match$ID) # matched data
#> [1] 128
length(analytic.miss$ID[analytic.miss$ID %in% OACVD.match$ID])
#> [1] 128
analytic.miss$matched[analytic.miss$ID %in% OACVD.match$ID] <- 1
table(analytic.miss$matched)
#> 
#>      0      1 
#> 397045    128
w.design0 <- svydesign(id=~1, weights=~survey.weight, 
                      data=analytic.miss)
w.design.m <- subset(w.design0, matched == 1)
Balance in matched population?
tab1 <- svyCreateTableOne(strata = "OA", data = w.design.m, 
                       test = FALSE, vars = var.names)
print(tab1, smd = TRUE)
#>                        Stratified by OA
#>                         0              1              SMD   
#>   n                     1783.6         2002.1               
#>   age (%)                                              0.307
#>      20-29 years         131.0 ( 7.3)   120.9 ( 6.0)        
#>      30-39 years         388.0 (21.8)   237.8 (11.9)        
#>      40-49 years         518.3 (29.1)   572.7 (28.6)        
#>      50-59 years         680.1 (38.1)   999.3 (49.9)        
#>      60-64 years          66.1 ( 3.7)    71.4 ( 3.6)        
#>      65 years and over     0.0 ( 0.0)     0.0 ( 0.0)        
#>      teen                  0.0 ( 0.0)     0.0 ( 0.0)        
#>   sex = Male (%)         852.4 (47.8)   985.1 (49.2)   0.028
#>   stress = stressed (%)  544.6 (30.5)   531.8 (26.6)   0.088
#>   married = single (%)   419.8 (23.5)   427.5 (21.4)   0.052
#>   income (%)                                           0.222
#>      $29,999 or less     266.6 (14.9)   352.4 (17.6)        
#>      $30,000-$49,999     462.8 (25.9)   348.8 (17.4)        
#>      $50,000-$79,999     298.6 (16.7)   315.7 (15.8)        
#>      $80,000 or more     755.5 (42.4)   985.2 (49.2)        
#>   race = White (%)      1129.2 (63.3)  1364.9 (68.2)   0.103
#>   bmi (%)                                              0.045
#>      Underweight           0.0 ( 0.0)     0.0 ( 0.0)        
#>      healthy weight      483.3 (27.1)   583.3 (29.1)        
#>      Overweight         1300.2 (72.9)  1418.8 (70.9)        
#>   phyact (%)                                           0.054
#>      Active              448.0 (25.1)   493.9 (24.7)        
#>      Inactive           1075.2 (60.3)  1176.6 (58.8)        
#>      Moderate            260.3 (14.6)   331.5 (16.6)        
#>   smoke (%)                                            0.288
#>      Never smoker        400.2 (22.4)   265.8 (13.3)        
#>      Current smoker      548.8 (30.8)   836.6 (41.8)        
#>      Former smoker       834.6 (46.8)   899.6 (44.9)        
#>   doctor = Yes (%)      1376.0 (77.1)  1430.1 (71.4)   0.131
#>   drink (%)                                            0.194
#>      Never drank          44.0 ( 2.5)   112.6 ( 5.6)        
#>      Current drinker    1464.1 (82.1)  1510.6 (75.5)        
#>      Former driker       275.4 (15.4)   378.9 (18.9)        
#>   bp = Yes (%)           166.6 ( 9.3)   153.5 ( 7.7)   0.060
#>   immigrate (%)                                        0.235
#>      not immigrant      1694.8 (95.0)  1774.1 (88.6)        
#>      > 10 years           88.8 ( 5.0)   228.0 (11.4)        
#>      recent                0.0 ( 0.0)     0.0 ( 0.0)        
#>   fruit (%)                                            0.293
#>      0-3 daily serving   426.1 (23.9)   480.8 (24.0)        
#>      4-6 daily serving   748.1 (41.9)  1082.3 (54.1)        
#>      6+ daily serving    609.4 (34.2)   439.0 (21.9)        
#>   diab = Yes (%)           0.0 ( 0.0)    83.6 ( 4.2)   0.295
#>   edu (%)                                              0.172
#>      < 2ndary            324.9 (18.2)   342.8 (17.1)        
#>      2nd grad.            18.8 ( 1.1)    47.6 ( 2.4)        
#>      Other 2nd grad.      15.2 ( 0.9)    52.2 ( 2.6)        
#>      Post-2nd grad.     1424.7 (79.9)  1559.4 (77.9)
Outcome analysis
fit.design <- svyglm(CVD ~ OA, design = w.design.m, 
       family = binomial(logit))
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
publish(fit.design)
#>  Variable Units OddsRatio       CI.95  p-value 
#>        OA            0.50 [0.08;3.09]   0.4535

Matched data with increase ratio

The matching process is repeated with a different ratio (e.g., 1:5) to explore how changing the ratio affects the covariate balance and treatment effect estimation.

# Step 1 and 2
matching.obj <- matchit(ps.formula,
                        data = analytic11n,
                        method = "nearest",
                        ratio = 5,
                        caliper = 0.2)
# see how many matched
matching.obj
#> A matchit object
#>  - method: 5:1 nearest neighbor matching without replacement
#>  - distance: Propensity score [caliper]
#>              - estimated with logistic regression
#>  - caliper: <distance> (0.013)
#>  - number of obs.: 1424 (original), 349 (matched)
#>  - target estimand: ATT
#>  - covariates: age, sex, stress, married, income, race, bmi, phyact, smoke, doctor, drink, bp, immigrate, fruit, diab, edu
OACVD.match <- match.data(matching.obj)
# Step 3
by(OACVD.match$distance, OACVD.match$OA, summary)
#> OACVD.match$OA: 0
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.004643 0.039181 0.084421 0.101576 0.146403 0.418206 
#> ------------------------------------------------------------ 
#> OACVD.match$OA: 1
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.004671 0.041694 0.095089 0.125262 0.183739 0.424895
tab1 <- CreateTableOne(strata = "OA", data = OACVD.match, 
                       test = FALSE, vars = var.names)
print(tab1, smd = TRUE)
#>                        Stratified by OA
#>                         0           1          SMD   
#>   n                     285         64               
#>   age (%)                                       0.217
#>      20-29 years         24 ( 8.4)   4 ( 6.2)        
#>      30-39 years         59 (20.7)  11 (17.2)        
#>      40-49 years         94 (33.0)  18 (28.1)        
#>      50-59 years         98 (34.4)  28 (43.8)        
#>      60-64 years         10 ( 3.5)   3 ( 4.7)        
#>      65 years and over    0 ( 0.0)   0 ( 0.0)        
#>      teen                 0 ( 0.0)   0 ( 0.0)        
#>   sex = Male (%)        132 (46.3)  30 (46.9)   0.011
#>   stress = stressed (%)  81 (28.4)  18 (28.1)   0.007
#>   married = single (%)   91 (31.9)  23 (35.9)   0.085
#>   income (%)                                    0.065
#>      $29,999 or less     64 (22.5)  13 (20.3)        
#>      $30,000-$49,999     65 (22.8)  14 (21.9)        
#>      $50,000-$79,999     51 (17.9)  12 (18.8)        
#>      $80,000 or more    105 (36.8)  25 (39.1)        
#>   race = White (%)      169 (59.3)  42 (65.6)   0.131
#>   bmi (%)                                       0.097
#>      Underweight          0 ( 0.0)   0 ( 0.0)        
#>      healthy weight      68 (23.9)  18 (28.1)        
#>      Overweight         217 (76.1)  46 (71.9)        
#>   phyact (%)                                    0.136
#>      Active              57 (20.0)  16 (25.0)        
#>      Inactive           178 (62.5)  36 (56.2)        
#>      Moderate            50 (17.5)  12 (18.8)        
#>   smoke (%)                                     0.097
#>      Never smoker        46 (16.1)   9 (14.1)        
#>      Current smoker     123 (43.2)  26 (40.6)        
#>      Former smoker      116 (40.7)  29 (45.3)        
#>   doctor = Yes (%)      183 (64.2)  44 (68.8)   0.096
#>   drink (%)                                     0.051
#>      Never drank         10 ( 3.5)   2 ( 3.1)        
#>      Current drinker    212 (74.4)  49 (76.6)        
#>      Former driker       63 (22.1)  13 (20.3)        
#>   bp = Yes (%)           22 ( 7.7)   5 ( 7.8)   0.003
#>   immigrate (%)                                 0.100
#>      not immigrant      266 (93.3)  58 (90.6)        
#>      > 10 years          19 ( 6.7)   6 ( 9.4)        
#>      recent               0 ( 0.0)   0 ( 0.0)        
#>   fruit (%)                                     0.149
#>      0-3 daily serving  104 (36.5)  19 (29.7)        
#>      4-6 daily serving  124 (43.5)  30 (46.9)        
#>      6+ daily serving    57 (20.0)  15 (23.4)        
#>   diab = Yes (%)         12 ( 4.2)   3 ( 4.7)   0.023
#>   edu (%)                                       0.137
#>      < 2ndary            67 (23.5)  13 (20.3)        
#>      2nd grad.            9 ( 3.2)   2 ( 3.1)        
#>      Other 2nd grad.      9 ( 3.2)   1 ( 1.6)        
#>      Post-2nd grad.     200 (70.2)  48 (75.0)
Question for the students
  • Did all of the SMDs decrease?
Look at the data
# what is weights variable now for 1:5 ratio?
head(OACVD.match)
summary(OACVD.match$weights)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.8906  0.8906  0.8906  1.0000  0.8906  4.4531

Combining matching weights

Different approaches to incorporating weights (e.g., matching weights, survey weights) are explored.

Not incorporating matching weights

analytic.miss$matched <- 0
length(analytic.miss$ID) # full data
#> [1] 397173
length(OACVD.match$ID) # matched data
#> [1] 349
length(analytic.miss$ID[analytic.miss$ID %in% OACVD.match$ID])
#> [1] 349
analytic.miss$matched[analytic.miss$ID %in% OACVD.match$ID] <- 1
table(analytic.miss$matched)
#> 
#>      0      1 
#> 396824    349
w.design0 <- svydesign(id=~1, weights=~survey.weight, 
                      data=analytic.miss)
w.design.m <- subset(w.design0, matched == 1)
fit.design <- svyglm(CVD ~ OA, design = w.design.m, 
       family = binomial(logit))
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
publish(fit.design)
#>  Variable Units OddsRatio       CI.95 p-value 
#>        OA            0.79 [0.23;2.80]   0.721
Incorporating matching weights
analytic.miss$matched <- 0
length(analytic.miss$ID) # full data
#> [1] 397173
length(OACVD.match$ID) # matched data
#> [1] 349
length(analytic.miss$ID[analytic.miss$ID %in% OACVD.match$ID])
#> [1] 349
analytic.miss$matched[analytic.miss$ID %in% OACVD.match$ID] <- 1
table(analytic.miss$matched)
#> 
#>      0      1 
#> 396824    349
# multiply with matching (ratio) weights with survey weights
analytic.miss$combined.weight <- 0
analytic.miss$combined.weight[analytic.miss$ID %in% OACVD.match$ID] <-
  OACVD.match$weights*OACVD.match$survey.weight
w.design0 <- svydesign(id=~1, weights=~combined.weight, 
                      data=analytic.miss)
w.design.m <- subset(w.design0, matched == 1)
fit.design <- svyglm(I(CVD=="event") ~ OA, design = w.design.m, 
       family = binomial(logit))
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
publish(fit.design)
#>  Variable Units OddsRatio       CI.95  p-value 
#>        OA            1.14 [0.32;4.07]   0.8409

Matched with replacement

Matching is performed with replacement, allowing control units to be used in more than one match.

# Step 1 and 2
matching.obj <- matchit(ps.formula,
                        data = analytic11n,
                        method = "nearest",
                        ratio = 5,
                        caliper = 0.2,
                        replace = TRUE)
# see how many matched
matching.obj
#> A matchit object
#>  - method: 5:1 nearest neighbor matching with replacement
#>  - distance: Propensity score [caliper]
#>              - estimated with logistic regression
#>  - caliper: <distance> (0.013)
#>  - number of obs.: 1424 (original), 308 (matched)
#>  - target estimand: ATT
#>  - covariates: age, sex, stress, married, income, race, bmi, phyact, smoke, doctor, drink, bp, immigrate, fruit, diab, edu
OACVD.match <- match.data(matching.obj)
# Step 3
by(OACVD.match$distance, OACVD.match$OA, summary)
#> OACVD.match$OA: 0
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.004643 0.034244 0.067224 0.100845 0.148958 0.418206 
#> ------------------------------------------------------------ 
#> OACVD.match$OA: 1
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.004671 0.042322 0.097877 0.129743 0.189255 0.424895
tab1 <- CreateTableOne(strata = "OA", data = OACVD.match, 
                       test = FALSE, vars = var.names)
print(tab1, smd = TRUE)
#>                        Stratified by OA
#>                         0           1          SMD   
#>   n                     243         65               
#>   age (%)                                       0.266
#>      20-29 years         22 ( 9.1)   4 ( 6.2)        
#>      30-39 years         57 (23.5)  11 (16.9)        
#>      40-49 years         74 (30.5)  18 (27.7)        
#>      50-59 years         82 (33.7)  29 (44.6)        
#>      60-64 years          8 ( 3.3)   3 ( 4.6)        
#>      65 years and over    0 ( 0.0)   0 ( 0.0)        
#>      teen                 0 ( 0.0)   0 ( 0.0)        
#>   sex = Male (%)        113 (46.5)  31 (47.7)   0.024
#>   stress = stressed (%)  67 (27.6)  19 (29.2)   0.037
#>   married = single (%)   75 (30.9)  23 (35.4)   0.096
#>   income (%)                                    0.079
#>      $29,999 or less     53 (21.8)  13 (20.0)        
#>      $30,000-$49,999     57 (23.5)  14 (21.5)        
#>      $50,000-$79,999     44 (18.1)  12 (18.5)        
#>      $80,000 or more     89 (36.6)  26 (40.0)        
#>   race = White (%)      146 (60.1)  42 (64.6)   0.094
#>   bmi (%)                                       0.088
#>      Underweight          0 ( 0.0)   0 ( 0.0)        
#>      healthy weight      58 (23.9)  18 (27.7)        
#>      Overweight         185 (76.1)  47 (72.3)        
#>   phyact (%)                                    0.160
#>      Active              45 (18.5)  16 (24.6)        
#>      Inactive           155 (63.8)  37 (56.9)        
#>      Moderate            43 (17.7)  12 (18.5)        
#>   smoke (%)                                     0.095
#>      Never smoker        40 (16.5)   9 (13.8)        
#>      Current smoker     101 (41.6)  26 (40.0)        
#>      Former smoker      102 (42.0)  30 (46.2)        
#>   doctor = Yes (%)      152 (62.6)  45 (69.2)   0.141
#>   drink (%)                                     0.059
#>      Never drank          9 ( 3.7)   2 ( 3.1)        
#>      Current drinker    181 (74.5)  50 (76.9)        
#>      Former driker       53 (21.8)  13 (20.0)        
#>   bp = Yes (%)           19 ( 7.8)   5 ( 7.7)   0.005
#>   immigrate (%)                                 0.115
#>      not immigrant      228 (93.8)  59 (90.8)        
#>      > 10 years          15 ( 6.2)   6 ( 9.2)        
#>      recent               0 ( 0.0)   0 ( 0.0)        
#>   fruit (%)                                     0.147
#>      0-3 daily serving   87 (35.8)  19 (29.2)        
#>      4-6 daily serving  109 (44.9)  31 (47.7)        
#>      6+ daily serving    47 (19.3)  15 (23.1)        
#>   diab = Yes (%)         11 ( 4.5)   3 ( 4.6)   0.004
#>   edu (%)                                       0.175
#>      < 2ndary            58 (23.9)  13 (20.0)        
#>      2nd grad.            8 ( 3.3)   2 ( 3.1)        
#>      Other 2nd grad.      9 ( 3.7)   1 ( 1.5)        
#>      Post-2nd grad.     168 (69.1)  49 (75.4)

Question for the students

  • Did all of the SMDs decrease?

Look at the data

# what is weights variable now for 1:5 ratio?
head(OACVD.match)
summary(OACVD.match$weights)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.7477  0.7477  0.7477  1.0000  1.0000  7.4769

Survey design

The matched data is converted into a survey design object, and the treatment effect is estimated while accounting for the complex survey design.

analytic.miss$matched <- 0
length(analytic.miss$ID) # full data
#> [1] 397173
length(OACVD.match$ID) # matched data
#> [1] 308
length(analytic.miss$ID[analytic.miss$ID %in% OACVD.match$ID])
#> [1] 308
analytic.miss$matched[analytic.miss$ID %in% OACVD.match$ID] <- 1
table(analytic.miss$matched)
#> 
#>      0      1 
#> 396865    308
# multiply with matching (ratio) weights with survey weights
analytic.miss$combined.weight <- 0
analytic.miss$combined.weight[analytic.miss$ID %in% OACVD.match$ID] <-
  OACVD.match$weights*OACVD.match$survey.weight
w.design0 <- svydesign(id=~1, weights=~combined.weight, 
                      data=analytic.miss)
w.design.m <- subset(w.design0, matched == 1)
fit.design <- svyglm(I(CVD=="event") ~ OA, design = w.design.m, 
       family = binomial(logit))
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
publish(fit.design)
#>  Variable Units OddsRatio       CI.95  p-value 
#>        OA            0.99 [0.26;3.73]   0.9932

Video content (optional)

Tip

For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.

RDocumentation. 2023. “Matchit: Matchit: Matching Software for Causal Inference.” https://www.rdocumentation.org/packages/MatchIt/versions/1.0-1/topics/matchit.