CCHS: Bivariate analysis

The following tutorial is performing bivariate analysis on our CCHS analytic dataset to examine relationships between two variables (association question).

We load several R packages required for bivariate analysis, statistical tests, and data visualization.

# Load required packages
library(survey)
library(knitr)
library(car)
library(tableone)
library(DataExplorer)
library(Publish)
library(ROCR)
library(WeightedROC)
library(jtools)

Load data

We load the dataset into the R environment and lists all available variables and objects.

load("Data/surveydata/cchs123b.RData")
ls()
#> [1] "analytic.miss"   "analytic2"       "has_annotations"

Preparing data

Weights

Here, the weights of survey respondents are accumulated, to account for the combination of different cycles of the data.

analytic.miss$weight <- analytic.miss$weight/3 # 3 cycles combined

Fixing variable types

We convert several variables to categorical or “factor” types, which are better suited for some statistical analysis when variables have categories.

var.names <- c("CVD", "age", "sex", "married", "race", "edu", "income", "bmi", 
               "phyact", "doctor", "stress", "smoke", "drink", "fruit", "bp", 
               "diab", "province", "OA", "immigrate")
analytic.miss[var.names] <- lapply(analytic.miss[var.names] , factor)
str(analytic.miss)
#> 'data.frame':    397173 obs. of  22 variables:
#>  $ CVD      : Factor w/ 2 levels "event","no event": 1 2 2 2 2 2 2 2 2 2 ...
#>  $ age      : Factor w/ 7 levels "20-29 years",..: 6 6 2 6 1 6 3 7 1 1 ...
#>  $ sex      : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 2 2 2 1 2 ...
#>  $ married  : Factor w/ 2 levels "not single","single": 2 2 1 2 2 1 1 2 2 2 ...
#>  $ race     : Factor w/ 2 levels "Non-white","White": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ edu      : Factor w/ 4 levels "< 2ndary","2nd grad.",..: 2 4 4 4 4 4 4 1 4 4 ...
#>  $ income   : Factor w/ 4 levels "$29,999 or less",..: 1 1 4 1 2 2 1 1 NA 4 ...
#>  $ bmi      : Factor w/ 3 levels "Underweight",..: NA NA 2 NA 2 NA 3 NA 2 3 ...
#>  $ phyact   : Factor w/ 3 levels "Active","Inactive",..: 2 2 2 2 2 2 1 1 2 3 ...
#>  $ doctor   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ stress   : Factor w/ 2 levels "Not too stressed",..: 1 1 2 1 1 1 1 NA 1 1 ...
#>  $ smoke    : Factor w/ 3 levels "Never smoker",..: 3 1 3 3 2 2 3 1 2 2 ...
#>  $ drink    : Factor w/ 3 levels "Never drank",..: 2 1 2 2 2 2 3 1 2 2 ...
#>  $ fruit    : Factor w/ 3 levels "0-3 daily serving",..: 2 2 2 3 3 3 2 2 2 2 ...
#>  $ bp       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ diab     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ province : Factor w/ 2 levels "South","North": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ weight   : num  47.6 23.8 56.1 23.8 65.4 ...
#>  $ cycle    : Factor w/ 3 levels "11","21","31": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ OA       : Factor w/ 2 levels "Control","OA": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ immigrate: Factor w/ 3 levels "not immigrant",..: 1 1 3 1 1 1 1 1 1 1 ...

The code identifies rows where data is missing and labels them for later analyses.

analytic.miss$miss <- 1
head(analytic.miss$ID) # full data
#> [1] 1 2 3 4 5 6
head(analytic2$ID) # complete case
#> [1]  3  5  7 10 11 13
head(analytic.miss$ID[analytic.miss$ID %in% analytic2$ID])
#> [1]  3  5  7 10 11 13
analytic.miss$miss[analytic.miss$ID %in% analytic2$ID] <- 0
table(analytic.miss$miss)
#> 
#>      0      1 
#> 185613 211560

Setting Design

The code sets up the survey design, specifying weights (but no specific clustering and stratification, as they are unavailable for CCHS public access data), for use in survey-weighted analyses.

require(survey)
summary(analytic.miss$weight)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.39   21.76   42.21   66.70   81.07 2384.98
w.design0 <- svydesign(id=~1, weights=~weight, 
                      data=analytic.miss)
summary(weights(w.design0))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.39   21.76   42.21   66.70   81.07 2384.98
sd(weights(w.design0))
#> [1] 80.34263

This creates a subset of the data where there are no missing values. Note that subset was done to the design object w.design0, not the data analytic.miss.

w.design <- subset(w.design0, miss == 0)
summary(weights(w.design))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.39   23.85   45.98   71.54   87.30 2384.98
sd(weights(w.design))
#> [1] 84.97819

Bivariate analysis

Table 1 (weighted)

Stratified by exposure

These tables contain descriptive statistics, stratified by different categories. They can be useful for understanding how variables relate to the exposure or outcome in the data.

require(tableone)
var.names <- c("CVD", "age", "sex", "married", "race", "edu", "income", "bmi", 
               "phyact", "doctor", "stress", "smoke", "drink", "fruit", "bp", 
               "diab", "province", "immigrate") # exclude "OA"
# tab1 <- CreateTableOne(var = var.names, strata= "OA", data=analytic.miss, test = TRUE)
# print(tab1)
tab2 <- svyCreateTableOne(var = var.names, strata= "OA", 
                          data=w.design, test = TRUE)
print(tab2)
#>                        Stratified by OA
#>                         Control            OA                p      test
#>   n                     12124961.5         1153392.2                    
#>   CVD = no event (%)    11786450.6 (97.2)  1020965.5 (88.5)  <0.001     
#>   age (%)                                                    <0.001     
#>      20-29 years         2668880.8 (22.0)    28317.5 ( 2.5)             
#>      30-39 years         3009426.7 (24.8)    77159.3 ( 6.7)             
#>      40-49 years         3108300.9 (25.6)   211515.1 (18.3)             
#>      50-59 years         1900845.3 (15.7)   350264.7 (30.4)             
#>      60-64 years          546041.8 ( 4.5)   163724.7 (14.2)             
#>      65 years and over    624706.7 ( 5.2)   322033.4 (27.9)             
#>      teen                 266759.3 ( 2.2)      377.4 ( 0.0)             
#>   sex = Male (%)         6374765.5 (52.6)   379850.5 (32.9)  <0.001     
#>   married = single (%)   4120611.0 (34.0)   367647.7 (31.9)  <0.001     
#>   race = White (%)      10312228.2 (85.0)  1081778.6 (93.8)  <0.001     
#>   edu (%)                                                    <0.001     
#>      < 2ndary            1752318.3 (14.5)   309652.8 (26.8)             
#>      2nd grad.           2314713.1 (19.1)   203437.5 (17.6)             
#>      Other 2nd grad.     1078645.2 ( 8.9)    79255.1 ( 6.9)             
#>      Post-2nd grad.      6979284.9 (57.6)   561046.8 (48.6)             
#>   income (%)                                                 <0.001     
#>      $29,999 or less     2051640.6 (16.9)   353862.9 (30.7)             
#>      $30,000-$49,999     2436063.7 (20.1)   272484.1 (23.6)             
#>      $50,000-$79,999     3495902.5 (28.8)   275115.8 (23.9)             
#>      $80,000 or more     4141354.6 (34.2)   251929.4 (21.8)             
#>   bmi (%)                                                    <0.001     
#>      Underweight          346004.9 ( 2.9)    22064.6 ( 1.9)             
#>      healthy weight      6019004.1 (49.6)   431570.2 (37.4)             
#>      Overweight          5759952.5 (47.5)   699757.4 (60.7)             
#>   phyact (%)                                                 <0.001     
#>      Active              3037314.2 (25.1)   216879.5 (18.8)             
#>      Inactive            5982492.3 (49.3)   647856.2 (56.2)             
#>      Moderate            3105154.9 (25.6)   288656.5 (25.0)             
#>   doctor = Yes (%)      10087473.8 (83.2)  1090763.9 (94.6)  <0.001     
#>   stress = stressed (%)  3123770.9 (25.8)   301895.2 (26.2)   0.420     
#>   smoke (%)                                                  <0.001     
#>      Never smoker        4043479.9 (33.3)   320323.7 (27.8)             
#>      Current smoker      3219168.6 (26.5)   275835.7 (23.9)             
#>      Former smoker       4862313.0 (40.1)   557232.7 (48.3)             
#>   drink (%)                                                  <0.001     
#>      Never drank          678435.7 ( 5.6)    66085.0 ( 5.7)             
#>      Current drinker    10297713.4 (84.9)   887808.2 (77.0)             
#>      Former driker       1148812.3 ( 9.5)   199498.9 (17.3)             
#>   fruit (%)                                                  <0.001     
#>      0-3 daily serving   3214156.0 (26.5)   236483.8 (20.5)             
#>      4-6 daily serving   6001124.3 (49.5)   588323.8 (51.0)             
#>      6+ daily serving    2909681.1 (24.0)   328584.5 (28.5)             
#>   bp = Yes (%)           1212548.2 (10.0)   347269.8 (30.1)  <0.001     
#>   diab = Yes (%)          377876.2 ( 3.1)   104541.0 ( 9.1)  <0.001     
#>   province = North (%)     27124.3 ( 0.2)     1825.1 ( 0.2)  <0.001     
#>   immigrate (%)                                              <0.001     
#>      not immigrant       9898636.6 (81.6)   994682.5 (86.2)             
#>      > 10 years          1384672.6 (11.4)   146879.8 (12.7)             
#>      recent               841652.3 ( 6.9)    11829.8 ( 1.0)
Stratified by outcome

This table is generally useful for logistic regression analysis

var.names <- c("OA", "age", "sex", "married", "race", "edu", "income", "bmi", 
               "phyact", "doctor", "stress", "smoke", "drink", "fruit", "bp", 
               "diab", "province", "immigrate") # exclude "CVD"
tab3 <- svyCreateTableOne(var = var.names, strata= "CVD", data=w.design, test = TRUE)
print(tab3)
#>                        Stratified by CVD
#>                         event            no event           p      test
#>   n                     470937.5         12807416.1                    
#>   OA = OA (%)           132426.7 (28.1)   1020965.5 ( 8.0)  <0.001     
#>   age (%)                                                   <0.001     
#>      20-29 years         14966.0 ( 3.2)   2682232.3 (20.9)             
#>      30-39 years         24105.5 ( 5.1)   3062480.5 (23.9)             
#>      40-49 years         63520.1 (13.5)   3256296.0 (25.4)             
#>      50-59 years        122613.0 (26.0)   2128497.0 (16.6)             
#>      60-64 years         72328.3 (15.4)    637438.1 ( 5.0)             
#>      65 years and over  172742.2 (36.7)    773997.8 ( 6.0)             
#>      teen                  662.3 ( 0.1)    266474.4 ( 2.1)             
#>   sex = Male (%)        267743.8 (56.9)   6486872.2 (50.6)  <0.001     
#>   married = single (%)  143747.0 (30.5)   4344511.8 (33.9)  <0.001     
#>   race = White (%)      434591.6 (92.3)  10959415.2 (85.6)  <0.001     
#>   edu (%)                                                   <0.001     
#>      < 2ndary           147338.4 (31.3)   1914632.6 (14.9)             
#>      2nd grad.           77705.6 (16.5)   2440445.0 (19.1)             
#>      Other 2nd grad.     30921.3 ( 6.6)   1126979.0 ( 8.8)             
#>      Post-2nd grad.     214972.2 (45.6)   7325359.4 (57.2)             
#>   income (%)                                                <0.001     
#>      $29,999 or less    164929.4 (35.0)   2240574.2 (17.5)             
#>      $30,000-$49,999    109988.2 (23.4)   2598559.6 (20.3)             
#>      $50,000-$79,999    103091.1 (21.9)   3667927.1 (28.6)             
#>      $80,000 or more     92928.7 (19.7)   4300355.3 (33.6)             
#>   bmi (%)                                                   <0.001     
#>      Underweight          8844.4 ( 1.9)    359225.0 ( 2.8)             
#>      healthy weight     173475.1 (36.8)   6277099.2 (49.0)             
#>      Overweight         288617.9 (61.3)   6171091.9 (48.2)             
#>   phyact (%)                                                <0.001     
#>      Active              85140.3 (18.1)   3169053.4 (24.7)             
#>      Inactive           274968.8 (58.4)   6355379.7 (49.6)             
#>      Moderate           110828.4 (23.5)   3282983.0 (25.6)             
#>   doctor = Yes (%)      445493.3 (94.6)  10732744.5 (83.8)  <0.001     
#>   stress = stressed (%) 113282.5 (24.1)   3312383.7 (25.9)   0.023     
#>   smoke (%)                                                 <0.001     
#>      Never smoker       119434.6 (25.4)   4244368.9 (33.1)             
#>      Current smoker      97328.0 (20.7)   3397676.3 (26.5)             
#>      Former smoker      254174.9 (54.0)   5165370.9 (40.3)             
#>   drink (%)                                                 <0.001     
#>      Never drank         29444.3 ( 6.3)    715076.4 ( 5.6)             
#>      Current drinker    344405.1 (73.1)  10841116.6 (84.6)             
#>      Former driker       97088.1 (20.6)   1251223.1 ( 9.8)             
#>   fruit (%)                                                  0.001     
#>      0-3 daily serving  111803.5 (23.7)   3338836.3 (26.1)             
#>      4-6 daily serving  233403.5 (49.6)   6356044.7 (49.6)             
#>      6+ daily serving   125730.4 (26.7)   3112535.1 (24.3)             
#>   bp = Yes (%)          209257.0 (44.4)   1350561.0 (10.5)  <0.001     
#>   diab = Yes (%)         78762.9 (16.7)    403654.4 ( 3.2)  <0.001     
#>   province = North (%)     702.8 ( 0.1)     28246.6 ( 0.2)   0.005     
#>   immigrate (%)                                             <0.001     
#>      not immigrant      389553.0 (82.7)  10503766.2 (82.0)             
#>      > 10 years          69008.0 (14.7)   1462544.4 (11.4)             
#>      recent              12376.5 ( 2.6)    841105.5 ( 6.6)

How did they calculate the p-values? Hint: svychisq (see below).

Proportions and Design Effect

This part computes proportions and design effects, which help understand the influence of the sampling design on the estimated statistics.

require(survey)
# Computing survey statistics on subsets of a survey defined by factor(s).
fit0a <- svyby(~CVD,~OA,design=w.design, svymean,deff=TRUE)
fit0a
confint(fit0a)
#>                          2.5 %    97.5 %
#> Control:CVDevent    0.02681661 0.0290204
#> OA:CVDevent         0.10847000 0.1211599
#> Control:CVDno event 0.97097960 0.9731834
#> OA:CVDno event      0.87884010 0.8915300
# 7.45% OA patients estimated to have CVD event.
# 95% CI:  (0.067, 0.0816)

Let

  • \(\theta\) = parameter (population slope) and
  • \(\hat(\theta)\) = statistic (estimated slope).

\(b = \frac{\sum[w (y_i-\bar{y}) (x_i-\bar{x})]}{\sum[w (x_i-\bar{x})^2]}\)

DE = Effect of complex survey on the SEs, relative to a SRS of equal size.

  • \(D^2(\hat{\theta}) = \frac{Var(\hat{\theta})_{Complex Survey}}{Var(\hat{\theta})_{SRS}}\)
  • \(D^2(\hat{\theta}) = \frac{SE(\hat{\theta})^2_{Complex Survey}}{SE(\hat{\theta})^2_{SRS}}\)

Note:

  1. SE increases as value of weight increases (CCHS).
  2. NHANES has more things to worry about (strata, PSU)

DEFF = 2 means that the variance of the sample proportion, when choosing the sample by complex survey sampling, is nearly 2 times as large as the variance of the same estimator under simple random sampling/SRS.

fit0b <- svyby(~CVD,~diab,design=w.design, svymean,deff=TRUE)
fit0b
confint(fit0b)
#>                     2.5 %     97.5 %
#> No:CVDevent     0.0295633 0.03173344
#> Yes:CVDevent    0.1505917 0.17594247
#> No:CVDno event  0.9682666 0.97043670
#> Yes:CVDno event 0.8240575 0.84940825

Testing association

Here, Chi-square tests are conducted to test the association between different variables. Two variants of the test are used: Rao-Scott and Thomas-Rao modifications. These adaptations are used when the data come from a complex survey design.

  • Tests for hypothesis
    • Rao-Scott modifications (chi-sq)
    • Thomas-Rao modifications (F)
# Rao-Scott modifications (chi-sq)
svychisq(~CVD+OA,design=w.design, statistic="Chisq")
#> 
#>  Pearson's X^2: Rao & Scott adjustment
#> 
#> data:  svychisq(~CVD + OA, design = w.design, statistic = "Chisq")
#> X-squared = 3249.7, df = 1, p-value < 2.2e-16

# Thomas-Rao modifications (F)
svychisq(~CVD+OA,design=w.design, statistic="F") 
#> 
#>  Pearson's X^2: Rao & Scott adjustment
#> 
#> data:  svychisq(~CVD + OA, design = w.design, statistic = "F")
#> F = 1863, ndf = 1, ddf = 185612, p-value < 2.2e-16

# Both provide strong evidence to reject the null hypothesis.
# Conclusion: there is a significant (at 5%) association 
# between CVD prevalence and OA.
svychisq(~CVD+fruit,design=w.design, statistic="F") 
#> 
#>  Pearson's X^2: Rao & Scott adjustment
#> 
#> data:  svychisq(~CVD + fruit, design = w.design, statistic = "F")
#> F = 7.1241, ndf = 1.9758e+00, ddf = 3.6673e+05, p-value = 0.0008503
svychisq(~CVD+province,design=w.design, statistic="Chisq") 
#> 
#>  Pearson's X^2: Rao & Scott adjustment
#> 
#> data:  svychisq(~CVD + province, design = w.design, statistic = "Chisq")
#> X-squared = 1.4848, df = 1, p-value = 0.00492

Saving data

Finally, the dataset, along with any new variables or subsets created during the analysis, is saved for future use.

save(w.design, analytic.miss, analytic2, file = "Data/surveydata/cchs123w.RData")

Video content (optional)

Tip

For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.