CCHS: Bivariate analysis
The following tutorial is performing bivariate analysis on our CCHS analytic dataset to examine relationships between two variables (association question).
We load several R packages required for bivariate analysis, statistical tests, and data visualization.
Load data
We load the dataset into the R environment and lists all available variables and objects.
Preparing data
Weights
Here, the weights of survey respondents are accumulated, to account for the combination of different cycles of the data.
Fixing variable types
We convert several variables to categorical or “factor” types, which are better suited for some statistical analysis when variables have categories.
var.names <- c("CVD", "age", "sex", "married", "race", "edu", "income", "bmi",
"phyact", "doctor", "stress", "smoke", "drink", "fruit", "bp",
"diab", "province", "OA", "immigrate")
analytic.miss[var.names] <- lapply(analytic.miss[var.names] , factor)
str(analytic.miss)
#> 'data.frame': 397173 obs. of 22 variables:
#> $ CVD : Factor w/ 2 levels "event","no event": 1 2 2 2 2 2 2 2 2 2 ...
#> $ age : Factor w/ 7 levels "20-29 years",..: 6 6 2 6 1 6 3 7 1 1 ...
#> $ sex : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 2 2 2 1 2 ...
#> $ married : Factor w/ 2 levels "not single","single": 2 2 1 2 2 1 1 2 2 2 ...
#> $ race : Factor w/ 2 levels "Non-white","White": 2 2 2 2 2 2 2 2 2 2 ...
#> $ edu : Factor w/ 4 levels "< 2ndary","2nd grad.",..: 2 4 4 4 4 4 4 1 4 4 ...
#> $ income : Factor w/ 4 levels "$29,999 or less",..: 1 1 4 1 2 2 1 1 NA 4 ...
#> $ bmi : Factor w/ 3 levels "Underweight",..: NA NA 2 NA 2 NA 3 NA 2 3 ...
#> $ phyact : Factor w/ 3 levels "Active","Inactive",..: 2 2 2 2 2 2 1 1 2 3 ...
#> $ doctor : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
#> $ stress : Factor w/ 2 levels "Not too stressed",..: 1 1 2 1 1 1 1 NA 1 1 ...
#> $ smoke : Factor w/ 3 levels "Never smoker",..: 3 1 3 3 2 2 3 1 2 2 ...
#> $ drink : Factor w/ 3 levels "Never drank",..: 2 1 2 2 2 2 3 1 2 2 ...
#> $ fruit : Factor w/ 3 levels "0-3 daily serving",..: 2 2 2 3 3 3 2 2 2 2 ...
#> $ bp : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> $ diab : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> $ province : Factor w/ 2 levels "South","North": 1 1 1 1 1 1 1 1 1 1 ...
#> $ weight : num 47.6 23.8 56.1 23.8 65.4 ...
#> $ cycle : Factor w/ 3 levels "11","21","31": 1 1 1 1 1 1 1 1 1 1 ...
#> $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ OA : Factor w/ 2 levels "Control","OA": 1 1 1 1 1 1 1 1 1 1 ...
#> $ immigrate: Factor w/ 3 levels "not immigrant",..: 1 1 3 1 1 1 1 1 1 1 ...
The code identifies rows where data is missing and labels them for later analyses.
analytic.miss$miss <- 1
head(analytic.miss$ID) # full data
#> [1] 1 2 3 4 5 6
head(analytic2$ID) # complete case
#> [1] 3 5 7 10 11 13
head(analytic.miss$ID[analytic.miss$ID %in% analytic2$ID])
#> [1] 3 5 7 10 11 13
analytic.miss$miss[analytic.miss$ID %in% analytic2$ID] <- 0
table(analytic.miss$miss)
#>
#> 0 1
#> 185613 211560
Setting Design
The code sets up the survey design, specifying weights (but no specific clustering and stratification, as they are unavailable for CCHS public access data), for use in survey-weighted analyses.
require(survey)
summary(analytic.miss$weight)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.39 21.76 42.21 66.70 81.07 2384.98
w.design0 <- svydesign(id=~1, weights=~weight,
data=analytic.miss)
summary(weights(w.design0))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.39 21.76 42.21 66.70 81.07 2384.98
sd(weights(w.design0))
#> [1] 80.34263
This creates a subset of the data where there are no missing values. Note that subset was done to the design object w.design0
, not the data analytic.miss
.
Bivariate analysis
Table 1 (weighted)
Stratified by exposure
These tables contain descriptive statistics, stratified by different categories. They can be useful for understanding how variables relate to the exposure or outcome in the data.
require(tableone)
var.names <- c("CVD", "age", "sex", "married", "race", "edu", "income", "bmi",
"phyact", "doctor", "stress", "smoke", "drink", "fruit", "bp",
"diab", "province", "immigrate") # exclude "OA"
# tab1 <- CreateTableOne(var = var.names, strata= "OA", data=analytic.miss, test = TRUE)
# print(tab1)
tab2 <- svyCreateTableOne(var = var.names, strata= "OA",
data=w.design, test = TRUE)
print(tab2)
#> Stratified by OA
#> Control OA p test
#> n 12124961.5 1153392.2
#> CVD = no event (%) 11786450.6 (97.2) 1020965.5 (88.5) <0.001
#> age (%) <0.001
#> 20-29 years 2668880.8 (22.0) 28317.5 ( 2.5)
#> 30-39 years 3009426.7 (24.8) 77159.3 ( 6.7)
#> 40-49 years 3108300.9 (25.6) 211515.1 (18.3)
#> 50-59 years 1900845.3 (15.7) 350264.7 (30.4)
#> 60-64 years 546041.8 ( 4.5) 163724.7 (14.2)
#> 65 years and over 624706.7 ( 5.2) 322033.4 (27.9)
#> teen 266759.3 ( 2.2) 377.4 ( 0.0)
#> sex = Male (%) 6374765.5 (52.6) 379850.5 (32.9) <0.001
#> married = single (%) 4120611.0 (34.0) 367647.7 (31.9) <0.001
#> race = White (%) 10312228.2 (85.0) 1081778.6 (93.8) <0.001
#> edu (%) <0.001
#> < 2ndary 1752318.3 (14.5) 309652.8 (26.8)
#> 2nd grad. 2314713.1 (19.1) 203437.5 (17.6)
#> Other 2nd grad. 1078645.2 ( 8.9) 79255.1 ( 6.9)
#> Post-2nd grad. 6979284.9 (57.6) 561046.8 (48.6)
#> income (%) <0.001
#> $29,999 or less 2051640.6 (16.9) 353862.9 (30.7)
#> $30,000-$49,999 2436063.7 (20.1) 272484.1 (23.6)
#> $50,000-$79,999 3495902.5 (28.8) 275115.8 (23.9)
#> $80,000 or more 4141354.6 (34.2) 251929.4 (21.8)
#> bmi (%) <0.001
#> Underweight 346004.9 ( 2.9) 22064.6 ( 1.9)
#> healthy weight 6019004.1 (49.6) 431570.2 (37.4)
#> Overweight 5759952.5 (47.5) 699757.4 (60.7)
#> phyact (%) <0.001
#> Active 3037314.2 (25.1) 216879.5 (18.8)
#> Inactive 5982492.3 (49.3) 647856.2 (56.2)
#> Moderate 3105154.9 (25.6) 288656.5 (25.0)
#> doctor = Yes (%) 10087473.8 (83.2) 1090763.9 (94.6) <0.001
#> stress = stressed (%) 3123770.9 (25.8) 301895.2 (26.2) 0.420
#> smoke (%) <0.001
#> Never smoker 4043479.9 (33.3) 320323.7 (27.8)
#> Current smoker 3219168.6 (26.5) 275835.7 (23.9)
#> Former smoker 4862313.0 (40.1) 557232.7 (48.3)
#> drink (%) <0.001
#> Never drank 678435.7 ( 5.6) 66085.0 ( 5.7)
#> Current drinker 10297713.4 (84.9) 887808.2 (77.0)
#> Former driker 1148812.3 ( 9.5) 199498.9 (17.3)
#> fruit (%) <0.001
#> 0-3 daily serving 3214156.0 (26.5) 236483.8 (20.5)
#> 4-6 daily serving 6001124.3 (49.5) 588323.8 (51.0)
#> 6+ daily serving 2909681.1 (24.0) 328584.5 (28.5)
#> bp = Yes (%) 1212548.2 (10.0) 347269.8 (30.1) <0.001
#> diab = Yes (%) 377876.2 ( 3.1) 104541.0 ( 9.1) <0.001
#> province = North (%) 27124.3 ( 0.2) 1825.1 ( 0.2) <0.001
#> immigrate (%) <0.001
#> not immigrant 9898636.6 (81.6) 994682.5 (86.2)
#> > 10 years 1384672.6 (11.4) 146879.8 (12.7)
#> recent 841652.3 ( 6.9) 11829.8 ( 1.0)
Stratified by outcome
This table is generally useful for logistic regression analysis
var.names <- c("OA", "age", "sex", "married", "race", "edu", "income", "bmi",
"phyact", "doctor", "stress", "smoke", "drink", "fruit", "bp",
"diab", "province", "immigrate") # exclude "CVD"
tab3 <- svyCreateTableOne(var = var.names, strata= "CVD", data=w.design, test = TRUE)
print(tab3)
#> Stratified by CVD
#> event no event p test
#> n 470937.5 12807416.1
#> OA = OA (%) 132426.7 (28.1) 1020965.5 ( 8.0) <0.001
#> age (%) <0.001
#> 20-29 years 14966.0 ( 3.2) 2682232.3 (20.9)
#> 30-39 years 24105.5 ( 5.1) 3062480.5 (23.9)
#> 40-49 years 63520.1 (13.5) 3256296.0 (25.4)
#> 50-59 years 122613.0 (26.0) 2128497.0 (16.6)
#> 60-64 years 72328.3 (15.4) 637438.1 ( 5.0)
#> 65 years and over 172742.2 (36.7) 773997.8 ( 6.0)
#> teen 662.3 ( 0.1) 266474.4 ( 2.1)
#> sex = Male (%) 267743.8 (56.9) 6486872.2 (50.6) <0.001
#> married = single (%) 143747.0 (30.5) 4344511.8 (33.9) <0.001
#> race = White (%) 434591.6 (92.3) 10959415.2 (85.6) <0.001
#> edu (%) <0.001
#> < 2ndary 147338.4 (31.3) 1914632.6 (14.9)
#> 2nd grad. 77705.6 (16.5) 2440445.0 (19.1)
#> Other 2nd grad. 30921.3 ( 6.6) 1126979.0 ( 8.8)
#> Post-2nd grad. 214972.2 (45.6) 7325359.4 (57.2)
#> income (%) <0.001
#> $29,999 or less 164929.4 (35.0) 2240574.2 (17.5)
#> $30,000-$49,999 109988.2 (23.4) 2598559.6 (20.3)
#> $50,000-$79,999 103091.1 (21.9) 3667927.1 (28.6)
#> $80,000 or more 92928.7 (19.7) 4300355.3 (33.6)
#> bmi (%) <0.001
#> Underweight 8844.4 ( 1.9) 359225.0 ( 2.8)
#> healthy weight 173475.1 (36.8) 6277099.2 (49.0)
#> Overweight 288617.9 (61.3) 6171091.9 (48.2)
#> phyact (%) <0.001
#> Active 85140.3 (18.1) 3169053.4 (24.7)
#> Inactive 274968.8 (58.4) 6355379.7 (49.6)
#> Moderate 110828.4 (23.5) 3282983.0 (25.6)
#> doctor = Yes (%) 445493.3 (94.6) 10732744.5 (83.8) <0.001
#> stress = stressed (%) 113282.5 (24.1) 3312383.7 (25.9) 0.023
#> smoke (%) <0.001
#> Never smoker 119434.6 (25.4) 4244368.9 (33.1)
#> Current smoker 97328.0 (20.7) 3397676.3 (26.5)
#> Former smoker 254174.9 (54.0) 5165370.9 (40.3)
#> drink (%) <0.001
#> Never drank 29444.3 ( 6.3) 715076.4 ( 5.6)
#> Current drinker 344405.1 (73.1) 10841116.6 (84.6)
#> Former driker 97088.1 (20.6) 1251223.1 ( 9.8)
#> fruit (%) 0.001
#> 0-3 daily serving 111803.5 (23.7) 3338836.3 (26.1)
#> 4-6 daily serving 233403.5 (49.6) 6356044.7 (49.6)
#> 6+ daily serving 125730.4 (26.7) 3112535.1 (24.3)
#> bp = Yes (%) 209257.0 (44.4) 1350561.0 (10.5) <0.001
#> diab = Yes (%) 78762.9 (16.7) 403654.4 ( 3.2) <0.001
#> province = North (%) 702.8 ( 0.1) 28246.6 ( 0.2) 0.005
#> immigrate (%) <0.001
#> not immigrant 389553.0 (82.7) 10503766.2 (82.0)
#> > 10 years 69008.0 (14.7) 1462544.4 (11.4)
#> recent 12376.5 ( 2.6) 841105.5 ( 6.6)
How did they calculate the p-values? Hint: svychisq
(see below).
Proportions and Design Effect
This part computes proportions and design effects, which help understand the influence of the sampling design on the estimated statistics.
require(survey)
# Computing survey statistics on subsets of a survey defined by factor(s).
fit0a <- svyby(~CVD,~OA,design=w.design, svymean,deff=TRUE)
fit0a
Let
-
\(\theta\) = parameter (population slope) and
- \(\hat(\theta)\) = statistic (estimated slope).
\(b = \frac{\sum[w (y_i-\bar{y}) (x_i-\bar{x})]}{\sum[w (x_i-\bar{x})^2]}\)
DE = Effect of complex survey on the SEs, relative to a SRS of equal size.
- \(D^2(\hat{\theta}) = \frac{Var(\hat{\theta})_{Complex Survey}}{Var(\hat{\theta})_{SRS}}\)
- \(D^2(\hat{\theta}) = \frac{SE(\hat{\theta})^2_{Complex Survey}}{SE(\hat{\theta})^2_{SRS}}\)
Note:
- SE increases as value of weight increases (CCHS).
- NHANES has more things to worry about (strata, PSU)
DEFF = 2 means that the variance of the sample proportion, when choosing the sample by complex survey sampling, is nearly 2 times as large as the variance of the same estimator under simple random sampling/SRS.
Testing association
Here, Chi-square tests are conducted to test the association between different variables. Two variants of the test are used: Rao-Scott and Thomas-Rao modifications. These adaptations are used when the data come from a complex survey design.
- Tests for hypothesis
- Rao-Scott modifications (chi-sq)
- Thomas-Rao modifications (F)
# Rao-Scott modifications (chi-sq)
svychisq(~CVD+OA,design=w.design, statistic="Chisq")
#>
#> Pearson's X^2: Rao & Scott adjustment
#>
#> data: svychisq(~CVD + OA, design = w.design, statistic = "Chisq")
#> X-squared = 3249.7, df = 1, p-value < 2.2e-16
# Thomas-Rao modifications (F)
svychisq(~CVD+OA,design=w.design, statistic="F")
#>
#> Pearson's X^2: Rao & Scott adjustment
#>
#> data: svychisq(~CVD + OA, design = w.design, statistic = "F")
#> F = 1863, ndf = 1, ddf = 185612, p-value < 2.2e-16
# Both provide strong evidence to reject the null hypothesis.
# Conclusion: there is a significant (at 5%) association
# between CVD prevalence and OA.
svychisq(~CVD+fruit,design=w.design, statistic="F")
#>
#> Pearson's X^2: Rao & Scott adjustment
#>
#> data: svychisq(~CVD + fruit, design = w.design, statistic = "F")
#> F = 7.1241, ndf = 1.9758e+00, ddf = 3.6673e+05, p-value = 0.0008503
svychisq(~CVD+province,design=w.design, statistic="Chisq")
#>
#> Pearson's X^2: Rao & Scott adjustment
#>
#> data: svychisq(~CVD + province, design = w.design, statistic = "Chisq")
#> X-squared = 1.4848, df = 1, p-value = 0.00492
Saving data
Finally, the dataset, along with any new variables or subsets created during the analysis, is saved for future use.
Video content (optional)
For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.