CCHS: Revisiting PICOT

Welcome to this tutorial where we will examine the same research question presented in the Causal question-1 tutorial. Our approach will be enriched this time by working with a more comprehensive set of covariates. We will follow the guidelines from the research article by Rahman et al. (2013) (DOI:10.1136/bmjopen-2013-002624). We will also will work properly with survey feature variables (e.g., sampling weights).

Rahman et al. (2013)

Remembering PICOT

Before diving into the data, let’s clarify and remember the research parameters using the PICOT framework:

  1. Target population: Canadian adults (CCHS data)

  2. Outcome (\(Y\)): CVD (Heart disease/Cardiovascular Disease)

  3. Exposure group (\(A\)): Osteoarthritis (OA)

  4. Control group: People without OA

  5. Timeline: Data collected from 2001 to 2005

In addition, we’ll identify potential confounders based on literature to better understand the relationship between exposure and outcome.

Creating dataset

To start, we’ll load the required R packages:

# Load required packages
library(survey)
library(knitr)
library(car)

Load data

We’ll be using CCHS data from various cycles. Use the following code to load this data into your R environment:

# CCHS 1.1
load("Data/surveydata/cchsc1.RData")

# CCHS 2.1
load("Data/surveydata/cchsc2.RData")

# CCHS 3.1
load("Data/surveydata/cchsc3.RData")

# objects loaded
ls()
#> [1] "c1"              "c2"              "c3"              "has_annotations"

To check the dimensions of each data set:

# Dimensions
dim(c1)
#> [1] 130880    117
dim(c2)
#> [1] 134072    112
dim(c3)
#> [1] 132221    112

Subset the data (subset variables)

Understand variable coding

Before subsetting the data, we need to comprehend the variables used in CCHS 3.1, which are described in detail in the official reference:

A table mapping variable concepts across different CCHS cycles can be found below.

Variable Concept CCHS 1.1 CCHS 2.1 CCHS 3.1
Has heart disease CCCA_121 CCCC_121 CCCE_121
Has arthritis or rheumatism CCCA_051 CCCC_051 CCCE_051
Kind of arthritis CCCA_05A CCCC_05A CCCE_05A
Age DHHAGAGE DHHCGAGE DHHEGAGE
Sex DHHA_SEX DHHC_SEX DHHE_SEX
Marital Status DHHAGMS DHHCGMS DHHEGMS
Cultural / racial origin SDCAGRAC SDCCGRAC SDCEGCGT
Immigrant status SDCAFIMM SDCCFIMM SDCEFIMM
Length of time in Canada since immigration SDCAGRES SDCCGRES SDCEGRES
Highest level of education - respondent EDUADR04 EDUCDR04 EDUEDR04
Total household income from all sources INCAGHH INCCGHH INCEGHH
Body mass index HWTAGBMI HWTCGBMI HWTEGBMI
Physical activity index PACADPAI PACCDPAI PACEDPAI
Has a regular medical doctor TWDA_5 HCUC_1AA HCUE_1AA
Self-perceived stress GENA_07 GENC_07 GENE_07
Type of smoker SMKADSTY SMKCDSTY SMKEDSTY
Type of drinker ALCADTYP ALCCDTYP ALCEDTYP
Daily consumption - total fruits and vegetables FVCADTOT FVCCDTOT FVCEDTOT
Has high blood pressure CCCA_071 CCCC_071 CCCE_071
Has emphysema or chronic obstructive pulmonary disease (COPD) CCCA_91B CCCC_91B CCCE_91F
Has diabetes CCCA_101 CCCC_101 CCCE_101
Province GEOAGPRV GEOCGPRV GEOEGPRV
Sampling weight - master weight WTSAM WTSC_M WTSE_M

While most variables in CCHS 3.1 are universally applicable to ‘All respondents,’ there are some exceptions. For example:

  • CCCE_05A universe: (Kind of arthritis / rheumatism)

  • Respondents who answered CCCE_051 = (1, 7 or 8) or CCCE_011 = 8

  • CCCE_051: All respondents

  • CCCE_011: All respondents

  • SDCEGRES universe: (Length of time in Canada since immigration)

    • Respondents who answered SDCE_2 = (2, 7 or 8) or SDCE_1 = (97 or 98)
    • SDCE_2 doesn’t exist!
    • (master file variable; not available in PUMF)
    • Public Use Microdata File (PUMF)
    • SDCE_1 doesn’t exist!
    • (master file variable; not available in PUMF)
  • HWTEGBMI universe: (Body Mass Index (BMI) / self-report)

    • All respondents excluding pregnant women (MAME_037 = 1)
      • MAME_037 doesn’t exist!
      • (master file variable; not available in PUMF)
  • GENE_07 universe: (Self-perceived stress)

    • Respondents aged 15 and over
  • CCCE_91F universe: (Has chronic obstructive pulmonary disease)

    • Respondents aged 30 and over
  • FVCEDTOT universe: (Daily consumption - total fruits and vegetables)

    • Respondents with FVCEFOPT = 1
      • FVCEFOPT: Optional module: Fruit and vegetable consumption - (F)

Ref:

  • page 66 in CCHS 3.1 guide (Canada 2005)

  • Potential problematic variables:

    • Self-perceived stress
    • Has chronic obstructive pulmonary disease / copd
    • Daily consumption - total fruits and vegetables

We will make decisions about these variables later: for now, let’s keep them.

Restrict the dataset with variables of interest only

Cycle 1.1
  • We define a vector of variable names var.names1 that are of interest for the first cycle of the Canadian Community Health Survey (CCHS 1.1). These variables cover a range of topics such as heart disease, age, sex, etc.
  • Then we creates a new data frame cc11 by subsetting the original data frame c1 to include only the columns specified in var.names1.
var.names1 <- c("CCCA_121", "CCCA_051", "CCCA_05A", "DHHAGAGE", 
                "DHHA_SEX", "DHHAGMS", "SDCAGRAC", "SDCAFIMM", 
                "SDCAGRES", "EDUADR04", "INCAGHH", "HWTAGBMI", 
                "PACADPAI", "TWDA_5", "GENA_07", "SMKADSTY", 
                "ALCADTYP", "FVCADTOT", "CCCA_071", "CCCA_91B",
                "CCCA_101", "GEOAGPRV", "WTSAM")
cc11 <- c1[var.names1]
dim(cc11)
#> [1] 130880     23
table(cc11$CCCA_051)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          24511         106231              0            110              3 
#>     NOT STATED 
#>             25

The subsequent two code chunks do the same for CCHS 2.1 and CCHS 3.1, respectively, resulting in new data frames cc21 and cc31.

Cycle 2.1
var.names2 <- c("CCCC_121", "CCCC_051", "CCCC_05A", "DHHCGAGE", 
                "DHHC_SEX", "DHHCGMS", "SDCCGRAC", "SDCCFIMM", 
                "SDCCGRES", "EDUCDR04", "INCCGHH", "HWTCGBMI", 
                "PACCDPAI", "HCUC_1AA", "GENC_07", "SMKCDSTY", 
                "ALCCDTYP", "FVCCDTOT", "CCCC_071", "CCCC_91B",
                "CCCC_101", "GEOCGPRV", "WTSC_M")
cc21 <- c2[var.names2]
dim(cc21)
#> [1] 134072     23
table(cc21$CCCC_051)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          29293         104530              0            208             11 
#>     NOT STATED 
#>             30
Cycle 3.1
var.names3 <- c("CCCE_121", "CCCE_051", "CCCE_05A", "DHHEGAGE", 
                "DHHE_SEX", "DHHEGMS", "SDCEGCGT", "SDCEFIMM", 
                "SDCEGRES", "EDUEDR04", "INCEGHH", "HWTEGBMI", 
                "PACEDPAI", "HCUE_1AA", "GENE_07", "SMKEDSTY", 
                "ALCEDTYP", "FVCEDTOT","CCCE_071", "CCCE_91F", 
                "CCCE_101", "GEOEGPRV", "WTSE_M")
cc31 <- c3[var.names3]
dim(cc31)
#> [1] 132221     23
table(cc31$CCCE_051)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          28221         103781              0            191              4 
#>     NOT STATED 
#>             24

Making variable names the same

We now create a new set of more readable and consistent variable names.

new.var.names <- c("CVD", "arthritis", "arthritis.kind", "age", 
                   "sex", "married", "race", "immigration", 
                   "recent.immigrant", "edu", "income", "bmi", 
                   "phyact", "doctor", "stress", "smoke",
                   "drink", "fruit", "bp", "copd", 
                   "diab",  "province", "weight")
cbind(new.var.names, var.names1, var.names2, var.names3)
#>       new.var.names      var.names1 var.names2 var.names3
#>  [1,] "CVD"              "CCCA_121" "CCCC_121" "CCCE_121"
#>  [2,] "arthritis"        "CCCA_051" "CCCC_051" "CCCE_051"
#>  [3,] "arthritis.kind"   "CCCA_05A" "CCCC_05A" "CCCE_05A"
#>  [4,] "age"              "DHHAGAGE" "DHHCGAGE" "DHHEGAGE"
#>  [5,] "sex"              "DHHA_SEX" "DHHC_SEX" "DHHE_SEX"
#>  [6,] "married"          "DHHAGMS"  "DHHCGMS"  "DHHEGMS" 
#>  [7,] "race"             "SDCAGRAC" "SDCCGRAC" "SDCEGCGT"
#>  [8,] "immigration"      "SDCAFIMM" "SDCCFIMM" "SDCEFIMM"
#>  [9,] "recent.immigrant" "SDCAGRES" "SDCCGRES" "SDCEGRES"
#> [10,] "edu"              "EDUADR04" "EDUCDR04" "EDUEDR04"
#> [11,] "income"           "INCAGHH"  "INCCGHH"  "INCEGHH" 
#> [12,] "bmi"              "HWTAGBMI" "HWTCGBMI" "HWTEGBMI"
#> [13,] "phyact"           "PACADPAI" "PACCDPAI" "PACEDPAI"
#> [14,] "doctor"           "TWDA_5"   "HCUC_1AA" "HCUE_1AA"
#> [15,] "stress"           "GENA_07"  "GENC_07"  "GENE_07" 
#> [16,] "smoke"            "SMKADSTY" "SMKCDSTY" "SMKEDSTY"
#> [17,] "drink"            "ALCADTYP" "ALCCDTYP" "ALCEDTYP"
#> [18,] "fruit"            "FVCADTOT" "FVCCDTOT" "FVCEDTOT"
#> [19,] "bp"               "CCCA_071" "CCCC_071" "CCCE_071"
#> [20,] "copd"             "CCCA_91B" "CCCC_91B" "CCCE_91F"
#> [21,] "diab"             "CCCA_101" "CCCC_101" "CCCE_101"
#> [22,] "province"         "GEOAGPRV" "GEOCGPRV" "GEOEGPRV"
#> [23,] "weight"           "WTSAM"    "WTSC_M"   "WTSE_M"
names(cc11) <- names(cc21) <- names(cc31) <- new.var.names

table(cc11$arthritis)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          24511         106231              0            110              3 
#>     NOT STATED 
#>             25
table(cc21$arthritis)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          29293         104530              0            208             11 
#>     NOT STATED 
#>             30
table(cc31$arthritis)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          28221         103781              0            191              4 
#>     NOT STATED 
#>             24

cc11$cycle <- 11
cc21$cycle <- 21
cc31$cycle <- 31

Appending

We now combine the data frames cc11, cc21, and cc31 by stacking them on top of each other.

cc123a <- rbind(cc11,cc21,cc31)
dim(cc123a)
#> [1] 397173     24
names(cc123a)
#>  [1] "CVD"              "arthritis"        "arthritis.kind"   "age"             
#>  [5] "sex"              "married"          "race"             "immigration"     
#>  [9] "recent.immigrant" "edu"              "income"           "bmi"             
#> [13] "phyact"           "doctor"           "stress"           "smoke"           
#> [17] "drink"            "fruit"            "bp"               "copd"            
#> [21] "diab"             "province"         "weight"           "cycle"
cc123a$ID <- 1:nrow(cc123a)

Variables

Sampling weight

We use the summary function to provide basic statistics (like mean, median, min, max, etc.) for the weight column in the data frame cc123a.

summary(cc123a$weight)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.17   65.28  126.63  200.09  243.21 7154.95

Exposure

This following chunk creates frequency tables for the ‘arthritis’ and ‘arthritis.kind’ columns.

table(cc123a$arthritis)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          82025         314542              0            509             18 
#>     NOT STATED 
#>             79
table(cc123a$arthritis.kind)
#> 
#> RHEUMATOID ARTH  OSTEOARTHRITIS           OTHER  NOT APPLICABLE      DON'T KNOW 
#>           19099           40943            7305          314542           12354 
#>         REFUSAL      NOT STATED      RHEUMATISM 
#>             215             619            2096
sum(cc123a$arthritis=="NO")
#> [1] 314542
sum(cc123a$arthritis.kind=="NOT APPLICABLE")
#> [1] 314542

We create the exposure variable with exposure status vs controls.

c123sub1 <- cc123a
c123sub1$arthritis.kind <- car::recode(c123sub1$arthritis.kind, 
                            "'OSTEOARTHRITIS'='OSTEOARTHRITIS';
                            'NOT APPLICABLE' = 'NOT APPLICABLE';
                            else = NA",  
                            as.factor = FALSE)
table(c123sub1$arthritis.kind, useNA = "always")
#> 
#> NOT APPLICABLE OSTEOARTHRITIS           <NA> 
#>         314542          40943          41688
# c123sub1 <- subset(cc123a, arthritis.kind == "OSTEOARTHRITIS" | 
#                      arthritis.kind == "NOT APPLICABLE" )
# dim(c123sub1)
table(c123sub1$arthritis.kind)
#> 
#> NOT APPLICABLE OSTEOARTHRITIS 
#>         314542          40943
table(c123sub1$arthritis)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          82025         314542              0            509             18 
#>     NOT STATED 
#>             79
require(car)
c123sub1$arthritis.kind <- car::recode(c123sub1$arthritis.kind, 
                                  "'OSTEOARTHRITIS'='OA';
                         'NOT APPLICABLE'='Control';
                         else=NA", as.factor = FALSE)
table(c123sub1$arthritis.kind, useNA = "always")
#> 
#> Control      OA    <NA> 
#>  314542   40943   41688
c123sub1$OA <- c123sub1$arthritis.kind
c123sub1$arthritis.kind <- NULL
c123sub1$arthritis <- NULL
dim(c123sub1)
#> [1] 397173     24
dim(c123sub1)[1]-dim(cc123a)[1]
#> [1] 0

Outcome

We create the outcome variable.

c123sub2 <- c123sub1
table(c123sub2$CVD)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          25524         371121              0            399             50 
#>     NOT STATED 
#>             79
c123sub2$CVD <- car::recode(c123sub2$CVD, 
                            "'YES'='YES';
                            'NO' = 'NO';
                            else = NA",  
                            as.factor = FALSE)
# table(c123sub1$CVD)
# c123sub2 <- subset(c123sub1, CVD == "YES" | CVD == "NO")
# table(c123sub2$CVD)
dim(c123sub2)
#> [1] 397173     24
c123sub2$CVD <- car::recode(c123sub2$CVD, 
                       "'YES'='event';
                       'NO'='no event';
                       else=NA",
                       as.factor = FALSE)
table(c123sub2$CVD, useNA = "always")
#> 
#>    event no event     <NA> 
#>    25524   371121      528
dim(c123sub2)
#> [1] 397173     24
dim(c123sub2)[1]-dim(c123sub1)[1]
#> [1] 0

Covariates

Age

Recodes the ‘age’ column into broader age categories.

c123sub2$age <- car::recode(c123sub2$age, 
                            "c('12 TO 14 YEARS','15 TO 19 YEARS', 
                            '15 TO 17 YEARS', '18 TO 19 YEARS')='teen';
                            c('20 TO 24 YEARS','25 TO 29 YEARS')=
                            '20-29 years'; 
                            c('30 TO 34 YEARS','35 TO 39 YEARS')=
                            '30-39 years';
                            c('40 TO 44 YEARS','45 TO 49 YEARS')=
                            '40-49 years';
                            c('50 TO 54 YEARS','55 TO 59 YEARS')=
                            '50-59 years';
                            c('60 TO 64 YEARS')='60-64 years';
                            else='65 years and over'",  
                            as.factor = FALSE)
table(c123sub2$age)
#> 
#>       20-29 years       30-39 years       40-49 years       50-59 years 
#>             48652             63810             65111             61035 
#>       60-64 years 65 years and over              teen 
#>             25265             80913             52387
c123sub3 <- c123sub2
# c123sub3$age[c123sub3$age == 'teen'] <- NA
# c123sub3$age[c123sub3$age == '65 years and over'] <- NA
# dim(c123sub2)
# c123sub3 <- subset(c123sub2, age != 'teen' & age != '65 years and over')
table(c123sub3$age, useNA = "always")
#> 
#>       20-29 years       30-39 years       40-49 years       50-59 years 
#>             48652             63810             65111             61035 
#>       60-64 years 65 years and over              teen              <NA> 
#>             25265             80913             52387                 0
dim(c123sub3)
#> [1] 397173     24
Sex

Recodes ‘sex’ to ‘Male’ or ‘Female’.

table(c123sub3$sex)
#> 
#>           MALE         FEMALE NOT APPLICABLE     NOT STATED     DON'T KNOW 
#>         182523         214650              0              0              0 
#>        REFUSAL 
#>              0
c123sub3$sex <- car::recode(c123sub3$sex, 
                            "'MALE'='Male';
                            'FEMALE' = 'Female';
                            else = NA",  
                            as.factor = FALSE)
table(c123sub3$sex, useNA = "always")
#> 
#> Female   Male   <NA> 
#> 214650 182523      0
Marital status

Recodes the ‘married’ column into two categories: ‘not single’ and ‘single’.

table(c123sub3$married)
#> 
#>          MARRIED       COMMON-LAW    WIDOW/SEP/DIV           SINGLE 
#>           171304            30909            75885            78261 
#>          REFUSAL       NOT STATED   NOT APPLICABLE       DON'T KNOW 
#>                0              661                0                0 
#> SINGLE/NEVER MAR 
#>            40153
c123sub3$married <- car::recode(c123sub3$married, 
                             "c('MARRIED', 'COMMON-LAW')='not single';
                             c('WIDOW/SEP/DIV', 'SINGLE', 
                             'SINGLE/NEVER MAR') = 'single';
                             else = NA",  
                             as.factor = FALSE)
table(c123sub3$married, useNA = "always")
#> 
#> not single     single       <NA> 
#>     202213     194299        661
Race/ethnicity

Recodes ‘race’ into ‘White’ and ‘Non-white’.

table(c123sub3$race)
#> 
#>            WHITE VISIBLE MINORITY   NOT APPLICABLE       NOT STATED 
#>           349222            38641                0             9310 
#>       DON'T KNOW          REFUSAL 
#>                0                0
c123sub3$race <- car::recode(c123sub3$race, 
                             "'WHITE'='White';
                             'VISIBLE MINORITY' = 'Non-white';
                             else = NA",  
                             as.factor = FALSE)
table(c123sub3$race, useNA = "always")
#> 
#> Non-white     White      <NA> 
#>     38641    349222      9310
Immigration

Creates a new column for immigration status based on the ‘recent.immigrant’ column, then removes the original column.

table(c123sub3$recent.immigrant)
#> 
#>     0 TO 9 YEARS 10 YEARS OR MORE   NOT APPLICABLE       NOT STATED 
#>            10644            26746           338078             7975 
#>       DON'T KNOW          REFUSAL 10 OR MORE YEARS 
#>                0                0            13730
c123sub3$immigrate <- car::recode(c123sub3$recent.immigrant,
                            "'0 TO 9 YEARS'='recent';
                            '10 YEARS OR MORE' = '> 10 years';
                            'NOT APPLICABLE' = 'not immigrant';
                            else = NA",
                            as.factor = FALSE)
table(c123sub3$immigrate, useNA = "always")
#> 
#>    > 10 years not immigrant        recent          <NA> 
#>         26746        338078         10644         21705
c123sub3$recent.immigrant <- NULL
c123sub3$immigration <- NULL
Education

Recode educational status into specified categories.

table(c123sub3$edu)
#> 
#> < THAN SECONDARY  SECONDARY GRAD.  OTHER POST-SEC.  POST-SEC. GRAD. 
#>           124425            64753            29000           171972 
#>   NOT APPLICABLE       NOT STATED       DON'T KNOW          REFUSAL 
#>                0             7023                0                0
c123sub3$edu <- car::recode(c123sub3$edu,
                            "'< THAN SECONDARY'='< 2ndary';
                            'SECONDARY GRAD.' = '2nd grad.';
                            'POST-SEC. GRAD.' = 'Post-2nd grad.';
                            'OTHER POST-SEC.' = 'Other 2nd grad.';
                            else = NA",
                            as.factor = FALSE)
table(c123sub3$edu, useNA = "always")
#> 
#>        < 2ndary       2nd grad. Other 2nd grad.  Post-2nd grad.            <NA> 
#>          124425           64753           29000          171972            7023
Income

Recodes income levels into broader categories.

table(c123sub3$income)
#> 
#>        NO INCOME LESS THAN 15,000  $15,000-$29,999  $30,000-$49,999 
#>            12636            14103            63766            77940 
#>  $50,000-$79,999  $80,000 OR MORE   NOT APPLICABLE       NOT STATED 
#>            85108            75715                0            57079 
#>       DON'T KNOW          REFUSAL   NO OR <$15,000 
#>                0                0            10826
# cycle 1.1 has: 'NO INCOME','LESS THAN 15,000'
# Other cycles have: 'NO OR <$15,000'
c123sub3$income <- car::recode(c123sub3$income, 
                               "c('NO OR <$15,000', 'NO INCOME',
                               'LESS THAN 15,000',
                               '$15,000-$29,999')='$29,999 or less';
                               '$30,000-$49,999' = '$30,000-$49,999';
                               '$50,000-$79,999' = '$50,000-$79,999';
                               '$80,000 OR MORE' = '$80,000 or more';
                               else = NA",  
                               as.factor = FALSE)
table(c123sub3$income, useNA = "always")
#> 
#> $29,999 or less $30,000-$49,999 $50,000-$79,999 $80,000 or more            <NA> 
#>          101331           77940           85108           75715           57079
BMI

Converts ‘bmi’ column values to numerical data and also categorizes it based on the BMI value.

If you want to reuse the continuous variable later (usually a good idea in statistical sense), keep a second copy of the variable.

# table(c123sub3$bmi)
sum(c123sub3$bmi=="NOT APPLICABLE")+
  sum(c123sub3$bmi=="REFUSAL")+
  sum(c123sub3$bmi=="NOT STATED")+
  sum(c123sub3$bmi=="DON'T KNOW")
#> [1] 72303
c123sub3$bmi <- car::recode(c123sub3$bmi, 
                               'c("NOT APPLICABLE", "REFUSAL", 
                               "NOT STATED", "DON\'T KNOW")=NA',
                               as.factor = FALSE)
#table(c123sub3$bmi, useNA = "always")
c123sub3$bmi <- as.numeric(as.character(c123sub3$bmi))
summary(c123sub3$bmi)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   11.91   22.47   25.30   25.86   28.50   57.90   72303
c123sub3$bmi2 <- cut(c123sub3$bmi,
                       breaks = c(0,18.5,25,Inf),
                       right = TRUE,
                       labels = c("Underweight", 
                                  "healthy weight", 
                                  "Overweight"))
c123sub3$bmi <- c123sub3$bmi2
c123sub3$bmi2 <- NULL
table(c123sub3$bmi, useNA = "always")
#> 
#>    Underweight healthy weight     Overweight           <NA> 
#>          10116         148533         166221          72303
Physical activity

Recodes physical activity levels.

table(c123sub3$phyact)
#> 
#>         ACTIVE       MODERATE       INACTIVE NOT APPLICABLE     NOT STATED 
#>          98571          93507         190739              0          14356 
#>     DON'T KNOW        REFUSAL 
#>              0              0
c123sub3$phyact <- car::recode(c123sub3$phyact,
                               "'ACTIVE'='Active';
                               'MODERATE' = 'Moderate';
                               'INACTIVE' = 'Inactive';
                               else = NA",
                               as.factor = FALSE)
table(c123sub3$phyact, useNA = "always")
#> 
#>   Active Inactive Moderate     <NA> 
#>    98571   190739    93507    14356
Doctor

Recodes whether someone has access to a doctor into ‘Yes’ or ‘No’.

table(c123sub3$doctor)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>         338105          58623              0            406             39 
#>     NOT STATED 
#>              0
c123sub3$doctor <- car::recode(c123sub3$doctor, 
                            "'YES'='Yes';
                            'NO' = 'No';
                            else = NA",  
                            as.factor = FALSE)
table(c123sub3$doctor, useNA = "always")
#> 
#>     No    Yes   <NA> 
#>  58623 338105    445
Stress

Recodes stress levels into ‘Not too stressed’ or ‘stressed’.

table(c123sub3$stress)
#> 
#>     NOT AT ALL       NOT VERY          A BIT    QUITE A BIT      EXTREMELY 
#>          51020          89612         146112          68203          14094 
#> NOT APPLICABLE     DON'T KNOW        REFUSAL     NOT STATED 
#>          26737           1244            145              6
c123sub3$stress <- car::recode(c123sub3$stress,
                              "c('NOT AT ALL','NOT VERY','A BIT')=
                              'Not too stressed';
                              c('QUITE A BIT','EXTREMELY') = 
                              'stressed';
                              else = NA",
                              as.factor = FALSE)
table(c123sub3$stress, useNA = "always")
#> 
#> Not too stressed         stressed             <NA> 
#>           286744            82297            28132
Smoking

Recodes smoking status into ‘Current smoker’, ‘Former smoker’, or ‘Never smoker’.

table(c123sub3$smoke)
#> 
#>            DAILY       OCCASIONAL ALWAYS OCCASION.     FORMER DAILY 
#>            79878            10931             7043            99450 
#> FORMER OCCASION.     NEVER SMOKED   NOT APPLICABLE       NOT STATED 
#>            58120           140017                0             1734 
#>       DON'T KNOW          REFUSAL 
#>                0                0
c123sub3$smoke <- car::recode(c123sub3$smoke,
                              "c('DAILY','OCCASIONAL',
                              'ALWAYS OCCASION.')='Current smoker';
                              c('FORMER DAILY','FORMER OCCASION.',
                              'ALWAYS OCCASION.') = 'Former smoker';
                              'NEVER SMOKED' = 'Never smoker';
                              else = NA",
                              as.factor = FALSE)
table(c123sub3$smoke, useNA = "always")
#> 
#> Current smoker  Former smoker   Never smoker           <NA> 
#>          97852         157570         140017           1734
Alcohol

Recodes drinking habits into ‘Current drinker’, ‘Former driker’, or ‘Never drank’.

table(c123sub3$drink)
#> 
#> REGULAR DRINKER    OCC. DRINKER  FORMER DRINKER     NEVER DRANK  NOT APPLICABLE 
#>          219334           76399           55299           40670               0 
#>      NOT STATED      DON'T KNOW         REFUSAL 
#>            5471               0               0
c123sub3$drink <- car::recode(c123sub3$drink,
                              "c('REGULAR DRINKER',
                              'OCC. DRINKER')='Current drinker';
                              c('FORMER DRINKER') = 'Former driker';
                              'NEVER DRANK' = 'Never drank';
                              else = NA",
                              as.factor = FALSE)
table(c123sub3$drink, useNA = "always")
#> 
#> Current drinker   Former driker     Never drank            <NA> 
#>          295733           55299           40670            5471
Fruit and vegetable consumption

Converts fruit and vegetable consumption to numerical values and then categorizes it.

If you want to reuse the continuous variable later (usually a good idea in statistical sense), keep a second copy of the variable.

str(c123sub3$fruit)
#>  Factor w/ 303 levels "0","0.1","0.2",..: 57 48 42 72 67 120 39 54 61 51 ...
#c123sub3$fruit.cont <- c123sub3$fruit
c123sub3$fruit2 <- as.numeric(as.character(c123sub3$fruit))
# Note: do not use as.numeric(c123sub3$fruit)
summary(c123sub3$fruit2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    0.00    3.00    4.30    4.75    6.00   80.00   75347
c123sub3$fruit2 <- cut(c123sub3$fruit2,
                       breaks = c(0,3,6,Inf),
                       right = TRUE,
                       labels = c("0-3 daily serving",
                                  "4-6 daily serving",
                                  "6+ daily serving"))
table(c123sub3$fruit2, useNA = "always")
#> 
#> 0-3 daily serving 4-6 daily serving  6+ daily serving              <NA> 
#>             83441            159380             78857             75495
c123sub3$fruit <- c123sub3$fruit2
c123sub3$fruit2 <- NULL
Hypertension

The following chunk of code is concerned with recoding the blood pressure column. The code also shows the distribution of the values before and after recoding.

table(c123sub3$bp)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          68071         328164              0            793             66 
#>     NOT STATED 
#>             79
c123sub3$bp <- car::recode(c123sub3$bp,
                           "'YES'='Yes';
                           'NO' = 'No';
                           else = NA",
                           as.factor = FALSE)
table(c123sub3$bp, useNA = "always")
#> 
#>     No    Yes   <NA> 
#> 328164  68071    938
COPD

The following chunk of code is concerned with recoding the COPD column.

table(c123sub3$copd)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>           4508         291191         101039            329             26 
#>     NOT STATED 
#>             80
c123sub3$copd <- car::recode(c123sub3$copd,
                             "'YES'='Yes';
                             'NO' = 'No';
                             else = NA",
                             as.factor = FALSE)
table(c123sub3$copd, useNA = "always")
#> 
#>     No    Yes   <NA> 
#> 291191   4508 101474
Diabetes

The following chunk of code is concerned with recoding the diabetes column.

table(c123sub3$diab)
#> 
#>            YES             NO NOT APPLICABLE     DON'T KNOW        REFUSAL 
#>          22231         374589              0            236             38 
#>     NOT STATED 
#>             79
c123sub3$diab <- car::recode(c123sub3$diab,
                             "'YES'='Yes';
                             'NO' = 'No';
                             else = NA",
                             as.factor = FALSE)
table(c123sub3$diab, useNA = "always")
#> 
#>     No    Yes   <NA> 
#> 374589  22231    353
North

This section is concerned with recoding the province column. It groups provinces into “North” and “South” based on their names.

  • Note that the category names might not exactly match with the data dictionary:
    • Particularly note two versions of QUEBEC
    • PEI was written in short
    • None in
      • DON’T KNOW
      • REFUSAL
      • NOT STATED
      • NOT APPLICABLE
c123sub3$province.check <- c123sub3$province
table(c123sub3$province)
#> 
#>     NEWFOUNDLAND              PEI      NOVA SCOTIA    NEW BRUNSWICK 
#>             7924             7744            15341            15025 
#>        QU\xc9BEC          ONTARIO         MANITOBA     SASKATCHEWAN 
#>            22012           123821            23454            23361 
#>          ALBERTA BRITISH COLUMBIA YUKON/NWT/NUNAVT   NOT APPLICABLE 
#>            40127            49767             5064                0 
#>       DON'T KNOW          REFUSAL       NOT STATED           QUEBEC 
#>                0                0                0            56764 
#>      NFLD & LAB.  YUKON/NWT/NUNA. 
#>             4111             2658
c123sub3$province <- car::recode(c123sub3$province,
                              "c('YUKON/NWT/NUNAVT','YUKON/NWT/NUNA.')=
                              'North'; else = 'South'",
                              as.factor = FALSE)
table(c123sub3$province, useNA = "always")
#> 
#>  North  South   <NA> 
#>   7722 389451      0
Dimension testing

This chunk verifies the dimensions of the modified data frames and performs some other operations like setting factors.

save.c123sub3 <- c123sub3
names(c123sub3)
#>  [1] "CVD"            "age"            "sex"            "married"       
#>  [5] "race"           "edu"            "income"         "bmi"           
#>  [9] "phyact"         "doctor"         "stress"         "smoke"         
#> [13] "drink"          "fruit"          "bp"             "copd"          
#> [17] "diab"           "province"       "weight"         "cycle"         
#> [21] "ID"             "OA"             "immigrate"      "province.check"
# c123sub3$phyact <- NULL
# c123sub3$stress <- NULL
# c123sub3$fruit <- NULL
# c123sub3$copd <- NULL
# c123sub3$province.check <- NULL
dim(c123sub3)
#> [1] 397173     24
dim(c123sub2)[1]-dim(c123sub3)[1]
#> [1] 0
analytic <- c123sub3
dim(analytic)
#> [1] 397173     24
analytic$cycle <- as.factor(analytic$cycle)
names(analytic)
#>  [1] "CVD"            "age"            "sex"            "married"       
#>  [5] "race"           "edu"            "income"         "bmi"           
#>  [9] "phyact"         "doctor"         "stress"         "smoke"         
#> [13] "drink"          "fruit"          "bp"             "copd"          
#> [17] "diab"           "province"       "weight"         "cycle"         
#> [21] "ID"             "OA"             "immigrate"      "province.check"

Saving data

We save the data for future use:

save(cc123a, analytic, file = "Data/surveydata/cchs123.RData")

Video content (optional)

Tip

For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.

References

Canada, Statistics. 2005. “Canadian Community Health Survey (CCHS), Cycle 3.1.” Author Ottawa.
Karim, Ehsan. 2023. “Case Study 2: Risk of Cardiovascular Disease Among Osteoarthritis Patients.” https://ssc.ca/en/case-study/case-study-2-risk-cardiovascular-disease-among-osteoarthritis-patients.
Rahman, M Mushfiqur, Jacek A Kopec, Jolanda Cibere, Charlie H Goldsmith, and Aslam H Anis. 2013. “The Relationship Between Osteoarthritis and Cardiovascular Disease in a Population Health Survey: A Cross-Sectional Study.” BMJ Open 3 (5): e002624.