CCHS: Revisiting PICOT
Welcome to this tutorial where we will examine the same research question presented in the Causal question-1 tutorial. Our approach will be enriched this time by working with a more comprehensive set of covariates. We will follow the guidelines from the research article by Rahman et al. (2013) (DOI:10.1136/bmjopen-2013-002624). We will also will work properly with survey feature variables (e.g., sampling weights).
Rahman et al. (2013)
Remembering PICOT
Before diving into the data, let’s clarify and remember the research parameters using the PICOT framework:
Target population: Canadian adults (CCHS data)
Outcome (\(Y\)): CVD (Heart disease/Cardiovascular Disease)
Exposure group (\(A\)): Osteoarthritis (OA)
Control group: People without OA
Timeline: Data collected from 2001 to 2005
In addition, we’ll identify potential confounders based on literature to better understand the relationship between exposure and outcome.
Creating dataset
To start, we’ll load the required R packages:
Load data
We’ll be using CCHS data from various cycles. Use the following code to load this data into your R environment:
To check the dimensions of each data set:
Subset the data (subset variables)
Understand variable coding
Before subsetting the data, we need to comprehend the variables used in CCHS 3.1, which are described in detail in the official reference:
page 60 in CCHS 3.1 guide (Canada 2005)
link for variable list: (Karim 2023)
A table mapping variable concepts across different CCHS cycles can be found below.
Variable Concept | CCHS 1.1 | CCHS 2.1 | CCHS 3.1 |
---|---|---|---|
Has heart disease | CCCA_121 | CCCC_121 | CCCE_121 |
Has arthritis or rheumatism | CCCA_051 | CCCC_051 | CCCE_051 |
Kind of arthritis | CCCA_05A | CCCC_05A | CCCE_05A |
Age | DHHAGAGE | DHHCGAGE | DHHEGAGE |
Sex | DHHA_SEX | DHHC_SEX | DHHE_SEX |
Marital Status | DHHAGMS | DHHCGMS | DHHEGMS |
Cultural / racial origin | SDCAGRAC | SDCCGRAC | SDCEGCGT |
Immigrant status | SDCAFIMM | SDCCFIMM | SDCEFIMM |
Length of time in Canada since immigration | SDCAGRES | SDCCGRES | SDCEGRES |
Highest level of education - respondent | EDUADR04 | EDUCDR04 | EDUEDR04 |
Total household income from all sources | INCAGHH | INCCGHH | INCEGHH |
Body mass index | HWTAGBMI | HWTCGBMI | HWTEGBMI |
Physical activity index | PACADPAI | PACCDPAI | PACEDPAI |
Has a regular medical doctor | TWDA_5 | HCUC_1AA | HCUE_1AA |
Self-perceived stress | GENA_07 | GENC_07 | GENE_07 |
Type of smoker | SMKADSTY | SMKCDSTY | SMKEDSTY |
Type of drinker | ALCADTYP | ALCCDTYP | ALCEDTYP |
Daily consumption - total fruits and vegetables | FVCADTOT | FVCCDTOT | FVCEDTOT |
Has high blood pressure | CCCA_071 | CCCC_071 | CCCE_071 |
Has emphysema or chronic obstructive pulmonary disease (COPD) | CCCA_91B | CCCC_91B | CCCE_91F |
Has diabetes | CCCA_101 | CCCC_101 | CCCE_101 |
Province | GEOAGPRV | GEOCGPRV | GEOEGPRV |
Sampling weight - master weight | WTSAM | WTSC_M | WTSE_M |
While most variables in CCHS 3.1 are universally applicable to ‘All respondents,’ there are some exceptions. For example:
CCCE_05A universe: (Kind of arthritis / rheumatism)
Respondents who answered CCCE_051 = (1, 7 or 8) or CCCE_011 = 8
CCCE_051: All respondents
CCCE_011: All respondents
-
SDCEGRES universe: (Length of time in Canada since immigration)
- Respondents who answered SDCE_2 = (2, 7 or 8) or SDCE_1 = (97 or 98)
- SDCE_2 doesn’t exist!
- (master file variable; not available in PUMF)
- Public Use Microdata File (PUMF)
- SDCE_1 doesn’t exist!
- (master file variable; not available in PUMF)
-
HWTEGBMI universe: (Body Mass Index (BMI) / self-report)
- All respondents excluding pregnant women (MAME_037 = 1)
- MAME_037 doesn’t exist!
- (master file variable; not available in PUMF)
- All respondents excluding pregnant women (MAME_037 = 1)
-
GENE_07 universe: (Self-perceived stress)
- Respondents aged 15 and over
-
CCCE_91F universe: (Has chronic obstructive pulmonary disease)
- Respondents aged 30 and over
-
FVCEDTOT universe: (Daily consumption - total fruits and vegetables)
- Respondents with FVCEFOPT = 1
- FVCEFOPT: Optional module: Fruit and vegetable consumption - (F)
- Respondents with FVCEFOPT = 1
Ref:
page 66 in CCHS 3.1 guide (Canada 2005)
-
Potential problematic variables:
- Self-perceived stress
- Has chronic obstructive pulmonary disease / copd
- Daily consumption - total fruits and vegetables
We will make decisions about these variables later: for now, let’s keep them.
Restrict the dataset with variables of interest only
Cycle 1.1
- We define a vector of variable names
var.names1
that are of interest for the first cycle of the Canadian Community Health Survey (CCHS 1.1). These variables cover a range of topics such as heart disease, age, sex, etc. - Then we creates a new data frame
cc11
by subsetting the original data framec1
to include only the columns specified invar.names1
.
var.names1 <- c("CCCA_121", "CCCA_051", "CCCA_05A", "DHHAGAGE",
"DHHA_SEX", "DHHAGMS", "SDCAGRAC", "SDCAFIMM",
"SDCAGRES", "EDUADR04", "INCAGHH", "HWTAGBMI",
"PACADPAI", "TWDA_5", "GENA_07", "SMKADSTY",
"ALCADTYP", "FVCADTOT", "CCCA_071", "CCCA_91B",
"CCCA_101", "GEOAGPRV", "WTSAM")
cc11 <- c1[var.names1]
dim(cc11)
#> [1] 130880 23
table(cc11$CCCA_051)
#>
#> YES NO NOT APPLICABLE DON'T KNOW REFUSAL
#> 24511 106231 0 110 3
#> NOT STATED
#> 25
The subsequent two code chunks do the same for CCHS 2.1 and CCHS 3.1, respectively, resulting in new data frames cc21
and cc31
.
Cycle 2.1
var.names2 <- c("CCCC_121", "CCCC_051", "CCCC_05A", "DHHCGAGE",
"DHHC_SEX", "DHHCGMS", "SDCCGRAC", "SDCCFIMM",
"SDCCGRES", "EDUCDR04", "INCCGHH", "HWTCGBMI",
"PACCDPAI", "HCUC_1AA", "GENC_07", "SMKCDSTY",
"ALCCDTYP", "FVCCDTOT", "CCCC_071", "CCCC_91B",
"CCCC_101", "GEOCGPRV", "WTSC_M")
cc21 <- c2[var.names2]
dim(cc21)
#> [1] 134072 23
table(cc21$CCCC_051)
#>
#> YES NO NOT APPLICABLE DON'T KNOW REFUSAL
#> 29293 104530 0 208 11
#> NOT STATED
#> 30
Cycle 3.1
var.names3 <- c("CCCE_121", "CCCE_051", "CCCE_05A", "DHHEGAGE",
"DHHE_SEX", "DHHEGMS", "SDCEGCGT", "SDCEFIMM",
"SDCEGRES", "EDUEDR04", "INCEGHH", "HWTEGBMI",
"PACEDPAI", "HCUE_1AA", "GENE_07", "SMKEDSTY",
"ALCEDTYP", "FVCEDTOT","CCCE_071", "CCCE_91F",
"CCCE_101", "GEOEGPRV", "WTSE_M")
cc31 <- c3[var.names3]
dim(cc31)
#> [1] 132221 23
table(cc31$CCCE_051)
#>
#> YES NO NOT APPLICABLE DON'T KNOW REFUSAL
#> 28221 103781 0 191 4
#> NOT STATED
#> 24
Making variable names the same
We now create a new set of more readable and consistent variable names.
new.var.names <- c("CVD", "arthritis", "arthritis.kind", "age",
"sex", "married", "race", "immigration",
"recent.immigrant", "edu", "income", "bmi",
"phyact", "doctor", "stress", "smoke",
"drink", "fruit", "bp", "copd",
"diab", "province", "weight")
cbind(new.var.names, var.names1, var.names2, var.names3)
#> new.var.names var.names1 var.names2 var.names3
#> [1,] "CVD" "CCCA_121" "CCCC_121" "CCCE_121"
#> [2,] "arthritis" "CCCA_051" "CCCC_051" "CCCE_051"
#> [3,] "arthritis.kind" "CCCA_05A" "CCCC_05A" "CCCE_05A"
#> [4,] "age" "DHHAGAGE" "DHHCGAGE" "DHHEGAGE"
#> [5,] "sex" "DHHA_SEX" "DHHC_SEX" "DHHE_SEX"
#> [6,] "married" "DHHAGMS" "DHHCGMS" "DHHEGMS"
#> [7,] "race" "SDCAGRAC" "SDCCGRAC" "SDCEGCGT"
#> [8,] "immigration" "SDCAFIMM" "SDCCFIMM" "SDCEFIMM"
#> [9,] "recent.immigrant" "SDCAGRES" "SDCCGRES" "SDCEGRES"
#> [10,] "edu" "EDUADR04" "EDUCDR04" "EDUEDR04"
#> [11,] "income" "INCAGHH" "INCCGHH" "INCEGHH"
#> [12,] "bmi" "HWTAGBMI" "HWTCGBMI" "HWTEGBMI"
#> [13,] "phyact" "PACADPAI" "PACCDPAI" "PACEDPAI"
#> [14,] "doctor" "TWDA_5" "HCUC_1AA" "HCUE_1AA"
#> [15,] "stress" "GENA_07" "GENC_07" "GENE_07"
#> [16,] "smoke" "SMKADSTY" "SMKCDSTY" "SMKEDSTY"
#> [17,] "drink" "ALCADTYP" "ALCCDTYP" "ALCEDTYP"
#> [18,] "fruit" "FVCADTOT" "FVCCDTOT" "FVCEDTOT"
#> [19,] "bp" "CCCA_071" "CCCC_071" "CCCE_071"
#> [20,] "copd" "CCCA_91B" "CCCC_91B" "CCCE_91F"
#> [21,] "diab" "CCCA_101" "CCCC_101" "CCCE_101"
#> [22,] "province" "GEOAGPRV" "GEOCGPRV" "GEOEGPRV"
#> [23,] "weight" "WTSAM" "WTSC_M" "WTSE_M"
names(cc11) <- names(cc21) <- names(cc31) <- new.var.names
table(cc11$arthritis)
#>
#> YES NO NOT APPLICABLE DON'T KNOW REFUSAL
#> 24511 106231 0 110 3
#> NOT STATED
#> 25
table(cc21$arthritis)
#>
#> YES NO NOT APPLICABLE DON'T KNOW REFUSAL
#> 29293 104530 0 208 11
#> NOT STATED
#> 30
table(cc31$arthritis)
#>
#> YES NO NOT APPLICABLE DON'T KNOW REFUSAL
#> 28221 103781 0 191 4
#> NOT STATED
#> 24
cc11$cycle <- 11
cc21$cycle <- 21
cc31$cycle <- 31
Appending
We now combine the data frames cc11
, cc21
, and cc31
by stacking them on top of each other.
cc123a <- rbind(cc11,cc21,cc31)
dim(cc123a)
#> [1] 397173 24
names(cc123a)
#> [1] "CVD" "arthritis" "arthritis.kind" "age"
#> [5] "sex" "married" "race" "immigration"
#> [9] "recent.immigrant" "edu" "income" "bmi"
#> [13] "phyact" "doctor" "stress" "smoke"
#> [17] "drink" "fruit" "bp" "copd"
#> [21] "diab" "province" "weight" "cycle"
cc123a$ID <- 1:nrow(cc123a)
Variables
Sampling weight
We use the summary
function to provide basic statistics (like mean, median, min, max, etc.) for the weight
column in the data frame cc123a
.
Exposure
This following chunk creates frequency tables for the ‘arthritis’ and ‘arthritis.kind’ columns.
table(cc123a$arthritis)
#>
#> YES NO NOT APPLICABLE DON'T KNOW REFUSAL
#> 82025 314542 0 509 18
#> NOT STATED
#> 79
table(cc123a$arthritis.kind)
#>
#> RHEUMATOID ARTH OSTEOARTHRITIS OTHER NOT APPLICABLE DON'T KNOW
#> 19099 40943 7305 314542 12354
#> REFUSAL NOT STATED RHEUMATISM
#> 215 619 2096
sum(cc123a$arthritis=="NO")
#> [1] 314542
sum(cc123a$arthritis.kind=="NOT APPLICABLE")
#> [1] 314542
We create the exposure variable with exposure status vs controls.
# c123sub1 <- subset(cc123a, arthritis.kind == "OSTEOARTHRITIS" |
# arthritis.kind == "NOT APPLICABLE" )
# dim(c123sub1)
table(c123sub1$arthritis.kind)
#>
#> NOT APPLICABLE OSTEOARTHRITIS
#> 314542 40943
table(c123sub1$arthritis)
#>
#> YES NO NOT APPLICABLE DON'T KNOW REFUSAL
#> 82025 314542 0 509 18
#> NOT STATED
#> 79
require(car)
c123sub1$arthritis.kind <- car::recode(c123sub1$arthritis.kind,
"'OSTEOARTHRITIS'='OA';
'NOT APPLICABLE'='Control';
else=NA", as.factor = FALSE)
table(c123sub1$arthritis.kind, useNA = "always")
#>
#> Control OA <NA>
#> 314542 40943 41688
c123sub1$OA <- c123sub1$arthritis.kind
c123sub1$arthritis.kind <- NULL
c123sub1$arthritis <- NULL
dim(c123sub1)
#> [1] 397173 24
dim(c123sub1)[1]-dim(cc123a)[1]
#> [1] 0
Outcome
We create the outcome variable.
# table(c123sub1$CVD)
# c123sub2 <- subset(c123sub1, CVD == "YES" | CVD == "NO")
# table(c123sub2$CVD)
dim(c123sub2)
#> [1] 397173 24
c123sub2$CVD <- car::recode(c123sub2$CVD,
"'YES'='event';
'NO'='no event';
else=NA",
as.factor = FALSE)
table(c123sub2$CVD, useNA = "always")
#>
#> event no event <NA>
#> 25524 371121 528
dim(c123sub2)
#> [1] 397173 24
dim(c123sub2)[1]-dim(c123sub1)[1]
#> [1] 0
Covariates
Age
Recodes the ‘age’ column into broader age categories.
c123sub2$age <- car::recode(c123sub2$age,
"c('12 TO 14 YEARS','15 TO 19 YEARS',
'15 TO 17 YEARS', '18 TO 19 YEARS')='teen';
c('20 TO 24 YEARS','25 TO 29 YEARS')=
'20-29 years';
c('30 TO 34 YEARS','35 TO 39 YEARS')=
'30-39 years';
c('40 TO 44 YEARS','45 TO 49 YEARS')=
'40-49 years';
c('50 TO 54 YEARS','55 TO 59 YEARS')=
'50-59 years';
c('60 TO 64 YEARS')='60-64 years';
else='65 years and over'",
as.factor = FALSE)
table(c123sub2$age)
#>
#> 20-29 years 30-39 years 40-49 years 50-59 years
#> 48652 63810 65111 61035
#> 60-64 years 65 years and over teen
#> 25265 80913 52387
# dim(c123sub2)
# c123sub3 <- subset(c123sub2, age != 'teen' & age != '65 years and over')
table(c123sub3$age, useNA = "always")
#>
#> 20-29 years 30-39 years 40-49 years 50-59 years
#> 48652 63810 65111 61035
#> 60-64 years 65 years and over teen <NA>
#> 25265 80913 52387 0
dim(c123sub3)
#> [1] 397173 24
Sex
Recodes ‘sex’ to ‘Male’ or ‘Female’.
table(c123sub3$sex)
#>
#> MALE FEMALE NOT APPLICABLE NOT STATED DON'T KNOW
#> 182523 214650 0 0 0
#> REFUSAL
#> 0
c123sub3$sex <- car::recode(c123sub3$sex,
"'MALE'='Male';
'FEMALE' = 'Female';
else = NA",
as.factor = FALSE)
table(c123sub3$sex, useNA = "always")
#>
#> Female Male <NA>
#> 214650 182523 0
Marital status
Recodes the ‘married’ column into two categories: ‘not single’ and ‘single’.
table(c123sub3$married)
#>
#> MARRIED COMMON-LAW WIDOW/SEP/DIV SINGLE
#> 171304 30909 75885 78261
#> REFUSAL NOT STATED NOT APPLICABLE DON'T KNOW
#> 0 661 0 0
#> SINGLE/NEVER MAR
#> 40153
c123sub3$married <- car::recode(c123sub3$married,
"c('MARRIED', 'COMMON-LAW')='not single';
c('WIDOW/SEP/DIV', 'SINGLE',
'SINGLE/NEVER MAR') = 'single';
else = NA",
as.factor = FALSE)
table(c123sub3$married, useNA = "always")
#>
#> not single single <NA>
#> 202213 194299 661
Race/ethnicity
Recodes ‘race’ into ‘White’ and ‘Non-white’.
table(c123sub3$race)
#>
#> WHITE VISIBLE MINORITY NOT APPLICABLE NOT STATED
#> 349222 38641 0 9310
#> DON'T KNOW REFUSAL
#> 0 0
c123sub3$race <- car::recode(c123sub3$race,
"'WHITE'='White';
'VISIBLE MINORITY' = 'Non-white';
else = NA",
as.factor = FALSE)
table(c123sub3$race, useNA = "always")
#>
#> Non-white White <NA>
#> 38641 349222 9310
Immigration
Creates a new column for immigration status based on the ‘recent.immigrant’ column, then removes the original column.
table(c123sub3$recent.immigrant)
#>
#> 0 TO 9 YEARS 10 YEARS OR MORE NOT APPLICABLE NOT STATED
#> 10644 26746 338078 7975
#> DON'T KNOW REFUSAL 10 OR MORE YEARS
#> 0 0 13730
c123sub3$immigrate <- car::recode(c123sub3$recent.immigrant,
"'0 TO 9 YEARS'='recent';
'10 YEARS OR MORE' = '> 10 years';
'NOT APPLICABLE' = 'not immigrant';
else = NA",
as.factor = FALSE)
table(c123sub3$immigrate, useNA = "always")
#>
#> > 10 years not immigrant recent <NA>
#> 26746 338078 10644 21705
c123sub3$recent.immigrant <- NULL
c123sub3$immigration <- NULL
Education
Recode educational status into specified categories.
table(c123sub3$edu)
#>
#> < THAN SECONDARY SECONDARY GRAD. OTHER POST-SEC. POST-SEC. GRAD.
#> 124425 64753 29000 171972
#> NOT APPLICABLE NOT STATED DON'T KNOW REFUSAL
#> 0 7023 0 0
c123sub3$edu <- car::recode(c123sub3$edu,
"'< THAN SECONDARY'='< 2ndary';
'SECONDARY GRAD.' = '2nd grad.';
'POST-SEC. GRAD.' = 'Post-2nd grad.';
'OTHER POST-SEC.' = 'Other 2nd grad.';
else = NA",
as.factor = FALSE)
table(c123sub3$edu, useNA = "always")
#>
#> < 2ndary 2nd grad. Other 2nd grad. Post-2nd grad. <NA>
#> 124425 64753 29000 171972 7023
Income
Recodes income levels into broader categories.
table(c123sub3$income)
#>
#> NO INCOME LESS THAN 15,000 $15,000-$29,999 $30,000-$49,999
#> 12636 14103 63766 77940
#> $50,000-$79,999 $80,000 OR MORE NOT APPLICABLE NOT STATED
#> 85108 75715 0 57079
#> DON'T KNOW REFUSAL NO OR <$15,000
#> 0 0 10826
# cycle 1.1 has: 'NO INCOME','LESS THAN 15,000'
# Other cycles have: 'NO OR <$15,000'
c123sub3$income <- car::recode(c123sub3$income,
"c('NO OR <$15,000', 'NO INCOME',
'LESS THAN 15,000',
'$15,000-$29,999')='$29,999 or less';
'$30,000-$49,999' = '$30,000-$49,999';
'$50,000-$79,999' = '$50,000-$79,999';
'$80,000 OR MORE' = '$80,000 or more';
else = NA",
as.factor = FALSE)
table(c123sub3$income, useNA = "always")
#>
#> $29,999 or less $30,000-$49,999 $50,000-$79,999 $80,000 or more <NA>
#> 101331 77940 85108 75715 57079
BMI
Converts ‘bmi’ column values to numerical data and also categorizes it based on the BMI value.
If you want to reuse the continuous variable later (usually a good idea in statistical sense), keep a second copy of the variable.
# table(c123sub3$bmi)
sum(c123sub3$bmi=="NOT APPLICABLE")+
sum(c123sub3$bmi=="REFUSAL")+
sum(c123sub3$bmi=="NOT STATED")+
sum(c123sub3$bmi=="DON'T KNOW")
#> [1] 72303
c123sub3$bmi <- car::recode(c123sub3$bmi,
'c("NOT APPLICABLE", "REFUSAL",
"NOT STATED", "DON\'T KNOW")=NA',
as.factor = FALSE)
#table(c123sub3$bmi, useNA = "always")
c123sub3$bmi <- as.numeric(as.character(c123sub3$bmi))
summary(c123sub3$bmi)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 11.91 22.47 25.30 25.86 28.50 57.90 72303
c123sub3$bmi2 <- cut(c123sub3$bmi,
breaks = c(0,18.5,25,Inf),
right = TRUE,
labels = c("Underweight",
"healthy weight",
"Overweight"))
c123sub3$bmi <- c123sub3$bmi2
c123sub3$bmi2 <- NULL
table(c123sub3$bmi, useNA = "always")
#>
#> Underweight healthy weight Overweight <NA>
#> 10116 148533 166221 72303
Physical activity
Recodes physical activity levels.
table(c123sub3$phyact)
#>
#> ACTIVE MODERATE INACTIVE NOT APPLICABLE NOT STATED
#> 98571 93507 190739 0 14356
#> DON'T KNOW REFUSAL
#> 0 0
c123sub3$phyact <- car::recode(c123sub3$phyact,
"'ACTIVE'='Active';
'MODERATE' = 'Moderate';
'INACTIVE' = 'Inactive';
else = NA",
as.factor = FALSE)
table(c123sub3$phyact, useNA = "always")
#>
#> Active Inactive Moderate <NA>
#> 98571 190739 93507 14356
Doctor
Recodes whether someone has access to a doctor into ‘Yes’ or ‘No’.
Stress
Recodes stress levels into ‘Not too stressed’ or ‘stressed’.
table(c123sub3$stress)
#>
#> NOT AT ALL NOT VERY A BIT QUITE A BIT EXTREMELY
#> 51020 89612 146112 68203 14094
#> NOT APPLICABLE DON'T KNOW REFUSAL NOT STATED
#> 26737 1244 145 6
c123sub3$stress <- car::recode(c123sub3$stress,
"c('NOT AT ALL','NOT VERY','A BIT')=
'Not too stressed';
c('QUITE A BIT','EXTREMELY') =
'stressed';
else = NA",
as.factor = FALSE)
table(c123sub3$stress, useNA = "always")
#>
#> Not too stressed stressed <NA>
#> 286744 82297 28132
Smoking
Recodes smoking status into ‘Current smoker’, ‘Former smoker’, or ‘Never smoker’.
table(c123sub3$smoke)
#>
#> DAILY OCCASIONAL ALWAYS OCCASION. FORMER DAILY
#> 79878 10931 7043 99450
#> FORMER OCCASION. NEVER SMOKED NOT APPLICABLE NOT STATED
#> 58120 140017 0 1734
#> DON'T KNOW REFUSAL
#> 0 0
c123sub3$smoke <- car::recode(c123sub3$smoke,
"c('DAILY','OCCASIONAL',
'ALWAYS OCCASION.')='Current smoker';
c('FORMER DAILY','FORMER OCCASION.',
'ALWAYS OCCASION.') = 'Former smoker';
'NEVER SMOKED' = 'Never smoker';
else = NA",
as.factor = FALSE)
table(c123sub3$smoke, useNA = "always")
#>
#> Current smoker Former smoker Never smoker <NA>
#> 97852 157570 140017 1734
Alcohol
Recodes drinking habits into ‘Current drinker’, ‘Former driker’, or ‘Never drank’.
table(c123sub3$drink)
#>
#> REGULAR DRINKER OCC. DRINKER FORMER DRINKER NEVER DRANK NOT APPLICABLE
#> 219334 76399 55299 40670 0
#> NOT STATED DON'T KNOW REFUSAL
#> 5471 0 0
c123sub3$drink <- car::recode(c123sub3$drink,
"c('REGULAR DRINKER',
'OCC. DRINKER')='Current drinker';
c('FORMER DRINKER') = 'Former driker';
'NEVER DRANK' = 'Never drank';
else = NA",
as.factor = FALSE)
table(c123sub3$drink, useNA = "always")
#>
#> Current drinker Former driker Never drank <NA>
#> 295733 55299 40670 5471
Fruit and vegetable consumption
Converts fruit and vegetable consumption to numerical values and then categorizes it.
If you want to reuse the continuous variable later (usually a good idea in statistical sense), keep a second copy of the variable.
str(c123sub3$fruit)
#> Factor w/ 303 levels "0","0.1","0.2",..: 57 48 42 72 67 120 39 54 61 51 ...
#c123sub3$fruit.cont <- c123sub3$fruit
c123sub3$fruit2 <- as.numeric(as.character(c123sub3$fruit))
# Note: do not use as.numeric(c123sub3$fruit)
summary(c123sub3$fruit2)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0.00 3.00 4.30 4.75 6.00 80.00 75347
c123sub3$fruit2 <- cut(c123sub3$fruit2,
breaks = c(0,3,6,Inf),
right = TRUE,
labels = c("0-3 daily serving",
"4-6 daily serving",
"6+ daily serving"))
table(c123sub3$fruit2, useNA = "always")
#>
#> 0-3 daily serving 4-6 daily serving 6+ daily serving <NA>
#> 83441 159380 78857 75495
c123sub3$fruit <- c123sub3$fruit2
c123sub3$fruit2 <- NULL
Hypertension
The following chunk of code is concerned with recoding the blood pressure column. The code also shows the distribution of the values before and after recoding.
COPD
The following chunk of code is concerned with recoding the COPD column.
Diabetes
The following chunk of code is concerned with recoding the diabetes column.
North
This section is concerned with recoding the province column. It groups provinces into “North” and “South” based on their names.
- Note that the category names might not exactly match with the data dictionary:
- Particularly note two versions of
QUEBEC
-
PEI
was written in short - None in
- DON’T KNOW
- REFUSAL
- NOT STATED
- NOT APPLICABLE
- Particularly note two versions of
c123sub3$province.check <- c123sub3$province
table(c123sub3$province)
#>
#> NEWFOUNDLAND PEI NOVA SCOTIA NEW BRUNSWICK
#> 7924 7744 15341 15025
#> QU\xc9BEC ONTARIO MANITOBA SASKATCHEWAN
#> 22012 123821 23454 23361
#> ALBERTA BRITISH COLUMBIA YUKON/NWT/NUNAVT NOT APPLICABLE
#> 40127 49767 5064 0
#> DON'T KNOW REFUSAL NOT STATED QUEBEC
#> 0 0 0 56764
#> NFLD & LAB. YUKON/NWT/NUNA.
#> 4111 2658
c123sub3$province <- car::recode(c123sub3$province,
"c('YUKON/NWT/NUNAVT','YUKON/NWT/NUNA.')=
'North'; else = 'South'",
as.factor = FALSE)
table(c123sub3$province, useNA = "always")
#>
#> North South <NA>
#> 7722 389451 0
Dimension testing
This chunk verifies the dimensions of the modified data frames and performs some other operations like setting factors.
save.c123sub3 <- c123sub3
names(c123sub3)
#> [1] "CVD" "age" "sex" "married"
#> [5] "race" "edu" "income" "bmi"
#> [9] "phyact" "doctor" "stress" "smoke"
#> [13] "drink" "fruit" "bp" "copd"
#> [17] "diab" "province" "weight" "cycle"
#> [21] "ID" "OA" "immigrate" "province.check"
# c123sub3$phyact <- NULL
# c123sub3$stress <- NULL
# c123sub3$fruit <- NULL
# c123sub3$copd <- NULL
# c123sub3$province.check <- NULL
dim(c123sub3)
#> [1] 397173 24
dim(c123sub2)[1]-dim(c123sub3)[1]
#> [1] 0
analytic <- c123sub3
dim(analytic)
#> [1] 397173 24
analytic$cycle <- as.factor(analytic$cycle)
names(analytic)
#> [1] "CVD" "age" "sex" "married"
#> [5] "race" "edu" "income" "bmi"
#> [9] "phyact" "doctor" "stress" "smoke"
#> [13] "drink" "fruit" "bp" "copd"
#> [17] "diab" "province" "weight" "cycle"
#> [21] "ID" "OA" "immigrate" "province.check"
Saving data
We save the data for future use:
Video content (optional)
For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.