Exercise 1 Solution (W)
Use the functions we learned in Lab 1 to complete Lab 1 Exercise. We will use Right Heart Catheterization Dataset saved in the folder named ‘Data/wrangling/’. The variable list and description can be accessed from Vanderbilt Biostatistics website.
A paper you can access the original table from this paper (doi: 10.1001/jama.1996.03540110043030). We have modified the table and corrected some issues. Please knit your file once you finished and submit the knitted file ONLY.
Problem 1: Basic Manipulation
- Continuous to Categories: Change the Age variable into categories below 50, 50 to below 60, 60 to below 70, 70 to below 80, 80 and above [Hint: the
cut
function could be helpful]
- Re-order: Re-order the levels of race to white, black and other
- Set reference: Change the reference category for gender to Male
- Count levels: Check how many levels does the variable “cat1” (Primary disease category) have? Regroup the levels for disease categories to “ARF”,“CHF”,“MOSF”,“Other”. [Hint: the
nlevels
andlist
functions could be helpful]
nlevels(as.factor(rhc$cat1)) # There are nine levels
#> [1] 9
rhc$cat1 <- as.factor(rhc$cat1)
levels(rhc$cat1) <- list(ARF = "ARF", CHF = "CHF",
MOSF = c("MOSF w/Malignancy", "MOSF w/Sepsis"),
Other = c("Cirrhosis", "Colon Cancer", "Coma", "COPD",
"Lung Cancer"))
rhc$cat1 <- factor(rhc$cat1, levels = c("ARF", "CHF", "MOSF", "Other"))
levels(rhc$cat1)
#> [1] "ARF" "CHF" "MOSF" "Other"
- Rename levels: Rename the levels of “ca” (Cancer) to “Metastatic”,“None” and “Localized (Yes)”, then re-order the levels to “None”,“Localized (Yes)” and “Metastatic”
- comorbidities:
- Create a new variable called “numcom” to count number of comorbidities illness for each person (12 categories) [Hint: the
rowSums
command could be helpful], - Report maximum and minimum values of numcom:
# See head of comorbidities
# head(rhc[,c("cardiohx", "chfhx", "dementhx", "psychhx", "chrpulhx", "renalhx",
# "liverhx", "gibledhx", "malighx", "immunhx", "transhx", "amihx")])
# number of comorbidities
rhc$numcom <- rowSums(rhc[,c("cardiohx", "chfhx", "dementhx", "psychhx", "chrpulhx",
"renalhx", "liverhx", "gibledhx", "malighx", "immunhx",
"transhx", "amihx") ])
# maximum and minimum
cbind(Maximum = max(rhc$numcom), Minimum = min(rhc$numcom))
#> Maximum Minimum
#> [1,] 6 0
- Anlaytic data: Create a dataset that has only the following variables
age, sex, race, cat1, ca, dnr1, aps1, surv2md1, numcom, adld3p, das2d3pc, temp1, hrt1, meanbp1, resp1, wblc1, pafi1, paco21, ph1, crea1, alb1, scoma1, swang1
name the dataset as rhc2
Problem 2: Table 1
Re-produce the sample table 1 from the rhc2 data (see the Table below). In your table, the variables should be ordered as the same as the sample. Please re-level or re-order the levels if needed. [Hint: the tableone
package might be useful]
No RHC | RHC | |
---|---|---|
n | 3551 | 2184 |
age (%) | ||
[-Inf,50) | 884 (24.9) | 540 (24.7) |
[50,60) | 546 (15.4) | 371 (17.0) |
[60,70) | 812 (22.9) | 577 (26.4) |
[70,80) | 809 (22.8) | 529 (24.2) |
[80, Inf) | 500 (14.1) | 167 ( 7.6) |
sex = Female (%) | 1637 (46.1) | 906 (41.5) |
race (%) | ||
white | 2753 (77.5) | 1707 (78.2) |
black | 585 (16.5) | 335 (15.3) |
other | 213 ( 6.0) | 142 ( 6.5) |
cat1 (%) | ||
ARF | 1581 (44.5) | 909 (41.6) |
CHF | 247 ( 7.0) | 209 ( 9.6) |
Other | 955 (26.9) | 208 ( 9.5) |
MOSF | 768 (21.6) | 858 (39.3) |
ca (%) | ||
None | 2652 (74.7) | 1727 (79.1) |
Localized (Yes) | 638 (18.0) | 334 (15.3) |
Metastatic | 261 ( 7.4) | 123 ( 5.6) |
dnr1 = Yes (%) | 499 (14.1) | 155 ( 7.1) |
aps1 (mean (SD)) | 50.93 (18.81) | 60.74 (20.27) |
surv2md1 (mean (SD)) | 0.61 (0.19) | 0.57 (0.20) |
numcom (mean (SD)) | 1.52 (1.17) | 1.48 (1.13) |
adld3p (mean (SD)) | 1.24 (1.86) | 1.02 (1.69) |
das2d3pc (mean (SD)) | 20.37 (5.48) | 20.70 (5.03) |
temp1 (mean (SD)) | 37.63 (1.74) | 37.59 (1.83) |
hrt1 (mean (SD)) | 112.87 (40.94) | 118.93 (41.47) |
meanbp1 (mean (SD)) | 84.87 (38.87) | 68.20 (34.24) |
resp1 (mean (SD)) | 28.98 (13.95) | 26.65 (14.17) |
wblc1 (mean (SD)) | 15.26 (11.41) | 16.27 (12.55) |
pafi1 (mean (SD)) | 240.63 (116.66) | 192.43 (105.54) |
paco21 (mean (SD)) | 39.95 (14.24) | 36.79 (10.97) |
ph1 (mean (SD)) | 7.39 (0.11) | 7.38 (0.11) |
crea1 (mean (SD)) | 1.92 (2.03) | 2.47 (2.05) |
alb1 (mean (SD)) | 3.16 (0.67) | 2.98 (0.93) |
scoma1 (mean (SD)) | 22.25 (31.37) | 18.97 (28.26) |
rhc2$cat1 <- factor(rhc2$cat1, levels = c("ARF", "CHF", "Other", "MOSF"))
vars <- c("age", "sex", "race", "cat1", "ca", "dnr1", "aps1", "surv2md1", "numcom",
"adld3p", "das2d3pc", "temp1", "hrt1", "meanbp1", "resp1", "wblc1",
"pafi1", "paco21", "ph1", "crea1", "alb1", "scoma1")
library(tableone)
tab1a <- CreateTableOne(vars = vars, data = rhc2, strata = "swang1", includeNA = TRUE,
test = F)
print(tab1a)
#> Stratified by swang1
#> No RHC RHC
#> n 3551 2184
#> age (%)
#> [-Inf,50) 884 (24.9) 540 (24.7)
#> [50,60) 546 (15.4) 371 (17.0)
#> [60,70) 812 (22.9) 577 (26.4)
#> [70,80) 809 (22.8) 529 (24.2)
#> [80, Inf) 500 (14.1) 167 ( 7.6)
#> sex = Female (%) 1637 (46.1) 906 (41.5)
#> race (%)
#> white 2753 (77.5) 1707 (78.2)
#> black 585 (16.5) 335 (15.3)
#> other 213 ( 6.0) 142 ( 6.5)
#> cat1 (%)
#> ARF 1581 (44.5) 909 (41.6)
#> CHF 247 ( 7.0) 209 ( 9.6)
#> Other 955 (26.9) 208 ( 9.5)
#> MOSF 768 (21.6) 858 (39.3)
#> ca (%)
#> None 2652 (74.7) 1727 (79.1)
#> Localized (Yes) 638 (18.0) 334 (15.3)
#> Metastatic 261 ( 7.4) 123 ( 5.6)
#> dnr1 = Yes (%) 499 (14.1) 155 ( 7.1)
#> aps1 (mean (SD)) 50.93 (18.81) 60.74 (20.27)
#> surv2md1 (mean (SD)) 0.61 (0.19) 0.57 (0.20)
#> numcom (mean (SD)) 1.52 (1.17) 1.48 (1.13)
#> adld3p (mean (SD)) 1.24 (1.86) 1.02 (1.69)
#> das2d3pc (mean (SD)) 20.37 (5.48) 20.70 (5.03)
#> temp1 (mean (SD)) 37.63 (1.74) 37.59 (1.83)
#> hrt1 (mean (SD)) 112.87 (40.94) 118.93 (41.47)
#> meanbp1 (mean (SD)) 84.87 (38.87) 68.20 (34.24)
#> resp1 (mean (SD)) 28.98 (13.95) 26.65 (14.17)
#> wblc1 (mean (SD)) 15.26 (11.41) 16.27 (12.55)
#> pafi1 (mean (SD)) 240.63 (116.66) 192.43 (105.54)
#> paco21 (mean (SD)) 39.95 (14.24) 36.79 (10.97)
#> ph1 (mean (SD)) 7.39 (0.11) 7.38 (0.11)
#> crea1 (mean (SD)) 1.92 (2.03) 2.47 (2.05)
#> alb1 (mean (SD)) 3.16 (0.67) 2.98 (0.93)
#> scoma1 (mean (SD)) 22.25 (31.37) 18.97 (28.26)
Problem 3: Table 1 for subset
Produce a similar table as Problem 2 but with only male sex and ARF primary disease category (cat1). Add the overall column in the same table. [Hint: filter
command could be useful]
rhc2.M <- filter(rhc2, sex == "Male" & cat1 == "ARF")
vars <- c("age", "race", "ca", "dnr1", "aps1", "surv2md1", "numcom", "adld3p",
"das2d3pc", "temp1", "hrt1", "meanbp1", "resp1", "wblc1", "pafi1",
"paco21", "ph1", "crea1", "alb1", "scoma1")
tab1b <- CreateTableOne(vars = vars, data = rhc2.M, strata = "swang1", includeNA = T,
addOverall = T, test = F)
print(tab1b)
#> Stratified by swang1
#> Overall No RHC RHC
#> n 1382 888 494
#> age (%)
#> [-Inf,50) 382 (27.6) 267 (30.1) 115 (23.3)
#> [50,60) 198 (14.3) 127 (14.3) 71 (14.4)
#> [60,70) 299 (21.6) 174 (19.6) 125 (25.3)
#> [70,80) 340 (24.6) 201 (22.6) 139 (28.1)
#> [80, Inf) 163 (11.8) 119 (13.4) 44 ( 8.9)
#> race (%)
#> white 1116 (80.8) 700 (78.8) 416 (84.2)
#> black 192 (13.9) 141 (15.9) 51 (10.3)
#> other 74 ( 5.4) 47 ( 5.3) 27 ( 5.5)
#> ca (%)
#> None 1068 (77.3) 670 (75.5) 398 (80.6)
#> Localized (Yes) 253 (18.3) 173 (19.5) 80 (16.2)
#> Metastatic 61 ( 4.4) 45 ( 5.1) 16 ( 3.2)
#> dnr1 = Yes (%) 136 ( 9.8) 105 (11.8) 31 ( 6.3)
#> aps1 (mean (SD)) 54.51 (18.94) 51.90 (18.21) 59.20 (19.34)
#> surv2md1 (mean (SD)) 0.61 (0.17) 0.63 (0.17) 0.59 (0.17)
#> numcom (mean (SD)) 1.34 (1.16) 1.32 (1.16) 1.38 (1.15)
#> adld3p (mean (SD)) 1.00 (1.79) 1.00 (1.78) 1.01 (1.80)
#> das2d3pc (mean (SD)) 21.74 (5.62) 21.67 (5.72) 21.87 (5.44)
#> temp1 (mean (SD)) 37.96 (1.71) 38.02 (1.69) 37.84 (1.76)
#> hrt1 (mean (SD)) 115.96 (39.26) 115.52 (39.39) 116.76 (39.06)
#> meanbp1 (mean (SD)) 79.08 (36.38) 83.69 (36.81) 70.80 (34.11)
#> resp1 (mean (SD)) 29.01 (14.35) 30.27 (14.21) 26.73 (14.33)
#> wblc1 (mean (SD)) 15.80 (12.03) 15.92 (11.50) 15.58 (12.93)
#> pafi1 (mean (SD)) 188.09 (100.74) 208.05 (102.50) 152.20 (86.70)
#> paco21 (mean (SD)) 37.45 (10.03) 38.08 (10.56) 36.32 (8.89)
#> ph1 (mean (SD)) 7.40 (0.10) 7.40 (0.10) 7.39 (0.10)
#> crea1 (mean (SD)) 2.22 (2.25) 2.10 (2.33) 2.45 (2.09)
#> alb1 (mean (SD)) 3.07 (0.68) 3.12 (0.66) 2.98 (0.70)
#> scoma1 (mean (SD)) 18.42 (27.05) 19.48 (28.07) 16.51 (25.02)
Problem 4: Considering eligibility criteria
Produce a similar table as Problem 2 but only for the subjects who meet all of the following eligibility criteria: (i) age is equal to or above 50, (ii) age is below 80 (iii) Glasgow Coma Score is below 61 and (iv) Primary disease categories are either ARF or MOSF. [Hint: droplevels.data.frame
can be a useful function]
rhc3 <- rhc2
rhc3$eligible <- with(rhc3, ifelse((age=="[50,60)" | age=="[60,70)" | age=="[70,80)") &
scoma1 < 61 & (cat1=="ARF" | cat1 == "MOSF"), 1, 0))
rhc3 <- rhc3[rhc3$eligible==1,]
rhc3 <- droplevels.data.frame(rhc3, exclude = c("[-Inf,50)", "[80, Inf)", "61+",
"CHF", "Other"))
vars <- c("age", "sex", "race", "cat1", "ca", "dnr1", "aps1", "surv2md1", "numcom",
"adld3p", "das2d3pc", "temp1", "hrt1", "meanbp1", "resp1", "wblc1",
"pafi1", "paco21", "ph1", "crea1", "alb1", "scoma1")
CreateTableOne(vars = vars, data = rhc3, strata = "swang1", includeNA = F, test = F)
#> Stratified by swang1
#> No RHC RHC
#> n 1226 1102
#> age (%)
#> [50,60) 321 (26.2) 263 (23.9)
#> [60,70) 429 (35.0) 416 (37.7)
#> [70,80) 476 (38.8) 423 (38.4)
#> sex = Female (%) 550 (44.9) 469 (42.6)
#> race (%)
#> white 977 (79.7) 908 (82.4)
#> black 187 (15.3) 135 (12.3)
#> other 62 ( 5.1) 59 ( 5.4)
#> cat1 = MOSF (%) 417 (34.0) 545 (49.5)
#> ca (%)
#> None 812 (66.2) 816 (74.0)
#> Localized (Yes) 283 (23.1) 212 (19.2)
#> Metastatic 131 (10.7) 74 ( 6.7)
#> dnr1 = Yes (%) 150 (12.2) 67 ( 6.1)
#> aps1 (mean (SD)) 54.57 (18.13) 61.82 (18.77)
#> surv2md1 (mean (SD)) 0.59 (0.16) 0.55 (0.16)
#> numcom (mean (SD)) 1.51 (1.14) 1.50 (1.12)
#> adld3p (mean (SD)) 1.18 (1.82) 1.22 (1.90)
#> das2d3pc (mean (SD)) 20.22 (5.22) 20.57 (4.82)
#> temp1 (mean (SD)) 37.85 (1.72) 37.67 (1.82)
#> hrt1 (mean (SD)) 116.68 (39.28) 120.02 (40.26)
#> meanbp1 (mean (SD)) 79.99 (37.36) 67.57 (33.53)
#> resp1 (mean (SD)) 30.57 (12.90) 26.34 (13.90)
#> wblc1 (mean (SD)) 16.35 (13.38) 16.91 (13.06)
#> pafi1 (mean (SD)) 226.30 (110.48) 178.37 (94.37)
#> paco21 (mean (SD)) 37.94 (11.25) 36.38 (10.44)
#> ph1 (mean (SD)) 7.40 (0.10) 7.38 (0.10)
#> crea1 (mean (SD)) 2.20 (2.34) 2.53 (2.06)
#> alb1 (mean (SD)) 3.09 (0.66) 2.94 (1.08)
#> scoma1 (mean (SD)) 14.22 (18.48) 13.65 (17.91)
Optional
Optional 1: Missing values
- Any variables included in rhc2 data had missing values? Name that variable. [Hint:
apply
function could be helpful]
- Count how many NAs does that variable have?
- Produce a table 1 for a complete case data (no missing observations) stratified by
swang1
.
rhc.complete <- rhc2[complete.cases(rhc2), ]
vars <- c("age", "sex", "race", "cat1", "ca", "dnr1", "aps1", "surv2md1", "numcom",
"adld3p", "das2d3pc", "temp1", "hrt1", "meanbp1", "resp1", "wblc1",
"pafi1", "paco21", "ph1", "crea1", "alb1", "scoma1")
CreateTableOne(vars = vars, data = rhc.complete, strata = "swang1", includeNA = F,
test = F)
#> Stratified by swang1
#> No RHC RHC
#> n 1049 390
#> age (%)
#> [-Inf,50) 264 (25.2) 113 (29.0)
#> [50,60) 160 (15.3) 85 (21.8)
#> [60,70) 261 (24.9) 99 (25.4)
#> [70,80) 238 (22.7) 70 (17.9)
#> [80, Inf) 126 (12.0) 23 ( 5.9)
#> sex = Female (%) 480 (45.8) 137 (35.1)
#> race (%)
#> white 813 (77.5) 297 (76.2)
#> black 176 (16.8) 67 (17.2)
#> other 60 ( 5.7) 26 ( 6.7)
#> cat1 (%)
#> ARF 429 (40.9) 127 (32.6)
#> CHF 174 (16.6) 129 (33.1)
#> Other 266 (25.4) 24 ( 6.2)
#> MOSF 180 (17.2) 110 (28.2)
#> ca (%)
#> None 797 (76.0) 324 (83.1)
#> Localized (Yes) 171 (16.3) 46 (11.8)
#> Metastatic 81 ( 7.7) 20 ( 5.1)
#> dnr1 = Yes (%) 87 ( 8.3) 11 ( 2.8)
#> aps1 (mean (SD)) 48.36 (16.34) 49.38 (19.71)
#> surv2md1 (mean (SD)) 0.70 (0.15) 0.69 (0.17)
#> numcom (mean (SD)) 1.74 (1.22) 1.76 (1.23)
#> adld3p (mean (SD)) 1.24 (1.86) 1.02 (1.69)
#> das2d3pc (mean (SD)) 20.36 (7.28) 20.36 (6.96)
#> temp1 (mean (SD)) 37.35 (1.66) 37.24 (1.61)
#> hrt1 (mean (SD)) 112.23 (38.20) 108.66 (39.22)
#> meanbp1 (mean (SD)) 87.35 (37.97) 70.91 (33.38)
#> resp1 (mean (SD)) 30.43 (11.65) 25.25 (12.73)
#> wblc1 (mean (SD)) 14.45 (11.16) 14.75 (13.09)
#> pafi1 (mean (SD)) 250.90 (112.53) 238.90 (104.11)
#> paco21 (mean (SD)) 41.77 (14.86) 37.16 (8.57)
#> ph1 (mean (SD)) 7.39 (0.10) 7.40 (0.09)
#> crea1 (mean (SD)) 2.03 (2.27) 2.22 (2.05)
#> alb1 (mean (SD)) 3.26 (0.65) 3.19 (0.64)
#> scoma1 (mean (SD)) 5.25 (15.83) 6.54 (17.20)
Optional 2: Calculating variance of a sample
Write a function
for Bessel’s correction to calculate an unbiased estimate of the population variance from a finite sample (a vector of 100 observations, consisting of numbers from 1 to 100).
Hint: Take a closer look at the functions, loops and algorithms shown in lab materials. Use a for loop
, utilizing the following pseudocode of the algorithm:
Verify that estimated variance with the following variance function output in R: