# Data
mpg <- read.csv("Data/wrangling/mpg.csv", header = TRUE)
# Frequency table for drv
table(mpg$drv)
#>
#> 4 f r
#> 103 106 25
# Frequency table for manufacturer
table(mpg$manufacturer)
#>
#> audi chevrolet dodge ford honda hyundai jeep
#> 18 19 37 25 9 14 8
#> land rover lincoln mercury nissan pontiac subaru toyota
#> 4 3 4 13 5 14 34
#> volkswagen
#> 27
## Ex create a summary table between manufacturer and drv
table(mpg$drv, mpg$manufacturer)
#>
#> audi chevrolet dodge ford honda hyundai jeep land rover lincoln mercury
#> 4 11 4 26 13 0 0 8 4 0 4
#> f 7 5 11 0 9 14 0 0 0 0
#> r 0 10 0 12 0 0 0 0 3 0
#>
#> nissan pontiac subaru toyota volkswagen
#> 4 4 0 14 15 0
#> f 9 5 0 19 27
#> r 0 0 0 0 0
Summary tables
Medical research and epidemiology often involve large, complex datasets. Data summarization is a vital step that transforms these vast datasets into concise, understandable insights. In medical contexts, these summaries can highlight patterns, indicate data inconsistencies, and guide further research. This tutorial will teach you how to use R to efficiently summarize medical data.
In epidemiology and medical research, “Table 1” typically refers to the first table in a research paper or report that provides descriptive statistics of the study population. It offers a snapshot of the baseline characteristics of the study groups, whether in a cohort study, clinical trial, or any other study design.
The first line reads a CSV file. It uses the table() function to generate a contingency table (cross-tabulation) between two categorical variables: drv
(drive) and manufacturer
. It essentially counts how many times each combination of drv and manufacturer appears in the dataset.
## Get the percentage summary using prop.table
prop.table(table(mpg$drv, mpg$manufacturer), margin = 2)
#>
#> audi chevrolet dodge ford honda hyundai jeep
#> 4 0.6111111 0.2105263 0.7027027 0.5200000 0.0000000 0.0000000 1.0000000
#> f 0.3888889 0.2631579 0.2972973 0.0000000 1.0000000 1.0000000 0.0000000
#> r 0.0000000 0.5263158 0.0000000 0.4800000 0.0000000 0.0000000 0.0000000
#>
#> land rover lincoln mercury nissan pontiac subaru toyota
#> 4 1.0000000 0.0000000 1.0000000 0.3076923 0.0000000 1.0000000 0.4411765
#> f 0.0000000 0.0000000 0.0000000 0.6923077 1.0000000 0.0000000 0.5588235
#> r 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#>
#> volkswagen
#> 4 0.0000000
#> f 1.0000000
#> r 0.0000000
## margin = 1 sum across row, 2 across col
This code calculates the column-wise proportion (as percentages) for each combination of drv
and manufacturer
. The prop.table() function is used to compute the proportions. The margin = 2
argument indicates that the proportions are to be computed across columns (margin = 1
would compute them across rows).
tableone package
CreateTableOne function from tableone package could be a very useful function to see the summary table. Type ?tableone::CreateTableOne
to see for more details.
This section introduces the tableone package, which offers the CreateTableOne
function. This function helps in creating “Table 1” type summary tables, commonly used in epidemiological studies.
require(tableone)
#> Loading required package: tableone
CreateTableOne(vars = c("cyl", "drv", "hwy", "cty"), data = mpg,
strata = "trans", includeNA = TRUE, test = FALSE)
#> Stratified by trans
#> auto(av) auto(l3) auto(l4) auto(l5)
#> n 5 2 83 39
#> cyl (mean (SD)) 5.20 (1.10) 4.00 (0.00) 6.14 (1.62) 6.56 (1.45)
#> drv (%)
#> 4 0 ( 0.0) 0 ( 0.0) 34 (41.0) 29 (74.4)
#> f 5 (100.0) 2 (100.0) 37 (44.6) 8 (20.5)
#> r 0 ( 0.0) 0 ( 0.0) 12 (14.5) 2 ( 5.1)
#> hwy (mean (SD)) 27.80 (2.59) 27.00 (4.24) 21.96 (5.64) 20.72 (6.04)
#> cty (mean (SD)) 20.00 (2.00) 21.00 (4.24) 15.94 (3.98) 14.72 (3.49)
#> Stratified by trans
#> auto(l6) auto(s4) auto(s5) auto(s6)
#> n 6 3 3 16
#> cyl (mean (SD)) 7.33 (1.03) 5.33 (2.31) 6.00 (2.00) 6.00 (1.59)
#> drv (%)
#> 4 2 (33.3) 2 (66.7) 1 (33.3) 7 (43.8)
#> f 2 (33.3) 1 (33.3) 2 (66.7) 8 (50.0)
#> r 2 (33.3) 0 ( 0.0) 0 ( 0.0) 1 ( 6.2)
#> hwy (mean (SD)) 20.00 (2.37) 25.67 (1.15) 25.33 (6.66) 25.19 (3.99)
#> cty (mean (SD)) 13.67 (1.86) 18.67 (2.31) 17.33 (5.03) 17.38 (3.22)
#> Stratified by trans
#> manual(m5) manual(m6)
#> n 58 19
#> cyl (mean (SD)) 5.00 (1.30) 6.00 (1.76)
#> drv (%)
#> 4 21 (36.2) 7 (36.8)
#> f 33 (56.9) 8 (42.1)
#> r 4 ( 6.9) 4 (21.1)
#> hwy (mean (SD)) 26.29 (5.99) 24.21 (5.75)
#> cty (mean (SD)) 19.26 (4.56) 16.89 (3.83)
The CreateTableOne function is used to create a summary table for the variables cyl, drv, hwy, and cty
from the mpg
dataset. The strata = trans
argument means that the summary is stratified by the trans variable. The includeNA = TRUE
argument means that missing values (NAs) are included in the summary. The test = FALSE
argument indicates that no statistical tests should be applied to the data (often tests are used to compare groups in the table).
table1 package
This section introduces another package, table1
, which can also be used to create “Table 1” type summary tables.
require(table1)
#> Loading required package: table1
#>
#> Attaching package: 'table1'
#> The following objects are masked from 'package:base':
#>
#> units, units<-
table1(~ cyl + drv + hwy + cty | trans, data=mpg)
auto(av) (N=5) |
auto(l3) (N=2) |
auto(l4) (N=83) |
auto(l5) (N=39) |
auto(l6) (N=6) |
auto(s4) (N=3) |
auto(s5) (N=3) |
auto(s6) (N=16) |
manual(m5) (N=58) |
manual(m6) (N=19) |
Overall (N=234) |
|
---|---|---|---|---|---|---|---|---|---|---|---|
cyl | |||||||||||
Mean (SD) | 5.20 (1.10) | 4.00 (0) | 6.14 (1.62) | 6.56 (1.45) | 7.33 (1.03) | 5.33 (2.31) | 6.00 (2.00) | 6.00 (1.59) | 5.00 (1.30) | 6.00 (1.76) | 5.89 (1.61) |
Median [Min, Max] | 6.00 [4.00, 6.00] | 4.00 [4.00, 4.00] | 6.00 [4.00, 8.00] | 6.00 [4.00, 8.00] | 8.00 [6.00, 8.00] | 4.00 [4.00, 8.00] | 6.00 [4.00, 8.00] | 6.00 [4.00, 8.00] | 4.00 [4.00, 8.00] | 6.00 [4.00, 8.00] | 6.00 [4.00, 8.00] |
drv | |||||||||||
f | 5 (100%) | 2 (100%) | 37 (44.6%) | 8 (20.5%) | 2 (33.3%) | 1 (33.3%) | 2 (66.7%) | 8 (50.0%) | 33 (56.9%) | 8 (42.1%) | 106 (45.3%) |
4 | 0 (0%) | 0 (0%) | 34 (41.0%) | 29 (74.4%) | 2 (33.3%) | 2 (66.7%) | 1 (33.3%) | 7 (43.8%) | 21 (36.2%) | 7 (36.8%) | 103 (44.0%) |
r | 0 (0%) | 0 (0%) | 12 (14.5%) | 2 (5.1%) | 2 (33.3%) | 0 (0%) | 0 (0%) | 1 (6.3%) | 4 (6.9%) | 4 (21.1%) | 25 (10.7%) |
hwy | |||||||||||
Mean (SD) | 27.8 (2.59) | 27.0 (4.24) | 22.0 (5.64) | 20.7 (6.04) | 20.0 (2.37) | 25.7 (1.15) | 25.3 (6.66) | 25.2 (3.99) | 26.3 (5.99) | 24.2 (5.75) | 23.4 (5.95) |
Median [Min, Max] | 27.0 [25.0, 31.0] | 27.0 [24.0, 30.0] | 22.0 [14.0, 41.0] | 19.0 [12.0, 36.0] | 19.0 [18.0, 23.0] | 25.0 [25.0, 27.0] | 27.0 [18.0, 31.0] | 26.0 [18.0, 29.0] | 26.0 [16.0, 44.0] | 26.0 [12.0, 32.0] | 24.0 [12.0, 44.0] |
cty | |||||||||||
Mean (SD) | 20.0 (2.00) | 21.0 (4.24) | 15.9 (3.98) | 14.7 (3.49) | 13.7 (1.86) | 18.7 (2.31) | 17.3 (5.03) | 17.4 (3.22) | 19.3 (4.56) | 16.9 (3.83) | 16.9 (4.26) |
Median [Min, Max] | 19.0 [18.0, 23.0] | 21.0 [18.0, 24.0] | 16.0 [11.0, 29.0] | 14.0 [9.00, 25.0] | 13.0 [12.0, 16.0] | 20.0 [16.0, 20.0] | 18.0 [12.0, 22.0] | 17.0 [12.0, 22.0] | 19.0 [11.0, 35.0] | 16.0 [9.00, 23.0] | 17.0 [9.00, 35.0] |
The table1()
function is used to generate a summary table for the specified variables. The formula-like syntax (~ cyl + drv + hwy + cty | trans)
indicates that the summary should be stratified by the trans
variable.