Summary tables

Medical research and epidemiology often involve large, complex datasets. Data summarization is a vital step that transforms these vast datasets into concise, understandable insights. In medical contexts, these summaries can highlight patterns, indicate data inconsistencies, and guide further research. This tutorial will teach you how to use R to efficiently summarize medical data.

In epidemiology and medical research, “Table 1” typically refers to the first table in a research paper or report that provides descriptive statistics of the study population. It offers a snapshot of the baseline characteristics of the study groups, whether in a cohort study, clinical trial, or any other study design.

# Data
mpg <- read.csv("Data/wrangling/mpg.csv", header = TRUE)

# Frequency table for drv
table(mpg$drv)
#> 
#>   4   f   r 
#> 103 106  25

# Frequency table for manufacturer
table(mpg$manufacturer)
#> 
#>       audi  chevrolet      dodge       ford      honda    hyundai       jeep 
#>         18         19         37         25          9         14          8 
#> land rover    lincoln    mercury     nissan    pontiac     subaru     toyota 
#>          4          3          4         13          5         14         34 
#> volkswagen 
#>         27

## Ex create a summary table between manufacturer and drv
table(mpg$drv, mpg$manufacturer)
#>    
#>     audi chevrolet dodge ford honda hyundai jeep land rover lincoln mercury
#>   4   11         4    26   13     0       0    8          4       0       4
#>   f    7         5    11    0     9      14    0          0       0       0
#>   r    0        10     0   12     0       0    0          0       3       0
#>    
#>     nissan pontiac subaru toyota volkswagen
#>   4      4       0     14     15          0
#>   f      9       5      0     19         27
#>   r      0       0      0      0          0

The first line reads a CSV file. It uses the table() function to generate a contingency table (cross-tabulation) between two categorical variables: drv (drive) and manufacturer. It essentially counts how many times each combination of drv and manufacturer appears in the dataset.

## Get the percentage summary using prop.table
prop.table(table(mpg$drv, mpg$manufacturer), margin = 2)
#>    
#>          audi chevrolet     dodge      ford     honda   hyundai      jeep
#>   4 0.6111111 0.2105263 0.7027027 0.5200000 0.0000000 0.0000000 1.0000000
#>   f 0.3888889 0.2631579 0.2972973 0.0000000 1.0000000 1.0000000 0.0000000
#>   r 0.0000000 0.5263158 0.0000000 0.4800000 0.0000000 0.0000000 0.0000000
#>    
#>     land rover   lincoln   mercury    nissan   pontiac    subaru    toyota
#>   4  1.0000000 0.0000000 1.0000000 0.3076923 0.0000000 1.0000000 0.4411765
#>   f  0.0000000 0.0000000 0.0000000 0.6923077 1.0000000 0.0000000 0.5588235
#>   r  0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#>    
#>     volkswagen
#>   4  0.0000000
#>   f  1.0000000
#>   r  0.0000000
## margin = 1 sum across row, 2 across col

This code calculates the column-wise proportion (as percentages) for each combination of drv and manufacturer. The prop.table() function is used to compute the proportions. The margin = 2 argument indicates that the proportions are to be computed across columns (margin = 1 would compute them across rows).

tableone package

Tip

CreateTableOne function from tableone package could be a very useful function to see the summary table. Type ?tableone::CreateTableOne to see for more details.

This section introduces the tableone package, which offers the CreateTableOne function. This function helps in creating “Table 1” type summary tables, commonly used in epidemiological studies.

require(tableone)
CreateTableOne(vars = c("cyl", "drv", "hwy", "cty"), data = mpg, 
               strata = "trans", includeNA = TRUE, test = FALSE)
#>                  Stratified by trans
#>                   auto(av)       auto(l3)       auto(l4)      auto(l5)     
#>   n                   5              2             83            39        
#>   cyl (mean (SD))  5.20 (1.10)    4.00 (0.00)    6.14 (1.62)   6.56 (1.45) 
#>   drv (%)                                                                  
#>      4                0 (  0.0)      0 (  0.0)     34 (41.0)     29 (74.4) 
#>      f                5 (100.0)      2 (100.0)     37 (44.6)      8 (20.5) 
#>      r                0 (  0.0)      0 (  0.0)     12 (14.5)      2 ( 5.1) 
#>   hwy (mean (SD)) 27.80 (2.59)   27.00 (4.24)   21.96 (5.64)  20.72 (6.04) 
#>   cty (mean (SD)) 20.00 (2.00)   21.00 (4.24)   15.94 (3.98)  14.72 (3.49) 
#>                  Stratified by trans
#>                   auto(l6)      auto(s4)      auto(s5)      auto(s6)     
#>   n                   6             3             3            16        
#>   cyl (mean (SD))  7.33 (1.03)   5.33 (2.31)   6.00 (2.00)   6.00 (1.59) 
#>   drv (%)                                                                
#>      4                2 (33.3)      2 (66.7)      1 (33.3)      7 (43.8) 
#>      f                2 (33.3)      1 (33.3)      2 (66.7)      8 (50.0) 
#>      r                2 (33.3)      0 ( 0.0)      0 ( 0.0)      1 ( 6.2) 
#>   hwy (mean (SD)) 20.00 (2.37)  25.67 (1.15)  25.33 (6.66)  25.19 (3.99) 
#>   cty (mean (SD)) 13.67 (1.86)  18.67 (2.31)  17.33 (5.03)  17.38 (3.22) 
#>                  Stratified by trans
#>                   manual(m5)    manual(m6)   
#>   n                  58            19        
#>   cyl (mean (SD))  5.00 (1.30)   6.00 (1.76) 
#>   drv (%)                                    
#>      4               21 (36.2)      7 (36.8) 
#>      f               33 (56.9)      8 (42.1) 
#>      r                4 ( 6.9)      4 (21.1) 
#>   hwy (mean (SD)) 26.29 (5.99)  24.21 (5.75) 
#>   cty (mean (SD)) 19.26 (4.56)  16.89 (3.83)

The CreateTableOne function is used to create a summary table for the variables cyl, drv, hwy, and cty from the mpg dataset. The strata = trans argument means that the summary is stratified by the trans variable. The includeNA = TRUE argument means that missing values (NAs) are included in the summary. The test = FALSE argument indicates that no statistical tests should be applied to the data (often tests are used to compare groups in the table).

table1 package

This section introduces another package, table1, which can also be used to create “Table 1” type summary tables.

require(table1)
table1(~ cyl + drv + hwy + cty | trans, data=mpg)
auto(av)
(N=5)
auto(l3)
(N=2)
auto(l4)
(N=83)
auto(l5)
(N=39)
auto(l6)
(N=6)
auto(s4)
(N=3)
auto(s5)
(N=3)
auto(s6)
(N=16)
manual(m5)
(N=58)
manual(m6)
(N=19)
Overall
(N=234)
cyl
Mean (SD) 5.20 (1.10) 4.00 (0) 6.14 (1.62) 6.56 (1.45) 7.33 (1.03) 5.33 (2.31) 6.00 (2.00) 6.00 (1.59) 5.00 (1.30) 6.00 (1.76) 5.89 (1.61)
Median [Min, Max] 6.00 [4.00, 6.00] 4.00 [4.00, 4.00] 6.00 [4.00, 8.00] 6.00 [4.00, 8.00] 8.00 [6.00, 8.00] 4.00 [4.00, 8.00] 6.00 [4.00, 8.00] 6.00 [4.00, 8.00] 4.00 [4.00, 8.00] 6.00 [4.00, 8.00] 6.00 [4.00, 8.00]
drv
f 5 (100%) 2 (100%) 37 (44.6%) 8 (20.5%) 2 (33.3%) 1 (33.3%) 2 (66.7%) 8 (50.0%) 33 (56.9%) 8 (42.1%) 106 (45.3%)
4 0 (0%) 0 (0%) 34 (41.0%) 29 (74.4%) 2 (33.3%) 2 (66.7%) 1 (33.3%) 7 (43.8%) 21 (36.2%) 7 (36.8%) 103 (44.0%)
r 0 (0%) 0 (0%) 12 (14.5%) 2 (5.1%) 2 (33.3%) 0 (0%) 0 (0%) 1 (6.3%) 4 (6.9%) 4 (21.1%) 25 (10.7%)
hwy
Mean (SD) 27.8 (2.59) 27.0 (4.24) 22.0 (5.64) 20.7 (6.04) 20.0 (2.37) 25.7 (1.15) 25.3 (6.66) 25.2 (3.99) 26.3 (5.99) 24.2 (5.75) 23.4 (5.95)
Median [Min, Max] 27.0 [25.0, 31.0] 27.0 [24.0, 30.0] 22.0 [14.0, 41.0] 19.0 [12.0, 36.0] 19.0 [18.0, 23.0] 25.0 [25.0, 27.0] 27.0 [18.0, 31.0] 26.0 [18.0, 29.0] 26.0 [16.0, 44.0] 26.0 [12.0, 32.0] 24.0 [12.0, 44.0]
cty
Mean (SD) 20.0 (2.00) 21.0 (4.24) 15.9 (3.98) 14.7 (3.49) 13.7 (1.86) 18.7 (2.31) 17.3 (5.03) 17.4 (3.22) 19.3 (4.56) 16.9 (3.83) 16.9 (4.26)
Median [Min, Max] 19.0 [18.0, 23.0] 21.0 [18.0, 24.0] 16.0 [11.0, 29.0] 14.0 [9.00, 25.0] 13.0 [12.0, 16.0] 20.0 [16.0, 20.0] 18.0 [12.0, 22.0] 17.0 [12.0, 22.0] 19.0 [11.0, 35.0] 16.0 [9.00, 23.0] 17.0 [9.00, 35.0]

The table1() function is used to generate a summary table for the specified variables. The formula-like syntax (~ cyl + drv + hwy + cty | trans) indicates that the summary should be stratified by the trans variable.