Chapter 8 Data Summary with tableone

8.1 Instructions

In this tutorial, we will be exploring how to summarize all variables of our datasets in one single table. We will familiarize ourselves with the R package tableone and its associated functions. This tutorial will show you how to be more efficient in analyzing data on R.

Accompanying this tutorial is a short Google quiz for your own self-assessment. The instructions of this tutorial will clearly indicate when you should answer which question.

8.2 Learning Objectives

Understand the basics the tableone package and its applications.
Efficiently summarize whole datasets into one single table.
Be familiar with the function CreateTableOne() and a few of its basic arguments.
Know how to tell tableone which variables are continuous and which variables are categorical.
Be familiar with different print() arguments to customize a tableone.

8.3 Set Up

For this tutorial, the main package that we will be working with is the tableone package. We will also need the dplyr package for a few basic functions and data from the nhanesA package. Let’s go ahead and load them in our session!

#install.packages("tableone")
library(tableone)

#install.packages("dplyr")
library(dplyr)

#install.packages("nhanesA")
library(nhanesA)

Alright, so we are going back to the NHANES dataset for this tutorial. Let’s, once again, download the “DEMO_H” dataset and save it in an object called “demo_original.”

demo_original <- nhanes("DEMO_H")

## Processing SAS dataset DEMO_H     ..

Just a reminder to everyone that this is what our raw dataset look like.

head(demo_original)

##    SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
## 1 73557        8        2        1       69       NA        4        4        1
## 2 73558        8        2        1       54       NA        3        3        1
## 3 73559        8        2        1       72       NA        3        3        2
## 4 73560        8        2        1        9       NA        3        3        1
## 5 73561        8        2        2       73       NA        3        3        1
## 6 73562        8        2        1       56       NA        1        1        1
##   RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3 DMDEDUC2
## 1       NA        1       1        1        1       NA       NA        3
## 2       NA        2      NA        1        1       NA       NA        3
## 3       NA        1       1        1        1       NA       NA        4
## 4      119       NA      NA        1        1       NA        3       NA
## 5       NA        2      NA        1        1       NA       NA        5
## 6       NA        1       2        1        1       NA       NA        4
##   DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG
## 1        4       NA       1        2        2       1        2        2       1
## 2        1       NA       1        2        2       1        2        2       1
## 3        1       NA       1        2        2       1        2        2       1
## 4       NA       NA       1        1        2       1        2        2       1
## 5        1       NA       1        2        2       1        2        2       1
## 6        3       NA       1        2        2       1        2        2       1
##   MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA DMDHHSZB DMDHHSZE
## 1        2        2        1        3        3        0        0        2
## 2        2        2        1        4        4        0        2        0
## 3        2        2       NA        2        2        0        0        2
## 4        2        2        1        4        4        0        2        0
## 5        2        2       NA        2        2        0        0        2
## 6        2        2        1        1        1        0        0        0
##   DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU WTINT2YR WTMEC2YR
## 1        1       69        1        3        4       NA 13281.24 13481.04
## 2        1       54        1        3        1        1 23682.06 24471.77
## 3        1       72        1        4        1        3 57214.80 57193.29
## 4        1       33        1        3        1        4 55201.18 55766.51
## 5        1       78        1        5        1        5 63709.67 65541.87
## 6        1       56        1        4        3       NA 24978.14 25344.99
##   SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
## 1       1      112        4        4     0.84
## 2       1      108        7        7     1.78
## 3       1      109       10       10     4.51
## 4       2      109        9        9     2.52
## 5       2      116       15       15     5.00
## 6       1      111        9        9     4.79

As we can see, the data is quite overwhelming! Let’s only select a few familiar variables to make the summary a bit more manageable and comprehensible.

demo <- select(demo_original, 
               c("RIAGENDR", # Gender
                 "RIDAGEYR", # Age
                 "RIDRETH3", # Race
                 "DMDEDUC2") # Education
               )

head(demo)

##   RIAGENDR RIDAGEYR RIDRETH3 DMDEDUC2
## 1        1       69        4        3
## 2        1       54        3        3
## 3        1       72        3        4
## 4        1        9        3       NA
## 5        2       73        3        5
## 6        1       56        1        4

Awesome, our data is looking much better now!

We have learned how to analyze it with dplyr and visualize it with ggplot. But in this tutorial, we are going to learn how to summarize the data in this large dataset into one simple table.

8.4 What is tableone?

tableone is an R package that helps us construct “Table 1,” or the baseline table that we see in biomedical research papers. This package gives us access to a lot of useful data summary function that we can use to summarize both categorical and continuous data. In addition, we can also identify normal and nonnormal variables so that R can analyze it more accurately.

tableone is unique in that it is very simple and easy to use. One single function can do tremendous data summary as we will see in the later sections in this tutorial.

DO QUESTIONS 1 & 2 OF THE QUIZ NOW

tableone is part of the tidyverse core. (True or False)
What sort of data can tableone summarize? (Select all that apply)

8.5 Creating a tableone

8.5.1 CreateTableOne

The simples way that we can use tableone is to use the function CreateTableOne() with the nested dataset between then () like so:

CreateTableOne(data = demo)

##                       
##                        Overall      
##   n                    10175        
##   RIAGENDR (mean (SD))  1.51 (0.50) 
##   RIDAGEYR (mean (SD)) 31.48 (24.42)
##   RIDRETH3 (mean (SD))  3.29 (1.61) 
##   DMDEDUC2 (mean (SD))  3.52 (1.24)

As we can see in the output above, this function has cleanly summarize all of our data into one table. It gives us how many records there are in the dataset (n), as well as the mean and standard deviation of all of our variables!

It looks pretty neat right now, but recall that the variables RIAGENDR (Gender), RIDAGEYR (Age), and RIDRETH3 (Race) are all categorical! So it does not make any sense to have a mean for these variables at all.

But do not worry at all! There are actually several ways that we can solve this problem: 1. First solution is, we can use nhanesTranslate and these variables will instantly be converted to categorical, and 2. Second solution is, we can use the factorVars argument in CreateTableOne() to identify categorical variables.

8.5.2 Solution 1: nhanesTranslate & CreateTableOne

First, let’s translate all of our variables using the nhanesTranslate() function that we have learned in previous tutorials like so.

demo_translate <- nhanesTranslate("DEMO_H",
                 c("RIAGENDR",
                   "RIDAGEYR",
                   "RIDRETH3", 
                   "DMDEDUC2"),
                   data = demo)

## Translated columns: RIAGENDR RIDRETH3 DMDEDUC2

After that, for ease of communication, let’s also change the column names to something that we can all understand.

names(demo_translate) <- c("Gender", "Age", "Race", "Education")

Try it yourself 8.1

Challenge: Why do you think we need to change the names of our variables AFTER we translate them?

Hint: Think about the data = demo argument in nhanesTranslate()

Now, this is what our dataset should look like. Look familiar?

head(demo_translate)

##   Gender Age               Race                        Education
## 1   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 2   Male  54 Non-Hispanic White High school graduate/GED or equi
## 3   Male  72 Non-Hispanic White        Some college or AA degree
## 4   Male   9 Non-Hispanic White                             <NA>
## 5 Female  73 Non-Hispanic White        College graduate or above
## 6   Male  56   Mexican American        Some college or AA degree

This table should look exactly like the one that you have seen in previous tutorials! The only difference here is that, in this tutorial, we are using and summarizing the ENTIRE dataset! We will not be scaling down to only analyzing or visualizing the first or last few rows!

Now if we use the CreateTableOne() function again but on our new demo_translate object, we should be able to see a quite different table.

(tab_nhanes <- CreateTableOne(data = demo_translate))

##                                      
##                                       Overall      
##   n                                   10175        
##   Gender = Female (%)                  5172 (50.8) 
##   Age (mean (SD))                     31.48 (24.42)
##   Race (%)                                         
##      Mexican American                  1730 (17.0) 
##      Other Hispanic                     960 ( 9.4) 
##      Non-Hispanic White                3674 (36.1) 
##      Non-Hispanic Black                2267 (22.3) 
##      Non-Hispanic Asian                1074 (10.6) 
##      Other Race - Including Multi-Rac   470 ( 4.6) 
##   Education (%)                                    
##      Less than 9th grade                455 ( 7.9) 
##      9-11th grade (Includes 12th grad   791 (13.7) 
##      High school graduate/GED or equi  1303 (22.6) 
##      Some college or AA degree         1770 (30.7) 
##      College graduate or above         1443 (25.0) 
##      Refused                              2 ( 0.0) 
##      Don't Know                           5 ( 0.1)

The count of records (n) is still there and we are still provided with the mean and standard deviation of participants’ age. However, instead of a single mean and standard deviation for gender, race, and education, we now have all of the categories of these variables fleshed out. In addition, we are also given the count and percentage of each category!

You may have also noticed that “Female” is the only gender that is shown in this table. This is because this variable only has two levels: Female and Male. For this reason, we can infer the count and percentage of the other category just based on the one that tableone gives us. There is a way that we can force tableone to show all categories of a variable. We will cover this in a later section of this tutorial.

DO QUESTIONS 3 & 4 OF THE QUIZ NOW

What kind of information is summarized when the data is continuous?
What kind of information is summarized when the data is categorical?

8.5.3 Solution 2: Identify Numerical Categorical Data

Before we hop to this second solution, again, let’s rename all of our variables to something more comprehensible so that everything is easier to understand. In this subsection, however, we will be renaming our demo dataset, instead of the demo_translate dataset that we renamed earlier.

names(demo) <- c("Gender", "Age", "Race", "Education")

Okay, now we are ready to go! Note that this second solution is more transferrable and will work for datasets that do not come from NHANES.

The second way that we can help tableone know which variable is categorical is by telling it directly using the argument factorVars. factorVars is especially useful for identifying numerical categorical data like the ones that we have.

Coupled with factorVars is also vars. vars is used to select which variables we want to keep in our tableone. Combined what we have learned about CreateTableOne() so far with factorVars and vars, this is what our function with clearly identified numerical categorical data should look like:

CreateTableOne(data = demo,
               vars = c("Gender", "Age", "Race", "Education"),
               factorVars = c("Gender", "Race", "Education")
              )

##                  
##                   Overall      
##   n               10175        
##   Gender = 2 (%)   5172 (50.8) 
##   Age (mean (SD)) 31.48 (24.42)
##   Race (%)                     
##      1             1730 (17.0) 
##      2              960 ( 9.4) 
##      3             3674 (36.1) 
##      4             2267 (22.3) 
##      6             1074 (10.6) 
##      7              470 ( 4.6) 
##   Education (%)                
##      1              455 ( 7.9) 
##      2              791 (13.7) 
##      3             1303 (22.6) 
##      4             1770 (30.7) 
##      5             1443 (25.0) 
##      7                2 ( 0.0) 
##      9                5 ( 0.1)

As we can see, this tableone that we just created should look somewhat familiar to the table that we created above. The only difference is that because we did not use nhanesTranslate, all of the categories in our categorical variables are numerical. This will not be an issue if we know which number corresponds to which gender, race, or education level of the participants. Other than that, the counts and percentages of these categorical variables should be identical.

If the amount of vectors c() and strings in the code above is a bit confusing and hard on our eyes, we can also define factorVars and vars before inputting them into CreateTableOne() like so:

vars <- c("Gender", "Age", "Race", "Education")

factorVars <- c("Gender", "Race", "Education")

CreateTableOne(data = demo,
               vars = vars,
               factorVars = factorVars
              )

##                  
##                   Overall      
##   n               10175        
##   Gender = 2 (%)   5172 (50.8) 
##   Age (mean (SD)) 31.48 (24.42)
##   Race (%)                     
##      1             1730 (17.0) 
##      2              960 ( 9.4) 
##      3             3674 (36.1) 
##      4             2267 (22.3) 
##      6             1074 (10.6) 
##      7              470 ( 4.6) 
##   Education (%)                
##      1              455 ( 7.9) 
##      2              791 (13.7) 
##      3             1303 (22.6) 
##      4             1770 (30.7) 
##      5             1443 (25.0) 
##      7                2 ( 0.0) 
##      9                5 ( 0.1)

We should be able to see that both tables in this subsection are identical!

Try it yourself 8.2

Create a tableone without the vars argument. What do you see? Do you think the vars argument is necessary in our case? If not, in what situation(s) do you think it would be necessary?

8.6 Other Arguments to Customize tableone

There are other arguments of CreateTableOne() that we can use to customize and adjust our tableone!

8.6.1 Show All Levels

Recall how our Gender variable only shows the “Female” category. If we want both categories “Female” and “Male” to be shown, we can add showAllLevels = TRUE to our print() function like so:

print(tab_nhanes, 
      showAllLevels = TRUE)

##                  
##                   level                            Overall      
##   n                                                10175        
##   Gender (%)      Male                              5003 (49.2) 
##                   Female                            5172 (50.8) 
##   Age (mean (SD))                                  31.48 (24.42)
##   Race (%)        Mexican American                  1730 (17.0) 
##                   Other Hispanic                     960 ( 9.4) 
##                   Non-Hispanic White                3674 (36.1) 
##                   Non-Hispanic Black                2267 (22.3) 
##                   Non-Hispanic Asian                1074 (10.6) 
##                   Other Race - Including Multi-Rac   470 ( 4.6) 
##   Education (%)   Less than 9th grade                455 ( 7.9) 
##                   9-11th grade (Includes 12th grad   791 (13.7) 
##                   High school graduate/GED or equi  1303 (22.6) 
##                   Some college or AA degree         1770 (30.7) 
##                   College graduate or above         1443 (25.0) 
##                   Refused                              2 ( 0.0) 
##                   Don't Know                           5 ( 0.1)

Another way that we can show both Male and Femal is to use cramVars. But this argument only works on 2-level variables (i.e. variables with only 2 categories) because all categories will be placed in the same row.

print(tab_nhanes, 
     cramVars = "Gender")

##                                      
##                                       Overall               
##   n                                       10175             
##   Gender = Male/Female (%)            5003/5172 (49.2/50.8) 
##   Age (mean (SD))                         31.48 (24.42)     
##   Race (%)                                                  
##      Mexican American                      1730 (17.0)      
##      Other Hispanic                         960 ( 9.4)      
##      Non-Hispanic White                    3674 (36.1)      
##      Non-Hispanic Black                    2267 (22.3)      
##      Non-Hispanic Asian                    1074 (10.6)      
##      Other Race - Including Multi-Rac       470 ( 4.6)      
##   Education (%)                                             
##      Less than 9th grade                    455 ( 7.9)      
##      9-11th grade (Includes 12th grad       791 (13.7)      
##      High school graduate/GED or equi      1303 (22.6)      
##      Some college or AA degree             1770 (30.7)      
##      College graduate or above             1443 (25.0)      
##      Refused                                  2 ( 0.0)      
##      Don't Know                               5 ( 0.1)

DO QUESTION 5 OF THE QUIZ NOW

What is the difference between showAllLevels and cramVars?

8.6.2 Nonnormal

Right now, our tableones assume that the data of all of our continuous variables are normal, but what if our data is not normal?

If we know that some or all of our continous variables are not normal, we can tell R this by using the nonnormal argument of print(). For example, if our Age variable is nonnormal, then:

print(tab_nhanes, 
      showAllLevels = TRUE,
      nonnormal = "Age"
     )

##                     
##                      level                            Overall             
##   n                                                   10175               
##   Gender (%)         Male                              5003 (49.2)        
##                      Female                            5172 (50.8)        
##   Age (median [IQR])                                  26.00 [10.00, 52.00]
##   Race (%)           Mexican American                  1730 (17.0)        
##                      Other Hispanic                     960 ( 9.4)        
##                      Non-Hispanic White                3674 (36.1)        
##                      Non-Hispanic Black                2267 (22.3)        
##                      Non-Hispanic Asian                1074 (10.6)        
##                      Other Race - Including Multi-Rac   470 ( 4.6)        
##   Education (%)      Less than 9th grade                455 ( 7.9)        
##                      9-11th grade (Includes 12th grad   791 (13.7)        
##                      High school graduate/GED or equi  1303 (22.6)        
##                      Some college or AA degree         1770 (30.7)        
##                      College graduate or above         1443 (25.0)        
##                      Refused                              2 ( 0.0)        
##                      Don't Know                           5 ( 0.1)

In the table above, we can see that instead of the usual mean and standard deviation, we are provided with the median and interquartile range (IQR) for our nonnormal Age variable!

Try it yourself 8.3

How do you know if a variable is nonnormal? Try using the function summary() and look at the number under skew. How do you decide if something is normal or nonnormal? Is the decision to make “Age” nonnormal accurate?

DO QUESTION 6 OF THE QUIZ NOW

The decision to make “Age” nonnormal is accurate. (True or False)

8.6.3 Show Categorical or Continuous Variables Only

We also have the option to only create tableones with only categorical or continuous variables.

## Categorical variables only

tab_nhanes$CatTable

##                                      
##                                       Overall     
##   n                                   10175       
##   Gender = Female (%)                 5172 (50.8) 
##   Race (%)                                        
##      Mexican American                 1730 (17.0) 
##      Other Hispanic                    960 ( 9.4) 
##      Non-Hispanic White               3674 (36.1) 
##      Non-Hispanic Black               2267 (22.3) 
##      Non-Hispanic Asian               1074 (10.6) 
##      Other Race - Including Multi-Rac  470 ( 4.6) 
##   Education (%)                                   
##      Less than 9th grade               455 ( 7.9) 
##      9-11th grade (Includes 12th grad  791 (13.7) 
##      High school graduate/GED or equi 1303 (22.6) 
##      Some college or AA degree        1770 (30.7) 
##      College graduate or above        1443 (25.0) 
##      Refused                             2 ( 0.0) 
##      Don't Know                          5 ( 0.1)

## Continuous variables only

print(tab_nhanes$ContTable, nonnormal = "Age")

##                     
##                      Overall             
##   n                  10175               
##   Age (median [IQR]) 26.00 [10.00, 52.00]

8.6.4 Strata

In a way, strata is like the function group_by() in dplyr or facets in ggplot. It groups data together into groups or “strata” and then summarizes each group individually.

Note that while showAllLevels and nonnormal are arguments of the function print(), strata is an argument of the function CreateTableOne().

For example, if we want to separate our data summary by Gender, we would need to write a code like so:

strata <- CreateTableOne(data = demo_translate,
                         vars = c("Age", "Race", "Education"), ## Note that Gender is not included because we already have strata = Gender
                         factorVars = c("Race","Education"), ## Again, Gender is not included because it is in the strata argument
                         strata = "Gender"
                         )

print(strata, 
      nonnormal = "Age", 
      cramVars = "Gender")

##                                      Stratified by Gender
##                                       Male                Female              
##   n                                    5003                5172               
##   Age (median [IQR])                  25.00 [9.00, 51.00] 28.00 [10.00, 52.00]
##   Race (%)                                                                    
##      Mexican American                   833 (16.7)          897 (17.3)        
##      Other Hispanic                     449 ( 9.0)          511 ( 9.9)        
##      Non-Hispanic White                1811 (36.2)         1863 (36.0)        
##      Non-Hispanic Black                1152 (23.0)         1115 (21.6)        
##      Non-Hispanic Asian                 521 (10.4)          553 (10.7)        
##      Other Race - Including Multi-Rac   237 ( 4.7)          233 ( 4.5)        
##   Education (%)                                                               
##      Less than 9th grade                230 ( 8.3)          225 ( 7.5)        
##      9-11th grade (Includes 12th grad   393 (14.2)          398 (13.2)        
##      High school graduate/GED or equi   665 (24.1)          638 (21.2)        
##      Some college or AA degree          754 (27.3)         1016 (33.7)        
##      College graduate or above          713 (25.9)          730 (24.2)        
##      Refused                              0 ( 0.0)            2 ( 0.1)        
##      Don't Know                           3 ( 0.1)            2 ( 0.1)        
##                                      Stratified by Gender
##                                       p      test   
##   n                                                 
##   Age (median [IQR])                   0.001 nonnorm
##   Race (%)                             0.317        
##      Mexican American                               
##      Other Hispanic                                 
##      Non-Hispanic White                             
##      Non-Hispanic Black                             
##      Non-Hispanic Asian                             
##      Other Race - Including Multi-Rac               
##   Education (%)                       <0.001        
##      Less than 9th grade                            
##      9-11th grade (Includes 12th grad               
##      High school graduate/GED or equi               
##      Some college or AA degree                      
##      College graduate or above                      
##      Refused                                        
##      Don't Know

Let’s unpack this table together. Firstly, we have the usual mean and standard deviation OR median and IQR for each category of each variable. Except now, we can see that all of the variables and their categories are summarized by or stratified by Gender.

Second of all, we can also see a second table below our usual table with p-values and test. This only appears when we have stratified our data into two groups for comparison. The default test for categorical variables is chisq.test() and the default for continuous variables is oneway.test() (regular ANOVA). tableone also considers nonnorm as present by the word “nonnorm” under “test” in the table above. Otherwise, we also have the option to use krushal.test() for nonnormal continuous variables.

Try it yourself 8.4

Create a tableone using the demo_translate dataset. Keep all variables and stratified the data using “Age.” What do you see? Do you think this is a helpful tableone?

DO QUESTION 7 OF THE QUIZ NOW

Which of the following is the least appropriate to stratify our dataset by?

8.7 Export tableone

Finally, let’s export our tableone!

Recall that we can use the function write.csv() to export data from R to a csv file. But before we can use this function, we need to save the table into an object using print() like so first:

tab_csv <- print(strata,
                 nonnormal = "Age",
                 printToggle = FALSE)

DO QUESTION 8 OF THE QUIZ NOW

What does the argument printToggle = FALSE do?

Now we can use our write.csv() function like normal.

write.csv(tab_csv, file = "data/NHANES_Summary.csv")

Tada! Now our table is saved as a csv file in our working directory!

dir()

##  [1] "_book"                                 
##  [2] "_bookdown.yml"                         
##  [3] "_bookdown_files"                       
##  [4] "_build.sh"                             
##  [5] "_deploy.sh"                            
##  [6] "_output.yml"                           
##  [7] "0-r-and-rstudio-set-up.Rmd"            
##  [8] "1-introduction-to-r.Rmd"               
##  [9] "2-importing-data-into-r-with-readr.Rmd"
## [10] "3-introduction-to-nhanes.Rmd"          
## [11] "4-data-analysis-with-dplyr.Rmd"        
## [12] "5-data-visualization-with-ggplot.Rmd"  
## [13] "6-date-time-data-with-lubridate.Rmd"   
## [14] "7-data-summary-with-tableone.Rmd"      
## [15] "8-Exercise-Solutions.Rmd"              
## [16] "9-references.Rmd"                      
## [17] "book.bib"                              
## [18] "data"                                  
## [19] "DESCRIPTION"                           
## [20] "Dockerfile"                            
## [21] "docs"                                  
## [22] "header.html"                           
## [23] "images"                                
## [24] "index.Rmd"                             
## [25] "intro2R.log"                           
## [26] "intro2R.Rmd"                           
## [27] "intro2R.tex"                           
## [28] "intro2R_cache"                         
## [29] "intro2R_files"                         
## [30] "LICENSE"                               
## [31] "now.json"                              
## [32] "packages.bib"                          
## [33] "preamble.tex"                          
## [34] "R.Rproj"                               
## [35] "README.md"                             
## [36] "style.css"                             
## [37] "toc.css"

DO QUESTIONS 9 & 10 OF THE QUIZ NOW

Which of the following arguments can be nested in CreateTableOne()?
Which of the following arguments can be nested in print()?

8.8 Alternatives to tableone

Data summary is one of the many applications that R specializes at. With this said, there are multiple other R packages that also do data summary aside from tableone. We will not go over any of these packages, but know that each package has its own strengths and so are most optimally used in different situations.

Here are the other data summary packages and its main data summary function:

8.8.1 base R

In base R, we have summary() and by():

summary(demo_translate)

##     Gender          Age                                      Race     
##  Male  :5003   Min.   : 0.00   Mexican American                :1730  
##  Female:5172   1st Qu.:10.00   Other Hispanic                  : 960  
##                Median :26.00   Non-Hispanic White              :3674  
##                Mean   :31.48   Non-Hispanic Black              :2267  
##                3rd Qu.:52.00   Non-Hispanic Asian              :1074  
##                Max.   :80.00   Other Race - Including Multi-Rac: 470  
##                                                                       
##                             Education   
##  Some college or AA degree       :1770  
##  College graduate or above       :1443  
##  High school graduate/GED or equi:1303  
##  9-11th grade (Includes 12th grad: 791  
##  Less than 9th grade             : 455  
##  (Other)                         :   7  
##  NA's                            :4406

by(demo_translate, demo_translate$Gender, summary)

## demo_translate$Gender: Male
##     Gender          Age                                      Race     
##  Male  :5003   Min.   : 0.00   Mexican American                : 833  
##  Female:   0   1st Qu.: 9.00   Other Hispanic                  : 449  
##                Median :25.00   Non-Hispanic White              :1811  
##                Mean   :30.69   Non-Hispanic Black              :1152  
##                3rd Qu.:51.00   Non-Hispanic Asian              : 521  
##                Max.   :80.00   Other Race - Including Multi-Rac: 237  
##                                                                       
##                             Education   
##  Some college or AA degree       : 754  
##  College graduate or above       : 713  
##  High school graduate/GED or equi: 665  
##  9-11th grade (Includes 12th grad: 393  
##  Less than 9th grade             : 230  
##  (Other)                         :   3  
##  NA's                            :2245  
## ------------------------------------------------------------ 
## demo_translate$Gender: Female
##     Gender          Age                                      Race     
##  Male  :   0   Min.   : 0.00   Mexican American                : 897  
##  Female:5172   1st Qu.:10.00   Other Hispanic                  : 511  
##                Median :28.00   Non-Hispanic White              :1863  
##                Mean   :32.25   Non-Hispanic Black              :1115  
##                3rd Qu.:52.00   Non-Hispanic Asian              : 553  
##                Max.   :80.00   Other Race - Including Multi-Rac: 233  
##                                                                       
##                             Education   
##  Some college or AA degree       :1016  
##  College graduate or above       : 730  
##  High school graduate/GED or equi: 638  
##  9-11th grade (Includes 12th grad: 398  
##  Less than 9th grade             : 225  
##  (Other)                         :   4  
##  NA's                            :2161

8.8.2 Hmisc

In Hmisc, we have describe():

#install.packages("Hmisc")
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

describe(demo_translate)

## demo_translate 
## 
##  4  Variables      10175  Observations
## --------------------------------------------------------------------------------
## Gender 
##        n  missing distinct 
##    10175        0        2 
##                         
## Value        Male Female
## Frequency    5003   5172
## Proportion  0.492  0.508
## --------------------------------------------------------------------------------
## Age : Age in years at screening 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    10175        0       81        1    31.48    27.75        1        3 
##      .25      .50      .75      .90      .95 
##       10       26       52       68       75 
## 
## lowest :  0  1  2  3  4, highest: 76 77 78 79 80
## --------------------------------------------------------------------------------
## Race 
##        n  missing distinct 
##    10175        0        6 
## 
## lowest : Mexican American                 Other Hispanic                   Non-Hispanic White               Non-Hispanic Black               Non-Hispanic Asian              
## highest: Other Hispanic                   Non-Hispanic White               Non-Hispanic Black               Non-Hispanic Asian               Other Race - Including Multi-Rac
## 
## Mexican American (1730, 0.170), Other Hispanic (960, 0.094), Non-Hispanic White
## (3674, 0.361), Non-Hispanic Black (2267, 0.223), Non-Hispanic Asian (1074,
## 0.106), Other Race - Including Multi-Rac (470, 0.046)
## --------------------------------------------------------------------------------
## Education 
##        n  missing distinct 
##     5769     4406        7 
## 
## lowest : Less than 9th grade              9-11th grade (Includes 12th grad High school graduate/GED or equi Some college or AA degree        College graduate or above       
## highest: High school graduate/GED or equi Some college or AA degree        College graduate or above        Refused                          Don't Know                      
## 
## Less than 9th grade (455, 0.079), 9-11th grade (Includes 12th grad (791,
## 0.137), High school graduate/GED or equi (1303, 0.226), Some college or AA
## degree (1770, 0.307), College graduate or above (1443, 0.250), Refused (2,
## 0.000), Don't Know (5, 0.001)
## --------------------------------------------------------------------------------

8.8.3 psych

In psych, we have describe() and describeBy(). Note how the categorical variables are marked with an asterisk (*).

#install.packages("psych")
library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:Hmisc':
## 
##     describe

## The following object is masked from 'package:car':
## 
##     logit

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describe(demo_translate)

##            vars     n  mean    sd median trimmed   mad min max range  skew
## Gender*       1 10175  1.51  0.50      2    1.51  0.00   1   2     1 -0.03
## Age           2 10175 31.48 24.42     26   29.82 28.17   0  80    80  0.44
## Race*         3 10175  3.14  1.35      3    3.11  1.48   1   6     5  0.04
## Education*    4  5769  3.52  1.23      4    3.62  1.48   1   7     6 -0.47
##            kurtosis   se
## Gender*       -2.00 0.00
## Age           -1.09 0.24
## Race*         -0.51 0.01
## Education*    -0.69 0.02

describeBy(demo_translate, demo_translate$Gender)

## 
##  Descriptive statistics by group 
## group: Male
##            vars    n  mean    sd median trimmed   mad min max range  skew
## Gender*       1 5003  1.00  0.00      1    1.00  0.00   1   1     0   NaN
## Age           2 5003 30.69 24.39     25   28.89 28.17   0  80    80  0.48
## Race*         3 5003  3.16  1.34      3    3.14  1.48   1   6     5  0.03
## Education*    4 2758  3.49  1.25      4    3.58  1.48   1   7     6 -0.40
##            kurtosis   se
## Gender*         NaN 0.00
## Age           -1.07 0.34
## Race*         -0.49 0.02
## Education*    -0.78 0.02
## ------------------------------------------------------------ 
## group: Female
##            vars    n  mean    sd median trimmed   mad min max range  skew
## Gender*       1 5172  2.00  0.00      2    2.00  0.00   2   2     0   NaN
## Age           2 5172 32.25 24.43     28   30.72 29.65   0  80    80  0.40
## Race*         3 5172  3.12  1.35      3    3.09  1.48   1   6     5  0.06
## Education*    4 3011  3.55  1.21      4    3.65  1.48   1   7     6 -0.53
##            kurtosis   se
## Gender*         NaN 0.00
## Age           -1.12 0.34
## Race*         -0.54 0.02
## Education*    -0.59 0.02

8.8.4 desctable

In desctable, we have desctable():

# install.packages("desctable")
library(desctable)

## Loading required package: pander

## 
## Attaching package: 'desctable'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test, IQR

desctable(demo_translate)

##                                                    N           % Median IQR
## 1                                       Gender 10175          NA     NA  NA
## 2                                 Gender: Male  5003 49.16953317     NA  NA
## 3                               Gender: Female  5172 50.83046683     NA  NA
## 4                                          Age 10175          NA     26  42
## 5                                         Race 10175          NA     NA  NA
## 6                       Race: Mexican American  1730 17.00245700     NA  NA
## 7                         Race: Other Hispanic   960  9.43488943     NA  NA
## 8                     Race: Non-Hispanic White  3674 36.10810811     NA  NA
## 9                     Race: Non-Hispanic Black  2267 22.28009828     NA  NA
## 10                    Race: Non-Hispanic Asian  1074 10.55528256     NA  NA
## 11      Race: Other Race - Including Multi-Rac   470  4.61916462     NA  NA
## 12                                   Education  5769          NA     NA  NA
## 13              Education: Less than 9th grade   455  7.88698215     NA  NA
## 14 Education: 9-11th grade (Includes 12th grad   791 13.71121512     NA  NA
## 15 Education: High school graduate/GED or equi  1303 22.58623678     NA  NA
## 16        Education: Some college or AA degree  1770 30.68122725     NA  NA
## 17        Education: College graduate or above  1443 25.01300052     NA  NA
## 18                          Education: Refused     2  0.03466805     NA  NA
## 19                       Education: Don't Know     5  0.08667013     NA  NA

8.8.5 skimr

In skimr, we have skim():

#install.packages("skimr")
library(skimr)

skim(demo_translate)

## Warning: Couldn't find skimmers for class: labelled, integer, numeric; No user-
## defined `sfl` provided. Falling back to `character`.

Table 8.1: Data summary
Name	demo_translate
Number of rows	10175
Number of columns	4
_______________________
Column type frequency:
character	1
factor	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Age	0	1	1	2	0	81	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Gender	0	1.00	FALSE	2	Fem: 5172, Mal: 5003
Race	0	1.00	FALSE	6	Non: 3674, Non: 2267, Mex: 1730, Non: 1074
Education	4406	0.57	FALSE	7	Som: 1770, Col: 1443, Hig: 1303, 9-1: 791

8.9 Summary and Takeaways

Congratulations on finishing tutorial 7 on Data Summary with tableone! After this tutorial, you should be familiar with the R package tableone as well as the function CreateTableOne(). In addition, you should also be familiar with the different arguments of print() to customize your own tableone.

There are a lot more powerful functions in the tableone package. You are free to explore them on your own using this document.