Chapter 8 Data Summary with tableone
8.1 Instructions
In this tutorial, we will be exploring how to summarize all variables of our datasets in one single table. We will familiarize ourselves with the R package tableone and its associated functions. This tutorial will show you how to be more efficient in analyzing data on R.
Accompanying this tutorial is a short Google quiz for your own self-assessment. The instructions of this tutorial will clearly indicate when you should answer which question.
8.2 Learning Objectives
- Understand the basics the tableone package and its applications.
- Efficiently summarize whole datasets into one single table.
- Be familiar with the function
CreateTableOne()
and a few of its basic arguments. - Know how to tell tableone which variables are continuous and which variables are categorical.
- Be familiar with different
print()
arguments to customize a tableone.
8.3 Set Up
For this tutorial, the main package that we will be working with is the tableone package. We will also need the dplyr package for a few basic functions and data from the nhanesA package. Let’s go ahead and load them in our session!
#install.packages("tableone")
library(tableone)
#install.packages("dplyr")
library(dplyr)
#install.packages("nhanesA")
library(nhanesA)
Alright, so we are going back to the NHANES dataset for this tutorial. Let’s, once again, download the “DEMO_H” dataset and save it in an object called “demo_original.”
<- nhanes("DEMO_H") demo_original
## Processing SAS dataset DEMO_H ..
Just a reminder to everyone that this is what our raw dataset look like.
head(demo_original)
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
## 1 73557 8 2 1 69 NA 4 4 1
## 2 73558 8 2 1 54 NA 3 3 1
## 3 73559 8 2 1 72 NA 3 3 2
## 4 73560 8 2 1 9 NA 3 3 1
## 5 73561 8 2 2 73 NA 3 3 1
## 6 73562 8 2 1 56 NA 1 1 1
## RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3 DMDEDUC2
## 1 NA 1 1 1 1 NA NA 3
## 2 NA 2 NA 1 1 NA NA 3
## 3 NA 1 1 1 1 NA NA 4
## 4 119 NA NA 1 1 NA 3 NA
## 5 NA 2 NA 1 1 NA NA 5
## 6 NA 1 2 1 1 NA NA 4
## DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG
## 1 4 NA 1 2 2 1 2 2 1
## 2 1 NA 1 2 2 1 2 2 1
## 3 1 NA 1 2 2 1 2 2 1
## 4 NA NA 1 1 2 1 2 2 1
## 5 1 NA 1 2 2 1 2 2 1
## 6 3 NA 1 2 2 1 2 2 1
## MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA DMDHHSZB DMDHHSZE
## 1 2 2 1 3 3 0 0 2
## 2 2 2 1 4 4 0 2 0
## 3 2 2 NA 2 2 0 0 2
## 4 2 2 1 4 4 0 2 0
## 5 2 2 NA 2 2 0 0 2
## 6 2 2 1 1 1 0 0 0
## DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU WTINT2YR WTMEC2YR
## 1 1 69 1 3 4 NA 13281.24 13481.04
## 2 1 54 1 3 1 1 23682.06 24471.77
## 3 1 72 1 4 1 3 57214.80 57193.29
## 4 1 33 1 3 1 4 55201.18 55766.51
## 5 1 78 1 5 1 5 63709.67 65541.87
## 6 1 56 1 4 3 NA 24978.14 25344.99
## SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
## 1 1 112 4 4 0.84
## 2 1 108 7 7 1.78
## 3 1 109 10 10 4.51
## 4 2 109 9 9 2.52
## 5 2 116 15 15 5.00
## 6 1 111 9 9 4.79
As we can see, the data is quite overwhelming! Let’s only select a few familiar variables to make the summary a bit more manageable and comprehensible.
<- select(demo_original,
demo c("RIAGENDR", # Gender
"RIDAGEYR", # Age
"RIDRETH3", # Race
"DMDEDUC2") # Education
)
head(demo)
## RIAGENDR RIDAGEYR RIDRETH3 DMDEDUC2
## 1 1 69 4 3
## 2 1 54 3 3
## 3 1 72 3 4
## 4 1 9 3 NA
## 5 2 73 3 5
## 6 1 56 1 4
Awesome, our data is looking much better now!
We have learned how to analyze it with dplyr and visualize it with ggplot. But in this tutorial, we are going to learn how to summarize the data in this large dataset into one simple table.
8.4 What is tableone?
tableone is an R package that helps us construct “Table 1,” or the baseline table that we see in biomedical research papers. This package gives us access to a lot of useful data summary function that we can use to summarize both categorical and continuous data. In addition, we can also identify normal and nonnormal variables so that R can analyze it more accurately.
tableone is unique in that it is very simple and easy to use. One single function can do tremendous data summary as we will see in the later sections in this tutorial.
DO QUESTIONS 1 & 2 OF THE QUIZ NOW
tableone is part of the tidyverse core. (True or False)
What sort of data can tableone summarize? (Select all that apply)
8.5 Creating a tableone
8.5.1 CreateTableOne
The simples way that we can use tableone is to use the function CreateTableOne()
with the nested dataset between then ()
like so:
CreateTableOne(data = demo)
##
## Overall
## n 10175
## RIAGENDR (mean (SD)) 1.51 (0.50)
## RIDAGEYR (mean (SD)) 31.48 (24.42)
## RIDRETH3 (mean (SD)) 3.29 (1.61)
## DMDEDUC2 (mean (SD)) 3.52 (1.24)
As we can see in the output above, this function has cleanly summarize all of our data into one table. It gives us how many records there are in the dataset (n), as well as the mean and standard deviation of all of our variables!
It looks pretty neat right now, but recall that the variables RIAGENDR (Gender), RIDAGEYR (Age), and RIDRETH3 (Race) are all categorical! So it does not make any sense to have a mean for these variables at all.
But do not worry at all! There are actually several ways that we can solve this problem:
1. First solution is, we can use nhanesTranslate and these variables will instantly be converted to categorical, and
2. Second solution is, we can use the factorVars
argument in CreateTableOne()
to identify categorical variables.
8.5.2 Solution 1: nhanesTranslate & CreateTableOne
First, let’s translate all of our variables using the nhanesTranslate()
function that we have learned in previous tutorials like so.
<- nhanesTranslate("DEMO_H",
demo_translate c("RIAGENDR",
"RIDAGEYR",
"RIDRETH3",
"DMDEDUC2"),
data = demo)
## Translated columns: RIAGENDR RIDRETH3 DMDEDUC2
After that, for ease of communication, let’s also change the column names to something that we can all understand.
names(demo_translate) <- c("Gender", "Age", "Race", "Education")
Try it yourself 8.1
Challenge: Why do you think we need to change the names of our variables AFTER we translate them?
Hint: Think about the data = demo
argument in nhanesTranslate()
Now, this is what our dataset should look like. Look familiar?
head(demo_translate)
## Gender Age Race Education
## 1 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 2 Male 54 Non-Hispanic White High school graduate/GED or equi
## 3 Male 72 Non-Hispanic White Some college or AA degree
## 4 Male 9 Non-Hispanic White <NA>
## 5 Female 73 Non-Hispanic White College graduate or above
## 6 Male 56 Mexican American Some college or AA degree
This table should look exactly like the one that you have seen in previous tutorials! The only difference here is that, in this tutorial, we are using and summarizing the ENTIRE dataset! We will not be scaling down to only analyzing or visualizing the first or last few rows!
Now if we use the CreateTableOne()
function again but on our new demo_translate
object, we should be able to see a quite different table.
<- CreateTableOne(data = demo_translate)) (tab_nhanes
##
## Overall
## n 10175
## Gender = Female (%) 5172 (50.8)
## Age (mean (SD)) 31.48 (24.42)
## Race (%)
## Mexican American 1730 (17.0)
## Other Hispanic 960 ( 9.4)
## Non-Hispanic White 3674 (36.1)
## Non-Hispanic Black 2267 (22.3)
## Non-Hispanic Asian 1074 (10.6)
## Other Race - Including Multi-Rac 470 ( 4.6)
## Education (%)
## Less than 9th grade 455 ( 7.9)
## 9-11th grade (Includes 12th grad 791 (13.7)
## High school graduate/GED or equi 1303 (22.6)
## Some college or AA degree 1770 (30.7)
## College graduate or above 1443 (25.0)
## Refused 2 ( 0.0)
## Don't Know 5 ( 0.1)
The count of records (n) is still there and we are still provided with the mean and standard deviation of participants’ age. However, instead of a single mean and standard deviation for gender, race, and education, we now have all of the categories of these variables fleshed out. In addition, we are also given the count and percentage of each category!
You may have also noticed that “Female” is the only gender that is shown in this table. This is because this variable only has two levels: Female and Male. For this reason, we can infer the count and percentage of the other category just based on the one that tableone gives us. There is a way that we can force tableone to show all categories of a variable. We will cover this in a later section of this tutorial.
DO QUESTIONS 3 & 4 OF THE QUIZ NOW
What kind of information is summarized when the data is continuous?
What kind of information is summarized when the data is categorical?
8.5.3 Solution 2: Identify Numerical Categorical Data
Before we hop to this second solution, again, let’s rename all of our variables to something more comprehensible so that everything is easier to understand. In this subsection, however, we will be renaming our demo
dataset, instead of the demo_translate
dataset that we renamed earlier.
names(demo) <- c("Gender", "Age", "Race", "Education")
Okay, now we are ready to go! Note that this second solution is more transferrable and will work for datasets that do not come from NHANES.
The second way that we can help tableone know which variable is categorical is by telling it directly using the argument factorVars
. factorVars
is especially useful for identifying numerical categorical data like the ones that we have.
Coupled with factorVars
is also vars
. vars
is used to select which variables we want to keep in our tableone. Combined what we have learned about CreateTableOne()
so far with factorVars
and vars
, this is what our function with clearly identified numerical categorical data should look like:
CreateTableOne(data = demo,
vars = c("Gender", "Age", "Race", "Education"),
factorVars = c("Gender", "Race", "Education")
)
##
## Overall
## n 10175
## Gender = 2 (%) 5172 (50.8)
## Age (mean (SD)) 31.48 (24.42)
## Race (%)
## 1 1730 (17.0)
## 2 960 ( 9.4)
## 3 3674 (36.1)
## 4 2267 (22.3)
## 6 1074 (10.6)
## 7 470 ( 4.6)
## Education (%)
## 1 455 ( 7.9)
## 2 791 (13.7)
## 3 1303 (22.6)
## 4 1770 (30.7)
## 5 1443 (25.0)
## 7 2 ( 0.0)
## 9 5 ( 0.1)
As we can see, this tableone that we just created should look somewhat familiar to the table that we created above. The only difference is that because we did not use nhanesTranslate, all of the categories in our categorical variables are numerical. This will not be an issue if we know which number corresponds to which gender, race, or education level of the participants. Other than that, the counts and percentages of these categorical variables should be identical.
If the amount of vectors c()
and strings in the code above is a bit confusing and hard on our eyes, we can also define factorVars
and vars
before inputting them into CreateTableOne()
like so:
<- c("Gender", "Age", "Race", "Education") vars
<- c("Gender", "Race", "Education") factorVars
CreateTableOne(data = demo,
vars = vars,
factorVars = factorVars
)
##
## Overall
## n 10175
## Gender = 2 (%) 5172 (50.8)
## Age (mean (SD)) 31.48 (24.42)
## Race (%)
## 1 1730 (17.0)
## 2 960 ( 9.4)
## 3 3674 (36.1)
## 4 2267 (22.3)
## 6 1074 (10.6)
## 7 470 ( 4.6)
## Education (%)
## 1 455 ( 7.9)
## 2 791 (13.7)
## 3 1303 (22.6)
## 4 1770 (30.7)
## 5 1443 (25.0)
## 7 2 ( 0.0)
## 9 5 ( 0.1)
We should be able to see that both tables in this subsection are identical!
Try it yourself 8.2
Create a tableone without the vars
argument. What do you see?
Do you think the vars
argument is necessary in our case? If not, in what situation(s) do you think it would be necessary?
8.6 Other Arguments to Customize tableone
There are other arguments of CreateTableOne()
that we can use to customize and adjust our tableone!
8.6.1 Show All Levels
Recall how our Gender variable only shows the “Female” category. If we want both categories “Female” and “Male” to be shown, we can add showAllLevels = TRUE
to our print()
function like so:
print(tab_nhanes,
showAllLevels = TRUE)
##
## level Overall
## n 10175
## Gender (%) Male 5003 (49.2)
## Female 5172 (50.8)
## Age (mean (SD)) 31.48 (24.42)
## Race (%) Mexican American 1730 (17.0)
## Other Hispanic 960 ( 9.4)
## Non-Hispanic White 3674 (36.1)
## Non-Hispanic Black 2267 (22.3)
## Non-Hispanic Asian 1074 (10.6)
## Other Race - Including Multi-Rac 470 ( 4.6)
## Education (%) Less than 9th grade 455 ( 7.9)
## 9-11th grade (Includes 12th grad 791 (13.7)
## High school graduate/GED or equi 1303 (22.6)
## Some college or AA degree 1770 (30.7)
## College graduate or above 1443 (25.0)
## Refused 2 ( 0.0)
## Don't Know 5 ( 0.1)
Another way that we can show both Male and Femal is to use cramVars
. But this argument only works on 2-level variables (i.e. variables with only 2 categories) because all categories will be placed in the same row.
print(tab_nhanes,
cramVars = "Gender")
##
## Overall
## n 10175
## Gender = Male/Female (%) 5003/5172 (49.2/50.8)
## Age (mean (SD)) 31.48 (24.42)
## Race (%)
## Mexican American 1730 (17.0)
## Other Hispanic 960 ( 9.4)
## Non-Hispanic White 3674 (36.1)
## Non-Hispanic Black 2267 (22.3)
## Non-Hispanic Asian 1074 (10.6)
## Other Race - Including Multi-Rac 470 ( 4.6)
## Education (%)
## Less than 9th grade 455 ( 7.9)
## 9-11th grade (Includes 12th grad 791 (13.7)
## High school graduate/GED or equi 1303 (22.6)
## Some college or AA degree 1770 (30.7)
## College graduate or above 1443 (25.0)
## Refused 2 ( 0.0)
## Don't Know 5 ( 0.1)
DO QUESTION 5 OF THE QUIZ NOW
- What is the difference between
showAllLevels
andcramVars
?
8.6.2 Nonnormal
Right now, our tableones assume that the data of all of our continuous variables are normal, but what if our data is not normal?
If we know that some or all of our continous variables are not normal, we can tell R this by using the nonnormal
argument of print()
. For example, if our Age variable is nonnormal, then:
print(tab_nhanes,
showAllLevels = TRUE,
nonnormal = "Age"
)
##
## level Overall
## n 10175
## Gender (%) Male 5003 (49.2)
## Female 5172 (50.8)
## Age (median [IQR]) 26.00 [10.00, 52.00]
## Race (%) Mexican American 1730 (17.0)
## Other Hispanic 960 ( 9.4)
## Non-Hispanic White 3674 (36.1)
## Non-Hispanic Black 2267 (22.3)
## Non-Hispanic Asian 1074 (10.6)
## Other Race - Including Multi-Rac 470 ( 4.6)
## Education (%) Less than 9th grade 455 ( 7.9)
## 9-11th grade (Includes 12th grad 791 (13.7)
## High school graduate/GED or equi 1303 (22.6)
## Some college or AA degree 1770 (30.7)
## College graduate or above 1443 (25.0)
## Refused 2 ( 0.0)
## Don't Know 5 ( 0.1)
In the table above, we can see that instead of the usual mean and standard deviation, we are provided with the median and interquartile range (IQR) for our nonnormal Age variable!
Try it yourself 8.3
How do you know if a variable is nonnormal? Try using the function summary()
and look at the number under skew. How do you decide if something is normal or nonnormal? Is the decision to make “Age” nonnormal accurate?
DO QUESTION 6 OF THE QUIZ NOW
- The decision to make “Age” nonnormal is accurate. (True or False)
8.6.3 Show Categorical or Continuous Variables Only
We also have the option to only create tableones with only categorical or continuous variables.
## Categorical variables only
$CatTable tab_nhanes
##
## Overall
## n 10175
## Gender = Female (%) 5172 (50.8)
## Race (%)
## Mexican American 1730 (17.0)
## Other Hispanic 960 ( 9.4)
## Non-Hispanic White 3674 (36.1)
## Non-Hispanic Black 2267 (22.3)
## Non-Hispanic Asian 1074 (10.6)
## Other Race - Including Multi-Rac 470 ( 4.6)
## Education (%)
## Less than 9th grade 455 ( 7.9)
## 9-11th grade (Includes 12th grad 791 (13.7)
## High school graduate/GED or equi 1303 (22.6)
## Some college or AA degree 1770 (30.7)
## College graduate or above 1443 (25.0)
## Refused 2 ( 0.0)
## Don't Know 5 ( 0.1)
## Continuous variables only
print(tab_nhanes$ContTable, nonnormal = "Age")
##
## Overall
## n 10175
## Age (median [IQR]) 26.00 [10.00, 52.00]
8.6.4 Strata
In a way, strata is like the function group_by()
in dplyr or facets in ggplot. It groups data together into groups or “strata” and then summarizes each group individually.
Note that while showAllLevels
and nonnormal
are arguments of the function print()
, strata
is an argument of the function CreateTableOne()
.
For example, if we want to separate our data summary by Gender, we would need to write a code like so:
<- CreateTableOne(data = demo_translate,
strata vars = c("Age", "Race", "Education"), ## Note that Gender is not included because we already have strata = Gender
factorVars = c("Race","Education"), ## Again, Gender is not included because it is in the strata argument
strata = "Gender"
)
print(strata,
nonnormal = "Age",
cramVars = "Gender")
## Stratified by Gender
## Male Female
## n 5003 5172
## Age (median [IQR]) 25.00 [9.00, 51.00] 28.00 [10.00, 52.00]
## Race (%)
## Mexican American 833 (16.7) 897 (17.3)
## Other Hispanic 449 ( 9.0) 511 ( 9.9)
## Non-Hispanic White 1811 (36.2) 1863 (36.0)
## Non-Hispanic Black 1152 (23.0) 1115 (21.6)
## Non-Hispanic Asian 521 (10.4) 553 (10.7)
## Other Race - Including Multi-Rac 237 ( 4.7) 233 ( 4.5)
## Education (%)
## Less than 9th grade 230 ( 8.3) 225 ( 7.5)
## 9-11th grade (Includes 12th grad 393 (14.2) 398 (13.2)
## High school graduate/GED or equi 665 (24.1) 638 (21.2)
## Some college or AA degree 754 (27.3) 1016 (33.7)
## College graduate or above 713 (25.9) 730 (24.2)
## Refused 0 ( 0.0) 2 ( 0.1)
## Don't Know 3 ( 0.1) 2 ( 0.1)
## Stratified by Gender
## p test
## n
## Age (median [IQR]) 0.001 nonnorm
## Race (%) 0.317
## Mexican American
## Other Hispanic
## Non-Hispanic White
## Non-Hispanic Black
## Non-Hispanic Asian
## Other Race - Including Multi-Rac
## Education (%) <0.001
## Less than 9th grade
## 9-11th grade (Includes 12th grad
## High school graduate/GED or equi
## Some college or AA degree
## College graduate or above
## Refused
## Don't Know
Let’s unpack this table together. Firstly, we have the usual mean and standard deviation OR median and IQR for each category of each variable. Except now, we can see that all of the variables and their categories are summarized by or stratified by Gender.
Second of all, we can also see a second table below our usual table with p-values and test. This only appears when we have stratified our data into two groups for comparison. The default test for categorical variables is chisq.test()
and the default for continuous variables is oneway.test()
(regular ANOVA). tableone also considers nonnorm as present by the word “nonnorm” under “test” in the table above. Otherwise, we also have the option to use krushal.test()
for nonnormal continuous variables.
Try it yourself 8.4
Create a tableone using the demo_translate dataset. Keep all variables and stratified the data using “Age.” What do you see? Do you think this is a helpful tableone?
DO QUESTION 7 OF THE QUIZ NOW
- Which of the following is the least appropriate to stratify our dataset by?
8.7 Export tableone
Finally, let’s export our tableone!
Recall that we can use the function write.csv()
to export data from R to a csv file. But before we can use this function, we need to save the table into an object using print()
like so first:
<- print(strata,
tab_csv nonnormal = "Age",
printToggle = FALSE)
DO QUESTION 8 OF THE QUIZ NOW
- What does the argument
printToggle = FALSE
do?
Now we can use our write.csv()
function like normal.
write.csv(tab_csv, file = "data/NHANES_Summary.csv")
Tada! Now our table is saved as a csv file in our working directory!
dir()
## [1] "_book"
## [2] "_bookdown.yml"
## [3] "_bookdown_files"
## [4] "_build.sh"
## [5] "_deploy.sh"
## [6] "_output.yml"
## [7] "0-r-and-rstudio-set-up.Rmd"
## [8] "1-introduction-to-r.Rmd"
## [9] "2-importing-data-into-r-with-readr.Rmd"
## [10] "3-introduction-to-nhanes.Rmd"
## [11] "4-data-analysis-with-dplyr.Rmd"
## [12] "5-data-visualization-with-ggplot.Rmd"
## [13] "6-date-time-data-with-lubridate.Rmd"
## [14] "7-data-summary-with-tableone.Rmd"
## [15] "8-Exercise-Solutions.Rmd"
## [16] "9-references.Rmd"
## [17] "book.bib"
## [18] "data"
## [19] "DESCRIPTION"
## [20] "Dockerfile"
## [21] "docs"
## [22] "header.html"
## [23] "images"
## [24] "index.Rmd"
## [25] "intro2R.log"
## [26] "intro2R.Rmd"
## [27] "intro2R.tex"
## [28] "intro2R_cache"
## [29] "intro2R_files"
## [30] "LICENSE"
## [31] "now.json"
## [32] "packages.bib"
## [33] "preamble.tex"
## [34] "R.Rproj"
## [35] "README.md"
## [36] "style.css"
## [37] "toc.css"
DO QUESTIONS 9 & 10 OF THE QUIZ NOW
Which of the following arguments can be nested in
CreateTableOne()
?Which of the following arguments can be nested in
print()
?
8.8 Alternatives to tableone
Data summary is one of the many applications that R specializes at. With this said, there are multiple other R packages that also do data summary aside from tableone. We will not go over any of these packages, but know that each package has its own strengths and so are most optimally used in different situations.
Here are the other data summary packages and its main data summary function:
8.8.1 base R
In base R, we have summary()
and by()
:
summary(demo_translate)
## Gender Age Race
## Male :5003 Min. : 0.00 Mexican American :1730
## Female:5172 1st Qu.:10.00 Other Hispanic : 960
## Median :26.00 Non-Hispanic White :3674
## Mean :31.48 Non-Hispanic Black :2267
## 3rd Qu.:52.00 Non-Hispanic Asian :1074
## Max. :80.00 Other Race - Including Multi-Rac: 470
##
## Education
## Some college or AA degree :1770
## College graduate or above :1443
## High school graduate/GED or equi:1303
## 9-11th grade (Includes 12th grad: 791
## Less than 9th grade : 455
## (Other) : 7
## NA's :4406
by(demo_translate, demo_translate$Gender, summary)
## demo_translate$Gender: Male
## Gender Age Race
## Male :5003 Min. : 0.00 Mexican American : 833
## Female: 0 1st Qu.: 9.00 Other Hispanic : 449
## Median :25.00 Non-Hispanic White :1811
## Mean :30.69 Non-Hispanic Black :1152
## 3rd Qu.:51.00 Non-Hispanic Asian : 521
## Max. :80.00 Other Race - Including Multi-Rac: 237
##
## Education
## Some college or AA degree : 754
## College graduate or above : 713
## High school graduate/GED or equi: 665
## 9-11th grade (Includes 12th grad: 393
## Less than 9th grade : 230
## (Other) : 3
## NA's :2245
## ------------------------------------------------------------
## demo_translate$Gender: Female
## Gender Age Race
## Male : 0 Min. : 0.00 Mexican American : 897
## Female:5172 1st Qu.:10.00 Other Hispanic : 511
## Median :28.00 Non-Hispanic White :1863
## Mean :32.25 Non-Hispanic Black :1115
## 3rd Qu.:52.00 Non-Hispanic Asian : 553
## Max. :80.00 Other Race - Including Multi-Rac: 233
##
## Education
## Some college or AA degree :1016
## College graduate or above : 730
## High school graduate/GED or equi: 638
## 9-11th grade (Includes 12th grad: 398
## Less than 9th grade : 225
## (Other) : 4
## NA's :2161
8.8.2 Hmisc
In Hmisc, we have describe()
:
#install.packages("Hmisc")
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
describe(demo_translate)
## demo_translate
##
## 4 Variables 10175 Observations
## --------------------------------------------------------------------------------
## Gender
## n missing distinct
## 10175 0 2
##
## Value Male Female
## Frequency 5003 5172
## Proportion 0.492 0.508
## --------------------------------------------------------------------------------
## Age : Age in years at screening
## n missing distinct Info Mean Gmd .05 .10
## 10175 0 81 1 31.48 27.75 1 3
## .25 .50 .75 .90 .95
## 10 26 52 68 75
##
## lowest : 0 1 2 3 4, highest: 76 77 78 79 80
## --------------------------------------------------------------------------------
## Race
## n missing distinct
## 10175 0 6
##
## lowest : Mexican American Other Hispanic Non-Hispanic White Non-Hispanic Black Non-Hispanic Asian
## highest: Other Hispanic Non-Hispanic White Non-Hispanic Black Non-Hispanic Asian Other Race - Including Multi-Rac
##
## Mexican American (1730, 0.170), Other Hispanic (960, 0.094), Non-Hispanic White
## (3674, 0.361), Non-Hispanic Black (2267, 0.223), Non-Hispanic Asian (1074,
## 0.106), Other Race - Including Multi-Rac (470, 0.046)
## --------------------------------------------------------------------------------
## Education
## n missing distinct
## 5769 4406 7
##
## lowest : Less than 9th grade 9-11th grade (Includes 12th grad High school graduate/GED or equi Some college or AA degree College graduate or above
## highest: High school graduate/GED or equi Some college or AA degree College graduate or above Refused Don't Know
##
## Less than 9th grade (455, 0.079), 9-11th grade (Includes 12th grad (791,
## 0.137), High school graduate/GED or equi (1303, 0.226), Some college or AA
## degree (1770, 0.307), College graduate or above (1443, 0.250), Refused (2,
## 0.000), Don't Know (5, 0.001)
## --------------------------------------------------------------------------------
8.8.3 psych
In psych, we have describe()
and describeBy()
. Note how the categorical variables are marked with an asterisk (*).
#install.packages("psych")
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following object is masked from 'package:car':
##
## logit
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(demo_translate)
## vars n mean sd median trimmed mad min max range skew
## Gender* 1 10175 1.51 0.50 2 1.51 0.00 1 2 1 -0.03
## Age 2 10175 31.48 24.42 26 29.82 28.17 0 80 80 0.44
## Race* 3 10175 3.14 1.35 3 3.11 1.48 1 6 5 0.04
## Education* 4 5769 3.52 1.23 4 3.62 1.48 1 7 6 -0.47
## kurtosis se
## Gender* -2.00 0.00
## Age -1.09 0.24
## Race* -0.51 0.01
## Education* -0.69 0.02
describeBy(demo_translate, demo_translate$Gender)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew
## Gender* 1 5003 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## Age 2 5003 30.69 24.39 25 28.89 28.17 0 80 80 0.48
## Race* 3 5003 3.16 1.34 3 3.14 1.48 1 6 5 0.03
## Education* 4 2758 3.49 1.25 4 3.58 1.48 1 7 6 -0.40
## kurtosis se
## Gender* NaN 0.00
## Age -1.07 0.34
## Race* -0.49 0.02
## Education* -0.78 0.02
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew
## Gender* 1 5172 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## Age 2 5172 32.25 24.43 28 30.72 29.65 0 80 80 0.40
## Race* 3 5172 3.12 1.35 3 3.09 1.48 1 6 5 0.06
## Education* 4 3011 3.55 1.21 4 3.65 1.48 1 7 6 -0.53
## kurtosis se
## Gender* NaN 0.00
## Age -1.12 0.34
## Race* -0.54 0.02
## Education* -0.59 0.02
8.8.4 desctable
In desctable, we have desctable()
:
# install.packages("desctable")
library(desctable)
## Loading required package: pander
##
## Attaching package: 'desctable'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test, IQR
desctable(demo_translate)
## N % Median IQR
## 1 Gender 10175 NA NA NA
## 2 Gender: Male 5003 49.16953317 NA NA
## 3 Gender: Female 5172 50.83046683 NA NA
## 4 Age 10175 NA 26 42
## 5 Race 10175 NA NA NA
## 6 Race: Mexican American 1730 17.00245700 NA NA
## 7 Race: Other Hispanic 960 9.43488943 NA NA
## 8 Race: Non-Hispanic White 3674 36.10810811 NA NA
## 9 Race: Non-Hispanic Black 2267 22.28009828 NA NA
## 10 Race: Non-Hispanic Asian 1074 10.55528256 NA NA
## 11 Race: Other Race - Including Multi-Rac 470 4.61916462 NA NA
## 12 Education 5769 NA NA NA
## 13 Education: Less than 9th grade 455 7.88698215 NA NA
## 14 Education: 9-11th grade (Includes 12th grad 791 13.71121512 NA NA
## 15 Education: High school graduate/GED or equi 1303 22.58623678 NA NA
## 16 Education: Some college or AA degree 1770 30.68122725 NA NA
## 17 Education: College graduate or above 1443 25.01300052 NA NA
## 18 Education: Refused 2 0.03466805 NA NA
## 19 Education: Don't Know 5 0.08667013 NA NA
8.8.5 skimr
In skimr, we have skim()
:
#install.packages("skimr")
library(skimr)
skim(demo_translate)
## Warning: Couldn't find skimmers for class: labelled, integer, numeric; No user-
## defined `sfl` provided. Falling back to `character`.
Name | demo_translate |
Number of rows | 10175 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 1 |
factor | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Age | 0 | 1 | 1 | 2 | 0 | 81 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Gender | 0 | 1.00 | FALSE | 2 | Fem: 5172, Mal: 5003 |
Race | 0 | 1.00 | FALSE | 6 | Non: 3674, Non: 2267, Mex: 1730, Non: 1074 |
Education | 4406 | 0.57 | FALSE | 7 | Som: 1770, Col: 1443, Hig: 1303, 9-1: 791 |
8.9 Summary and Takeaways
Congratulations on finishing tutorial 7 on Data Summary with tableone! After this tutorial, you should be familiar with the R package tableone as well as the function CreateTableOne()
. In addition, you should also be familiar with the different arguments of print()
to customize your own tableone.
There are a lot more powerful functions in the tableone package. You are free to explore them on your own using this document.