Chapter 5 Data Analysis with dplyr
5.1 Instructions
This tutorial is aiming to introduce you to how to manipulate data and transfer the raw data into ready-to-analysis form in R. It will guide you to learn and practice the basic and useful functions in the dplyr package. In this tutorial, follow the step-by-step instruction as well as the examples which demonstrating how the functions work.
Accompanying this tutorial is a short Google quiz for your own self-assessment. The instructions of this tutorial will clearly indicate when you should answer which question.
5.2 Learning Objectives
- Be able to import nhanesA and dplyr package. Be able to understand dataframe.
- Be able to use rename() and select() to rename and select variables (columns) in a dataframe.
- Be able to use filter() to subset a dataframe based on conditions.
- Be able to use arrange() to re-order the rows in a dataframe.
- Be able to use mutate() and transmute() to add new variables that are computed from existing variables in a dataframe.
- Be able to use summarize() to get summary statistics from a dataframe; Be able to perform grouping with summarize(), filter(), and mutate().
- Be able to use pipe to re-write mutilpe operations in a more readable way.
- Be able to check missing values existence and deal with missing values while using the above functions.
5.3 Set up
5.3.1 Install and load packages
First, we need to install and load the dplyr package as well as bring in our National Health and Nutrition Examination Survey (NHANES) dataset. This particular tutorial uses data from 2013-2014.
For more information about NHANES, you can visit this website. It is recommended that you explore this website to familiar yourself with the data that we will be using throughout this tutorial.
# install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# install.packages("nhanesA")
library(nhanesA)
5.3.2 Set working directory
We want to set the working directory using setwd()
- you may need to change the working directory to match your hard drive set-up.
# setwd('/kaggle/working')
5.3.3 Import dataset
This particular tutorial uses the Demographics dataset and Blood pressure dataset from NHANES dataset (2013-2014).
The Demographics dataset contains a huge records of demographics information such as gender, race, annual income for each participant. The Blood pressure dataset is in the Examination data and contains a hugh records of measurements that related to blood pressure measurement.
More information on the imported dataset can be found here:
- Demographics data
- Examination data
- Complete variable dictionary
- DEMO_H Code book (Demographic Variables)
- BPX_H Code book (Blood Pressure)
<- nhanes('DEMO_H') demo
## Processing SAS dataset DEMO_H ..
<- nhanes('BPX_H') bpx
## Processing SAS dataset BPX_H ..
Functions debunked
nhanes()
is the function we use to import our NHANES data and save it in a dataframe. What is a dataframe? We’ll introduce it below. The arguments are as follows:
nhanes(‘Name of Dataset’)
For example: nhanes('DEMO_H')
5.3.4 Explore our dataset
Recall the basic functions that we learned in tutorial 1, it is good practice for us to explore our datasets before doing any analysis.
We can check the dimension of the datasets:
dim(demo)
## [1] 10175 47
dim(bpx)
## [1] 9813 23
We can also check their first few rows:
head(demo, 3)
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
## 1 73557 8 2 1 69 NA 4 4 1
## 2 73558 8 2 1 54 NA 3 3 1
## 3 73559 8 2 1 72 NA 3 3 2
## RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3 DMDEDUC2
## 1 NA 1 1 1 1 NA NA 3
## 2 NA 2 NA 1 1 NA NA 3
## 3 NA 1 1 1 1 NA NA 4
## DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG
## 1 4 NA 1 2 2 1 2 2 1
## 2 1 NA 1 2 2 1 2 2 1
## 3 1 NA 1 2 2 1 2 2 1
## MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA DMDHHSZB DMDHHSZE
## 1 2 2 1 3 3 0 0 2
## 2 2 2 1 4 4 0 2 0
## 3 2 2 NA 2 2 0 0 2
## DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU WTINT2YR WTMEC2YR
## 1 1 69 1 3 4 NA 13281.24 13481.04
## 2 1 54 1 3 1 1 23682.06 24471.77
## 3 1 72 1 4 1 3 57214.80 57193.29
## SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
## 1 1 112 4 4 0.84
## 2 1 108 7 7 1.78
## 3 1 109 10 10 4.51
head(bpx, 3)
## SEQN PEASCST1 PEASCTM1 PEASCCT1 BPXCHR BPAARM BPACSZ BPXPLS BPXPULS BPXPTY
## 1 73557 1 620 NA NA 1 4 86 1 1
## 2 73558 1 766 NA NA 1 4 74 1 1
## 3 73559 1 665 NA NA 1 4 68 1 1
## BPXML1 BPXSY1 BPXDI1 BPAEN1 BPXSY2 BPXDI2 BPAEN2 BPXSY3 BPXDI3 BPAEN3 BPXSY4
## 1 140 122 72 2 114 76 2 102 74 2 NA
## 2 170 156 62 2 160 80 2 156 42 2 NA
## 3 160 140 90 2 140 76 2 146 80 2 NA
## BPXDI4 BPAEN4
## 1 NA NA
## 2 NA NA
## 3 NA NA
It does not hurt to also check the last few rows of the datasets:
tail(demo, 3)
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3
## 10173 83729 8 2 2 42 NA 4 4
## 10174 83730 8 2 1 7 NA 2 2
## 10175 83731 8 2 1 11 NA 5 6
## RIDEXMON RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3
## 10173 2 NA 2 NA 2 1 6 NA
## 10174 1 84 NA NA 1 1 NA 0
## 10175 1 140 NA NA 1 1 NA 5
## DMDEDUC2 DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY
## 10173 5 3 2 1 2 2 1 2
## 10174 NA NA NA 1 1 2 1 2
## 10175 NA NA NA 1 1 2 1 2
## FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA
## 10173 2 NA NA NA NA 1 1 0
## 10174 2 NA NA NA NA 4 4 1
## 10175 2 1 2 2 NA 4 4 0
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU
## 10173 0 0 2 42 2 5 3 NA
## 10174 1 0 2 30 2 4 1 3
## 10175 2 0 1 43 2 5 1 5
## WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
## 10173 24122.25 26902.344 1 104 7 7 3.66
## 10174 25521.88 26686.026 2 109 6 6 1.05
## 10175 8930.18 9700.873 2 106 15 15 5.00
tail(bpx, 3)
## SEQN PEASCST1 PEASCTM1 PEASCCT1 BPXCHR BPAARM BPACSZ BPXPLS BPXPULS
## 9811 83729 1 679 NA NA 1 4 80 1
## 9812 83730 1 381 NA 72 NA NA NA 1
## 9813 83731 1 498 NA NA 1 3 90 1
## BPXPTY BPXML1 BPXSY1 BPXDI1 BPAEN1 BPXSY2 BPXDI2 BPAEN2 BPXSY3 BPXDI3
## 9811 1 150 136 82 2 130 82 2 138 80
## 9812 NA NA NA NA NA NA NA NA NA NA
## 9813 1 120 94 68 2 94 56 2 90 62
## BPAEN3 BPXSY4 BPXDI4 BPAEN4
## 9811 2 NA NA NA
## 9812 NA NA NA NA
## 9813 2 NA NA NA
5.3.4.1 is.na()
In real world, missing values are unavoidable. There are many reasons for missing values - may be a result from nonresponse or incorrect data collection and it happens all the time. It is always a good idea to check if there is any missing value before proceeding.
In a dataframe, you will see NAs if there’re missing values and NA indicates missing values in R.
Note: Be careful when doing manipulation on real dataset!!!Missing values may be recorded in other ways, for example, infinity or other numbers. To deal with missing values in this case, one way is to convert these values into NAs and then treat them as regular NAs. Details will not be discussed in this tutorial but will likely be included in later tutorials.
To check if there is any missing values (NAs) in a dataframe, use is.na()
:
head(is.na(demo),5)
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3
## [1,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## RIDEXMON RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3
## [1,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [2,] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## [3,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [4,] FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
## [5,] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## DMDEDUC2 DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY
## [1,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [4,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
To find the total number of missing values in a dataframe:
sum(is.na(demo))
## [1] 82092
To check if there is any missing values (NAs) in a column,for example, gender:
head(is.na(demo['RIAGENDR']),5)
## RIAGENDR
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,] FALSE
To find the total number of missing values in a column:
sum(is.na(demo['RIAGENDR']))
## [1] 0
To find the number of missing values for each column, use the summary() function:
summary(demo)
If there is a missing value in one cell, is.na() will return TRUE; if there is no missing value in one cell, it will return FALSE
For example, (1,1) (1st row 1st column) is FALSE - it means that (1,1) is not a missing value; (1,6) (1st row 6th column) is TRUE - it means that (1,6) is NA.
is.na() is the function we use to inspect any missing values in a dataframe - it is housed in the base package.
The arguments are as follows:
is.na( > NAME OF DATAFRAME
)
For example: is.na(demo)
DO QUESTIONS 1-3 OF THE QUIZ NOW
What package does the function
nhanes()
belong to?Which code below is the correct one to print the last 10 rows of a dataframe called temp?
Suppose a cell contains NA. What is the output after perform is.na() on it?
5.4 Dataset Preparation
Now that we’ve successfully imported the two datasets and we’re ready to do further data manipulations. But before we proceed, there are three issues in the datasets:
The two datasets are Huge. Recall the dimensions of the two datasets in section 1, there are 10175 rows and 47 columns in the demo dataset and 9813 rows and 23 columns in the bpx dataset.
For demostration purpose, this tutorial will only focus on parts of the two datasets.
For the demo dataset, we want to keep the following variables as our primary interests only:
- SEQN (Respondent sequence number)
- RIAGENDR (Gender)
- RIDAGEYR (Age in years at screening)
- RIDRETH3 (Race)
- DMDEDUC2 (Education level - Adults 20+)
For the bpx dataset, we want to keep the following variables as our primary interests only:
- SEQN (Respondent sequence number)
- PEASCST1 (Blood Pressure Status)
- PEASCTM1 (Blood Pressure Time in Seconds)
- BPXSY1 (Systolic: Blood pres (1st rdg) mm Hg)
- BPXDI1 (Diastolic: Blood pres (1st rdg) mm Hg)
For both datasets, we want to keep the top 5 rows only.
There is something odd about the demo dataset. For example,if you run demo alone, you will see that RIAGENDR (gender) is coded as 1 and 2. For ease of future use, we want to translate this 1 and 2 into male and female.
The variable names are long are ambiguous. For ease of further use, we want to rename the variables so that they can be easily understand.
For the demo dataframe, we want to rename the variables in this way:
- SEQN -> id
- RIAGENDR -> gender
- RIDAGEYR -> age
- RIDRETH3 -> race
- DMDEDUC2 -> race
For the bpx dataframe, we want to rename the variables in this way:
- SEQN -> id
- PEASCST1 -> bp_status
- PEASCTM1 -> bpt_sec
- BPXSY1 -> systolic
- BPXDI1 -> diastolic
Note, there are many different ways of renaming as long as the new names are straight-forward. We will use the above rename strategy in this tutorial for the sake of consistency.
With the three issues listed above, here’s an example of how to resolve them in the demo dataset:
First, we need to translate the way of variable encoding using the nhanesTranslate()
function:
<- nhanesTranslate('DEMO_H',
demo_translate c('SEQN', # Respondent sequence number
'RIAGENDR', # Gender
'RIDAGEYR', # Age in years at screening
'RIDRETH3', # Race
'DMDEDUC2'), # Education level - Adults 20+
data = demo)
## Warning in FUN(X[[i]], ...): No translation table is available for SEQN
## Translated columns: RIAGENDR RIDRETH3 DMDEDUC2
head(demo_translate,5)
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1
## 1 73557 8 2 Male 69 NA 4
## 2 73558 8 2 Male 54 NA 3
## 3 73559 8 2 Male 72 NA 3
## 4 73560 8 2 Male 9 NA 3
## 5 73561 8 2 Female 73 NA 3
## RIDRETH3 RIDEXMON RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN
## 1 Non-Hispanic Black 1 NA 1 1 1 1
## 2 Non-Hispanic White 1 NA 2 NA 1 1
## 3 Non-Hispanic White 2 NA 1 1 1 1
## 4 Non-Hispanic White 1 119 NA NA 1 1
## 5 Non-Hispanic White 1 NA 2 NA 1 1
## DMDYRSUS DMDEDUC3 DMDEDUC2 DMDMARTL RIDEXPRG SIALANG
## 1 NA NA High school graduate/GED or equi 4 NA 1
## 2 NA NA High school graduate/GED or equi 1 NA 1
## 3 NA NA Some college or AA degree 1 NA 1
## 4 NA 3 <NA> NA NA 1
## 5 NA NA College graduate or above 1 NA 1
## SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG MIAPROXY MIAINTRP
## 1 2 2 1 2 2 1 2 2
## 2 2 2 1 2 2 1 2 2
## 3 2 2 1 2 2 1 2 2
## 4 1 2 1 2 2 1 2 2
## 5 2 2 1 2 2 1 2 2
## AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE
## 1 1 3 3 0 0 2 1 69
## 2 1 4 4 0 2 0 1 54
## 3 NA 2 2 0 0 2 1 72
## 4 1 4 4 0 2 0 1 33
## 5 NA 2 2 0 0 2 1 78
## DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA
## 1 1 3 4 NA 13281.24 13481.04 1 112
## 2 1 3 1 1 23682.06 24471.77 1 108
## 3 1 4 1 3 57214.80 57193.29 1 109
## 4 1 3 1 4 55201.18 55766.51 2 109
## 5 1 5 1 5 63709.67 65541.87 2 116
## INDHHIN2 INDFMIN2 INDFMPIR
## 1 4 4 0.84
## 2 7 7 1.78
## 3 10 10 4.51
## 4 9 9 2.52
## 5 15 15 5.00
Second, we want to keep the variabls which we’re interested in only:
<- select(demo_translate,
new_demo c(SEQN,
RIAGENDR,
RIDAGEYR,
RIDRETH3,
DMDEDUC2))head(new_demo,5)
## SEQN RIAGENDR RIDAGEYR RIDRETH3 DMDEDUC2
## 1 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 3 73559 Male 72 Non-Hispanic White Some college or AA degree
## 4 73560 Male 9 Non-Hispanic White <NA>
## 5 73561 Female 73 Non-Hispanic White College graduate or above
Third, we want to rename the variables using the rename()
function:
<- rename(new_demo,
new_demo id = SEQN, # Respondent sequence number
gender = RIAGENDR, # Gender
age = RIDAGEYR, # Age in years at screening
race = RIDRETH3, # Race/Hispanic origin
edu = DMDEDUC2, # Education level - Adults 20+
)head(new_demo,5)
## id gender age race edu
## 1 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 3 73559 Male 72 Non-Hispanic White Some college or AA degree
## 4 73560 Male 9 Non-Hispanic White <NA>
## 5 73561 Female 73 Non-Hispanic White College graduate or above
Last, we want to keep the top 10 rows using the head()
function and save the dataframe as final_demo
:
<- head(new_demo,10)
final_demo final_demo
## id gender age race edu
## 1 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 3 73559 Male 72 Non-Hispanic White Some college or AA degree
## 4 73560 Male 9 Non-Hispanic White <NA>
## 5 73561 Female 73 Non-Hispanic White College graduate or above
## 6 73562 Male 56 Mexican American Some college or AA degree
## 7 73563 Male 0 Non-Hispanic White <NA>
## 8 73564 Female 61 Non-Hispanic White College graduate or above
## 9 73565 Male 42 Other Hispanic High school graduate/GED or equi
## 10 73566 Female 56 Non-Hispanic White High school graduate/GED or equi
rename() is the function we use to rename the variables and return all variables - it is housed in the dplyr package. The arguments are as follows:
rename(Name of Data Frame, New Variable Name = Old Variable Name)
For example: rename(demo, id = SEQN)
select() is the function we use to only return the varibles we want - it is housed in the dplyr package. The arguments are as follows:
select(Name of Data Frame, Variables (can be written as a vector))
For example: select(demo, c(id,gender,age))
For example: select(demo, id:age))
Try it yourself 5.1
In the bpx dataframe, keep the following variables and top 5 rows only:
- SEQN
- PEASCST1
- PEASCTM1
- BPXSY1
- BPXDI1
Try it yourself 5.2
rename the variables in the following way:
- SEQN -> id
- PEASCST1 -> bp_status
- PEASCTM1 -> bpt_sec
- BPXSY1 -> systolic
- BPXDI1 -> diastolic
DO QUESTIONS 4 & 5 OF THE QUIZ NOW
Given the following code:
select(dataset, A, B)
. What is the purpose of the code?Which code below is the correct one to rename column A to B in a dataframe called dataset?
5.5 Filter
Sometimes, we want to focus on a subset of the dataset that satisifying some conditions for further analysis. In this case, we need to filter the dataset based on variables’ values and conditions.
For example, we use the following code to filter the observations that patients are 40 years old at screening.
filter(final_demo, age == 40)
## [1] id gender age race edu
## <0 rows> (or 0-length row.names)
We may also interested in the observations that patients are 40 or 41 years old at screening:
filter(final_demo, age == 40 | age == 41)
## [1] id gender age race edu
## <0 rows> (or 0-length row.names)
Another way of writing the code above is to use the %in%
operator:
x %in% y
is equivalent to the condition that the value of x is in one of the values of y
filter(final_demo, age %in% c(40, 41))
## [1] id gender age race edu
## <0 rows> (or 0-length row.names)
Functions debunked
filter() is the function we use to subset the dataset based on their values and conditions - it is housed in the dplyr package. The arguments are as follows:
filter(Name of Dataset, Condition 1, Condition 2, …, Condition n)
For example: filter(new_demo, gender == 1)
Note:
- Pay attention to the difference between
=
and==
>=
is assignment operator
==
is comparison operator
More comparison operators are <
, <=
,>
, >=
,!=
.
- Logical operators
Mutiple conditions can be combined using logical operators. Common logical operators are:
and
is&
or
is|
not
is!
For example: filter(new_demo, age == 40 | age == 41)
5.5.0.1 Missing values
Note, we don’t have any missing values in the age column (can be checked using is.na(new_demo['age'])
).
What if we have missing values and how does filter() treat missing values?
filter() only keeps the rows where the condition is TRUE and remove the rows where the condition is failed due to FALSE OR NA.
If you want to keep the NAs, you need to add the condition explicitly: is.na(VARIABLE_NAME)
and use |
to combine with other conditions.
For example: filter(new_demo, is.na(income) | income <= 50)
Try it yourself 5.3
In the demo dataframe, find all the records that:
The participant who is a male
The participant who is a male and is older tha 50 years old
The education level is missing
DO QUESTIONS 6-8 OF THE QUIZ NOW
Fill in the blank to answer the ‘Try it yourself’ sections:
filter(final_demo, gender
___
)filter(final_demo, gender == ‘Male’
___
age > 50)filter(final_demo,
___
)
5.6 Re-order the Rows
Suppose we want to re-order the observations by certain variables, we can use the arrange() function in R.
First, let’s look at the top 5 rows in the demo dataframe:
final_demo
## id gender age race edu
## 1 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 3 73559 Male 72 Non-Hispanic White Some college or AA degree
## 4 73560 Male 9 Non-Hispanic White <NA>
## 5 73561 Female 73 Non-Hispanic White College graduate or above
## 6 73562 Male 56 Mexican American Some college or AA degree
## 7 73563 Male 0 Non-Hispanic White <NA>
## 8 73564 Female 61 Non-Hispanic White College graduate or above
## 9 73565 Male 42 Other Hispanic High school graduate/GED or equi
## 10 73566 Female 56 Non-Hispanic White High school graduate/GED or equi
We notice that the column “age” doesn’t follow any order. We use the following code to re-order the dataframe by age:
arrange(final_demo, age)
## id gender age race edu
## 1 73563 Male 0 Non-Hispanic White <NA>
## 2 73560 Male 9 Non-Hispanic White <NA>
## 3 73565 Male 42 Other Hispanic High school graduate/GED or equi
## 4 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 5 73562 Male 56 Mexican American Some college or AA degree
## 6 73566 Female 56 Non-Hispanic White High school graduate/GED or equi
## 7 73564 Female 61 Non-Hispanic White College graduate or above
## 8 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 9 73559 Male 72 Non-Hispanic White Some college or AA degree
## 10 73561 Female 73 Non-Hispanic White College graduate or above
If we want to change it to descending order, use desc():
arrange(final_demo, desc(age))
## id gender age race edu
## 1 73561 Female 73 Non-Hispanic White College graduate or above
## 2 73559 Male 72 Non-Hispanic White Some college or AA degree
## 3 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 4 73564 Female 61 Non-Hispanic White College graduate or above
## 5 73562 Male 56 Mexican American Some college or AA degree
## 6 73566 Female 56 Non-Hispanic White High school graduate/GED or equi
## 7 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 8 73565 Male 42 Other Hispanic High school graduate/GED or equi
## 9 73560 Male 9 Non-Hispanic White <NA>
## 10 73563 Male 0 Non-Hispanic White <NA>
What if we want to change the order by multiple columns?
We can simply add multiple columns in the arguments! It’s also useful when there are ties in the values of one column and the subsquent columns are used to break the ties.
For example, re-order the observations by age,id:
arrange(final_demo,age,id)
## id gender age race edu
## 1 73563 Male 0 Non-Hispanic White <NA>
## 2 73560 Male 9 Non-Hispanic White <NA>
## 3 73565 Male 42 Other Hispanic High school graduate/GED or equi
## 4 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 5 73562 Male 56 Mexican American Some college or AA degree
## 6 73566 Female 56 Non-Hispanic White High school graduate/GED or equi
## 7 73564 Female 61 Non-Hispanic White College graduate or above
## 8 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 9 73559 Male 72 Non-Hispanic White Some college or AA degree
## 10 73561 Female 73 Non-Hispanic White College graduate or above
Functions debunked
arrange() is the function we use to change the order of the observations by columns - it is housed in the dplyr package. The arguments are as follows:
arrange( > NAME OF DATAFRAME,
COLUMN 1
COLUMN 2
…
COLUMN n
)
For example: arrange(new_demo, gender)
Use desc(COLUMN_NAME)
to reorder the dataframe by COLUMN_NAME in descending order.
5.6.0.1 Missing values
In the last example, we notice that there are missing values in column ‘income’ and ‘income_ratio.’
How does arrange() deal with missing value? All NAs will be retained at the end. Check the output above!
Try it yourself 5.4
Re-order the rows in the bpx dataset by Blood Pressure Time in Seconds (bpt_sec) in descending order:
DO QUESTION 9 OF THE QUIZ NOW
Fill in the blank to answer the ‘Try it yourself’ section:
___
(final_bpx,___
)
5.7 Add new variables
Suppose we want to gain more information about our observations and we want to constrcut a new variable based on the variables we have. The mutate() and transmute() in R helps us to add new variables into a dataframe.
For example, we want to add a new variable called born_year in the demo dataframe and born_year is calculated in this way: born_year = 2021- age:
mutate(final_demo,
born_year = 2021 - age
)
## id gender age race edu
## 1 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 3 73559 Male 72 Non-Hispanic White Some college or AA degree
## 4 73560 Male 9 Non-Hispanic White <NA>
## 5 73561 Female 73 Non-Hispanic White College graduate or above
## 6 73562 Male 56 Mexican American Some college or AA degree
## 7 73563 Male 0 Non-Hispanic White <NA>
## 8 73564 Female 61 Non-Hispanic White College graduate or above
## 9 73565 Male 42 Other Hispanic High school graduate/GED or equi
## 10 73566 Female 56 Non-Hispanic White High school graduate/GED or equi
## born_year
## 1 1952
## 2 1967
## 3 1949
## 4 2012
## 5 1948
## 6 1965
## 7 2021
## 8 1960
## 9 1979
## 10 1965
If we want to keep the added variables only, use transmute() instead:
transmute(final_demo,
born_year = 2021 - age
)
## born_year
## 1 1952
## 2 1967
## 3 1949
## 4 2012
## 5 1948
## 6 1965
## 7 2021
## 8 1960
## 9 1979
## 10 1965
Functions debunked
mutate() is the function we use to create new variables based on variables we have and add them to the original dataframe - it is housed in the dplyr package. The arguments are as follows:
mutate( > NAME OF DATAFRAME,
NEW_VARIABLE = FUNCTION OF EXISTING VARIBALES
… )
For example: mutate(new_demo, rescale_income = income/2)
transmute() is the function we use to create new variables based on variables we have and only keep the added variables - it is housed in the dplyr package. The arguments are as follows:
transmute( > NAME OF DATAFRAME,
NEW_VARIABLE = FUNCTION OF EXISTING VARIBALES
…
)
For example: transmute(new_demo, rescale_income = income/2)
You can find more details in how to use creation functions to create new variables in Chapter I. Explore Chapter 3.
Try it yourself 5.5
Create a new variable called called rescale_bpt_sec that records the Blood Pressure Time in miuntes. Keep both original and new variables.
Create rescale_bpt_sec in the same way above and only keep new variables.
Note: Try to avoid using select().
DO QUESTION 10 OF THE QUIZ NOW
Fill in the blank to answer both parts of the ‘Try it yourself’ sections.
___
(final_bpx, rescale_bpt_sec = bpt_sec/60)
5.8 Summary Statistics and group_by
Information about the whole dataframe such as mean is very useful for data analysis. However, things could be very complex when aggregating multiple functions. We’ll start with the simple one: find a summary statistic from the whole dataset.
For example, find the average age in the whole demo dataframe:
summarize(final_demo,average_age = mean(age,na.rm = TRUE))
## average_age
## 1 49.2
It gets more complex when we change the unit of analysis into groups, i.e, find summary statistics grouping by variables. To do so, we first convert the dataframe into a grouped dataframe using group_by() and then apply the summarize() to the grouped dataframe.
For example, find the average age in the demo dataframe per gender:
<- group_by(final_demo,gender)
by_gender summarize(by_gender, average_age = mean(age, na.rm = TRUE))
## # A tibble: 2 x 2
## gender average_age
## <fct> <dbl>
## 1 Male 43.1
## 2 Female 63.3
Functions debunked
summarize() is the function we use to compute the summary statistics for the whole dataframe - it is housed in the dplyr package. The arguments are as follows:
summarize(Name of Data Frame, Name of Summary Statistic = Function())
For example: summarize(new_demo, mean(age,na.rm = TRUE))
group_by() is the function we use to create a grouped dataframe grouping by one or more variables - it is housed in the dplyr package. The arguments are as follows:
group_by(Name of Data Frame, Name of Variable)
For example: group_by(new_demo,age)
Note: if there is missing value in the variable, group_by() treats it as a new group
Try it yourself 5.6
Find the average age in the demo dataframe per education level
DO QUESTION 11 OF THE QUIZ NOW
Fill in the blanks to answer the ‘Try it yourself’ section:
by_edu <- group_by(
___
)summarize(
___
, average_age =___
(age, na.rm = TRUE))
5.8.1 Group_by() extension
group_by() is also useful when conjuncting with filter() and mutate().
For example, return the observations which has more than 2 records in each education group.
<-group_by(final_demo,edu)
by_edu filter(by_edu,n() > 2)
## # A tibble: 4 x 5
## # Groups: edu [1]
## id gender age race edu
## <labelled> <fct> <labelled> <fct> <fct>
## 1 73557 Male 69 Non-Hispanic Black High school graduate/GED or e~
## 2 73558 Male 54 Non-Hispanic White High school graduate/GED or e~
## 3 73565 Male 42 Other Hispanic High school graduate/GED or e~
## 4 73566 Female 56 Non-Hispanic White High school graduate/GED or e~
Here’s another example: using group_by() with mutate() to compute the difference in each age and the average mean in each gender group
<-group_by(final_demo,gender)
by_gender mutate(by_gender,
diff_age = age - mean(age, na.rm = T))
## # A tibble: 10 x 6
## # Groups: gender [2]
## id gender age race edu diff_age
## <labelle> <fct> <labelle> <fct> <fct> <labelled>
## 1 73557 Male 69 Non-Hispanic ~ High school graduate/GE~ 25.857143
## 2 73558 Male 54 Non-Hispanic ~ High school graduate/GE~ 10.857143
## 3 73559 Male 72 Non-Hispanic ~ Some college or AA degr~ 28.857143
## 4 73560 Male 9 Non-Hispanic ~ <NA> -34.142857
## 5 73561 Female 73 Non-Hispanic ~ College graduate or abo~ 9.666667
## 6 73562 Male 56 Mexican Ameri~ Some college or AA degr~ 12.857143
## 7 73563 Male 0 Non-Hispanic ~ <NA> -43.142857
## 8 73564 Female 61 Non-Hispanic ~ College graduate or abo~ -2.333333
## 9 73565 Male 42 Other Hispanic High school graduate/GE~ -1.142857
## 10 73566 Female 56 Non-Hispanic ~ High school graduate/GE~ -7.333333
Functions debunked
n() is the function we use to count the number of observations in each group - it is housed in the dplyr package. It can only be used with the existence of summarize(), filter(), and mutate() and there is no argument in it.
Try it yourself 5.7
Return the observations which has more than 3 records in each gender group.
Try it yourself 5.8
Compute the difference in each age and the average mean in each education level group
DO QUESTIONS 12 & 13 OF THE QUIZ NOW
Fill in the blanks to answer the ‘Try it yourself’ sections:
by_gender <-group_by(final_demo,gender)
- filter(by_edu,
___
)
- filter(by_edu,
by_edu <-group_by(final_demo,edu)
- mutate(by_edu,
___
)
- mutate(by_edu,
5.9 Pipe
You may have noticed that when doing multiple-step operations, we need to assign the output to a new variable every time when a step is done. It becomes more annoying when there are more steps to do. The pipe operator %>% - which is housed in the magrittr package - saves us from create many unnecessary variables: it takes the output from one function as an input to the following function.
Here’s example without using pipe: we want to first keep the observations in the demo dataframe with an age greater than 40 and then create a new variable called born_year calculated by 2021 - age.
We need to do this in two steps.The first step is to filter the dataframe and save the output as a new variable temp:
<- filter(final_demo, age > 40)
temp temp
## id gender age race edu
## 1 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 3 73559 Male 72 Non-Hispanic White Some college or AA degree
## 4 73561 Female 73 Non-Hispanic White College graduate or above
## 5 73562 Male 56 Mexican American Some college or AA degree
## 6 73564 Female 61 Non-Hispanic White College graduate or above
## 7 73565 Male 42 Other Hispanic High school graduate/GED or equi
## 8 73566 Female 56 Non-Hispanic White High school graduate/GED or equi
The second step is to create the new variable:
<- mutate(temp, born_year = 2021 - age)
temp temp
## id gender age race edu
## 1 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 3 73559 Male 72 Non-Hispanic White Some college or AA degree
## 4 73561 Female 73 Non-Hispanic White College graduate or above
## 5 73562 Male 56 Mexican American Some college or AA degree
## 6 73564 Female 61 Non-Hispanic White College graduate or above
## 7 73565 Male 42 Other Hispanic High school graduate/GED or equi
## 8 73566 Female 56 Non-Hispanic White High school graduate/GED or equi
## born_year
## 1 1952
## 2 1967
## 3 1949
## 4 1948
## 5 1965
## 6 1960
## 7 1979
## 8 1965
The intermediate variable temp can be avoided by using the pipe operatore:
%>%
final_demo filter(age > 40) %>%
mutate(born_year = 2021 - age)
## id gender age race edu
## 1 73557 Male 69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558 Male 54 Non-Hispanic White High school graduate/GED or equi
## 3 73559 Male 72 Non-Hispanic White Some college or AA degree
## 4 73561 Female 73 Non-Hispanic White College graduate or above
## 5 73562 Male 56 Mexican American Some college or AA degree
## 6 73564 Female 61 Non-Hispanic White College graduate or above
## 7 73565 Male 42 Other Hispanic High school graduate/GED or equi
## 8 73566 Female 56 Non-Hispanic White High school graduate/GED or equi
## born_year
## 1 1952
## 2 1967
## 3 1949
## 4 1948
## 5 1965
## 6 1960
## 7 1979
## 8 1965
Try it yourself 5.9
Re-write the following code using pipe operator
# temp <- filter(final_bpx, systolic > 120)
# temp <- mutate(temp, bpt_min = bpt_sec/60)
# temp
DO QUESTION 14 OF THE QUIZ NOW
Fill in the blanks to answer the’Try it yourself’ section:
___
%>%- filter(
___
,systolic > 120) %>% - mutate(
___
,bpt_min = bpt_sec/60)
5.10 Summary of dealing with missing values
Dealing with missing values can be complex. You need to be careful when using functions since different functions have different ways in dealing with missing values.
filter(): excludes missing values. If you do want to retain missing values, use is.na() explicitly.
arrange(): retains missing values and sorts them at the end
select(): retains missing values
mutate(): retains missing values
summarize(): retains missing values. If you want to remove missing values, set na.rm = TRUE in aggregation functions.
group_by(): treats missing values as a group
5.11 Alternatives to NHANESTranslate()
5.11.1 case_when()
case_when()
is another useful function that we can use that will aid our data analysis. This function acts like mutate()
, except it changes the actual values of the variable. Here is an example of how we can use case_when()
and mutate()
together to change values within a variable:
<- c(97, 36, 55, 50, 49, 65, 46, 87, 100) x
case_when(x < 50 ~ "Fail",
> 65 ~ "Pass",
x TRUE ~ "Retake recommended")
## [1] "Pass" "Fail" "Retake recommended"
## [4] "Retake recommended" "Fail" "Retake recommended"
## [7] "Fail" "Pass" "Pass"
In the example above, we have a vector of student grades. Using case_when()
, we have translated the grades to “Fail,” “Pass,” and “Retake recommended” depending on the student grades.
* Fail if students receive a grade of less than 50,
* Pass if students receive a grade of greater than 65,
* Retake recommended if students receive a grade between 50 and 65.
5.12 Translating NHANES using case_when()
Another general method for us to translate our NHANES data into conventional language is to use mutate()
and case_when()
! However, this method requires us to know what each numerical value of each variable means. For example, if we want to translate the RIDRETH3 values, then we need to check the Codebook and Frequencies for the following translations:
- 1: “Mexican American”
- 2: “Other Hispanic”
- 3: “Non-Hispanic White”
- 4: “Non-Hispanic Black”
- 6: “Non-Hispanic Asian”
- 7: “Other Race - Including Multi-Racial”
- .: “Missing”
Knowing this, we can use mutate()
and case_when()
like so:
<- demo %>%
translated_demo mutate(Race = case_when(
== 1 ~ "Mexican American",
RIDRETH3 == 2 ~ "Other Hispanic",
RIDRETH3 == 3 ~ "Non-Hispanic White",
RIDRETH3 == 4 ~ "Non-Hispanic Black",
RIDRETH3 == 6 ~ "Non-Hispanic Asian",
RIDRETH3 == 7 ~ "Other Race - Including Multi-Racial",
RIDRETH3 == "." ~ "Missing"
RIDRETH3 %>%
)) select(ID = SEQN, Race)
head(translated_demo, 10)
## ID Race
## 1 73557 Non-Hispanic Black
## 2 73558 Non-Hispanic White
## 3 73559 Non-Hispanic White
## 4 73560 Non-Hispanic White
## 5 73561 Non-Hispanic White
## 6 73562 Mexican American
## 7 73563 Non-Hispanic White
## 8 73564 Non-Hispanic White
## 9 73565 Other Hispanic
## 10 73566 Non-Hispanic White
But what if we want to have less categories or want to combine some of the categories? What do we do then?
case_when()
is still the way to go! In this case, our codes should look like this:
<- demo %>%
less_categories mutate(Race = case_when(
== 1 ~ "Hispanic",
RIDRETH3 == 2 ~ "Hispanic",
RIDRETH3 == 3 ~ "White",
RIDRETH3 == 4 ~ "Black",
RIDRETH3 == 6 ~ "Asian",
RIDRETH3 == 7 ~ "Other",
RIDRETH3 == "." ~ "Missing"
RIDRETH3 %>%
)) select(ID = SEQN, Race)
head(less_categories, 10)
## ID Race
## 1 73557 Black
## 2 73558 White
## 3 73559 White
## 4 73560 White
## 5 73561 White
## 6 73562 Hispanic
## 7 73563 White
## 8 73564 White
## 9 73565 Hispanic
## 10 73566 White
Functions debunked
case_when() is a function that we use to change or translate the values of our variables into something else that is still meaningful. The arguments are as follows:
case_when(A Variable == Value as it is written in the original dataset ~ “New/Translated Value” )
Note that if none of the cases match, then R will automatically regard it as a missing (NA) value.
For example: case_when(x < 50 ~ "Fail", x > 65 ~ "Pass", TRUE ~ "Retake recommended")
5.13 recode()
from dplyr
Another alternative is recode()
. This function works the same way as case_when()
, except you only need to identify the variable once like so:
# demo_translate2 <- demo %>%
# mutate(Race = dplyr::recode(RIDRETH3,
# `1` = "Hispanic",
# `2` = "Hispanic",
# `3` = "White",
# `4` = "Black",
# `6` = "Asian",
# `7` = "Other",
# .default = "Missing"
# )) %>%
# select(ID = SEQN, Race)
# head(demo_translate2, 10)
5.14 recode()
from car
Another option is to use recode()
from the car package. This function is mostly used for translating numerical data into more meaningful character data or strings.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
While looking at the codes below, take note of the different types of quotation marks: single (’’) or double ("") and the semi-colon (;) as the argument separator. Note also how we can use 1:2
instead of separating them into 1 and 2 like what we did with earlier functions.
<- demo %>%
demo_translate3 mutate(Race = car::recode(RIDRETH3,
"1:2 = 'Hispanic';
3 = 'White';
4 = 'Black';
6 = 'Asian';
7 = 'Other';
else = 'Missing'"
%>%
)) select(ID = SEQN, Race)
head(demo_translate3, 10)
## ID Race
## 1 73557 Black
## 2 73558 White
## 3 73559 White
## 4 73560 White
## 5 73561 White
## 6 73562 Hispanic
## 7 73563 White
## 8 73564 White
## 9 73565 Hispanic
## 10 73566 White
5.15 Summary and Takeaways
By the end of this tutorial, you should be familiar with the dplyr package and be able to do basic data wrangling by yourself.
For more study materials on the dplyr package, check out this textbook.