Chapter 5 Data Analysis with dplyr

5.1 Instructions

This tutorial is aiming to introduce you to how to manipulate data and transfer the raw data into ready-to-analysis form in R. It will guide you to learn and practice the basic and useful functions in the dplyr package. In this tutorial, follow the step-by-step instruction as well as the examples which demonstrating how the functions work.

Accompanying this tutorial is a short Google quiz for your own self-assessment. The instructions of this tutorial will clearly indicate when you should answer which question.

5.2 Learning Objectives

Be able to import nhanesA and dplyr package. Be able to understand dataframe.
Be able to use rename() and select() to rename and select variables (columns) in a dataframe.
Be able to use filter() to subset a dataframe based on conditions.
Be able to use arrange() to re-order the rows in a dataframe.
Be able to use mutate() and transmute() to add new variables that are computed from existing variables in a dataframe.
Be able to use summarize() to get summary statistics from a dataframe; Be able to perform grouping with summarize(), filter(), and mutate().
Be able to use pipe to re-write mutilpe operations in a more readable way.
Be able to check missing values existence and deal with missing values while using the above functions.

5.3 Set up

5.3.1 Install and load packages

First, we need to install and load the dplyr package as well as bring in our National Health and Nutrition Examination Survey (NHANES) dataset. This particular tutorial uses data from 2013-2014.

For more information about NHANES, you can visit this website. It is recommended that you explore this website to familiar yourself with the data that we will be using throughout this tutorial.

# install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# install.packages("nhanesA")
library(nhanesA)

5.3.2 Set working directory

We want to set the working directory using setwd() - you may need to change the working directory to match your hard drive set-up.

# setwd('/kaggle/working')

5.3.3 Import dataset

This particular tutorial uses the Demographics dataset and Blood pressure dataset from NHANES dataset (2013-2014).

The Demographics dataset contains a huge records of demographics information such as gender, race, annual income for each participant. The Blood pressure dataset is in the Examination data and contains a hugh records of measurements that related to blood pressure measurement.

More information on the imported dataset can be found here:

Demographics data
Examination data
Complete variable dictionary
DEMO_H Code book (Demographic Variables)
BPX_H Code book (Blood Pressure)

demo <- nhanes('DEMO_H')

## Processing SAS dataset DEMO_H     ..

bpx <- nhanes('BPX_H')

## Processing SAS dataset BPX_H      ..

Functions debunked

nhanes() is the function we use to import our NHANES data and save it in a dataframe. What is a dataframe? We’ll introduce it below. The arguments are as follows:

nhanes(‘Name of Dataset’)

For example: nhanes('DEMO_H')

5.3.4 Explore our dataset

Recall the basic functions that we learned in tutorial 1, it is good practice for us to explore our datasets before doing any analysis.

We can check the dimension of the datasets:

dim(demo)

## [1] 10175    47

dim(bpx)

## [1] 9813   23

We can also check their first few rows:

head(demo, 3)

##    SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
## 1 73557        8        2        1       69       NA        4        4        1
## 2 73558        8        2        1       54       NA        3        3        1
## 3 73559        8        2        1       72       NA        3        3        2
##   RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3 DMDEDUC2
## 1       NA        1       1        1        1       NA       NA        3
## 2       NA        2      NA        1        1       NA       NA        3
## 3       NA        1       1        1        1       NA       NA        4
##   DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG
## 1        4       NA       1        2        2       1        2        2       1
## 2        1       NA       1        2        2       1        2        2       1
## 3        1       NA       1        2        2       1        2        2       1
##   MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA DMDHHSZB DMDHHSZE
## 1        2        2        1        3        3        0        0        2
## 2        2        2        1        4        4        0        2        0
## 3        2        2       NA        2        2        0        0        2
##   DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU WTINT2YR WTMEC2YR
## 1        1       69        1        3        4       NA 13281.24 13481.04
## 2        1       54        1        3        1        1 23682.06 24471.77
## 3        1       72        1        4        1        3 57214.80 57193.29
##   SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
## 1       1      112        4        4     0.84
## 2       1      108        7        7     1.78
## 3       1      109       10       10     4.51

head(bpx, 3)

##    SEQN PEASCST1 PEASCTM1 PEASCCT1 BPXCHR BPAARM BPACSZ BPXPLS BPXPULS BPXPTY
## 1 73557        1      620       NA     NA      1      4     86       1      1
## 2 73558        1      766       NA     NA      1      4     74       1      1
## 3 73559        1      665       NA     NA      1      4     68       1      1
##   BPXML1 BPXSY1 BPXDI1 BPAEN1 BPXSY2 BPXDI2 BPAEN2 BPXSY3 BPXDI3 BPAEN3 BPXSY4
## 1    140    122     72      2    114     76      2    102     74      2     NA
## 2    170    156     62      2    160     80      2    156     42      2     NA
## 3    160    140     90      2    140     76      2    146     80      2     NA
##   BPXDI4 BPAEN4
## 1     NA     NA
## 2     NA     NA
## 3     NA     NA

It does not hurt to also check the last few rows of the datasets:

tail(demo, 3)

##        SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3
## 10173 83729        8        2        2       42       NA        4        4
## 10174 83730        8        2        1        7       NA        2        2
## 10175 83731        8        2        1       11       NA        5        6
##       RIDEXMON RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3
## 10173        2       NA        2      NA        2        1        6       NA
## 10174        1       84       NA      NA        1        1       NA        0
## 10175        1      140       NA      NA        1        1       NA        5
##       DMDEDUC2 DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY
## 10173        5        3        2       1        2        2       1        2
## 10174       NA       NA       NA       1        1        2       1        2
## 10175       NA       NA       NA       1        1        2       1        2
##       FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA
## 10173        2      NA       NA       NA       NA        1        1        0
## 10174        2      NA       NA       NA       NA        4        4        1
## 10175        2       1        2        2       NA        4        4        0
##       DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU
## 10173        0        0        2       42        2        5        3       NA
## 10174        1        0        2       30        2        4        1        3
## 10175        2        0        1       43        2        5        1        5
##       WTINT2YR  WTMEC2YR SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
## 10173 24122.25 26902.344       1      104        7        7     3.66
## 10174 25521.88 26686.026       2      109        6        6     1.05
## 10175  8930.18  9700.873       2      106       15       15     5.00

tail(bpx, 3)

##       SEQN PEASCST1 PEASCTM1 PEASCCT1 BPXCHR BPAARM BPACSZ BPXPLS BPXPULS
## 9811 83729        1      679       NA     NA      1      4     80       1
## 9812 83730        1      381       NA     72     NA     NA     NA       1
## 9813 83731        1      498       NA     NA      1      3     90       1
##      BPXPTY BPXML1 BPXSY1 BPXDI1 BPAEN1 BPXSY2 BPXDI2 BPAEN2 BPXSY3 BPXDI3
## 9811      1    150    136     82      2    130     82      2    138     80
## 9812     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 9813      1    120     94     68      2     94     56      2     90     62
##      BPAEN3 BPXSY4 BPXDI4 BPAEN4
## 9811      2     NA     NA     NA
## 9812     NA     NA     NA     NA
## 9813      2     NA     NA     NA

5.3.4.1 is.na()

In real world, missing values are unavoidable. There are many reasons for missing values - may be a result from nonresponse or incorrect data collection and it happens all the time. It is always a good idea to check if there is any missing value before proceeding.

In a dataframe, you will see NAs if there’re missing values and NA indicates missing values in R.

Note: Be careful when doing manipulation on real dataset!!!Missing values may be recorded in other ways, for example, infinity or other numbers. To deal with missing values in this case, one way is to convert these values into NAs and then treat them as regular NAs. Details will not be discussed in this tutorial but will likely be included in later tutorials.

To check if there is any missing values (NAs) in a dataframe, use is.na():

head(is.na(demo),5)

##       SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3
## [1,] FALSE    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE
## [2,] FALSE    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE
## [3,] FALSE    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE
## [4,] FALSE    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE
## [5,] FALSE    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE
##      RIDEXMON RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3
## [1,]    FALSE     TRUE    FALSE   FALSE    FALSE    FALSE     TRUE     TRUE
## [2,]    FALSE     TRUE    FALSE    TRUE    FALSE    FALSE     TRUE     TRUE
## [3,]    FALSE     TRUE    FALSE   FALSE    FALSE    FALSE     TRUE     TRUE
## [4,]    FALSE    FALSE     TRUE    TRUE    FALSE    FALSE     TRUE    FALSE
## [5,]    FALSE     TRUE    FALSE    TRUE    FALSE    FALSE     TRUE     TRUE
##      DMDEDUC2 DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY
## [1,]    FALSE    FALSE     TRUE   FALSE    FALSE    FALSE   FALSE    FALSE
## [2,]    FALSE    FALSE     TRUE   FALSE    FALSE    FALSE   FALSE    FALSE
## [3,]    FALSE    FALSE     TRUE   FALSE    FALSE    FALSE   FALSE    FALSE
## [4,]     TRUE     TRUE     TRUE   FALSE    FALSE    FALSE   FALSE    FALSE
## [5,]    FALSE    FALSE     TRUE   FALSE    FALSE    FALSE   FALSE    FALSE
##      FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA
## [1,]    FALSE   FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE
## [2,]    FALSE   FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE
## [3,]    FALSE   FALSE    FALSE    FALSE     TRUE    FALSE    FALSE    FALSE
## [4,]    FALSE   FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE
## [5,]    FALSE   FALSE    FALSE    FALSE     TRUE    FALSE    FALSE    FALSE
##      DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU
## [1,]    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE     TRUE
## [2,]    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE
## [3,]    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE
## [4,]    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE
## [5,]    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE
##      WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
## [1,]    FALSE    FALSE   FALSE    FALSE    FALSE    FALSE    FALSE
## [2,]    FALSE    FALSE   FALSE    FALSE    FALSE    FALSE    FALSE
## [3,]    FALSE    FALSE   FALSE    FALSE    FALSE    FALSE    FALSE
## [4,]    FALSE    FALSE   FALSE    FALSE    FALSE    FALSE    FALSE
## [5,]    FALSE    FALSE   FALSE    FALSE    FALSE    FALSE    FALSE

To find the total number of missing values in a dataframe:

sum(is.na(demo))

## [1] 82092

To check if there is any missing values (NAs) in a column,for example, gender:

head(is.na(demo['RIAGENDR']),5)

##      RIAGENDR
## [1,]    FALSE
## [2,]    FALSE
## [3,]    FALSE
## [4,]    FALSE
## [5,]    FALSE

To find the total number of missing values in a column:

sum(is.na(demo['RIAGENDR']))

## [1] 0

To find the number of missing values for each column, use the summary() function:

summary(demo)

If there is a missing value in one cell, is.na() will return TRUE; if there is no missing value in one cell, it will return FALSE

For example, (1,1) (1st row 1st column) is FALSE - it means that (1,1) is not a missing value; (1,6) (1st row 6th column) is TRUE - it means that (1,6) is NA.

is.na() is the function we use to inspect any missing values in a dataframe - it is housed in the base package.

The arguments are as follows:

is.na( > NAME OF DATAFRAME

)

For example: is.na(demo)

DO QUESTIONS 1-3 OF THE QUIZ NOW

What package does the function nhanes() belong to?
Which code below is the correct one to print the last 10 rows of a dataframe called temp?
Suppose a cell contains NA. What is the output after perform is.na() on it?

5.4 Dataset Preparation

Now that we’ve successfully imported the two datasets and we’re ready to do further data manipulations. But before we proceed, there are three issues in the datasets:

The two datasets are Huge. Recall the dimensions of the two datasets in section 1, there are 10175 rows and 47 columns in the demo dataset and 9813 rows and 23 columns in the bpx dataset.
- For demostration purpose, this tutorial will only focus on parts of the two datasets.
  - For the demo dataset, we want to keep the following variables as our primary interests only:
    - SEQN (Respondent sequence number)
    - RIAGENDR (Gender)
    - RIDAGEYR (Age in years at screening)
    - RIDRETH3 (Race)
    - DMDEDUC2 (Education level - Adults 20+)
  - For the bpx dataset, we want to keep the following variables as our primary interests only:
    - SEQN (Respondent sequence number)
    - PEASCST1 (Blood Pressure Status)
    - PEASCTM1 (Blood Pressure Time in Seconds)
    - BPXSY1 (Systolic: Blood pres (1st rdg) mm Hg)
    - BPXDI1 (Diastolic: Blood pres (1st rdg) mm Hg)
  - For both datasets, we want to keep the top 5 rows only.
There is something odd about the demo dataset. For example,if you run demo alone, you will see that RIAGENDR (gender) is coded as 1 and 2. For ease of future use, we want to translate this 1 and 2 into male and female.
The variable names are long are ambiguous. For ease of further use, we want to rename the variables so that they can be easily understand.
- For the demo dataframe, we want to rename the variables in this way:
  - SEQN -> id
  - RIAGENDR -> gender
  - RIDAGEYR -> age
  - RIDRETH3 -> race
  - DMDEDUC2 -> race
- For the bpx dataframe, we want to rename the variables in this way:
  - SEQN -> id
  - PEASCST1 -> bp_status
  - PEASCTM1 -> bpt_sec
  - BPXSY1 -> systolic
  - BPXDI1 -> diastolic
- Note, there are many different ways of renaming as long as the new names are straight-forward. We will use the above rename strategy in this tutorial for the sake of consistency.

With the three issues listed above, here’s an example of how to resolve them in the demo dataset:

First, we need to translate the way of variable encoding using the nhanesTranslate() function:

demo_translate <- nhanesTranslate('DEMO_H',
                                  c('SEQN', # Respondent sequence number
                                    'RIAGENDR', # Gender
                                    'RIDAGEYR', # Age in years at screening
                                    'RIDRETH3', # Race
                                    'DMDEDUC2'), # Education level - Adults 20+
                                 data = demo)

## Warning in FUN(X[[i]], ...): No translation table is available for SEQN

## Translated columns: RIAGENDR RIDRETH3 DMDEDUC2

head(demo_translate,5)

##    SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1
## 1 73557        8        2     Male       69       NA        4
## 2 73558        8        2     Male       54       NA        3
## 3 73559        8        2     Male       72       NA        3
## 4 73560        8        2     Male        9       NA        3
## 5 73561        8        2   Female       73       NA        3
##             RIDRETH3 RIDEXMON RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN
## 1 Non-Hispanic Black        1       NA        1       1        1        1
## 2 Non-Hispanic White        1       NA        2      NA        1        1
## 3 Non-Hispanic White        2       NA        1       1        1        1
## 4 Non-Hispanic White        1      119       NA      NA        1        1
## 5 Non-Hispanic White        1       NA        2      NA        1        1
##   DMDYRSUS DMDEDUC3                         DMDEDUC2 DMDMARTL RIDEXPRG SIALANG
## 1       NA       NA High school graduate/GED or equi        4       NA       1
## 2       NA       NA High school graduate/GED or equi        1       NA       1
## 3       NA       NA        Some college or AA degree        1       NA       1
## 4       NA        3                             <NA>       NA       NA       1
## 5       NA       NA        College graduate or above        1       NA       1
##   SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG MIAPROXY MIAINTRP
## 1        2        2       1        2        2       1        2        2
## 2        2        2       1        2        2       1        2        2
## 3        2        2       1        2        2       1        2        2
## 4        1        2       1        2        2       1        2        2
## 5        2        2       1        2        2       1        2        2
##   AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE
## 1        1        3        3        0        0        2        1       69
## 2        1        4        4        0        2        0        1       54
## 3       NA        2        2        0        0        2        1       72
## 4        1        4        4        0        2        0        1       33
## 5       NA        2        2        0        0        2        1       78
##   DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA
## 1        1        3        4       NA 13281.24 13481.04       1      112
## 2        1        3        1        1 23682.06 24471.77       1      108
## 3        1        4        1        3 57214.80 57193.29       1      109
## 4        1        3        1        4 55201.18 55766.51       2      109
## 5        1        5        1        5 63709.67 65541.87       2      116
##   INDHHIN2 INDFMIN2 INDFMPIR
## 1        4        4     0.84
## 2        7        7     1.78
## 3       10       10     4.51
## 4        9        9     2.52
## 5       15       15     5.00

Second, we want to keep the variabls which we’re interested in only:

new_demo <- select(demo_translate,
                   c(SEQN,
                    RIAGENDR,
                    RIDAGEYR,
                    RIDRETH3,
                    DMDEDUC2))
head(new_demo,5)

##    SEQN RIAGENDR RIDAGEYR           RIDRETH3                         DMDEDUC2
## 1 73557     Male       69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558     Male       54 Non-Hispanic White High school graduate/GED or equi
## 3 73559     Male       72 Non-Hispanic White        Some college or AA degree
## 4 73560     Male        9 Non-Hispanic White                             <NA>
## 5 73561   Female       73 Non-Hispanic White        College graduate or above

Third, we want to rename the variables using the rename() function:

new_demo <- rename(new_demo,
       id = SEQN, # Respondent sequence number
       gender = RIAGENDR, # Gender
       age = RIDAGEYR, # Age in years at screening
       race = RIDRETH3, # Race/Hispanic origin
       edu = DMDEDUC2, # Education level - Adults 20+
       )
head(new_demo,5)

##      id gender age               race                              edu
## 1 73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 3 73559   Male  72 Non-Hispanic White        Some college or AA degree
## 4 73560   Male   9 Non-Hispanic White                             <NA>
## 5 73561 Female  73 Non-Hispanic White        College graduate or above

Last, we want to keep the top 10 rows using the head() function and save the dataframe as final_demo:

final_demo <- head(new_demo,10)
final_demo

##       id gender age               race                              edu
## 1  73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 2  73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 3  73559   Male  72 Non-Hispanic White        Some college or AA degree
## 4  73560   Male   9 Non-Hispanic White                             <NA>
## 5  73561 Female  73 Non-Hispanic White        College graduate or above
## 6  73562   Male  56   Mexican American        Some college or AA degree
## 7  73563   Male   0 Non-Hispanic White                             <NA>
## 8  73564 Female  61 Non-Hispanic White        College graduate or above
## 9  73565   Male  42     Other Hispanic High school graduate/GED or equi
## 10 73566 Female  56 Non-Hispanic White High school graduate/GED or equi

rename() is the function we use to rename the variables and return all variables - it is housed in the dplyr package. The arguments are as follows:

rename(Name of Data Frame, New Variable Name = Old Variable Name)

For example: rename(demo, id = SEQN)

select() is the function we use to only return the varibles we want - it is housed in the dplyr package. The arguments are as follows:

select(Name of Data Frame, Variables (can be written as a vector))

For example: select(demo, c(id,gender,age))

For example: select(demo, id:age))

Try it yourself 5.1

In the bpx dataframe, keep the following variables and top 5 rows only:

SEQN
PEASCST1
PEASCTM1
BPXSY1
BPXDI1

Try it yourself 5.2

rename the variables in the following way:

SEQN -> id
PEASCST1 -> bp_status
PEASCTM1 -> bpt_sec
BPXSY1 -> systolic
BPXDI1 -> diastolic

DO QUESTIONS 4 & 5 OF THE QUIZ NOW

Given the following code: select(dataset, A, B). What is the purpose of the code?
Which code below is the correct one to rename column A to B in a dataframe called dataset?

5.5 Filter

Sometimes, we want to focus on a subset of the dataset that satisifying some conditions for further analysis. In this case, we need to filter the dataset based on variables’ values and conditions.

For example, we use the following code to filter the observations that patients are 40 years old at screening.

filter(final_demo, age == 40)

## [1] id     gender age    race   edu   
## <0 rows> (or 0-length row.names)

We may also interested in the observations that patients are 40 or 41 years old at screening:

filter(final_demo, age == 40 | age == 41)

## [1] id     gender age    race   edu   
## <0 rows> (or 0-length row.names)

Another way of writing the code above is to use the %in% operator:

x %in% y is equivalent to the condition that the value of x is in one of the values of y

filter(final_demo, age %in% c(40, 41))

## [1] id     gender age    race   edu   
## <0 rows> (or 0-length row.names)

Functions debunked

filter() is the function we use to subset the dataset based on their values and conditions - it is housed in the dplyr package. The arguments are as follows:

filter(Name of Dataset, Condition 1, Condition 2, …, Condition n)

For example: filter(new_demo, gender == 1)

Note:

Pay attention to the difference between = and == > = is assignment operator

== is comparison operator

More comparison operators are <, <=,>, >=,!=.

Logical operators

Mutiple conditions can be combined using logical operators. Common logical operators are:

and is &

or is |

not is !

For example: filter(new_demo, age == 40 | age == 41)

5.5.0.1 Missing values

Note, we don’t have any missing values in the age column (can be checked using is.na(new_demo['age'])).

What if we have missing values and how does filter() treat missing values?

filter() only keeps the rows where the condition is TRUE and remove the rows where the condition is failed due to FALSE OR NA.

If you want to keep the NAs, you need to add the condition explicitly: is.na(VARIABLE_NAME) and use | to combine with other conditions.

For example: filter(new_demo, is.na(income) | income <= 50)

Try it yourself 5.3

In the demo dataframe, find all the records that:

The participant who is a male
The participant who is a male and is older tha 50 years old
The education level is missing

DO QUESTIONS 6-8 OF THE QUIZ NOW

Fill in the blank to answer the ‘Try it yourself’ sections:

filter(final_demo, gender ___)
filter(final_demo, gender == ‘Male’ ___ age > 50)
filter(final_demo, ___)

5.6 Re-order the Rows

Suppose we want to re-order the observations by certain variables, we can use the arrange() function in R.

First, let’s look at the top 5 rows in the demo dataframe:

final_demo

##       id gender age               race                              edu
## 1  73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 2  73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 3  73559   Male  72 Non-Hispanic White        Some college or AA degree
## 4  73560   Male   9 Non-Hispanic White                             <NA>
## 5  73561 Female  73 Non-Hispanic White        College graduate or above
## 6  73562   Male  56   Mexican American        Some college or AA degree
## 7  73563   Male   0 Non-Hispanic White                             <NA>
## 8  73564 Female  61 Non-Hispanic White        College graduate or above
## 9  73565   Male  42     Other Hispanic High school graduate/GED or equi
## 10 73566 Female  56 Non-Hispanic White High school graduate/GED or equi

We notice that the column “age” doesn’t follow any order. We use the following code to re-order the dataframe by age:

arrange(final_demo, age)

##       id gender age               race                              edu
## 1  73563   Male   0 Non-Hispanic White                             <NA>
## 2  73560   Male   9 Non-Hispanic White                             <NA>
## 3  73565   Male  42     Other Hispanic High school graduate/GED or equi
## 4  73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 5  73562   Male  56   Mexican American        Some college or AA degree
## 6  73566 Female  56 Non-Hispanic White High school graduate/GED or equi
## 7  73564 Female  61 Non-Hispanic White        College graduate or above
## 8  73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 9  73559   Male  72 Non-Hispanic White        Some college or AA degree
## 10 73561 Female  73 Non-Hispanic White        College graduate or above

If we want to change it to descending order, use desc():

arrange(final_demo, desc(age))

##       id gender age               race                              edu
## 1  73561 Female  73 Non-Hispanic White        College graduate or above
## 2  73559   Male  72 Non-Hispanic White        Some college or AA degree
## 3  73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 4  73564 Female  61 Non-Hispanic White        College graduate or above
## 5  73562   Male  56   Mexican American        Some college or AA degree
## 6  73566 Female  56 Non-Hispanic White High school graduate/GED or equi
## 7  73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 8  73565   Male  42     Other Hispanic High school graduate/GED or equi
## 9  73560   Male   9 Non-Hispanic White                             <NA>
## 10 73563   Male   0 Non-Hispanic White                             <NA>

What if we want to change the order by multiple columns?

We can simply add multiple columns in the arguments! It’s also useful when there are ties in the values of one column and the subsquent columns are used to break the ties.

For example, re-order the observations by age,id:

arrange(final_demo,age,id)

##       id gender age               race                              edu
## 1  73563   Male   0 Non-Hispanic White                             <NA>
## 2  73560   Male   9 Non-Hispanic White                             <NA>
## 3  73565   Male  42     Other Hispanic High school graduate/GED or equi
## 4  73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 5  73562   Male  56   Mexican American        Some college or AA degree
## 6  73566 Female  56 Non-Hispanic White High school graduate/GED or equi
## 7  73564 Female  61 Non-Hispanic White        College graduate or above
## 8  73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 9  73559   Male  72 Non-Hispanic White        Some college or AA degree
## 10 73561 Female  73 Non-Hispanic White        College graduate or above

Functions debunked

arrange() is the function we use to change the order of the observations by columns - it is housed in the dplyr package. The arguments are as follows:

arrange( > NAME OF DATAFRAME,

COLUMN 1

COLUMN 2

…

COLUMN n

)

For example: arrange(new_demo, gender)

Use desc(COLUMN_NAME) to reorder the dataframe by COLUMN_NAME in descending order.

5.6.0.1 Missing values

In the last example, we notice that there are missing values in column ‘income’ and ‘income_ratio.’

How does arrange() deal with missing value? All NAs will be retained at the end. Check the output above!

Try it yourself 5.4

Re-order the rows in the bpx dataset by Blood Pressure Time in Seconds (bpt_sec) in descending order:

DO QUESTION 9 OF THE QUIZ NOW

Fill in the blank to answer the ‘Try it yourself’ section:

___(final_bpx,___)

5.7 Add new variables

Suppose we want to gain more information about our observations and we want to constrcut a new variable based on the variables we have. The mutate() and transmute() in R helps us to add new variables into a dataframe.

For example, we want to add a new variable called born_year in the demo dataframe and born_year is calculated in this way: born_year = 2021- age:

mutate(final_demo,
       born_year = 2021 - age 
      )

##       id gender age               race                              edu
## 1  73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 2  73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 3  73559   Male  72 Non-Hispanic White        Some college or AA degree
## 4  73560   Male   9 Non-Hispanic White                             <NA>
## 5  73561 Female  73 Non-Hispanic White        College graduate or above
## 6  73562   Male  56   Mexican American        Some college or AA degree
## 7  73563   Male   0 Non-Hispanic White                             <NA>
## 8  73564 Female  61 Non-Hispanic White        College graduate or above
## 9  73565   Male  42     Other Hispanic High school graduate/GED or equi
## 10 73566 Female  56 Non-Hispanic White High school graduate/GED or equi
##    born_year
## 1       1952
## 2       1967
## 3       1949
## 4       2012
## 5       1948
## 6       1965
## 7       2021
## 8       1960
## 9       1979
## 10      1965

If we want to keep the added variables only, use transmute() instead:

transmute(final_demo,
    born_year = 2021 - age
     )

##    born_year
## 1       1952
## 2       1967
## 3       1949
## 4       2012
## 5       1948
## 6       1965
## 7       2021
## 8       1960
## 9       1979
## 10      1965

Functions debunked

mutate() is the function we use to create new variables based on variables we have and add them to the original dataframe - it is housed in the dplyr package. The arguments are as follows:

mutate( > NAME OF DATAFRAME,

NEW_VARIABLE = FUNCTION OF EXISTING VARIBALES

… )

For example: mutate(new_demo, rescale_income = income/2)

transmute() is the function we use to create new variables based on variables we have and only keep the added variables - it is housed in the dplyr package. The arguments are as follows:

transmute( > NAME OF DATAFRAME,

NEW_VARIABLE = FUNCTION OF EXISTING VARIBALES

…

)

For example: transmute(new_demo, rescale_income = income/2)

You can find more details in how to use creation functions to create new variables in Chapter I. Explore Chapter 3.

Try it yourself 5.5

Create a new variable called called rescale_bpt_sec that records the Blood Pressure Time in miuntes. Keep both original and new variables.
Create rescale_bpt_sec in the same way above and only keep new variables.

Note: Try to avoid using select().

DO QUESTION 10 OF THE QUIZ NOW

Fill in the blank to answer both parts of the ‘Try it yourself’ sections.

___(final_bpx, rescale_bpt_sec = bpt_sec/60)

5.8 Summary Statistics and group_by

Information about the whole dataframe such as mean is very useful for data analysis. However, things could be very complex when aggregating multiple functions. We’ll start with the simple one: find a summary statistic from the whole dataset.

For example, find the average age in the whole demo dataframe:

summarize(final_demo,average_age = mean(age,na.rm = TRUE))

##   average_age
## 1        49.2

It gets more complex when we change the unit of analysis into groups, i.e, find summary statistics grouping by variables. To do so, we first convert the dataframe into a grouped dataframe using group_by() and then apply the summarize() to the grouped dataframe.

For example, find the average age in the demo dataframe per gender:

by_gender <- group_by(final_demo,gender)
summarize(by_gender, average_age = mean(age, na.rm = TRUE))

## # A tibble: 2 x 2
##   gender average_age
##   <fct>        <dbl>
## 1 Male          43.1
## 2 Female        63.3

Functions debunked

summarize() is the function we use to compute the summary statistics for the whole dataframe - it is housed in the dplyr package. The arguments are as follows:

summarize(Name of Data Frame, Name of Summary Statistic = Function())

For example: summarize(new_demo, mean(age,na.rm = TRUE))

group_by() is the function we use to create a grouped dataframe grouping by one or more variables - it is housed in the dplyr package. The arguments are as follows:

group_by(Name of Data Frame, Name of Variable)

For example: group_by(new_demo,age)

Note: if there is missing value in the variable, group_by() treats it as a new group

Try it yourself 5.6

Find the average age in the demo dataframe per education level

DO QUESTION 11 OF THE QUIZ NOW

Fill in the blanks to answer the ‘Try it yourself’ section:

by_edu <- group_by(___)
summarize(___, average_age = ___(age, na.rm = TRUE))

5.8.1 Group_by() extension

group_by() is also useful when conjuncting with filter() and mutate().

For example, return the observations which has more than 2 records in each education group.

by_edu <-group_by(final_demo,edu)
filter(by_edu,n() > 2)

## # A tibble: 4 x 5
## # Groups:   edu [1]
##   id         gender age        race               edu                           
##   <labelled> <fct>  <labelled> <fct>              <fct>                         
## 1 73557      Male   69         Non-Hispanic Black High school graduate/GED or e~
## 2 73558      Male   54         Non-Hispanic White High school graduate/GED or e~
## 3 73565      Male   42         Other Hispanic     High school graduate/GED or e~
## 4 73566      Female 56         Non-Hispanic White High school graduate/GED or e~

Here’s another example: using group_by() with mutate() to compute the difference in each age and the average mean in each gender group

by_gender <-group_by(final_demo,gender)
mutate(by_gender,
      diff_age = age - mean(age, na.rm = T))

## # A tibble: 10 x 6
## # Groups:   gender [2]
##    id        gender age       race           edu                      diff_age  
##    <labelle> <fct>  <labelle> <fct>          <fct>                    <labelled>
##  1 73557     Male   69        Non-Hispanic ~ High school graduate/GE~  25.857143
##  2 73558     Male   54        Non-Hispanic ~ High school graduate/GE~  10.857143
##  3 73559     Male   72        Non-Hispanic ~ Some college or AA degr~  28.857143
##  4 73560     Male    9        Non-Hispanic ~ <NA>                     -34.142857
##  5 73561     Female 73        Non-Hispanic ~ College graduate or abo~   9.666667
##  6 73562     Male   56        Mexican Ameri~ Some college or AA degr~  12.857143
##  7 73563     Male    0        Non-Hispanic ~ <NA>                     -43.142857
##  8 73564     Female 61        Non-Hispanic ~ College graduate or abo~  -2.333333
##  9 73565     Male   42        Other Hispanic High school graduate/GE~  -1.142857
## 10 73566     Female 56        Non-Hispanic ~ High school graduate/GE~  -7.333333

Functions debunked

n() is the function we use to count the number of observations in each group - it is housed in the dplyr package. It can only be used with the existence of summarize(), filter(), and mutate() and there is no argument in it.

Try it yourself 5.7

Return the observations which has more than 3 records in each gender group.

Try it yourself 5.8

Compute the difference in each age and the average mean in each education level group

DO QUESTIONS 12 & 13 OF THE QUIZ NOW

Fill in the blanks to answer the ‘Try it yourself’ sections:

by_gender <-group_by(final_demo,gender)
- filter(by_edu,___)
by_edu <-group_by(final_demo,edu)
- mutate(by_edu, ___)

5.9 Pipe

You may have noticed that when doing multiple-step operations, we need to assign the output to a new variable every time when a step is done. It becomes more annoying when there are more steps to do. The pipe operator %>% - which is housed in the magrittr package - saves us from create many unnecessary variables: it takes the output from one function as an input to the following function.

Here’s example without using pipe: we want to first keep the observations in the demo dataframe with an age greater than 40 and then create a new variable called born_year calculated by 2021 - age.

We need to do this in two steps.The first step is to filter the dataframe and save the output as a new variable temp:

temp <- filter(final_demo, age > 40)
temp

##      id gender age               race                              edu
## 1 73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 3 73559   Male  72 Non-Hispanic White        Some college or AA degree
## 4 73561 Female  73 Non-Hispanic White        College graduate or above
## 5 73562   Male  56   Mexican American        Some college or AA degree
## 6 73564 Female  61 Non-Hispanic White        College graduate or above
## 7 73565   Male  42     Other Hispanic High school graduate/GED or equi
## 8 73566 Female  56 Non-Hispanic White High school graduate/GED or equi

The second step is to create the new variable:

temp <- mutate(temp, born_year = 2021 - age)
temp

##      id gender age               race                              edu
## 1 73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 3 73559   Male  72 Non-Hispanic White        Some college or AA degree
## 4 73561 Female  73 Non-Hispanic White        College graduate or above
## 5 73562   Male  56   Mexican American        Some college or AA degree
## 6 73564 Female  61 Non-Hispanic White        College graduate or above
## 7 73565   Male  42     Other Hispanic High school graduate/GED or equi
## 8 73566 Female  56 Non-Hispanic White High school graduate/GED or equi
##   born_year
## 1      1952
## 2      1967
## 3      1949
## 4      1948
## 5      1965
## 6      1960
## 7      1979
## 8      1965

The intermediate variable temp can be avoided by using the pipe operatore:

final_demo %>%
  filter(age > 40) %>%
  mutate(born_year = 2021 - age)

##      id gender age               race                              edu
## 1 73557   Male  69 Non-Hispanic Black High school graduate/GED or equi
## 2 73558   Male  54 Non-Hispanic White High school graduate/GED or equi
## 3 73559   Male  72 Non-Hispanic White        Some college or AA degree
## 4 73561 Female  73 Non-Hispanic White        College graduate or above
## 5 73562   Male  56   Mexican American        Some college or AA degree
## 6 73564 Female  61 Non-Hispanic White        College graduate or above
## 7 73565   Male  42     Other Hispanic High school graduate/GED or equi
## 8 73566 Female  56 Non-Hispanic White High school graduate/GED or equi
##   born_year
## 1      1952
## 2      1967
## 3      1949
## 4      1948
## 5      1965
## 6      1960
## 7      1979
## 8      1965

Try it yourself 5.9

Re-write the following code using pipe operator

# temp <- filter(final_bpx, systolic > 120)
# temp <- mutate(temp, bpt_min = bpt_sec/60)
# temp

DO QUESTION 14 OF THE QUIZ NOW

Fill in the blanks to answer the’Try it yourself’ section:

___ %>%
filter(___,systolic > 120) %>%
mutate(___,bpt_min = bpt_sec/60)

5.10 Summary of dealing with missing values

Dealing with missing values can be complex. You need to be careful when using functions since different functions have different ways in dealing with missing values.

filter(): excludes missing values. If you do want to retain missing values, use is.na() explicitly.

arrange(): retains missing values and sorts them at the end

select(): retains missing values

mutate(): retains missing values

summarize(): retains missing values. If you want to remove missing values, set na.rm = TRUE in aggregation functions.

group_by(): treats missing values as a group

5.11 Alternatives to NHANESTranslate()

5.11.1 `case_when()`

case_when() is another useful function that we can use that will aid our data analysis. This function acts like mutate(), except it changes the actual values of the variable. Here is an example of how we can use case_when() and mutate() together to change values within a variable:

x <- c(97, 36, 55, 50, 49, 65, 46, 87, 100)

case_when(x < 50 ~ "Fail", 
          x > 65 ~ "Pass", 
          TRUE  ~ "Retake recommended")

## [1] "Pass"               "Fail"               "Retake recommended"
## [4] "Retake recommended" "Fail"               "Retake recommended"
## [7] "Fail"               "Pass"               "Pass"

In the example above, we have a vector of student grades. Using case_when(), we have translated the grades to “Fail,” “Pass,” and “Retake recommended” depending on the student grades. * Fail if students receive a grade of less than 50, * Pass if students receive a grade of greater than 65, * Retake recommended if students receive a grade between 50 and 65.

5.12 Translating NHANES using `case_when()`

Another general method for us to translate our NHANES data into conventional language is to use mutate() and case_when()! However, this method requires us to know what each numerical value of each variable means. For example, if we want to translate the RIDRETH3 values, then we need to check the Codebook and Frequencies for the following translations:

1: “Mexican American”
2: “Other Hispanic”
3: “Non-Hispanic White”
4: “Non-Hispanic Black”
6: “Non-Hispanic Asian”
7: “Other Race - Including Multi-Racial”
.: “Missing”

Knowing this, we can use mutate() and case_when() like so:

translated_demo <- demo %>% 
    mutate(Race = case_when(
          RIDRETH3 == 1 ~ "Mexican American",
          RIDRETH3 == 2 ~ "Other Hispanic",
          RIDRETH3 == 3 ~ "Non-Hispanic White",
          RIDRETH3 == 4 ~ "Non-Hispanic Black",
          RIDRETH3 == 6 ~ "Non-Hispanic Asian",
          RIDRETH3 == 7 ~ "Other Race - Including Multi-Racial",
          RIDRETH3 == "." ~ "Missing"
      )) %>%
    select(ID = SEQN, Race)

head(translated_demo, 10)

##       ID               Race
## 1  73557 Non-Hispanic Black
## 2  73558 Non-Hispanic White
## 3  73559 Non-Hispanic White
## 4  73560 Non-Hispanic White
## 5  73561 Non-Hispanic White
## 6  73562   Mexican American
## 7  73563 Non-Hispanic White
## 8  73564 Non-Hispanic White
## 9  73565     Other Hispanic
## 10 73566 Non-Hispanic White

But what if we want to have less categories or want to combine some of the categories? What do we do then?

case_when() is still the way to go! In this case, our codes should look like this:

less_categories <- demo %>% 
    mutate(Race = case_when(
          RIDRETH3 == 1 ~ "Hispanic",
          RIDRETH3 == 2 ~ "Hispanic",
          RIDRETH3 == 3 ~ "White",
          RIDRETH3 == 4 ~ "Black",
          RIDRETH3 == 6 ~ "Asian",
          RIDRETH3 == 7 ~ "Other",
          RIDRETH3 == "." ~ "Missing"
      )) %>%
    select(ID = SEQN, Race)

head(less_categories, 10)

##       ID     Race
## 1  73557    Black
## 2  73558    White
## 3  73559    White
## 4  73560    White
## 5  73561    White
## 6  73562 Hispanic
## 7  73563    White
## 8  73564    White
## 9  73565 Hispanic
## 10 73566    White

Functions debunked

case_when() is a function that we use to change or translate the values of our variables into something else that is still meaningful. The arguments are as follows:

case_when(A Variable == Value as it is written in the original dataset ~ “New/Translated Value” )

Note that if none of the cases match, then R will automatically regard it as a missing (NA) value.

For example: case_when(x < 50 ~ "Fail", x > 65 ~ "Pass", TRUE ~ "Retake recommended")

5.13 `recode()` from dplyr

Another alternative is recode(). This function works the same way as case_when(), except you only need to identify the variable once like so:

# demo_translate2 <- demo %>%
#     mutate(Race = dplyr::recode(RIDRETH3,
#            `1` = "Hispanic",
#            `2` = "Hispanic",
#            `3` = "White",
#            `4` = "Black",
#            `6` = "Asian",
#            `7` = "Other",
#           .default = "Missing"
#     )) %>%
#     select(ID = SEQN, Race)

# head(demo_translate2, 10)

5.14 `recode()` from car

Another option is to use recode() from the car package. This function is mostly used for translating numerical data into more meaningful character data or strings.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

While looking at the codes below, take note of the different types of quotation marks: single (’’) or double ("") and the semi-colon (;) as the argument separator. Note also how we can use 1:2 instead of separating them into 1 and 2 like what we did with earlier functions.

demo_translate3 <- demo %>%
    mutate(Race = car::recode(RIDRETH3,
        "1:2 = 'Hispanic';
         3 = 'White';
         4 = 'Black';
         6 = 'Asian';
         7 = 'Other';
         else = 'Missing'"
    )) %>%
    select(ID = SEQN, Race)

head(demo_translate3, 10)

##       ID     Race
## 1  73557    Black
## 2  73558    White
## 3  73559    White
## 4  73560    White
## 5  73561    White
## 6  73562 Hispanic
## 7  73563    White
## 8  73564    White
## 9  73565 Hispanic
## 10 73566    White

5.15 Summary and Takeaways

By the end of this tutorial, you should be familiar with the dplyr package and be able to do basic data wrangling by yourself.

For more study materials on the dplyr package, check out this textbook.

Chapter 5 Data Analysis with dplyr

5.1 Instructions

5.2 Learning Objectives

5.3 Set up

5.3.1 Install and load packages

5.3.2 Set working directory

5.3.3 Import dataset

Functions debunked

5.3.4 Explore our dataset

5.3.4.1 is.na()

DO QUESTIONS 1-3 OF THE QUIZ NOW

5.4 Dataset Preparation

DO QUESTIONS 4 & 5 OF THE QUIZ NOW

5.5 Filter

Functions debunked

5.5.0.1 Missing values

DO QUESTIONS 6-8 OF THE QUIZ NOW

5.6 Re-order the Rows

Functions debunked

5.6.0.1 Missing values

DO QUESTION 9 OF THE QUIZ NOW

5.7 Add new variables

Functions debunked

DO QUESTION 10 OF THE QUIZ NOW

5.8 Summary Statistics and group_by

Functions debunked

DO QUESTION 11 OF THE QUIZ NOW

5.8.1 Group_by() extension

Functions debunked

DO QUESTIONS 12 & 13 OF THE QUIZ NOW

5.9 Pipe

DO QUESTION 14 OF THE QUIZ NOW

5.10 Summary of dealing with missing values

5.11 Alternatives to NHANESTranslate()

5.11.1 case_when()

5.12 Translating NHANES using case_when()

Functions debunked

5.13 recode() from dplyr

5.14 recode() from car

5.15 Summary and Takeaways

5.11.1 `case_when()`

5.12 Translating NHANES using `case_when()`

5.13 `recode()` from dplyr

5.14 `recode()` from car