Data manipulation

Data manipulation is a foundational skill for data analysis. This guide introduces common methods for subsetting datasets, handling variable types, creating summary tables, and dealing with missing values using R.

Load dataset

Understanding the dataset’s structure is the first step in data manipulation. Here, we’re using the mpg dataset, which provides information on various car models:

mpg <- read.csv("Data/wrangling/mpg.csv", header = TRUE)

Subset

Often, you’ll need to subset your data for analysis. Here, we’ll explore different methods to both drop unwanted variables and keep desired observations.

Drop variables

Sometimes, only part of the variables will be used in your analysis. Therefore, you may want to drop the variables you do not need. There are multiple ways to drop variables from a dataset. Below are two examples without using any package and using the dplyr package.

Tip

Option 1: No package needed

dataset.name[, c(columns_names_you_want_to_KEEP)]

Say, we want to keep only three variables in the mpg dataset: manufacturer, model and cyl. For Option 1 (without package), we can use the following R codes to keep these three variables:

mpg1 <- mpg[, c("manufacturer", "model", "cyl")]
head(mpg1)
ABCDEFGHIJ0123456789
 
 
manufacturer
<chr>
model
<chr>
cyl
<int>
1audia44
2audia44
3audia44
4audia44
5audia46
6audia46

Here mpg1 is a new dataset containing only three variables (manufacturer, model and cyl).

Tip

Option 2: use select in dplyr

select(dataset.name, c(columns_names_you_want_to_KEEP))

For Option 2, the dplyr package offers the select function, which provides a more intuitive way to subset data.

mpg2 <- select(mpg, c("manufacturer", "model", "cyl"))
head(mpg2)
ABCDEFGHIJ0123456789
 
 
manufacturer
<chr>
model
<chr>
cyl
<int>
1audia44
2audia44
3audia44
4audia44
5audia46
6audia46

We can also exclude any variables from the dataset by using the minus (-) sign with the select function. For example, we we want to drop trans, drv, and cty from the mpg dataset, we can use the following codes:

mpg3 <- select(mpg, -c("trans", "drv", "cty"))
head(mpg3)
ABCDEFGHIJ0123456789
 
 
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
hwy
<int>
fl
<chr>
class
<chr>
1audia41.81999429pcompact
2audia41.81999429pcompact
3audia42.02008431pcompact
4audia42.02008430pcompact
5audia42.81999626pcompact
6audia42.81999626pcompact

This mpg3 is a new dataset from mpg after dropping three variables (trans, drv, and cty).

Keep observations

It often happens that we only want to investigate a subset of a population which only requires a subset of our dataset. In this case, we need to subset the dataset to meet certain requirements. Again, there are multiple ways to do this task. Below is an example without a package and with the dplyr package:

Tip

Option 1: No package needed

dataset.name[rows_you_want_to_KEEP, ]

Tip

Option 2: No package needed

subset(dataset.name, rows_you_want_to_KEEP)

Tip

Option 3: use filter in dplyr

filter(dataset.name, rows_you_want_to_KEEP)

We can use the logical tests for the rows you want to keep or drop.

Tip

Common logical tests are:

Syntax Meaning
X <(=) Y Smaller (equal) than
X >(=) Y Larger (equal) than
X == Y Equal to
X != Y Not equal to
is.na(X) is NA/missing?

Say, we want to keep the observations for which cars are manufactured in 2008. We can use the following R codes to do it:

# Option 1
mpg4 <- mpg[mpg$year == "2008",]
head(mpg4)
ABCDEFGHIJ0123456789
 
 
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
3audia42.020084manual(m6)f2031
4audia42.020084auto(av)f2130
7audia43.120086auto(av)f1827
10audia4 quattro2.020084manual(m6)42028
11audia4 quattro2.020084auto(s6)41927
14audia4 quattro3.120086auto(s6)41725

The following codes with the subset and filter function will do the same:

# Option 2
mpg5 <- subset(mpg, year == "2008")
head(mpg5)
ABCDEFGHIJ0123456789
 
 
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
3audia42.020084manual(m6)f2031
4audia42.020084auto(av)f2130
7audia43.120086auto(av)f1827
10audia4 quattro2.020084manual(m6)42028
11audia4 quattro2.020084auto(s6)41927
14audia4 quattro3.120086auto(s6)41725
# Option 3
mpg6 <- filter(mpg, year == "2008") 
head(mpg6)
ABCDEFGHIJ0123456789
 
 
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
1audia42.020084manual(m6)f2031
2audia42.020084auto(av)f2130
3audia43.120086auto(av)f1827
4audia4 quattro2.020084manual(m6)42028
5audia4 quattro2.020084auto(s6)41927
6audia4 quattro3.120086auto(s6)41725

The filter function can also work when you have multiple criteria (i.e., multiple logical tests) to satisfy. Here, we need Boolean operators to connect different logical tests.

Tip

Common boolean operators are:

Syntax Meaning
& and
| or
! not
== equals to
!= not equal to
> greater than
< less than
>= greater than or equal to
<= less than or equal to

Say, we want to keep the observations for 6 and 8 cylinders (cyl) and engine displacement (displ) greater than or equal to 4 litres. We can use the following codes to do the task:

mpg7 <- filter(mpg, cyl %in% c("6","8") & displ >= 4)
head(mpg7)
ABCDEFGHIJ0123456789
 
 
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
1audia6 quattro4.220088auto(s6)41623
2chevroletc1500 suburban 2wd5.320088auto(l4)r1420
3chevroletc1500 suburban 2wd5.320088auto(l4)r1115
4chevroletc1500 suburban 2wd5.320088auto(l4)r1420
5chevroletc1500 suburban 2wd5.719998auto(l4)r1317
6chevroletc1500 suburban 2wd6.020088auto(l4)r1217

The %in% operator is used to determine whether the values of the first argument are present in the second argument.

Handling Variable Types

Tip

Most common types of variable in R are

  • numbers,
  • factors and
  • strings(or character).

Understanding and manipulating these types are crucial for data analysis.

  1. Identifying Variable Type

When we analyze the data, we usually just deal with numbers and factors. If there are variables are strings, we could convert them to factors using as.factors(variable.name)

mode(mpg$trans)
#> [1] "character"
str(mpg$trans)
#>  chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" "auto(l5)" ...
  1. Converting Characters to Factors

Sometimes, it’s necessary to treat text data as categorical by converting them into factors. as.numeric() converts other types of variables to numbers. For a factor variable, we usually we want to access the categories (or levels) it has. We can use a build-in function to explore: levels(variable.name)

# no levels for character
levels(mpg$trans)
#> NULL
## Ex check how many different trans the dataset has
mpg$trans <- as.factor(mpg$trans)
levels(mpg$trans)
#>  [1] "auto(av)"   "auto(l3)"   "auto(l4)"   "auto(l5)"   "auto(l6)"  
#>  [6] "auto(s4)"   "auto(s5)"   "auto(s6)"   "manual(m5)" "manual(m6)"

The levels usually will be ordered alphabetically. The first level is called “baseline”. However, the users may/may not want to keep this baseline and want to relevel/change the reference group. We can do it using the relevel function:

relevel(variable.name, ref=)

mpg$trans <- relevel(mpg$trans, ref = "auto(s6)")
levels(mpg$trans)
#>  [1] "auto(s6)"   "auto(av)"   "auto(l3)"   "auto(l4)"   "auto(l5)"  
#>  [6] "auto(l6)"   "auto(s4)"   "auto(s5)"   "manual(m5)" "manual(m6)"
nlevels(mpg$trans)
#> [1] 10

The factor function can also be used to combine multiple factors into one factor.

## EX re-group trans to "auto" and "manual"
levels(mpg$trans) <- list(auto = c("auto(av)", "auto(l3)", "auto(l4)", "auto(l5)", "auto(l6)", 
                                   "auto(s4)", "auto(s5)", "auto(s6)"), 
                          manual = c("manual(m5)", "manual(m6)"))
levels(mpg$trans)
#> [1] "auto"   "manual"

You can also change the order of all factors using the following code: factor(variable.name, levels = c(“new order”))

## EX. Change the order of trans to manual
mpg$trans <- factor(mpg$trans, levels = c("manual", "auto"))
levels(mpg$trans)
#> [1] "manual" "auto"

In R, the use of factors with multiple levels is primarily a memory optimization strategy. While users may not directly see this, R assigns internal numerical identifiers to each level, which is a more memory-efficient way of handling such data. Unlike some other software packages that generate multiple dummy variables to represent a single variable, R’s approach is generally more resource-efficient.

  1. Converting back from Factors Characters

You can also convert factor back to character using the as.character function.

# Convert factor back to character
mpg$trans <- as.character(mpg$trans)
levels(mpg$trans) # no levels for character
#> NULL

Convert continuous variables to categorical variables

Tip

ifelse, cut, recode all are helpful functions to convert numerical variables to categorical variables.

Let’s see the summary of the cty variable first.

summary(mpg$cty)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    9.00   14.00   17.00   16.86   19.00   35.00

say, we may want to change continuous ‘cty’ into groups 0-14, 15-18, and 18-40. Below is an example with the cut function.

## EX. change the cty into two categories (0,14], (14,18] and (18,40]
mpg$cty.num <- cut(mpg$cty, c(0, 14, 18, 40), right = TRUE)
table(mpg$cty.num)
#> 
#>  (0,14] (14,18] (18,40] 
#>      73      85      76
## Try this: do you see a difference?: [0,14), [14,18) and [18,40)
mpg$cty.num2 <- cut(mpg$cty, c(0, 14, 18, 40), right = FALSE)
table(mpg$cty.num2)
#> 
#>  [0,14) [14,18) [18,40) 
#>      54      78     102

] stands for closed interval, i.e., right = TRUE. On the other hand, ) means open interval. Hence, there will be a huge difference when setting right = TRUE vs. right = FALSE

Missing value

Tip

Incomplete datasets can distort analysis. Identifying and managing these missing values is thus crucial.

We can check how many missing values we have by: table(is.na(variable.name))

Let’s us check whether the cty variable contains any missing values:

table(is.na(mpg$cty))
#> 
#> FALSE 
#>   234

If you want to return all non-missing values, i.e., complete case values: na.omit(variable.name). For more extensive methods on handling missing values, see subsequent tutorials.