Data manipulation
Data manipulation is a foundational skill for data analysis. This guide introduces common methods for subsetting datasets, handling variable types, creating summary tables, and dealing with missing values using R.
Load dataset
Understanding the dataset’s structure is the first step in data manipulation. Here, we’re using the mpg
dataset, which provides information on various car models:
Subset
Often, you’ll need to subset your data for analysis. Here, we’ll explore different methods to both drop unwanted variables and keep desired observations.
Drop variables
Sometimes, only part of the variables will be used in your analysis. Therefore, you may want to drop the variables you do not need. There are multiple ways to drop variables from a dataset. Below are two examples without using any package and using the dplyr
package.
Option 1: No package needed
dataset.name[, c(columns_names_you_want_to_KEEP)]
Say, we want to keep only three variables in the mpg dataset: manufacturer, model and cyl. For Option 1 (without package), we can use the following R codes to keep these three variables:
Here mpg1
is a new dataset containing only three variables (manufacturer, model and cyl).
Option 2: use select in dplyr
select(dataset.name, c(columns_names_you_want_to_KEEP))
For Option 2, the dplyr
package offers the select
function, which provides a more intuitive way to subset data.
We can also exclude any variables from the dataset by using the minus (-) sign with the select
function. For example, we we want to drop trans
, drv
, and cty
from the mpg
dataset, we can use the following codes:
This mpg3
is a new dataset from mpg
after dropping three variables (trans, drv, and cty).
Keep observations
It often happens that we only want to investigate a subset of a population which only requires a subset of our dataset. In this case, we need to subset the dataset to meet certain requirements. Again, there are multiple ways to do this task. Below is an example without a package and with the dplyr
package:
Option 1: No package needed
dataset.name[rows_you_want_to_KEEP, ]
Option 2: No package needed
subset(dataset.name, rows_you_want_to_KEEP)
Option 3: use filter in dplyr
filter(dataset.name, rows_you_want_to_KEEP)
We can use the logical tests for the rows you want to keep or drop.
Common logical tests are:
Syntax | Meaning |
---|---|
X <(=) Y | Smaller (equal) than |
X >(=) Y | Larger (equal) than |
X == Y | Equal to |
X != Y | Not equal to |
is.na(X) | is NA/missing? |
Say, we want to keep the observations for which cars are manufactured in 2008. We can use the following R codes to do it:
The following codes with the subset
and filter
function will do the same:
The filter function can also work when you have multiple criteria (i.e., multiple logical tests) to satisfy. Here, we need Boolean operators to connect different logical tests.
Common boolean operators are:
Syntax | Meaning |
---|---|
& | and |
| | or |
! | not |
== | equals to |
!= | not equal to |
> | greater than |
< | less than |
>= | greater than or equal to |
<= | less than or equal to |
Say, we want to keep the observations for 6 and 8 cylinders (cyl) and engine displacement (displ) greater than or equal to 4 litres. We can use the following codes to do the task:
The %in%
operator is used to determine whether the values of the first argument are present in the second argument.
Handling Variable Types
Most common types of variable in R are
- numbers,
- factors and
- strings(or character).
Understanding and manipulating these types are crucial for data analysis.
- Identifying Variable Type
When we analyze the data, we usually just deal with numbers and factors. If there are variables are strings, we could convert them to factors using as.factors(variable.name)
- Converting Characters to Factors
Sometimes, it’s necessary to treat text data as categorical by converting them into factors. as.numeric() converts other types of variables to numbers. For a factor variable, we usually we want to access the categories (or levels) it has. We can use a build-in function to explore: levels(variable.name)
The levels usually will be ordered alphabetically. The first level is called “baseline”. However, the users may/may not want to keep this baseline and want to relevel/change the reference group. We can do it using the relevel
function:
relevel(variable.name, ref=)
The factor
function can also be used to combine multiple factors into one factor.
You can also change the order of all factors using the following code: factor(variable.name, levels = c(“new order”))
In R, the use of factors with multiple levels is primarily a memory optimization strategy. While users may not directly see this, R assigns internal numerical identifiers to each level, which is a more memory-efficient way of handling such data. Unlike some other software packages that generate multiple dummy variables to represent a single variable, R’s approach is generally more resource-efficient.
- Converting back from Factors Characters
You can also convert factor back to character using the as.character function.
Convert continuous variables to categorical variables
ifelse
, cut
, recode
all are helpful functions to convert numerical variables to categorical variables.
Let’s see the summary of the cty variable first.
say, we may want to change continuous ‘cty’ into groups 0-14, 15-18, and 18-40. Below is an example with the cut
function.
]
stands for closed interval, i.e., right = TRUE. On the other hand, )
means open interval. Hence, there will be a huge difference when setting right = TRUE vs. right = FALSE
Missing value
Incomplete datasets can distort analysis. Identifying and managing these missing values is thus crucial.
We can check how many missing values we have by: table(is.na(variable.name))
Let’s us check whether the cty variable contains any missing values:
If you want to return all non-missing values, i.e., complete case values: na.omit(variable.name). For more extensive methods on handling missing values, see subsequent tutorials.