Importing dataset
Introduction to Data Importing
Before analyzing data in R, one of the first steps you’ll typically undertake is importing your dataset. R provides numerous methods to do this, depending on the format of your dataset.
Datasets come in a variety of file formats, with .csv (Comma-Separated Values) and .txt (Text file) being among the most common. While R’s interface offers manual ways to load these datasets, knowing how to code this step ensures better reproducibility and automation.
Importing .txt files
A .txt data file can be imported using the read.table
function. As an example, consider you have a dataset named grade in the specified path.
Let’s briefly glance at the file without concerning ourselves with its formatting.
Using the read.table
function, you can load this dataset in R properly. It’s important to specify header = TRUE
if the first row of your dataset contains variable names.
Tip: Always ensure the
header
argument matches the structure of your dataset. If your dataset contains variable names, setheader = TRUE
.
Importing .csv files
Similarly, .csv files can be loaded using the read.csv
function. Here’s how you can load a .csv dataset named mpg:
## Read a csv dataset
mpg <- read.csv("Data/wrangling/mpg.csv", header = TRUE)
# Display the first few rows of the dataset
head(mpg)
While we’ve discussed two popular data formats, R can handle a plethora of other formats. For further details, refer to Quick-R (2023). Notably, some datasets come built-in with R packages, like the mpg dataset in the ggplot2 package. To load such a dataset:
To understand more about the variables and the dataset’s structure, you can consult the documentation:
Data Screening and Understanding Your Dataset
dim()
, nrow()
, ncol()
, and str()
are incredibly handy functions when initially exploring your dataset.
Once your data is in R, the next logical step is to get familiar with it. Knowing the dimensions of your dataset, types of variables, and the first few entries can give you a quick sense of what you’re dealing with.
For instance, str
(short for structure) is a concise way to display information about your data. It reveals the type of each variable, the first few entries, and the total number of observations:
str(mpg)
#> tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
#> $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
#> $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
#> $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
#> $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
#> $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
#> $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
#> $ drv : chr [1:234] "f" "f" "f" "f" ...
#> $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
#> $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
#> $ fl : chr [1:234] "p" "p" "p" "p" ...
#> $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
In summary, becoming proficient in data importing and initial screening is a fundamental step in any data analysis process in R. It ensures that subsequent stages of data manipulation and analysis are based on a clear understanding of the dataset at hand.
Video content (optional)
For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.