Importing dataset

Introduction to Data Importing

Before analyzing data in R, one of the first steps you’ll typically undertake is importing your dataset. R provides numerous methods to do this, depending on the format of your dataset.

Datasets come in a variety of file formats, with .csv (Comma-Separated Values) and .txt (Text file) being among the most common. While R’s interface offers manual ways to load these datasets, knowing how to code this step ensures better reproducibility and automation.

Importing .txt files

A .txt data file can be imported using the read.table function. As an example, consider you have a dataset named grade in the specified path.

Let’s briefly glance at the file without concerning ourselves with its formatting.

# Read and print the content of the TXT file
content <- readLines("Data/wrangling/grade.txt")
cat(content, sep = "\n")
#> Studyid Grade Sex
#> 1    90   M
#> 2    85   F
#> 10    75   F
#> 15    90   M
#> 50    65   M

Using the read.table function, you can load this dataset in R properly. It’s important to specify header = TRUE if the first row of your dataset contains variable names.

Tip: Always ensure the header argument matches the structure of your dataset. If your dataset contains variable names, set header = TRUE.

## Read a text dataset
grade <- read.table("Data/wrangling/grade.txt", header = TRUE)

# Display the first few rows of the dataset
head(grade)

Importing .csv files

Similarly, .csv files can be loaded using the read.csv function. Here’s how you can load a .csv dataset named mpg:

## Read a csv dataset
mpg <- read.csv("Data/wrangling/mpg.csv", header = TRUE)
# Display the first few rows of the dataset
head(mpg)

While we’ve discussed two popular data formats, R can handle a plethora of other formats. For further details, refer to Quick-R (2023). Notably, some datasets come built-in with R packages, like the mpg dataset in the ggplot2 package. To load such a dataset:

data(mpg, package = "ggplot2")
head(mpg)

To understand more about the variables and the dataset’s structure, you can consult the documentation:

?mpg

Data Screening and Understanding Your Dataset

dim(), nrow(), ncol(), and str() are incredibly handy functions when initially exploring your dataset.

Once your data is in R, the next logical step is to get familiar with it. Knowing the dimensions of your dataset, types of variables, and the first few entries can give you a quick sense of what you’re dealing with.

For instance, str (short for structure) is a concise way to display information about your data. It reveals the type of each variable, the first few entries, and the total number of observations:

str(mpg)
#> tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
#>  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
#>  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
#>  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
#>  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
#>  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
#>  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
#>  $ drv         : chr [1:234] "f" "f" "f" "f" ...
#>  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
#>  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
#>  $ fl          : chr [1:234] "p" "p" "p" "p" ...
#>  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

In summary, becoming proficient in data importing and initial screening is a fundamental step in any data analysis process in R. It ensures that subsequent stages of data manipulation and analysis are based on a clear understanding of the dataset at hand.

Video content (optional)

Tip

For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.

References

Quick-R. 2023. “Importing Data.” https://www.statmethods.net/input/importingdata.html.