Chapter 3 Importing Data into R with readr

3.1 Instructions

This tutorial will teach you the basics of importing data from your hard drive into R. We will cover how to import a Comma-Separated Values (csv) file into R using the read_csv() function in the readr package. We will also be covering the different data types that R can recognize from a csv file, how to manipulate how data show up on R, as well as how to export the manipulated file back into a csv file.

Accompanying this tutorial is a short Google quiz for your own self-assessment. The instructions of this tutorial will clearly indicate when you should answer which question.

3.2 Learning Objectives

Know how to import a Comma-Separated Values (csv) file from a hard drive into R.
Understand the basics of read_csv() including how to use it to import data and how to manipulate the presentation of the data on R.
Be familiar with the different data types that R can recognize.
Know how to import other data files such as txt, xlsx, xpt, and sas into R.
Know how to export a csv file from R into a hard drive.

3.3 Set Up

In this tutorial, we will be using the readr package, so we will need to install and attach this package onto our R and R session. The readr package is part of a larger tidyverse core. This tidyverse core contains many R packages that give us access to functions that mainly work to organize data. In this tutorial series, we will be covering three packages from tidyverse: readr (tutorial 2), dplyr (tutorial 4), and ggplot (tutorial 5).

#install.packages("readr")
library(readr)

For this tutorial, we will be using the demo_csv.csv file. This data is a subset of the National Health and Nutrition Examination Survey (NHANES) conducted by the National Center for Health Statistics (NCHS). Our demo_csv.csv, in particular, contains a portion of the information about the demographic of the survey’s participants in the years 2013-2014. We will cover NHANES in more detail in tutorial 3. For now, you can explore NHANES in general by visiting this website.

After attaching the readr package, one other thing that we need to complete this tutorial is a csv file. Note that the csv file should be in your working directory - this just makes out lives much easier when we want to import data from a hard drive onto R. Recall that we can use the function dir() to check if all of the files we need are in our working directory.

dir()

##  [1] "_book"                                 
##  [2] "_bookdown.yml"                         
##  [3] "_bookdown_files"                       
##  [4] "_build.sh"                             
##  [5] "_deploy.sh"                            
##  [6] "_output.yml"                           
##  [7] "0-r-and-rstudio-set-up.Rmd"            
##  [8] "1-introduction-to-r.Rmd"               
##  [9] "2-importing-data-into-r-with-readr.Rmd"
## [10] "3-introduction-to-nhanes.Rmd"          
## [11] "4-data-analysis-with-dplyr.Rmd"        
## [12] "5-data-visualization-with-ggplot.Rmd"  
## [13] "6-date-time-data-with-lubridate.Rmd"   
## [14] "7-data-summary-with-tableone.Rmd"      
## [15] "8-Exercise-Solutions.Rmd"              
## [16] "9-references.Rmd"                      
## [17] "book.bib"                              
## [18] "data"                                  
## [19] "DESCRIPTION"                           
## [20] "Dockerfile"                            
## [21] "docs"                                  
## [22] "header.html"                           
## [23] "images"                                
## [24] "index.Rmd"                             
## [25] "intro2R.log"                           
## [26] "intro2R.Rmd"                           
## [27] "intro2R.tex"                           
## [28] "intro2R_cache"                         
## [29] "intro2R_files"                         
## [30] "LICENSE"                               
## [31] "now.json"                              
## [32] "packages.bib"                          
## [33] "preamble.tex"                          
## [34] "R.Rproj"                               
## [35] "README.md"                             
## [36] "style.css"                             
## [37] "toc.css"

You may be a bit confused about the output of the previous code. This is because we are working on Kaggle and the csv file that we will be using is not located in our working directory. More specifically, the csv file is in the input folder, whereas our working directory is the output folder.

getwd()

## [1] "C:/Users/ehsan/Documents/GitHub/intro2R"

There are two ways that we can approach this problem. We will go over how to do both in the next section of this tutorial.

DO QUESTION 1 OF THE QUIZ NOW

REVIEW: Which of the following functions lets us set a new working directory?

3.4 Basics Of Importing a CSV File Into R

3.4.1 Method 1: Setting a Different Working Directory

As we have sort of alluded to in tutorial 1, we can set our working directory as the location of the csv file to import it into R. After successfully importing the file, we can then set the working directory back to our original folder.

setwd("./data")

Perfect! Now that we know the csv file is in our working directory, we can now import it using read_csv(). This function is relatively easy to use. All we need to do is add the name of our csv file along with the .csv extension in "" within the brackets, and we are good to go!

read_csv("data/demo_csv.csv")

## Rows: 15 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 15 x 5
##       id gender   age race                          edu                         
##    <dbl> <chr>  <dbl> <chr>                         <chr>                       
##  1 83717 Female    80 Mexican American              Less than 9th grade         
##  2 83718 Female    60 Non-Hispanic Black            High school graduate/GED or~
##  3 83719 Male       3 Mexican American              <NA>                        
##  4 83720 Male      36 Non-Hispanic Black            Some college or AA degree   
##  5 83721 Male      52 Non-Hispanic White            College graduate or above   
##  6 83722 Male       0 Other Race - Including Multi~ <NA>                        
##  7 83723 Male      61 Mexican American              9-11th grade (Includes 12th~
##  8 83724 Male      80 Non-Hispanic White            High school graduate/GED or~
##  9 83725 Male       7 Mexican American              <NA>                        
## 10 83726 Male      40 Mexican American              Less than 9th grade         
## 11 83727 Male      26 Other Hispanic                College graduate or above   
## 12 83728 Female     2 Mexican American              <NA>                        
## 13 83729 Female    42 Non-Hispanic Black            College graduate or above   
## 14 83730 Male       7 Other Hispanic                <NA>                        
## 15 83731 Male      11 Non-Hispanic Asian            <NA>

You should see a list of “Column specification” and the demo_csv.csv file imported into a data frame in R after running the codes above. We will go over what “Column specification” means later in this tutorial.

Now that we have successfully imported our csv file into R, it is time for us to set our working directory back to our original directory.

setwd("..")

getwd()

## [1] "C:/Users/ehsan/Documents/GitHub/intro2R"

3.4.2 Method 2: Copying the Exact Pathway of the File

Another way for us to import the csv file into R is to copy and paste the exact pathway of the file into read_csv(). You should see the exact same “Column specification” and data frame as before!

read_csv("data/demo_csv.csv")

## Rows: 15 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 15 x 5
##       id gender   age race                          edu                         
##    <dbl> <chr>  <dbl> <chr>                         <chr>                       
##  1 83717 Female    80 Mexican American              Less than 9th grade         
##  2 83718 Female    60 Non-Hispanic Black            High school graduate/GED or~
##  3 83719 Male       3 Mexican American              <NA>                        
##  4 83720 Male      36 Non-Hispanic Black            Some college or AA degree   
##  5 83721 Male      52 Non-Hispanic White            College graduate or above   
##  6 83722 Male       0 Other Race - Including Multi~ <NA>                        
##  7 83723 Male      61 Mexican American              9-11th grade (Includes 12th~
##  8 83724 Male      80 Non-Hispanic White            High school graduate/GED or~
##  9 83725 Male       7 Mexican American              <NA>                        
## 10 83726 Male      40 Mexican American              Less than 9th grade         
## 11 83727 Male      26 Other Hispanic                College graduate or above   
## 12 83728 Female     2 Mexican American              <NA>                        
## 13 83729 Female    42 Non-Hispanic Black            College graduate or above   
## 14 83730 Male       7 Other Hispanic                <NA>                        
## 15 83731 Male      11 Non-Hispanic Asian            <NA>

Note that you can also store this imported data into an object using <-.

DEMO <- read_csv("data/demo_csv.csv")

## Rows: 15 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now, we can just type DEMO to see the data frame.

DEMO

## # A tibble: 15 x 5
##       id gender   age race                          edu                         
##    <dbl> <chr>  <dbl> <chr>                         <chr>                       
##  1 83717 Female    80 Mexican American              Less than 9th grade         
##  2 83718 Female    60 Non-Hispanic Black            High school graduate/GED or~
##  3 83719 Male       3 Mexican American              <NA>                        
##  4 83720 Male      36 Non-Hispanic Black            Some college or AA degree   
##  5 83721 Male      52 Non-Hispanic White            College graduate or above   
##  6 83722 Male       0 Other Race - Including Multi~ <NA>                        
##  7 83723 Male      61 Mexican American              9-11th grade (Includes 12th~
##  8 83724 Male      80 Non-Hispanic White            High school graduate/GED or~
##  9 83725 Male       7 Mexican American              <NA>                        
## 10 83726 Male      40 Mexican American              Less than 9th grade         
## 11 83727 Male      26 Other Hispanic                College graduate or above   
## 12 83728 Female     2 Mexican American              <NA>                        
## 13 83729 Female    42 Non-Hispanic Black            College graduate or above   
## 14 83730 Male       7 Other Hispanic                <NA>                        
## 15 83731 Male      11 Non-Hispanic Asian            <NA>

DO QUESTIONS 2-4 OF THE QUIZ NOW

REVIEW: Which of the following codes will print the entire DEMO data frame?
read_csv can also be used to import Excel and txt files. (True or False)
Which R package does the function read_csv() belong to?

Try it yourself 3.1

Can you try importing the bpx.csv file into R using the function read_csv()?

3.4.3 Key Notes About Importing Data into R

There are a few key things that we should note when using read_csv(): 1. The file name or pathway to the file needs to be in "", 2. The file extension, .csv, needs to be present, and 3. The name of the file needs to be exact.

The third point is related to one of the most common mistakes. When importing any data from your hard drive onto R, you need to make sure that the file name that you write in R is exactly what it displays on your hard drive. For instance, take note of spaces, capital letters, spelling of words, as well as the correct extensions. In other words, demo_csv-1.csv or Demo_csv.csv is much different than demo_csv.csv.

Another point to note is that read_csv() automatically assumes that the first row of your csv file is the header. We will learn how to tell R this assumption is not correct in section 3 of this tutorial.

DO QUESTION 5 OF THE QUIZ NOW

REVIEW: R is case sensitive. (True or False)

Try it yourself 3.2

Can you identify the mistakes of the following codes?

# a. 
# read_csv(../input/import/demo_csv.csv)

# b.
# read_csv("data/DEMO_csv.csv")

# c.
# Read_csv("data/demo_csv.csv")

# d. 
# read_csv(data/"demo_csv.csv")

DO QUESTION 6 OF THE QUIZ NOW

Which of the following statements about read_csv() are correct? (select all that apply)

3.4.4 Column Specification

You may notice that when you import a data into R by running read_csv(), a “Column specification” list appears. This list tells us two things: 1. The names of our columns and 2. The type of data that each column contains.

read_csv("data/demo_csv.csv")

## Rows: 15 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 15 x 5
##       id gender   age race                          edu                         
##    <dbl> <chr>  <dbl> <chr>                         <chr>                       
##  1 83717 Female    80 Mexican American              Less than 9th grade         
##  2 83718 Female    60 Non-Hispanic Black            High school graduate/GED or~
##  3 83719 Male       3 Mexican American              <NA>                        
##  4 83720 Male      36 Non-Hispanic Black            Some college or AA degree   
##  5 83721 Male      52 Non-Hispanic White            College graduate or above   
##  6 83722 Male       0 Other Race - Including Multi~ <NA>                        
##  7 83723 Male      61 Mexican American              9-11th grade (Includes 12th~
##  8 83724 Male      80 Non-Hispanic White            High school graduate/GED or~
##  9 83725 Male       7 Mexican American              <NA>                        
## 10 83726 Male      40 Mexican American              Less than 9th grade         
## 11 83727 Male      26 Other Hispanic                College graduate or above   
## 12 83728 Female     2 Mexican American              <NA>                        
## 13 83729 Female    42 Non-Hispanic Black            College graduate or above   
## 14 83730 Male       7 Other Hispanic                <NA>                        
## 15 83731 Male      11 Non-Hispanic Asian            <NA>

As we can see after running the code above, there are five columns in our data frame: id (the participant’s unique ID number), gender, age, race, and edu (highest level of education).

We can also see that there are two types of data in this data frame col_double() and col_character().

DO QUESTION 7 OF THE QUIZ NOW

Which of the following is the best an example of a data that would be classified as col_double()?

Try it yourself 3.3

Just by looking at the actual data frame, can you guess what type of data col_double() and col_character() are?

(HINT: doubles? integers? logical? character?)

3.5 More Arguments Of read_csv

3.5.1 Skip

There are a range of other arguments that we can use with read_csv(). Firstly, we can nest skip inside the () of read_csv() to tell R to skip (AKA not import) a certain number of rows when importing our data.

Skip_2 <- read_csv("data/demo_csv.csv", skip = 2)

## Rows: 13 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Female, Non-Hispanic Black, High school graduate/GED or equi
## dbl (2): 83718, 60

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Skip_2

## # A tibble: 13 x 5
##    `83718` Female  `60` `Non-Hispanic Black`        `High school graduate/GED o~
##      <dbl> <chr>  <dbl> <chr>                       <chr>                       
##  1   83719 Male       3 Mexican American            <NA>                        
##  2   83720 Male      36 Non-Hispanic Black          Some college or AA degree   
##  3   83721 Male      52 Non-Hispanic White          College graduate or above   
##  4   83722 Male       0 Other Race - Including Mul~ <NA>                        
##  5   83723 Male      61 Mexican American            9-11th grade (Includes 12th~
##  6   83724 Male      80 Non-Hispanic White          High school graduate/GED or~
##  7   83725 Male       7 Mexican American            <NA>                        
##  8   83726 Male      40 Mexican American            Less than 9th grade         
##  9   83727 Male      26 Other Hispanic              College graduate or above   
## 10   83728 Female     2 Mexican American            <NA>                        
## 11   83729 Female    42 Non-Hispanic Black          College graduate or above   
## 12   83730 Male       7 Other Hispanic              <NA>                        
## 13   83731 Male      11 Non-Hispanic Asian          <NA>

DEMO

## # A tibble: 15 x 5
##       id gender   age race                          edu                         
##    <dbl> <chr>  <dbl> <chr>                         <chr>                       
##  1 83717 Female    80 Mexican American              Less than 9th grade         
##  2 83718 Female    60 Non-Hispanic Black            High school graduate/GED or~
##  3 83719 Male       3 Mexican American              <NA>                        
##  4 83720 Male      36 Non-Hispanic Black            Some college or AA degree   
##  5 83721 Male      52 Non-Hispanic White            College graduate or above   
##  6 83722 Male       0 Other Race - Including Multi~ <NA>                        
##  7 83723 Male      61 Mexican American              9-11th grade (Includes 12th~
##  8 83724 Male      80 Non-Hispanic White            High school graduate/GED or~
##  9 83725 Male       7 Mexican American              <NA>                        
## 10 83726 Male      40 Mexican American              Less than 9th grade         
## 11 83727 Male      26 Other Hispanic                College graduate or above   
## 12 83728 Female     2 Mexican American              <NA>                        
## 13 83729 Female    42 Non-Hispanic Black            College graduate or above   
## 14 83730 Male       7 Other Hispanic                <NA>                        
## 15 83731 Male      11 Non-Hispanic Asian            <NA>

When comparing the Skip_2 table with our original DEMO table, we can see that Skip_2 has two less rows. This is because the argument skip = 2 has told R to not import the first two rows of our demo.csv.

DO QUESTION 8 OF THE QUIZ NOW

Which of the following statements is true about the argument skip?

Try it yourself 3.4

You may also notice that the header of Skip_2 is incorrect. This is because R recognizes the header of our data as the first row, thus omiting it when importing demo.csv into R.

Let’s say this is not what we really want. What we actually want to do is to remove the first two rows of actual data while keeping the header. What do you think we have to do to achieve this?

(HINT: Recall what we learn about extracting rows in tutorial 1)

3.5.2 Remove Header & Header Names

Recall how read_csv() assumes that the first row of our data is the header. If this is not true, we can use col_names = FALSE to tell R that the first row of our data do not contain headers and that R should add headers for our data.

No_header <- read_csv("data/demo_csv.csv", 
                      col_names = FALSE)

## Rows: 16 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): X1, X2, X3, X4, X5

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

No_header

## # A tibble: 16 x 5
##    X1    X2     X3    X4                            X5                          
##    <chr> <chr>  <chr> <chr>                         <chr>                       
##  1 id    gender age   race                          edu                         
##  2 83717 Female 80    Mexican American              Less than 9th grade         
##  3 83718 Female 60    Non-Hispanic Black            High school graduate/GED or~
##  4 83719 Male   3     Mexican American              <NA>                        
##  5 83720 Male   36    Non-Hispanic Black            Some college or AA degree   
##  6 83721 Male   52    Non-Hispanic White            College graduate or above   
##  7 83722 Male   0     Other Race - Including Multi~ <NA>                        
##  8 83723 Male   61    Mexican American              9-11th grade (Includes 12th~
##  9 83724 Male   80    Non-Hispanic White            High school graduate/GED or~
## 10 83725 Male   7     Mexican American              <NA>                        
## 11 83726 Male   40    Mexican American              Less than 9th grade         
## 12 83727 Male   26    Other Hispanic                College graduate or above   
## 13 83728 Female 2     Mexican American              <NA>                        
## 14 83729 Female 42    Non-Hispanic Black            College graduate or above   
## 15 83730 Male   7     Other Hispanic                <NA>                        
## 16 83731 Male   11    Non-Hispanic Asian            <NA>

We can also change the names of our headers by using col_names = following by a vector of names. For example:

Header_names <- read_csv("data/demo_csv.csv",
                      col_names = c("ID", "Gender", "Age", "Race", "Education"))

## Rows: 16 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): ID, Gender, Age, Race, Education

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Header_names

## # A tibble: 16 x 5
##    ID    Gender Age   Race                          Education                   
##    <chr> <chr>  <chr> <chr>                         <chr>                       
##  1 id    gender age   race                          edu                         
##  2 83717 Female 80    Mexican American              Less than 9th grade         
##  3 83718 Female 60    Non-Hispanic Black            High school graduate/GED or~
##  4 83719 Male   3     Mexican American              <NA>                        
##  5 83720 Male   36    Non-Hispanic Black            Some college or AA degree   
##  6 83721 Male   52    Non-Hispanic White            College graduate or above   
##  7 83722 Male   0     Other Race - Including Multi~ <NA>                        
##  8 83723 Male   61    Mexican American              9-11th grade (Includes 12th~
##  9 83724 Male   80    Non-Hispanic White            High school graduate/GED or~
## 10 83725 Male   7     Mexican American              <NA>                        
## 11 83726 Male   40    Mexican American              Less than 9th grade         
## 12 83727 Male   26    Other Hispanic                College graduate or above   
## 13 83728 Female 2     Mexican American              <NA>                        
## 14 83729 Female 42    Non-Hispanic Black            College graduate or above   
## 15 83730 Male   7     Other Hispanic                <NA>                        
## 16 83731 Male   11    Non-Hispanic Asian            <NA>

DO QUESTION 9 OF THE QUIZ NOW

In which of the following scenarios do you think we would NEED to use col_names = FALSE? (select all that apply)

With the addedcol_names, you may notice that the column specification for our data is not incorrect (everything is recognized as col_character()!

This is because R now reads “id,” “gender,” “age,” “race,” and “edu” as a content row, and since all of these are texts, R recognizes the entire column as col_character(). This is something worthy to note when you are importing data into R.

We can solve this problem with this solution:

(Skip_and_Header_Names <- read_csv("data/demo_csv.csv", 
                                  skip = 1,
                                  col_names = c("ID", "Gender", "Age", "Race", "Education")))

## Rows: 15 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Gender, Race, Education
## dbl (2): ID, Age

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 15 x 5
##       ID Gender   Age Race                          Education                   
##    <dbl> <chr>  <dbl> <chr>                         <chr>                       
##  1 83717 Female    80 Mexican American              Less than 9th grade         
##  2 83718 Female    60 Non-Hispanic Black            High school graduate/GED or~
##  3 83719 Male       3 Mexican American              <NA>                        
##  4 83720 Male      36 Non-Hispanic Black            Some college or AA degree   
##  5 83721 Male      52 Non-Hispanic White            College graduate or above   
##  6 83722 Male       0 Other Race - Including Multi~ <NA>                        
##  7 83723 Male      61 Mexican American              9-11th grade (Includes 12th~
##  8 83724 Male      80 Non-Hispanic White            High school graduate/GED or~
##  9 83725 Male       7 Mexican American              <NA>                        
## 10 83726 Male      40 Mexican American              Less than 9th grade         
## 11 83727 Male      26 Other Hispanic                College graduate or above   
## 12 83728 Female     2 Mexican American              <NA>                        
## 13 83729 Female    42 Non-Hispanic Black            College graduate or above   
## 14 83730 Male       7 Other Hispanic                <NA>                        
## 15 83731 Male      11 Non-Hispanic Asian            <NA>

3.5.3 Missing Values

We can also define missing values by using na =. For example, if we want to assign “Some college or AA degree” under the edu column as NA, we can use the following code:

Missing_values <- read_csv("data/demo_csv.csv",
                           na = "Some college or AA degree")

## Rows: 15 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Missing_values

## # A tibble: 15 x 5
##       id gender   age race                          edu                         
##    <dbl> <chr>  <dbl> <chr>                         <chr>                       
##  1 83717 Female    80 Mexican American              Less than 9th grade         
##  2 83718 Female    60 Non-Hispanic Black            High school graduate/GED or~
##  3 83719 Male       3 Mexican American              NA                          
##  4 83720 Male      36 Non-Hispanic Black            <NA>                        
##  5 83721 Male      52 Non-Hispanic White            College graduate or above   
##  6 83722 Male       0 Other Race - Including Multi~ NA                          
##  7 83723 Male      61 Mexican American              9-11th grade (Includes 12th~
##  8 83724 Male      80 Non-Hispanic White            High school graduate/GED or~
##  9 83725 Male       7 Mexican American              NA                          
## 10 83726 Male      40 Mexican American              Less than 9th grade         
## 11 83727 Male      26 Other Hispanic                College graduate or above   
## 12 83728 Female     2 Mexican American              NA                          
## 13 83729 Female    42 Non-Hispanic Black            College graduate or above   
## 14 83730 Male       7 Other Hispanic                NA                          
## 15 83731 Male      11 Non-Hispanic Asian            NA

DO QUESTION 10 OF THE QUIZ NOW

Only characters can be assigned a value of NA, there is a different missing-value designation for numeric values. (True or False)

3.6 Importing Other File Types into R

While csv is the most common file type to import into R, we can also import other types of data file into R using different functions. In this section, you will be introduced to the very basics of how to import txt, xlsx, xpt, and sas files into R.

3.6.1 Text file (txt)

The simplest function that we can use to import a txt file is read.table(). This function belongs to the default Base R package, so we do not need to install or attach any packages before using it!

The first argument of this function is the file path. What do you think header = TRUE mean?

read.table("data/demo_txt.txt", header = TRUE)

##       id gender age
## 1  83717 Female  80
## 2  83718 Female  60
## 3  83719   Male   3
## 4  83720   Male  36
## 5  83721   Male  52
## 6  83722   Male   0
## 7  83723   Male  61
## 8  83724   Male  80
## 9  83725   Male   7
## 10 83726   Male  40
## 11 83727   Male  26
## 12 83728 Female   2
## 13 83729 Female  42
## 14 83730   Male   7
## 15 83731   Male  11

3.6.2 Excel file (xlsx)

To import an xlsx file into R, we use read_excel(). But before we can use this function, we need to install and attach the readxl package. Similarly to read.table(), this function requires a file path.

# install.packages("readxl")
library(readxl)

read_excel("data/demo_xlsx.xlsx")

## # A tibble: 15 x 5
##       id gender   age race                          edu                         
##    <dbl> <chr>  <dbl> <chr>                         <chr>                       
##  1 83717 Female    80 Mexican American              Less than 9th grade         
##  2 83718 Female    60 Non-Hispanic Black            High school graduate/GED or~
##  3 83719 Male       3 Mexican American              NA                          
##  4 83720 Male      36 Non-Hispanic Black            Some college or AA degree   
##  5 83721 Male      52 Non-Hispanic White            College graduate or above   
##  6 83722 Male       0 Other Race - Including Multi~ NA                          
##  7 83723 Male      61 Mexican American              9-11th grade (Includes 12th~
##  8 83724 Male      80 Non-Hispanic White            High school graduate/GED or~
##  9 83725 Male       7 Mexican American              NA                          
## 10 83726 Male      40 Mexican American              Less than 9th grade         
## 11 83727 Male      26 Other Hispanic                College graduate or above   
## 12 83728 Female     2 Mexican American              NA                          
## 13 83729 Female    42 Non-Hispanic Black            College graduate or above   
## 14 83730 Male       7 Other Hispanic                NA                          
## 15 83731 Male      11 Non-Hispanic Asian            NA

Try it yourself 3.5

Import the bpx.xlsx into R using the read_excel() function.

3.6.3 XPT File Extension

Another file type that you may need to import into R is xpt. To do this, we need the function read.xport() that belongs to the SASxport package.

# install.packages("SASxport")
library(SASxport)

read.xport("data/demo_xpt.xpt")

##       ID GENDER                             RACE
## 1  83717 Female                 Mexican American
## 2  83718 Female               Non-Hispanic Black
## 3  83719   Male                 Mexican American
## 4  83720   Male               Non-Hispanic Black
## 5  83721   Male               Non-Hispanic White
## 6  83722   Male Other Race - Including Multi-Rac
## 7  83723   Male                 Mexican American
## 8  83724   Male               Non-Hispanic White
## 9  83725   Male                 Mexican American
## 10 83726   Male                 Mexican American
## 11 83727   Male                   Other Hispanic
## 12 83728 Female                 Mexican American
## 13 83729 Female               Non-Hispanic Black
## 14 83730   Male                   Other Hispanic
## 15 83731   Male Other Race - Including Multi-Rac

3.6.4 Statistical Analysis Software (SAS)

We can use the function read_sas() to import sas files into R. But before we do this, we need to install and attach the haven package.

# install.packages("haven")
library("haven")

read_sas("data/demo_sas.sas")

## # A tibble: 15 x 3
##       id gender  race
##    <dbl>  <dbl> <dbl>
##  1 83717      2     1
##  2 83718      2     4
##  3 83719      1     1
##  4 83720      1     4
##  5 83721      1     3
##  6 83722      1     5
##  7 83723      1     1
##  8 83724      1     3
##  9 83725      1     1
## 10 83726      1     1
## 11 83727      1     2
## 12 83728      2     1
## 13 83729      2     4
## 14 83730      1     2
## 15 83731      1     5

3.7 Exporting the Data Frame From R

After changing and manipulating our data in R, we can also export it back into a csv file to share it. To do this, we can use the function write_csv(). For example, let’s say we want to export our “Missing_values” data frame.

write_csv(Missing_values, "data/Missing Values.csv")

We can also export it to an Excel file by using write_excel_csv().

write_excel_csv(Missing_values, "data/Missing Values.xlsx")

Now if we check our working directory, there should be 2 new files, “Missing Values.csv” and “Missing Values.xlsx!”

dir()

##  [1] "_book"                                 
##  [2] "_bookdown.yml"                         
##  [3] "_bookdown_files"                       
##  [4] "_build.sh"                             
##  [5] "_deploy.sh"                            
##  [6] "_output.yml"                           
##  [7] "0-r-and-rstudio-set-up.Rmd"            
##  [8] "1-introduction-to-r.Rmd"               
##  [9] "2-importing-data-into-r-with-readr.Rmd"
## [10] "3-introduction-to-nhanes.Rmd"          
## [11] "4-data-analysis-with-dplyr.Rmd"        
## [12] "5-data-visualization-with-ggplot.Rmd"  
## [13] "6-date-time-data-with-lubridate.Rmd"   
## [14] "7-data-summary-with-tableone.Rmd"      
## [15] "8-Exercise-Solutions.Rmd"              
## [16] "9-references.Rmd"                      
## [17] "book.bib"                              
## [18] "data"                                  
## [19] "DESCRIPTION"                           
## [20] "Dockerfile"                            
## [21] "docs"                                  
## [22] "header.html"                           
## [23] "images"                                
## [24] "index.Rmd"                             
## [25] "intro2R.log"                           
## [26] "intro2R.Rmd"                           
## [27] "intro2R.tex"                           
## [28] "intro2R_cache"                         
## [29] "intro2R_files"                         
## [30] "LICENSE"                               
## [31] "now.json"                              
## [32] "packages.bib"                          
## [33] "preamble.tex"                          
## [34] "R.Rproj"                               
## [35] "README.md"                             
## [36] "style.css"                             
## [37] "toc.css"

Congratulations! We have now succeeded in exporting a dataset from R to an external file! This will make our work much easier to share and access!

3.8 Summary and Takeaways

In this tutorial, we have learned how to import csv files from our hard drive into R using read_csv(). This is an important first step in data analysis or manipulation since we need to be able to have the data in R in order to process it!