Chapter 3 Importing Data into R with readr
3.1 Instructions
This tutorial will teach you the basics of importing data from your hard drive into R. We will cover how to import a Comma-Separated Values (csv) file into R using the read_csv()
function in the readr package. We will also be covering the different data types that R can recognize from a csv file, how to manipulate how data show up on R, as well as how to export the manipulated file back into a csv file.
Accompanying this tutorial is a short Google quiz for your own self-assessment. The instructions of this tutorial will clearly indicate when you should answer which question.
3.2 Learning Objectives
- Know how to import a Comma-Separated Values (csv) file from a hard drive into R.
- Understand the basics of
read_csv()
including how to use it to import data and how to manipulate the presentation of the data on R. - Be familiar with the different data types that R can recognize.
- Know how to import other data files such as txt, xlsx, xpt, and sas into R.
- Know how to export a csv file from R into a hard drive.
3.3 Set Up
In this tutorial, we will be using the readr package, so we will need to install and attach this package onto our R and R session. The readr package is part of a larger tidyverse core. This tidyverse core contains many R packages that give us access to functions that mainly work to organize data. In this tutorial series, we will be covering three packages from tidyverse: readr (tutorial 2), dplyr (tutorial 4), and ggplot (tutorial 5).
#install.packages("readr")
library(readr)
For this tutorial, we will be using the demo_csv.csv
file. This data is a subset of the National Health and Nutrition Examination Survey (NHANES) conducted by the National Center for Health Statistics (NCHS). Our demo_csv.csv
, in particular, contains a portion of the information about the demographic of the survey’s participants in the years 2013-2014. We will cover NHANES in more detail in tutorial 3. For now, you can explore NHANES in general by visiting this website.
After attaching the readr package, one other thing that we need to complete this tutorial is a csv file. Note that the csv file should be in your working directory - this just makes out lives much easier when we want to import data from a hard drive onto R. Recall that we can use the function dir()
to check if all of the files we need are in our working directory.
dir()
## [1] "_book"
## [2] "_bookdown.yml"
## [3] "_bookdown_files"
## [4] "_build.sh"
## [5] "_deploy.sh"
## [6] "_output.yml"
## [7] "0-r-and-rstudio-set-up.Rmd"
## [8] "1-introduction-to-r.Rmd"
## [9] "2-importing-data-into-r-with-readr.Rmd"
## [10] "3-introduction-to-nhanes.Rmd"
## [11] "4-data-analysis-with-dplyr.Rmd"
## [12] "5-data-visualization-with-ggplot.Rmd"
## [13] "6-date-time-data-with-lubridate.Rmd"
## [14] "7-data-summary-with-tableone.Rmd"
## [15] "8-Exercise-Solutions.Rmd"
## [16] "9-references.Rmd"
## [17] "book.bib"
## [18] "data"
## [19] "DESCRIPTION"
## [20] "Dockerfile"
## [21] "docs"
## [22] "header.html"
## [23] "images"
## [24] "index.Rmd"
## [25] "intro2R.log"
## [26] "intro2R.Rmd"
## [27] "intro2R.tex"
## [28] "intro2R_cache"
## [29] "intro2R_files"
## [30] "LICENSE"
## [31] "now.json"
## [32] "packages.bib"
## [33] "preamble.tex"
## [34] "R.Rproj"
## [35] "README.md"
## [36] "style.css"
## [37] "toc.css"
You may be a bit confused about the output of the previous code. This is because we are working on Kaggle and the csv file that we will be using is not located in our working directory. More specifically, the csv file is in the input folder, whereas our working directory is the output folder.
getwd()
## [1] "C:/Users/ehsan/Documents/GitHub/intro2R"
There are two ways that we can approach this problem. We will go over how to do both in the next section of this tutorial.
DO QUESTION 1 OF THE QUIZ NOW
- REVIEW: Which of the following functions lets us set a new working directory?
3.4 Basics Of Importing a CSV File Into R
3.4.1 Method 1: Setting a Different Working Directory
As we have sort of alluded to in tutorial 1, we can set our working directory as the location of the csv file to import it into R. After successfully importing the file, we can then set the working directory back to our original folder.
setwd("./data")
Perfect! Now that we know the csv file is in our working directory, we can now import it using read_csv()
. This function is relatively easy to use. All we need to do is add the name of our csv file along with the .csv
extension in ""
within the brackets, and we are good to go!
read_csv("data/demo_csv.csv")
## Rows: 15 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 15 x 5
## id gender age race edu
## <dbl> <chr> <dbl> <chr> <chr>
## 1 83717 Female 80 Mexican American Less than 9th grade
## 2 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 3 83719 Male 3 Mexican American <NA>
## 4 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 5 83721 Male 52 Non-Hispanic White College graduate or above
## 6 83722 Male 0 Other Race - Including Multi~ <NA>
## 7 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 8 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 9 83725 Male 7 Mexican American <NA>
## 10 83726 Male 40 Mexican American Less than 9th grade
## 11 83727 Male 26 Other Hispanic College graduate or above
## 12 83728 Female 2 Mexican American <NA>
## 13 83729 Female 42 Non-Hispanic Black College graduate or above
## 14 83730 Male 7 Other Hispanic <NA>
## 15 83731 Male 11 Non-Hispanic Asian <NA>
You should see a list of “Column specification” and the demo_csv.csv
file imported into a data frame in R after running the codes above. We will go over what “Column specification” means later in this tutorial.
Now that we have successfully imported our csv file into R, it is time for us to set our working directory back to our original directory.
setwd("..")
getwd()
## [1] "C:/Users/ehsan/Documents/GitHub/intro2R"
3.4.2 Method 2: Copying the Exact Pathway of the File
Another way for us to import the csv file into R is to copy and paste the exact pathway of the file into read_csv()
. You should see the exact same “Column specification” and data frame as before!
read_csv("data/demo_csv.csv")
## Rows: 15 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 15 x 5
## id gender age race edu
## <dbl> <chr> <dbl> <chr> <chr>
## 1 83717 Female 80 Mexican American Less than 9th grade
## 2 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 3 83719 Male 3 Mexican American <NA>
## 4 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 5 83721 Male 52 Non-Hispanic White College graduate or above
## 6 83722 Male 0 Other Race - Including Multi~ <NA>
## 7 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 8 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 9 83725 Male 7 Mexican American <NA>
## 10 83726 Male 40 Mexican American Less than 9th grade
## 11 83727 Male 26 Other Hispanic College graduate or above
## 12 83728 Female 2 Mexican American <NA>
## 13 83729 Female 42 Non-Hispanic Black College graduate or above
## 14 83730 Male 7 Other Hispanic <NA>
## 15 83731 Male 11 Non-Hispanic Asian <NA>
Note that you can also store this imported data into an object using <-
.
<- read_csv("data/demo_csv.csv") DEMO
## Rows: 15 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now, we can just type DEMO to see the data frame.
DEMO
## # A tibble: 15 x 5
## id gender age race edu
## <dbl> <chr> <dbl> <chr> <chr>
## 1 83717 Female 80 Mexican American Less than 9th grade
## 2 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 3 83719 Male 3 Mexican American <NA>
## 4 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 5 83721 Male 52 Non-Hispanic White College graduate or above
## 6 83722 Male 0 Other Race - Including Multi~ <NA>
## 7 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 8 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 9 83725 Male 7 Mexican American <NA>
## 10 83726 Male 40 Mexican American Less than 9th grade
## 11 83727 Male 26 Other Hispanic College graduate or above
## 12 83728 Female 2 Mexican American <NA>
## 13 83729 Female 42 Non-Hispanic Black College graduate or above
## 14 83730 Male 7 Other Hispanic <NA>
## 15 83731 Male 11 Non-Hispanic Asian <NA>
DO QUESTIONS 2-4 OF THE QUIZ NOW
REVIEW: Which of the following codes will print the entire DEMO data frame?
read_csv can also be used to import Excel and txt files. (True or False)
Which R package does the function
read_csv()
belong to?
Try it yourself 3.1
Can you try importing the bpx.csv
file into R using the function read_csv()
?
3.4.3 Key Notes About Importing Data into R
There are a few key things that we should note when using read_csv()
:
1. The file name or pathway to the file needs to be in ""
,
2. The file extension, .csv
, needs to be present, and
3. The name of the file needs to be exact.
The third point is related to one of the most common mistakes. When importing any data from your hard drive onto R, you need to make sure that the file name that you write in R is exactly what it displays on your hard drive. For instance, take note of spaces, capital letters, spelling of words, as well as the correct extensions. In other words, demo_csv-1.csv
or Demo_csv.csv
is much different than demo_csv.csv
.
Another point to note is that read_csv()
automatically assumes that the first row of your csv file is the header. We will learn how to tell R this assumption is not correct in section 3 of this tutorial.
DO QUESTION 5 OF THE QUIZ NOW
- REVIEW: R is case sensitive. (True or False)
Try it yourself 3.2
Can you identify the mistakes of the following codes?
# a.
# read_csv(../input/import/demo_csv.csv)
# b.
# read_csv("data/DEMO_csv.csv")
# c.
# Read_csv("data/demo_csv.csv")
# d.
# read_csv(data/"demo_csv.csv")
DO QUESTION 6 OF THE QUIZ NOW
- Which of the following statements about
read_csv()
are correct? (select all that apply)
3.4.4 Column Specification
You may notice that when you import a data into R by running read_csv()
, a “Column specification” list appears. This list tells us two things:
1. The names of our columns and
2. The type of data that each column contains.
read_csv("data/demo_csv.csv")
## Rows: 15 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 15 x 5
## id gender age race edu
## <dbl> <chr> <dbl> <chr> <chr>
## 1 83717 Female 80 Mexican American Less than 9th grade
## 2 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 3 83719 Male 3 Mexican American <NA>
## 4 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 5 83721 Male 52 Non-Hispanic White College graduate or above
## 6 83722 Male 0 Other Race - Including Multi~ <NA>
## 7 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 8 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 9 83725 Male 7 Mexican American <NA>
## 10 83726 Male 40 Mexican American Less than 9th grade
## 11 83727 Male 26 Other Hispanic College graduate or above
## 12 83728 Female 2 Mexican American <NA>
## 13 83729 Female 42 Non-Hispanic Black College graduate or above
## 14 83730 Male 7 Other Hispanic <NA>
## 15 83731 Male 11 Non-Hispanic Asian <NA>
As we can see after running the code above, there are five columns in our data frame: id (the participant’s unique ID number), gender, age, race, and edu (highest level of education).
We can also see that there are two types of data in this data frame col_double()
and col_character()
.
DO QUESTION 7 OF THE QUIZ NOW
- Which of the following is the best an example of a data that would be classified as
col_double()
?
Try it yourself 3.3
Just by looking at the actual data frame, can you guess what type of data col_double()
and col_character()
are?
(HINT: doubles? integers? logical? character?)
3.5 More Arguments Of read_csv
3.5.1 Skip
There are a range of other arguments that we can use with read_csv()
. Firstly, we can nest skip
inside the ()
of read_csv()
to tell R to skip (AKA not import) a certain number of rows when importing our data.
<- read_csv("data/demo_csv.csv", skip = 2) Skip_2
## Rows: 13 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Female, Non-Hispanic Black, High school graduate/GED or equi
## dbl (2): 83718, 60
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Skip_2
## # A tibble: 13 x 5
## `83718` Female `60` `Non-Hispanic Black` `High school graduate/GED o~
## <dbl> <chr> <dbl> <chr> <chr>
## 1 83719 Male 3 Mexican American <NA>
## 2 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 3 83721 Male 52 Non-Hispanic White College graduate or above
## 4 83722 Male 0 Other Race - Including Mul~ <NA>
## 5 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 6 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 7 83725 Male 7 Mexican American <NA>
## 8 83726 Male 40 Mexican American Less than 9th grade
## 9 83727 Male 26 Other Hispanic College graduate or above
## 10 83728 Female 2 Mexican American <NA>
## 11 83729 Female 42 Non-Hispanic Black College graduate or above
## 12 83730 Male 7 Other Hispanic <NA>
## 13 83731 Male 11 Non-Hispanic Asian <NA>
DEMO
## # A tibble: 15 x 5
## id gender age race edu
## <dbl> <chr> <dbl> <chr> <chr>
## 1 83717 Female 80 Mexican American Less than 9th grade
## 2 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 3 83719 Male 3 Mexican American <NA>
## 4 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 5 83721 Male 52 Non-Hispanic White College graduate or above
## 6 83722 Male 0 Other Race - Including Multi~ <NA>
## 7 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 8 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 9 83725 Male 7 Mexican American <NA>
## 10 83726 Male 40 Mexican American Less than 9th grade
## 11 83727 Male 26 Other Hispanic College graduate or above
## 12 83728 Female 2 Mexican American <NA>
## 13 83729 Female 42 Non-Hispanic Black College graduate or above
## 14 83730 Male 7 Other Hispanic <NA>
## 15 83731 Male 11 Non-Hispanic Asian <NA>
When comparing the Skip_2 table with our original DEMO table, we can see that Skip_2 has two less rows. This is because the argument skip = 2
has told R to not import the first two rows of our demo.csv
.
DO QUESTION 8 OF THE QUIZ NOW
- Which of the following statements is true about the argument
skip
?
Try it yourself 3.4
You may also notice that the header of Skip_2 is incorrect. This is because R recognizes the header of our data as the first row, thus omiting it when importing demo.csv
into R.
Let’s say this is not what we really want. What we actually want to do is to remove the first two rows of actual data while keeping the header. What do you think we have to do to achieve this?
(HINT: Recall what we learn about extracting rows in tutorial 1)
3.5.2 Remove Header & Header Names
Recall how read_csv()
assumes that the first row of our data is the header. If this is not true, we can use col_names = FALSE
to tell R that the first row of our data do not contain headers and that R should add headers for our data.
<- read_csv("data/demo_csv.csv",
No_header col_names = FALSE)
## Rows: 16 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): X1, X2, X3, X4, X5
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
No_header
## # A tibble: 16 x 5
## X1 X2 X3 X4 X5
## <chr> <chr> <chr> <chr> <chr>
## 1 id gender age race edu
## 2 83717 Female 80 Mexican American Less than 9th grade
## 3 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 4 83719 Male 3 Mexican American <NA>
## 5 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 6 83721 Male 52 Non-Hispanic White College graduate or above
## 7 83722 Male 0 Other Race - Including Multi~ <NA>
## 8 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 9 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 10 83725 Male 7 Mexican American <NA>
## 11 83726 Male 40 Mexican American Less than 9th grade
## 12 83727 Male 26 Other Hispanic College graduate or above
## 13 83728 Female 2 Mexican American <NA>
## 14 83729 Female 42 Non-Hispanic Black College graduate or above
## 15 83730 Male 7 Other Hispanic <NA>
## 16 83731 Male 11 Non-Hispanic Asian <NA>
We can also change the names of our headers by using col_names =
following by a vector of names. For example:
<- read_csv("data/demo_csv.csv",
Header_names col_names = c("ID", "Gender", "Age", "Race", "Education"))
## Rows: 16 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): ID, Gender, Age, Race, Education
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Header_names
## # A tibble: 16 x 5
## ID Gender Age Race Education
## <chr> <chr> <chr> <chr> <chr>
## 1 id gender age race edu
## 2 83717 Female 80 Mexican American Less than 9th grade
## 3 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 4 83719 Male 3 Mexican American <NA>
## 5 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 6 83721 Male 52 Non-Hispanic White College graduate or above
## 7 83722 Male 0 Other Race - Including Multi~ <NA>
## 8 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 9 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 10 83725 Male 7 Mexican American <NA>
## 11 83726 Male 40 Mexican American Less than 9th grade
## 12 83727 Male 26 Other Hispanic College graduate or above
## 13 83728 Female 2 Mexican American <NA>
## 14 83729 Female 42 Non-Hispanic Black College graduate or above
## 15 83730 Male 7 Other Hispanic <NA>
## 16 83731 Male 11 Non-Hispanic Asian <NA>
DO QUESTION 9 OF THE QUIZ NOW
- In which of the following scenarios do you think we would NEED to use
col_names = FALSE
? (select all that apply)
With the addedcol_names
, you may notice that the column specification for our data is not incorrect (everything is recognized as col_character()
!
This is because R now reads “id,” “gender,” “age,” “race,” and “edu” as a content row, and since all of these are texts, R recognizes the entire column as col_character()
. This is something worthy to note when you are importing data into R.
We can solve this problem with this solution:
<- read_csv("data/demo_csv.csv",
(Skip_and_Header_Names skip = 1,
col_names = c("ID", "Gender", "Age", "Race", "Education")))
## Rows: 15 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Gender, Race, Education
## dbl (2): ID, Age
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 15 x 5
## ID Gender Age Race Education
## <dbl> <chr> <dbl> <chr> <chr>
## 1 83717 Female 80 Mexican American Less than 9th grade
## 2 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 3 83719 Male 3 Mexican American <NA>
## 4 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 5 83721 Male 52 Non-Hispanic White College graduate or above
## 6 83722 Male 0 Other Race - Including Multi~ <NA>
## 7 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 8 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 9 83725 Male 7 Mexican American <NA>
## 10 83726 Male 40 Mexican American Less than 9th grade
## 11 83727 Male 26 Other Hispanic College graduate or above
## 12 83728 Female 2 Mexican American <NA>
## 13 83729 Female 42 Non-Hispanic Black College graduate or above
## 14 83730 Male 7 Other Hispanic <NA>
## 15 83731 Male 11 Non-Hispanic Asian <NA>
3.5.3 Missing Values
We can also define missing values by using na =
. For example, if we want to assign “Some college or AA degree” under the edu column as NA, we can use the following code:
<- read_csv("data/demo_csv.csv",
Missing_values na = "Some college or AA degree")
## Rows: 15 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): gender, race, edu
## dbl (2): id, age
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Missing_values
## # A tibble: 15 x 5
## id gender age race edu
## <dbl> <chr> <dbl> <chr> <chr>
## 1 83717 Female 80 Mexican American Less than 9th grade
## 2 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 3 83719 Male 3 Mexican American NA
## 4 83720 Male 36 Non-Hispanic Black <NA>
## 5 83721 Male 52 Non-Hispanic White College graduate or above
## 6 83722 Male 0 Other Race - Including Multi~ NA
## 7 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 8 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 9 83725 Male 7 Mexican American NA
## 10 83726 Male 40 Mexican American Less than 9th grade
## 11 83727 Male 26 Other Hispanic College graduate or above
## 12 83728 Female 2 Mexican American NA
## 13 83729 Female 42 Non-Hispanic Black College graduate or above
## 14 83730 Male 7 Other Hispanic NA
## 15 83731 Male 11 Non-Hispanic Asian NA
DO QUESTION 10 OF THE QUIZ NOW
- Only characters can be assigned a value of NA, there is a different missing-value designation for numeric values. (True or False)
3.6 Importing Other File Types into R
While csv is the most common file type to import into R, we can also import other types of data file into R using different functions. In this section, you will be introduced to the very basics of how to import txt, xlsx, xpt, and sas files into R.
3.6.1 Text file (txt)
The simplest function that we can use to import a txt file is read.table()
. This function belongs to the default Base R package, so we do not need to install or attach any packages before using it!
The first argument of this function is the file path. What do you think header = TRUE
mean?
read.table("data/demo_txt.txt", header = TRUE)
## id gender age
## 1 83717 Female 80
## 2 83718 Female 60
## 3 83719 Male 3
## 4 83720 Male 36
## 5 83721 Male 52
## 6 83722 Male 0
## 7 83723 Male 61
## 8 83724 Male 80
## 9 83725 Male 7
## 10 83726 Male 40
## 11 83727 Male 26
## 12 83728 Female 2
## 13 83729 Female 42
## 14 83730 Male 7
## 15 83731 Male 11
3.6.2 Excel file (xlsx)
To import an xlsx file into R, we use read_excel()
. But before we can use this function, we need to install and attach the readxl package. Similarly to read.table()
, this function requires a file path.
# install.packages("readxl")
library(readxl)
read_excel("data/demo_xlsx.xlsx")
## # A tibble: 15 x 5
## id gender age race edu
## <dbl> <chr> <dbl> <chr> <chr>
## 1 83717 Female 80 Mexican American Less than 9th grade
## 2 83718 Female 60 Non-Hispanic Black High school graduate/GED or~
## 3 83719 Male 3 Mexican American NA
## 4 83720 Male 36 Non-Hispanic Black Some college or AA degree
## 5 83721 Male 52 Non-Hispanic White College graduate or above
## 6 83722 Male 0 Other Race - Including Multi~ NA
## 7 83723 Male 61 Mexican American 9-11th grade (Includes 12th~
## 8 83724 Male 80 Non-Hispanic White High school graduate/GED or~
## 9 83725 Male 7 Mexican American NA
## 10 83726 Male 40 Mexican American Less than 9th grade
## 11 83727 Male 26 Other Hispanic College graduate or above
## 12 83728 Female 2 Mexican American NA
## 13 83729 Female 42 Non-Hispanic Black College graduate or above
## 14 83730 Male 7 Other Hispanic NA
## 15 83731 Male 11 Non-Hispanic Asian NA
Try it yourself 3.5
Import the bpx.xlsx
into R using the read_excel()
function.
3.6.3 XPT File Extension
Another file type that you may need to import into R is xpt. To do this, we need the function read.xport()
that belongs to the SASxport package.
# install.packages("SASxport")
library(SASxport)
read.xport("data/demo_xpt.xpt")
## ID GENDER RACE
## 1 83717 Female Mexican American
## 2 83718 Female Non-Hispanic Black
## 3 83719 Male Mexican American
## 4 83720 Male Non-Hispanic Black
## 5 83721 Male Non-Hispanic White
## 6 83722 Male Other Race - Including Multi-Rac
## 7 83723 Male Mexican American
## 8 83724 Male Non-Hispanic White
## 9 83725 Male Mexican American
## 10 83726 Male Mexican American
## 11 83727 Male Other Hispanic
## 12 83728 Female Mexican American
## 13 83729 Female Non-Hispanic Black
## 14 83730 Male Other Hispanic
## 15 83731 Male Other Race - Including Multi-Rac
3.6.4 Statistical Analysis Software (SAS)
We can use the function read_sas()
to import sas files into R. But before we do this, we need to install and attach the haven package.
# install.packages("haven")
library("haven")
read_sas("data/demo_sas.sas")
## # A tibble: 15 x 3
## id gender race
## <dbl> <dbl> <dbl>
## 1 83717 2 1
## 2 83718 2 4
## 3 83719 1 1
## 4 83720 1 4
## 5 83721 1 3
## 6 83722 1 5
## 7 83723 1 1
## 8 83724 1 3
## 9 83725 1 1
## 10 83726 1 1
## 11 83727 1 2
## 12 83728 2 1
## 13 83729 2 4
## 14 83730 1 2
## 15 83731 1 5
3.7 Exporting the Data Frame From R
After changing and manipulating our data in R, we can also export it back into a csv file to share it. To do this, we can use the function write_csv()
. For example, let’s say we want to export our “Missing_values” data frame.
write_csv(Missing_values, "data/Missing Values.csv")
We can also export it to an Excel file by using write_excel_csv()
.
write_excel_csv(Missing_values, "data/Missing Values.xlsx")
Now if we check our working directory, there should be 2 new files, “Missing Values.csv” and “Missing Values.xlsx!”
dir()
## [1] "_book"
## [2] "_bookdown.yml"
## [3] "_bookdown_files"
## [4] "_build.sh"
## [5] "_deploy.sh"
## [6] "_output.yml"
## [7] "0-r-and-rstudio-set-up.Rmd"
## [8] "1-introduction-to-r.Rmd"
## [9] "2-importing-data-into-r-with-readr.Rmd"
## [10] "3-introduction-to-nhanes.Rmd"
## [11] "4-data-analysis-with-dplyr.Rmd"
## [12] "5-data-visualization-with-ggplot.Rmd"
## [13] "6-date-time-data-with-lubridate.Rmd"
## [14] "7-data-summary-with-tableone.Rmd"
## [15] "8-Exercise-Solutions.Rmd"
## [16] "9-references.Rmd"
## [17] "book.bib"
## [18] "data"
## [19] "DESCRIPTION"
## [20] "Dockerfile"
## [21] "docs"
## [22] "header.html"
## [23] "images"
## [24] "index.Rmd"
## [25] "intro2R.log"
## [26] "intro2R.Rmd"
## [27] "intro2R.tex"
## [28] "intro2R_cache"
## [29] "intro2R_files"
## [30] "LICENSE"
## [31] "now.json"
## [32] "packages.bib"
## [33] "preamble.tex"
## [34] "R.Rproj"
## [35] "README.md"
## [36] "style.css"
## [37] "toc.css"
Congratulations! We have now succeeded in exporting a dataset from R to an external file! This will make our work much easier to share and access!
3.8 Summary and Takeaways
In this tutorial, we have learned how to import csv files from our hard drive into R using read_csv()
. This is an important first step in data analysis or manipulation since we need to be able to have the data in R in order to process it!