Import external data
When dealing with data analysis in R, it’s common to need to import external data. This tutorial will walk you through importing data in different formats.
CSV format data
CSV stands for “Comma-Separated Values” and it’s a widely used format for data. We’ll be looking at the “Employee Salaries - 2017” dataset, which contains salary information for permanent employees of Montgomery County in 2017.
We’ll be loading the Employee_Salaries_-_2017.csv
dataset into R from its saved location at Data/wrangling/
. Do note, the directory path might vary for you based on where you’ve stored the downloaded data.
Here, the read.csv
function reads the data from the CSV file and stores it in a variable called data.download
.
To understand the structure of our dataset, We can see the number of rows and columns and the names of the columns/variables as follows:
dim(data.download) # check dimension / row / column numbers
#> [1] 9398 12
nrow(data.download) # check row numbers
#> [1] 9398
names(data.download) # check column names
#> [1] "Full.Name" "Gender"
#> [3] "Current.Annual.Salary" "X2017.Gross.Pay.Received"
#> [5] "X2017.Overtime.Pay" "Department"
#> [7] "Department.Name" "Division"
#> [9] "Assignment.Category" "Employee.Position.Title"
#> [11] "Position.Under.Filled" "Date.First.Hired"
head
shows the first 6 elements of an object, giving you a sneak peek into the data you’re dealing with, while tail
shows the last 6 elements.
We can see the first see six rows of the dataset as follows:
Next, for learning purposes, let’s artificially assign all male genders in our dataset as missing:
# Assigning male gender as missing
data.download$Gender[data.download$Gender == "M"] <- NA
head(data.download)
This chunk sets the Gender column’s value to NA (missing) wherever the gender is “M”. This is a form of data manipulation, sometimes used to handle missing or incorrect data. If you want to work with datasets that exclude any missing values:
na.omit
and complete.cases
are useful functions to to create datasets with non-NA values
# deleting/dropping missing components
data.download2 <- na.omit(data.download)
head(data.download2)
Here, na.omit is used to remove rows with any missing values. This can be essential when preparing data for certain analyses.
Alternatively, we could have selected only females to drop all males:
And to check the size of this new dataset:
SAS format data
SAS is another data format, commonly used in professional statistics and analytics.
Let’s explore importing a SAS dataset. We download a SAS formatted dataset from the CDC website.
# Link
x <- "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"
# Data
NHANES1516data <- sasxport.get(x)
#> Processing SAS dataset DEMO_I ..
# Check dimension / row / column numbers
dim(NHANES1516data)
#> [1] 9971 47
# Check row numbers
nrow(NHANES1516data)
#> [1] 9971
# Check first 10 names
names(NHANES1516data)[1:10]
#> [1] "seqn" "sddsrvyr" "ridstatr" "riagendr" "ridageyr" "ridagemn"
#> [7] "ridreth1" "ridreth3" "ridexmon" "ridexagm"
The sasxport.get
function retrieves the SAS dataset. The following lines, just like before, help understand its structure.
To analyze some of the data:
Verify these numbers from CDC website
This code creates a frequency table of the riagendr
variable, which represents gender.