Chapter 6 Importing NHANES to R
This is a short instruction document of how to get NHANES dataset from the US CDC site to your RStudio environment. Once we bring the dataset into RStudio, the next step is to think about creating analytic dataset.
6.1 NHANES Dataset
National Center for Health Statistics (NCHS) conducts National Health and Nutrition Examination Survey (NHANES) (CDC,NCHS (2018)). These surveys are designed to evaluate the health and nutritional status of U.S. adults and children. These surveys are being administered in two-year cycles or intervals starting from 1999-2000. Prior to 1999, a number of surveys were conducted (e.g., NHANES III), but in our discussion, we will mostly restrict our discussions to `continuous NHANES’ (e.g., NHANES 1999-2000 to NHANES 2017-2018).
Witin the CDC website, continuous NHANES data are available in 5 categories:
- Demographics
- Dietary
- Examination
- Laboratory
- Questionnaire
6.2 Accessing NHANES Data
In the following example, we will see how to download ‘Demographics’ data, and check associated variable in that data.
6.2.1 Accessing NHANES Data Directly from the CDC website
NHANES 1999-2000 and onward survey datasets are publicly available at wwwn.cdc.gov/nchs/nhanes/.
- Step 1: Say, for example, we are interested about NHANES 2015-2016 surveys. Clicking the associated link in the above Figure gets us to the page for the cirresponding cycle (see below).
- Step 2: There are various types of data available for this survey. Let’s explore the demographic information from this clycle. These data are mostly available in the form of SAS `XPT’ format (see below).
- Step 3: We can download the XPT data in the local PC folder and read the data into R as as follows:
# install.packages("SASxport")
require(SASxport)
library(foreign)
<- read.xport("SurveyData\\DEMO_I.XPT") DEMO
##
## Attaching package: 'foreign'
## The following objects are masked from 'package:SASxport':
##
## lookup.xport, read.xport
- Step 4: Once data is imported in RStudio, we will see the
DEMO
object listed under data window (see below):
- Step 5: We can also check the variable names in this
DEMO
dataset as follows:
names(DEMO)
## [1] "SEQN" "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
## [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC"
## [13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
## [19] "RIDEXPRG" "SIALANG" "SIAPROXY" "SIAINTRP" "FIALANG" "FIAPROXY"
## [25] "FIAINTRP" "MIALANG" "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
## [31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
## [37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
## [43] "SDMVPSU" "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"
- Step 6: We can open the data in RStudio in the dataview window (by clicking the
DEMO
data from the data window). The next Figure shows only a few columns and rows from this large dataset. Note that there are some values marked as “NA”, which represents missing values.
- Step 7: There is a column name associated with each column, e.g.,
DMDHSEDU
in the first column in the above Figure. To understand what the column names mean in this Figure, we need to take a look at the codebook. To access codebook, click the'DEMO|Doc'
link (in step 2). This will show the data documentation and associated codebook (see the next Figure).
- Step 8: We can see a link for the column or variable
DMDHSEDU
in the table of content (in the above Figure). Clicking that link will provide us further information about what this variable means (see the next Figure).
- Step 9: We can assess if the numbers reported under count and cumulative (from the above Figure) matches with what we get from the
DEMO
data we just imported (particularly, for theDMDHSEDU
variable):
table(DEMO$DMDHSEDU)
##
## 1 2 3 4 5 7 9
## 619 511 980 1462 1629 2 23
cumsum(table(DEMO$DMDHSEDU))
## 1 2 3 4 5 7 9
## 619 1130 2110 3572 5201 5203 5226
length(is.na(DEMO$DMDHSEDU))
## [1] 9971
6.2.2 Accessing NHANES Data Using R Packages
6.2.2.1 nhanesA
nhanesA
provides a convenient way to download and analyze NHANES survey data.
#install.packages("nhanesA")
library(nhanesA)
- Step 1: Witin the CDC website, NHANES data are available in 5 categories
- Demographics (
DEMO
) - Dietary (
DIET
) - Examination (
EXAM
) - Laboratory (
LAB
) - Questionnaire (
Q
)
- Demographics (
To get a list of available variables within a datafile, we run the following command (e.g., we check variable names within DEMO
data):
library(nhanesA)
nhanesTables(data_group='DEMO', year=2015)
## Data.File.Name Data.File.Description
## 1 DEMO_I Demographic Variables and Sample Weights
- Step 2: We can obtain the summaries of the downloaded data as follows (see below):
<- nhanes('DEMO_I')
demo names(demo)
## [1] "SEQN" "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
## [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC"
## [13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
## [19] "RIDEXPRG" "SIALANG" "SIAPROXY" "SIAINTRP" "FIALANG" "FIAPROXY"
## [25] "FIAINTRP" "MIALANG" "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
## [31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
## [37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
## [43] "SDMVPSU" "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"
table(demo$DMDHSEDU)
##
## 1 2 3 4 5 7 9
## 619 511 980 1462 1629 2 23
cumsum(table(demo$DMDHSEDU))
## 1 2 3 4 5 7 9
## 619 1130 2110 3572 5201 5203 5226
length(is.na(demo$DMDHSEDU))
## [1] 9971
6.2.2.2 RNHANES
RNHANES
(Susmann (2016)) is another packages for downloading the NHANES data easily. Try yourself.