Importing NHANES to R
This tutorial provides comprehensive instructions on accessing the National Health and Nutrition Examination Survey (NHANES) dataset from the US Centers for Disease Control and Prevention (CDC) website and importing it into the RStudio environment. It covers accessing NHANES Data:
- Directly from the CDC website: A step-by-step guide with accompanying images, illustrating how to navigate the CDC website, download the data, and interpret the accompanying codebook.
- Using R packages, specifically the nhanesA package: A concise guide on how to download and get summaries of the NHANES data using this R package.
Before installing a package from GitHub, it’s better to check whether you installed the right version of Rtools
Accessing NHANES Data Directly from the CDC website
In the following example, we will see how to download ‘Demographics’ data, and check associated variable in that dataset.
NHANES 1999-2000 and onward survey datasets are publicly available at wwwn.cdc.gov/nchs/nhanes/
- Step 1: Say, for example, we are interested about the NHANES 2015-2016 survey. Clicking the associated link in the above Figure gets us to the page for the corresponding cycle (see below).
-
Step 2: There are various types of data available for this survey. Let’s explore the demographic information from this cycle. These data are mostly available in the form of SAS
XPT
format (see below).
- Step 3: We can download the XPT data in the local PC folder and read the data into R as as follows:
-
Step 4: Once data is imported in RStudio, we will see the
DEMO
object listed under data window (see below):
-
Step 5: We can also check the variable names in this
DEMO
dataset as follows:
names(DEMO)
#> [1] "SEQN" "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
#> [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC"
#> [13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
#> [19] "RIDEXPRG" "SIALANG" "SIAPROXY" "SIAINTRP" "FIALANG" "FIAPROXY"
#> [25] "FIAINTRP" "MIALANG" "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
#> [31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
#> [37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
#> [43] "SDMVPSU" "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"
-
Step 6: We can open the data in RStudio in the dataview window (by clicking the
DEMO
data from the data window). The next Figure shows only a few columns and rows from this large dataset. Note that there are some values marked as “NA”, which represents missing values.
-
Step 7: There is a column name associated with each column, e.g.,
DMDHSEDU
in the first column in the above Figure. To understand what the column names mean in this Figure, we need to take a look at the codebook. To access codebook, click the'DEMO|Doc'
link (in step 2). This will show the data documentation and associated codebook (see the next Figure).
-
Step 8: We can see a link for the column or variable
DMDHSEDU
in the table of content (in the above Figure). Clicking that link will provide us further information about what this variable means (see the next Figure).
-
Step 9: We can assess if the numbers reported under count and cumulative (from the above Figure) matches with what we get from the
DEMO
data we just imported (particularly, for theDMDHSEDU
variable):
Accessing NHANES Data Using R Packages
nhanesA package
R package nhanesA
provides a convenient way to download and analyze NHANES survey data.
RNHANES (Susmann 2016) is another packages for downloading the NHANES data easily.
-
Step 1: Witin the CDC website, NHANES data are available in 5 categories
- Demographics (
DEMO
) - Dietary (
DIET
) - Examination (
EXAM
) - Laboratory (
LAB
) - Questionnaire (
Q
)
- Demographics (
To get a list of available variables within a data file, we run the following command (e.g., we check variable names within DEMO
data):
- Step 2: We can obtain the summaries of the downloaded data as follows (see below):
demo <- nhanes('DEMO_I')
names(demo)
#> [1] "SEQN" "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
#> [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC"
#> [13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
#> [19] "RIDEXPRG" "SIALANG" "SIAPROXY" "SIAINTRP" "FIALANG" "FIAPROXY"
#> [25] "FIAINTRP" "MIALANG" "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
#> [31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
#> [37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
#> [43] "SDMVPSU" "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"
table(demo$DMDHSEDU) # Frequency table
#>
#> 1 2 3 4 5 7 9
#> 619 511 980 1462 1629 2 23
cumsum(table(demo$DMDHSEDU)) # Cumulative frequency table
#> 1 2 3 4 5 7 9
#> 619 1130 2110 3572 5201 5203 5226
length(is.na(demo$DMDHSEDU)) # Number of non-NA observations
#> [1] 9971
Import data issue
Sometimes, you might see a warning message when downloading NHANES data using an R package. For example, simpleWarning in download.file(url, tf, mode = “wb”, quiet = TRUE): cannot open URL, or 404 data Not Found.
The possible reason could be the NHANES server was down when you tried to connect. In that case, try later with the same codes. Also, check the name of the variables carefully. The name of the same variable could be different in different survey cycles. It is also possible that some variables are not available in all cycles.