Importing NHANES to R

This tutorial provides comprehensive instructions on accessing the National Health and Nutrition Examination Survey (NHANES) dataset from the US Centers for Disease Control and Prevention (CDC) website and importing it into the RStudio environment. It covers accessing NHANES Data:

# Load required packages
#devtools::install_github("warnes/SASxport")
library(SASxport)
library(foreign)
library(nhanesA)
library(knitr)
require(DiagrammeR)
require(DiagrammeRsvg)
require(rsvg)
library(magrittr)
library(svglite)
library(png)
use.saved.chche <- TRUE

Before installing a package from GitHub, it’s better to check whether you installed the right version of Rtools

Accessing NHANES Data Directly from the CDC website

In the following example, we will see how to download ‘Demographics’ data, and check associated variable in that dataset.

NHANES 1999-2000 and onward survey datasets are publicly available at wwwn.cdc.gov/nchs/nhanes/

  • Step 1: Say, for example, we are interested about the NHANES 2015-2016 survey. Clicking the associated link in the above Figure gets us to the page for the corresponding cycle (see below).

  • Step 2: There are various types of data available for this survey. Let’s explore the demographic information from this cycle. These data are mostly available in the form of SAS XPT format (see below).

  • Step 3: We can download the XPT data in the local PC folder and read the data into R as as follows:
DEMO <- read.xport("Data/accessing/DEMO_I.XPT")
  • Step 4: Once data is imported in RStudio, we will see the DEMO object listed under data window (see below):

  • Step 5: We can also check the variable names in this DEMO dataset as follows:
names(DEMO)
#>  [1] "SEQN"     "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
#>  [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC" 
#> [13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
#> [19] "RIDEXPRG" "SIALANG"  "SIAPROXY" "SIAINTRP" "FIALANG"  "FIAPROXY"
#> [25] "FIAINTRP" "MIALANG"  "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
#> [31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
#> [37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
#> [43] "SDMVPSU"  "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"
  • Step 6: We can open the data in RStudio in the dataview window (by clicking the DEMO data from the data window). The next Figure shows only a few columns and rows from this large dataset. Note that there are some values marked as “NA”, which represents missing values.

  • Step 7: There is a column name associated with each column, e.g., DMDHSEDU in the first column in the above Figure. To understand what the column names mean in this Figure, we need to take a look at the codebook. To access codebook, click the 'DEMO|Doc' link (in step 2). This will show the data documentation and associated codebook (see the next Figure).

  • Step 8: We can see a link for the column or variable DMDHSEDU in the table of content (in the above Figure). Clicking that link will provide us further information about what this variable means (see the next Figure).

  • Step 9: We can assess if the numbers reported under count and cumulative (from the above Figure) matches with what we get from the DEMO data we just imported (particularly, for the DMDHSEDU variable):
table(DEMO$DMDHSEDU) # Frequency table
#> 
#>    1    2    3    4    5    7    9 
#>  619  511  980 1462 1629    2   23
cumsum(table(DEMO$DMDHSEDU)) # Cumulative frequency table
#>    1    2    3    4    5    7    9 
#>  619 1130 2110 3572 5201 5203 5226
length(is.na(DEMO$DMDHSEDU)) # Number of non-NA observations
#> [1] 9971

Accessing NHANES Data Using R Packages

nhanesA package

library(nhanesA)
Tip

R package nhanesA provides a convenient way to download and analyze NHANES survey data.

RNHANES (Susmann 2016) is another packages for downloading the NHANES data easily.

  • Step 1: Witin the CDC website, NHANES data are available in 5 categories
    • Demographics (DEMO)
    • Dietary (DIET)
    • Examination (EXAM)
    • Laboratory (LAB)
    • Questionnaire (Q)

To get a list of available variables within a data file, we run the following command (e.g., we check variable names within DEMO data):

nhanesTables(data_group='DEMO', year=2015)
  • Step 2: We can obtain the summaries of the downloaded data as follows (see below):
demo <- nhanes('DEMO_I')
names(demo)
#>  [1] "SEQN"     "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
#>  [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC" 
#> [13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
#> [19] "RIDEXPRG" "SIALANG"  "SIAPROXY" "SIAINTRP" "FIALANG"  "FIAPROXY"
#> [25] "FIAINTRP" "MIALANG"  "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
#> [31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
#> [37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
#> [43] "SDMVPSU"  "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"
table(demo$DMDHSEDU) # Frequency table
#> 
#>    1    2    3    4    5    7    9 
#>  619  511  980 1462 1629    2   23
cumsum(table(demo$DMDHSEDU)) # Cumulative frequency table
#>    1    2    3    4    5    7    9 
#>  619 1130 2110 3572 5201 5203 5226
length(is.na(demo$DMDHSEDU)) # Number of non-NA observations
#> [1] 9971

Import data issue

Sometimes, you might see a warning message when downloading NHANES data using an R package. For example, simpleWarning in download.file(url, tf, mode = “wb”, quiet = TRUE): cannot open URL, or 404 data Not Found.

The possible reason could be the NHANES server was down when you tried to connect. In that case, try later with the same codes. Also, check the name of the variables carefully. The name of the same variable could be different in different survey cycles. It is also possible that some variables are not available in all cycles.

References

Susmann, Herb. 2016. RNHANES: Facilitates Analysis of CDC NHANES Data. https://CRAN.R-project.org/package=RNHANES.