Predictive question-2a

Working with a Predictive question using NHANES

Part 1: Identify, download and merge necessary data:

The tutorial focuses on building a predictive model for Diastolic blood pressure in the U.S. population for the years 2013-14. It provides a step-by-step guide on how to use R for data manipulation and analysis, covering the initial setup of the R environment, identification of relevant covariates like age, sex, and lifestyle factors, and methods to search and import these variables from the NHANES dataset. Following data importation, subsets of relevant variables are merged into a single analytic dataset, which is then saved for future research. The tutorial also includes an exercise.

Example article

Let us use the article by Li et al. (2020) as our reference. DOI:10.1038/s41371-019-0224-9.

Video content (optional)

Tip

For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the content.

Research question

Building a predictive model for Diastolic blood pressure from 2013-14 NHANES data. Here we are only interested about explaining the outcome (Diastolic blood pressure).

PICOT element Description
P US
I Not applicable, as we are dealing with a prediction model here
C Not applicable, as we are dealing with a prediction model here
O Diastolic blood pressure
T 2013-14

Covariates under consideration (that are known to influence this outcome) based on the literature (e.g., try this paper; see Table 1):

  • sex
  • age
  • race
  • marital status
  • Systolic blood pressure
  • smoking
  • alcohol

We will also extract additional survey features: - weights, - strata, - cluster

But how do we know where to search these variables?

Searching for useful variables and datasets

R can do some preliminary searches:

Find suffix for the year

First, we need to identify the correct suffix for the data year.

nhanesTables function enables quick display of all available tables in survey groups:

  • DEMOGRAPHICS: DEMO
  • DIETARY: DIET
  • EXAMINATION: EXAM
  • LABORATORY: LAB
  • QUESTIONNAIRE: Q
# ?nhanesTables
nhanesTables(data_group='DEMO', year=2013)
# nicer tables
kable(nhanesTables(data_group='DEMO', year=2013))
Data.File.Name Data.File.Description
DEMO_H Demographic Variables and Sample Weights

H is assigned for 2013-2014.

kable(nhanesTables(data_group='EXAM', year=2013))
Data.File.Name Data.File.Description
BPX_H Blood Pressure
DXXSPN_H Dual-Energy X-ray Absorptiometry - Spine
DXXFEM_H Dual-Energy X-ray Absorptiometry - Femur
DXXVFA_H Dual-Energy X-ray Absorptiometry - Vertebral Fracture Assessment
BMX_H Body Measures
DXXFRX_H Dual-Energy X-ray Absorptiometry - FRAX Score
CSX_H Taste & Smell
MGX_H Muscle Strength - Grip Test
OHXDEN_H Oral Health - Dentition
OHXPER_H Oral Health - Periodontal
OHXREF_H Oral Health - Recommendation of Care
FLXCLN_H Fluorosis - Clinical
DXXAAC_H Dual-Energy X-ray Absorptiometry - Abdominal Aortic Calcification
DXXT4_H Dual-Energy X-ray Absorptiometry - T4 Vertebrae Morphology
DXXT5_H Dual-Energy X-ray Absorptiometry - T5 Vertebrae Morphology
DXXT6_H Dual-Energy X-ray Absorptiometry - T6 Vertebrae Morphology
DXXT7_H Dual-Energy X-ray Absorptiometry - T7 Vertebrae Morphology
DXXT8_H Dual-Energy X-ray Absorptiometry - T8 Vertebrae Morphology
DXXT9_H Dual-Energy X-ray Absorptiometry - T9 Vertebrae Morphology
DXXT10_H Dual-Energy X-ray Absorptiometry - T10 Vertebrae Morphology
DXXT11_H Dual-Energy X-ray Absorptiometry - T11 Vertebrae Morphology
DXXT12_H Dual-Energy X-ray Absorptiometry - T12 Vertebrae Morphology
DXXL1_H Dual-Energy X-ray Absorptiometry - L1 Vertebrae Morphology
DXXL2_H Dual-Energy X-ray Absorptiometry - L2 Vertebrae Morphology
DXXL3_H Dual-Energy X-ray Absorptiometry - L3 Vertebrae Morphology
DXXL4_H Dual-Energy X-ray Absorptiometry - L4 Vertebrae Morphology
DXX_H Dual-Energy X-ray Absorptiometry - Whole Body
PAXDAY_H Physical Activity Monitor - Day
PAXHD_H Physical Activity Monitor - Header
PAXHR_H Physical Activity Monitor - Hour
PAXMIN_H Physical Activity Monitor - Minute
DXXAG_H Dual-Energy X-ray Absorptiometry - Android/Gynoid Measurements
# Also try other datasets
# nhanesTables(data_group='DIET', year=2013)
# nhanesTables(data_group='LAB', year=2013)
# nhanesTables(data_group='Q', year=2013)
# what happens when you change year to 2014 or 2015. 
# Try both and compare the outcome.

Look up variable names

Once we have the table names, we need to find out which variables in those tables are useful for us.

NHANES Variable Keyword Search

nhanesTableVars enables quick display of table variables and their definitions:

# ?nhanesTableVars
nhanesTableVars(data_group='DEMO', nh_table='DEMO_H', 
                namesonly = TRUE)
#>  [1] "AIALANGA" "DMDBORN4" "DMDCITZN" "DMDEDUC2" "DMDEDUC3" "DMDFMSIZ"
#>  [7] "DMDHHSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRAGE" "DMDHRBR4"
#> [13] "DMDHREDU" "DMDHRGND" "DMDHRMAR" "DMDHSEDU" "DMDMARTL" "DMDYRSUS"
#> [19] "DMQADFC"  "DMQMILIZ" "FIAINTRP" "FIALANG"  "FIAPROXY" "INDFMIN2"
#> [25] "INDFMPIR" "INDHHIN2" "MIAINTRP" "MIALANG"  "MIAPROXY" "RIAGENDR"
#> [31] "RIDAGEMN" "RIDAGEYR" "RIDEXAGM" "RIDEXMON" "RIDEXPRG" "RIDRETH1"
#> [37] "RIDRETH3" "RIDSTATR" "SDDSRVYR" "SDMVPSU"  "SDMVSTRA" "SEQN"    
#> [43] "SIAINTRP" "SIALANG"  "SIAPROXY" "WTINT2YR" "WTMEC2YR"
kable(nhanesTableVars(data_group='DEMO', nh_table='DEMO_H', 
                      namesonly = FALSE))
Variable.Name Variable.Description
AIALANGA Language of the MEC ACASI Interview Instrument
DMDBORN4 In what country {were you/was SP} born?
DMDCITZN {Are you/Is SP} a citizen of the United States? [Information about citizenship is being collected by
DMDEDUC2 What is the highest grade or level of school {you have/SP has} completed or the highest degree {you
DMDEDUC3 What is the highest grade or level of school {you have/SP has} completed or the highest degree {you
DMDFMSIZ Total number of people in the Family
DMDHHSIZ Total number of people in the Household
DMDHHSZA Number of children aged 5 years or younger in the household
DMDHHSZB Number of children aged 6-17 years old in the household
DMDHHSZE Number of adults aged 60 years or older in the household
DMDHRAGE HH reference person’s age in years
DMDHRBR4 HH reference person’s country of birth
DMDHREDU HH reference person’s education level
DMDHRGND HH reference person’s gender
DMDHRMAR HH reference person’s marital status
DMDHSEDU HH reference person’s spouse’s education level
DMDMARTL Marital status
DMDYRSUS Length of time the participant has been in the US.
DMQADFC Did {you/SP} ever serve in a foreign country during a time of armed conflict or on a humanitarian or
DMQMILIZ {Have you/Has SP} ever served on active duty in the U.S. Armed Forces, military Reserves, or Nationa
FIAINTRP Was an interpreter used to conduct the Family interview?
FIALANG Language of the Family Interview Instrument
FIAPROXY Was a Proxy respondent used in conducting the Family Interview?
INDFMIN2 Total family income (reported as a range value in dollars)
INDFMPIR A ratio of family income to poverty guidelines.
INDHHIN2 Total household income (reported as a range value in dollars)
MIAINTRP Was an interpreter used to conduct the MEC CAPI interview?
MIALANG Language of the MEC CAPI Interview Instrument
MIAPROXY Was a Proxy respondent used in conducting the MEC CAPI Interview?
RIAGENDR Gender of the participant.
RIDAGEMN Age in months of the participant at the time of screening. Reported for persons aged 24 months or yo
RIDAGEYR Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80
RIDEXAGM Age in months of the participant at the time of examination. Reported for persons aged 19 years or y
RIDEXMON Six month time period when the examination was performed - two categories: November 1 through April
RIDEXPRG Pregnancy status for females between 20 and 44 years of age at the time of MEC exam.
RIDRETH1 Recode of reported race and Hispanic origin information
RIDRETH3 Recode of reported race and Hispanic origin information, with Non-Hispanic Asian Category
RIDSTATR Interview and examination status of the participant.
SDDSRVYR Data release cycle
SDMVPSU Masked variance unit pseudo-PSU variable for variance estimation
SDMVSTRA Masked variance unit pseudo-stratum variable for variance estimation
SEQN Respondent sequence number.
SIAINTRP Was an interpreter used to conduct the Sample Person (SP) interview?
SIALANG Language of the Sample Person Interview Instrument
SIAPROXY Was a Proxy respondent used in conducting the Sample Person (SP) interview?
WTINT2YR Full sample 2 year interview weight.
WTMEC2YR Full sample 2 year MEC exam weight.
# https://wwwn.cdc.gov/nchs/nhanes/2013-2014/DEMO_H.htm

NHANES 2013-2014 Demographics Data

Displays a list of variables in the specified NHANES table:

nhanesTableVars(data_group='EXAM', nh_table='BPX_H')
# https://wwwn.cdc.gov/nchs/nhanes/2013-2014/BPX_H.htm

NHANES 2013-2014 Examination Data

Importing and Subsetting the dataset

Objective is to retain only the useful variables. We will start by importing only the demographic variables that we need.

Demographics

NHANES 2013-2014 Demographic Variables and Sample Weights (DEMO_H)

Take a look at Target for each variables.

What is the difference between

  • WTINT2YR: 0 missing
  • WTMEC2YR: Missing 0, Not MEC Examined 362
demo <- nhanes('DEMO_H') # Both males and females 0 YEARS - 150 YEARS
names(demo)
#>  [1] "SEQN"     "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
#>  [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC" 
#> [13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
#> [19] "RIDEXPRG" "SIALANG"  "SIAPROXY" "SIAINTRP" "FIALANG"  "FIAPROXY"
#> [25] "FIAINTRP" "MIALANG"  "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
#> [31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
#> [37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
#> [43] "SDMVPSU"  "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"
demo1 <- demo[c("SEQN", # Respondent sequence number
                "RIAGENDR", # gender
                "RIDAGEYR", # Age in years at screening
                "RIDRETH3", # Race/Hispanic origin w/ NH Asian
                "DMDMARTL", # Marital status: 20 YEARS - 150 YEARS
                "WTINT2YR", "WTMEC2YR", #  Full sample 2 year weights
                "SDMVPSU", # Masked variance pseudo-PSU
                "SDMVSTRA")] # Masked variance pseudo-stratum
demo_vars <- names(demo1) 
demo2 <- nhanesTranslate('DEMO_H', demo_vars, data=demo1)
#> No translation table is available for SEQN
#> Translated columns: RIAGENDR RIDRETH3 DMDMARTL
head(demo2$SEQN)
#> [1] 73557 73558 73559 73560 73561 73562
head(demo2)

Blood pressure

Next, we focus on the blood pressure readings.

NHANES 2013-2014 Blood Pressure (BPX_H)

Take a look at Target and missing for each variables. For example,

BPXDI1 - Diastolic: Blood pres (1st rdg) mm Hg - Target:Both males and females 8 YEARS - 150 YEARS - Missing 2641

BPXSY1 - Systolic: Blood pres (1st rdg) mm Hg - Target:Both males and females 8 YEARS - 150 YEARS - Missing 2641

bpx <- nhanes('BPX_H')
names(bpx)
#>  [1] "SEQN"     "PEASCST1" "PEASCTM1" "PEASCCT1" "BPXCHR"   "BPAARM"  
#>  [7] "BPACSZ"   "BPXPLS"   "BPXPULS"  "BPXPTY"   "BPXML1"   "BPXSY1"  
#> [13] "BPXDI1"   "BPAEN1"   "BPXSY2"   "BPXDI2"   "BPAEN2"   "BPXSY3"  
#> [19] "BPXDI3"   "BPAEN3"   "BPXSY4"   "BPXDI4"   "BPAEN4"
bpx1 <- bpx[c("SEQN", # Respondent sequence number
             "BPXDI1", # Diastolic: Blood pres (1st rdg) mm Hg
             "BPXSY1")] # Systolic: Blood pres (1st rdg) mm Hg
bpx_vars <- names(bpx1)
bpx2 <- nhanesTranslate('BPX_H', bpx_vars, data=bpx1)
#> No translation table is available for SEQN
#> Warning in nhanesTranslate("BPX_H", bpx_vars, data = bpx1): No columns were
#> translated
head(bpx2)

Smoking

Now, let’s consider smoking data.

NHANES 2013-2014 Smoking - Cigarette Use (SMQ_H)

SMQ040 - Do you now smoke cigarettes - Target:Both males and females 18 YEARS - 150 YEARS - Missing 4589

smq <- nhanes('SMQ_H')
smq1 <- smq[c("SEQN", # Respondent sequence number
             "SMQ040")] # Do you now smoke cigarettes?: 18 YEARS - 150 YEARS
smq_vars <- names(smq1)
smq2 <- nhanesTranslate('SMQ_I', smq_vars, data=smq1)
#> No translation table is available for SEQN
#> Translated columns: SMQ040
head(smq2)

Other options for smoking variable candidates could be

  • SMD641 - # days smoked cigs during past 30 days
  • SMD650 - Avg # cigarettes/day during past 30 days
  • SMQ621 - Cigarettes smoked in entire life

Which of these variables are more towards describing what you are thinking as a smoking variable?

Alcohol

Finally, we will import data about alcohol consumption.

NHANES 2013-2014 Alcohol Use (ALQ_H)

ALQ130 - Avg no alcoholic drinks/day - past 12 mos - Target: Both males and females 18 YEARS - 150 YEARS - Missing 2328

alq <- nhanes('ALQ_H')
alq1 <- alq[c("SEQN", # Respondent sequence number
              "ALQ130")] # Avg # alcoholic drinks/day 
alq_vars <- names(alq1)
alq2 <- nhanesTranslate('ALQ_H', alq_vars, data=alq1)
#> No translation table is available for SEQN
#> Warning in nhanesTranslate("ALQ_H", alq_vars, data = alq1): No columns were
#> translated
head(alq2)

Merging all the datasets

one-by-one

Now, we need to combine all these individual datasets into one for our analysis.

analytic.data0 <- merge(demo2, bpx2, by = c("SEQN"), all=TRUE)
head(analytic.data0)
dim(analytic.data0)
#> [1] 10175    11
analytic.data1 <- merge(analytic.data0, smq2, by = c("SEQN"), 
                        all=TRUE)
head(analytic.data1)
dim(analytic.data1)
#> [1] 10175    12
analytic.data2 <- merge(analytic.data1, alq2, by = c("SEQN"), 
                        all=TRUE)
head(analytic.data2)
dim(analytic.data2)
#> [1] 10175    13

All at once

Alternatively, you can merge all datasets at once.

require(plyr)
#> Loading required package: plyr
analytic.data <- join_all(list(demo2, bpx2, smq2, alq2), 
                          by = "SEQN", type='full')
head(analytic.data)
dim(analytic.data)
#> [1] 10175    13

Saving data for later use

It’s a good practice to save your data for future reference.

save(analytic.data, file="Data/researchquestion/Analytic2013.RData") 

Exercise (try yourself)

Follow the steps in the exercise section to deepen your understanding and broaden the analysis.

  1. The following variables were not included in the above analysis, that were included in this paper: try including them and then create the new analytic data:
  • education level
  • poverty income ratio
  • Sodium intake (mg)
  • Potassium intake (mg)
  1. Download the NHANES 2015-2016 and append with the NHANES 2013-2014 analytic data with same variables.

References

Li, Meng, Shoumeng Yan, Xing Li, Shan Jiang, Xiaoyu Ma, Hantong Zhao, Jiagen Li, et al. 2020. “Association Between Blood Pressure and Dietary Intakes of Sodium and Potassium Among US Adults Using Quantile Regression Analysis NHANES 2007–2014.” Journal of Human Hypertension 34 (5): 346–54.