Data.File.Name | Data.File.Description |
---|---|
DEMO_H | Demographic Variables and Sample Weights |
Predictive question-2a
Working with a Predictive question using NHANES
Part 1: Identify, download and merge necessary data:
The tutorial focuses on building a predictive model for Diastolic blood pressure in the U.S. population for the years 2013-14. It provides a step-by-step guide on how to use R for data manipulation and analysis, covering the initial setup of the R environment, identification of relevant covariates like age, sex, and lifestyle factors, and methods to search and import these variables from the NHANES dataset. Following data importation, subsets of relevant variables are merged into a single analytic dataset, which is then saved for future research. The tutorial also includes an exercise.
Example article
Let us use the article by Li et al. (2020) as our reference. DOI:10.1038/s41371-019-0224-9.
Video content (optional)
For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the content.
Research question
Building a predictive model for Diastolic blood pressure
from 2013-14 NHANES data. Here we are only interested about explaining the outcome (Diastolic blood pressure
).
PICOT element | Description |
---|---|
P | US |
I | Not applicable, as we are dealing with a prediction model here |
C | Not applicable, as we are dealing with a prediction model here |
O | Diastolic blood pressure |
T | 2013-14 |
Covariates under consideration (that are known to influence this outcome) based on the literature (e.g., try this paper; see Table 1):
- sex
- age
- race
- marital status
- Systolic blood pressure
- smoking
- alcohol
We will also extract additional survey features: - weights, - strata, - cluster
But how do we know where to search these variables?
Searching for useful variables and datasets
R can do some preliminary searches:
Find suffix for the year
First, we need to identify the correct suffix for the data year.
nhanesTables
function enables quick display of all available tables in survey groups:
- DEMOGRAPHICS:
DEMO
- DIETARY:
DIET
- EXAMINATION:
EXAM
- LABORATORY:
LAB
- QUESTIONNAIRE:
Q
H
is assigned for 2013-2014.
Data.File.Name | Data.File.Description |
---|---|
BPX_H | Blood Pressure |
DXXSPN_H | Dual-Energy X-ray Absorptiometry - Spine |
DXXFEM_H | Dual-Energy X-ray Absorptiometry - Femur |
DXXVFA_H | Dual-Energy X-ray Absorptiometry - Vertebral Fracture Assessment |
BMX_H | Body Measures |
DXXFRX_H | Dual-Energy X-ray Absorptiometry - FRAX Score |
CSX_H | Taste & Smell |
MGX_H | Muscle Strength - Grip Test |
OHXDEN_H | Oral Health - Dentition |
OHXPER_H | Oral Health - Periodontal |
OHXREF_H | Oral Health - Recommendation of Care |
FLXCLN_H | Fluorosis - Clinical |
DXXAAC_H | Dual-Energy X-ray Absorptiometry - Abdominal Aortic Calcification |
DXXT4_H | Dual-Energy X-ray Absorptiometry - T4 Vertebrae Morphology |
DXXT5_H | Dual-Energy X-ray Absorptiometry - T5 Vertebrae Morphology |
DXXT6_H | Dual-Energy X-ray Absorptiometry - T6 Vertebrae Morphology |
DXXT7_H | Dual-Energy X-ray Absorptiometry - T7 Vertebrae Morphology |
DXXT8_H | Dual-Energy X-ray Absorptiometry - T8 Vertebrae Morphology |
DXXT9_H | Dual-Energy X-ray Absorptiometry - T9 Vertebrae Morphology |
DXXT10_H | Dual-Energy X-ray Absorptiometry - T10 Vertebrae Morphology |
DXXT11_H | Dual-Energy X-ray Absorptiometry - T11 Vertebrae Morphology |
DXXT12_H | Dual-Energy X-ray Absorptiometry - T12 Vertebrae Morphology |
DXXL1_H | Dual-Energy X-ray Absorptiometry - L1 Vertebrae Morphology |
DXXL2_H | Dual-Energy X-ray Absorptiometry - L2 Vertebrae Morphology |
DXXL3_H | Dual-Energy X-ray Absorptiometry - L3 Vertebrae Morphology |
DXXL4_H | Dual-Energy X-ray Absorptiometry - L4 Vertebrae Morphology |
DXX_H | Dual-Energy X-ray Absorptiometry - Whole Body |
PAXDAY_H | Physical Activity Monitor - Day |
PAXHD_H | Physical Activity Monitor - Header |
PAXHR_H | Physical Activity Monitor - Hour |
PAXMIN_H | Physical Activity Monitor - Minute |
DXXAG_H | Dual-Energy X-ray Absorptiometry - Android/Gynoid Measurements |
Look up variable names
Once we have the table names, we need to find out which variables in those tables are useful for us.
NHANES Variable Keyword Search
nhanesTableVars
enables quick display of table variables and their definitions:
# ?nhanesTableVars
nhanesTableVars(data_group='DEMO', nh_table='DEMO_H',
namesonly = TRUE)
#> [1] "AIALANGA" "DMDBORN4" "DMDCITZN" "DMDEDUC2" "DMDEDUC3" "DMDFMSIZ"
#> [7] "DMDHHSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRAGE" "DMDHRBR4"
#> [13] "DMDHREDU" "DMDHRGND" "DMDHRMAR" "DMDHSEDU" "DMDMARTL" "DMDYRSUS"
#> [19] "DMQADFC" "DMQMILIZ" "FIAINTRP" "FIALANG" "FIAPROXY" "INDFMIN2"
#> [25] "INDFMPIR" "INDHHIN2" "MIAINTRP" "MIALANG" "MIAPROXY" "RIAGENDR"
#> [31] "RIDAGEMN" "RIDAGEYR" "RIDEXAGM" "RIDEXMON" "RIDEXPRG" "RIDRETH1"
#> [37] "RIDRETH3" "RIDSTATR" "SDDSRVYR" "SDMVPSU" "SDMVSTRA" "SEQN"
#> [43] "SIAINTRP" "SIALANG" "SIAPROXY" "WTINT2YR" "WTMEC2YR"
kable(nhanesTableVars(data_group='DEMO', nh_table='DEMO_H',
namesonly = FALSE))
Variable.Name | Variable.Description |
---|---|
AIALANGA | Language of the MEC ACASI Interview Instrument |
DMDBORN4 | In what country {were you/was SP} born? |
DMDCITZN | {Are you/Is SP} a citizen of the United States? [Information about citizenship is being collected by |
DMDEDUC2 | What is the highest grade or level of school {you have/SP has} completed or the highest degree {you |
DMDEDUC3 | What is the highest grade or level of school {you have/SP has} completed or the highest degree {you |
DMDFMSIZ | Total number of people in the Family |
DMDHHSIZ | Total number of people in the Household |
DMDHHSZA | Number of children aged 5 years or younger in the household |
DMDHHSZB | Number of children aged 6-17 years old in the household |
DMDHHSZE | Number of adults aged 60 years or older in the household |
DMDHRAGE | HH reference person’s age in years |
DMDHRBR4 | HH reference person’s country of birth |
DMDHREDU | HH reference person’s education level |
DMDHRGND | HH reference person’s gender |
DMDHRMAR | HH reference person’s marital status |
DMDHSEDU | HH reference person’s spouse’s education level |
DMDMARTL | Marital status |
DMDYRSUS | Length of time the participant has been in the US. |
DMQADFC | Did {you/SP} ever serve in a foreign country during a time of armed conflict or on a humanitarian or |
DMQMILIZ | {Have you/Has SP} ever served on active duty in the U.S. Armed Forces, military Reserves, or Nationa |
FIAINTRP | Was an interpreter used to conduct the Family interview? |
FIALANG | Language of the Family Interview Instrument |
FIAPROXY | Was a Proxy respondent used in conducting the Family Interview? |
INDFMIN2 | Total family income (reported as a range value in dollars) |
INDFMPIR | A ratio of family income to poverty guidelines. |
INDHHIN2 | Total household income (reported as a range value in dollars) |
MIAINTRP | Was an interpreter used to conduct the MEC CAPI interview? |
MIALANG | Language of the MEC CAPI Interview Instrument |
MIAPROXY | Was a Proxy respondent used in conducting the MEC CAPI Interview? |
RIAGENDR | Gender of the participant. |
RIDAGEMN | Age in months of the participant at the time of screening. Reported for persons aged 24 months or yo |
RIDAGEYR | Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 |
RIDEXAGM | Age in months of the participant at the time of examination. Reported for persons aged 19 years or y |
RIDEXMON | Six month time period when the examination was performed - two categories: November 1 through April |
RIDEXPRG | Pregnancy status for females between 20 and 44 years of age at the time of MEC exam. |
RIDRETH1 | Recode of reported race and Hispanic origin information |
RIDRETH3 | Recode of reported race and Hispanic origin information, with Non-Hispanic Asian Category |
RIDSTATR | Interview and examination status of the participant. |
SDDSRVYR | Data release cycle |
SDMVPSU | Masked variance unit pseudo-PSU variable for variance estimation |
SDMVSTRA | Masked variance unit pseudo-stratum variable for variance estimation |
SEQN | Respondent sequence number. |
SIAINTRP | Was an interpreter used to conduct the Sample Person (SP) interview? |
SIALANG | Language of the Sample Person Interview Instrument |
SIAPROXY | Was a Proxy respondent used in conducting the Sample Person (SP) interview? |
WTINT2YR | Full sample 2 year interview weight. |
WTMEC2YR | Full sample 2 year MEC exam weight. |
NHANES 2013-2014 Demographics Data
Displays a list of variables in the specified NHANES table:
Importing and Subsetting the dataset
Objective is to retain only the useful variables. We will start by importing only the demographic variables that we need.
Demographics
NHANES 2013-2014 Demographic Variables and Sample Weights (DEMO_H)
Take a look at Target
for each variables.
What is the difference between
-
WTINT2YR
: 0 missing -
WTMEC2YR
: Missing 0, Not MEC Examined 362
demo <- nhanes('DEMO_H') # Both males and females 0 YEARS - 150 YEARS
names(demo)
#> [1] "SEQN" "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
#> [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC"
#> [13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
#> [19] "RIDEXPRG" "SIALANG" "SIAPROXY" "SIAINTRP" "FIALANG" "FIAPROXY"
#> [25] "FIAINTRP" "MIALANG" "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
#> [31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
#> [37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
#> [43] "SDMVPSU" "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"
demo1 <- demo[c("SEQN", # Respondent sequence number
"RIAGENDR", # gender
"RIDAGEYR", # Age in years at screening
"RIDRETH3", # Race/Hispanic origin w/ NH Asian
"DMDMARTL", # Marital status: 20 YEARS - 150 YEARS
"WTINT2YR", "WTMEC2YR", # Full sample 2 year weights
"SDMVPSU", # Masked variance pseudo-PSU
"SDMVSTRA")] # Masked variance pseudo-stratum
demo_vars <- names(demo1)
demo2 <- nhanesTranslate('DEMO_H', demo_vars, data=demo1)
#> No translation table is available for SEQN
#> Translated columns: RIAGENDR RIDRETH3 DMDMARTL
Blood pressure
Next, we focus on the blood pressure readings.
NHANES 2013-2014 Blood Pressure (BPX_H)
Take a look at Target
and missing for each variables. For example,
BPXDI1
- Diastolic: Blood pres (1st rdg) mm Hg - Target:Both males and females 8 YEARS - 150 YEARS - Missing 2641
BPXSY1
- Systolic: Blood pres (1st rdg) mm Hg - Target:Both males and females 8 YEARS - 150 YEARS - Missing 2641
bpx <- nhanes('BPX_H')
names(bpx)
#> [1] "SEQN" "PEASCST1" "PEASCTM1" "PEASCCT1" "BPXCHR" "BPAARM"
#> [7] "BPACSZ" "BPXPLS" "BPXPULS" "BPXPTY" "BPXML1" "BPXSY1"
#> [13] "BPXDI1" "BPAEN1" "BPXSY2" "BPXDI2" "BPAEN2" "BPXSY3"
#> [19] "BPXDI3" "BPAEN3" "BPXSY4" "BPXDI4" "BPAEN4"
bpx1 <- bpx[c("SEQN", # Respondent sequence number
"BPXDI1", # Diastolic: Blood pres (1st rdg) mm Hg
"BPXSY1")] # Systolic: Blood pres (1st rdg) mm Hg
bpx_vars <- names(bpx1)
bpx2 <- nhanesTranslate('BPX_H', bpx_vars, data=bpx1)
#> No translation table is available for SEQN
#> Warning in nhanesTranslate("BPX_H", bpx_vars, data = bpx1): No columns were
#> translated
head(bpx2)
Smoking
Now, let’s consider smoking data.
NHANES 2013-2014 Smoking - Cigarette Use (SMQ_H)
SMQ040
- Do you now smoke cigarettes - Target:Both males and females 18 YEARS - 150 YEARS - Missing 4589
smq <- nhanes('SMQ_H')
smq1 <- smq[c("SEQN", # Respondent sequence number
"SMQ040")] # Do you now smoke cigarettes?: 18 YEARS - 150 YEARS
smq_vars <- names(smq1)
smq2 <- nhanesTranslate('SMQ_I', smq_vars, data=smq1)
#> No translation table is available for SEQN
#> Translated columns: SMQ040
head(smq2)
Other options for smoking
variable candidates could be
-
SMD641
- # days smoked cigs during past 30 days -
SMD650
- Avg # cigarettes/day during past 30 days -
SMQ621
- Cigarettes smoked in entire life
Which of these variables are more towards describing what you are thinking as a smoking variable?
Alcohol
Finally, we will import data about alcohol consumption.
NHANES 2013-2014 Alcohol Use (ALQ_H)
ALQ130
- Avg no alcoholic drinks/day - past 12 mos - Target: Both males and females 18 YEARS - 150 YEARS - Missing 2328
alq <- nhanes('ALQ_H')
alq1 <- alq[c("SEQN", # Respondent sequence number
"ALQ130")] # Avg # alcoholic drinks/day
alq_vars <- names(alq1)
alq2 <- nhanesTranslate('ALQ_H', alq_vars, data=alq1)
#> No translation table is available for SEQN
#> Warning in nhanesTranslate("ALQ_H", alq_vars, data = alq1): No columns were
#> translated
head(alq2)
Merging all the datasets
one-by-one
Now, we need to combine all these individual datasets into one for our analysis.
All at once
Alternatively, you can merge all datasets at once.
Saving data for later use
It’s a good practice to save your data for future reference.
Exercise (try yourself)
Follow the steps in the exercise section to deepen your understanding and broaden the analysis.
- The following variables were not included in the above analysis, that were included in this paper: try including them and then create the new analytic data:
- education level
- poverty income ratio
- Sodium intake (mg)
- Potassium intake (mg)
- Download the NHANES 2015-2016 and append with the NHANES 2013-2014 analytic data with same variables.