6 Variable Definitions and Cleaning
This section continues the data preparation process using the combined NHANES and mortality dataset created in the previous section, dat.full.with.mortality. Here, we will construct the final variables required for our survival analysis. This includes defining the exposure, outcome, and follow-up time, as well as creating the final analytic dataset by applying inclusion criteria and handling missing data.
- R Code Chunk 1: Load Complete Merged Data
The following code begins by loading the dat.full.with.mortality.RDS file. This dataset is the complete, merged result from the previous chapter, containing the processed demographic, smoking, and mortality data for all participants. The readRDS() function loads the file from the data/ directory, and the colnames() function is used to display the names of all its columns.
Show/Hide Code
# Load previous data file
dat.full.with.mortality <-
readRDS("data/dat.full.with.mortality.RDS")
colnames(dat.full.with.mortality)
#> [1] "id" "age" "sex"
#> [4] "race" "born" "smoking.age"
#> [7] "smoked.while.child" "smoking" "psu"
#> [10] "strata" "year" "survey.weight.new"
#> [13] "mort_eligstat" "mort_stat" "mort_ucod_leading"
#> [16] "mort_diabetes" "mort_hyperten" "mort_permth_int"
#> [19] "mort_permth_exm"6.1 Creating Key Analysis Variables
- R Code Chunk 2: Variable Construction
This section details the construction of the key variables essential for our survival analysis. This includes defining the exposure variable (age at smoking initiation), the survival outcome, and the follow-up time.
1. Exposure Variable: Age at Smoking Initiation Categories (exposure.cat)
The following code creates the primary exposure variable for our analysis, exposure.cat. The numeric smoking.age variable (SMD030 from NHANES) is categorized into six levels based on age ranges used in the original paper. This categorization is based on established age ranges for smoking initiation used in previous research, and aims to capture distinct developmental and addiction risk profiles. We first inspect the raw distribution of smoking.age before using car::recode() to create the new factor and explicitly setting the level order to ensure “Never smoked” is the reference category.
Show/Hide Code
# Display initial distribution of raw smoking.age
table(dat.full.with.mortality$smoking.age, useNA = "always")
#>
#> 0 5 6 7 8 9 10 11 12 13 14 15 16
#> 31915 4 3 106 100 157 261 240 841 1207 1597 2354 2973
#> 17 18 19 20 21 22 23 24 25 26 27 28 29
#> 2399 3591 1510 1755 995 708 415 263 760 166 173 156 63
#> 30 31 32 33 34 35 36 37 38 39 40 41 42
#> 370 26 71 29 22 143 34 24 28 19 98 9 18
#> 43 44 45 46 47 48 49 50 51 52 53 54 55
#> 14 11 44 11 7 6 5 25 4 2 1 4 8
#> 56 57 58 59 60 62 63 64 65 68 70 71 72
#> 2 1 5 2 2 2 1 1 4 4 1 1 4
#> 74 76 77 <NA>
#> 2 1 1 45537
# Recode smoking.age into categorical exposure.cat
dat.full.with.mortality$exposure.cat <- car::recode(
dat.full.with.mortality$smoking.age, "
0 = 'Never smoked'; 1:9 = 'Started before 10';
10:14 = 'Started at 10-14'; 15:17 = 'Started at 15-17';
18:20 = 'Started at 18-20'; 21:80 = 'Started after 20';
else = NA ", as.factor = TRUE)
# Explicitly set the order of factor levels
dat.full.with.mortality$exposure.cat <-
factor(dat.full.with.mortality$exposure.cat,
levels = c("Never smoked", 'Started before 10',
'Started at 10-14', "Started at 15-17",
"Started at 18-20", "Started after 20"))The following tables are used to inspect the distribution of the newly created exposure variable, both overall and broken down by mortality status.
Show/Hide Code
# Display distributions of the newly created exposure variable
table(dat.full.with.mortality$exposure.cat, useNA = "always")
#>
#> Never smoked Started before 10 Started at 10-14 Started at 15-17
#> 31915 370 4146 7726
#> Started at 18-20 Started after 20 <NA>
#> 6856 4766 45537
# Display distributions of exposure.cat level
# broken down by mort_stat variable
table(dat.full.with.mortality$exposure.cat,
dat.full.with.mortality$mort_stat, useNA = "always")
#>
#> 0 1 <NA>
#> Never smoked 27823 3994 98
#> Started before 10 254 116 0
#> Started at 10-14 3217 922 7
#> Started at 15-17 6193 1522 11
#> Started at 18-20 5438 1405 13
#> Started after 20 3618 1140 8
#> <NA> 3272 150 421152. Outcome and Follow-up Time Variables
The following code constructs the variables for the survival analysis. As noted in the original paper, the time of birth is designated as the baseline to prevent differential start times for exposed and unexposed participants. The primary survival outcome is therefore the time from birth to all-cause mortality. Specifically, this is calculated by combining the participant’s age at the interview (from the RIDAGEYR variable) with their mortality follow-up time from the interview date (from the PERMTH_INT variable). We also define the status_all variable from the raw mortality data to indicate the participant’s final mortality status.
Show/Hide Code
# Survival time Person-Months of Follow-up from Interview date
# (PERMTH_INT)
dat.full.with.mortality$stime.since.interview <-
dat.full.with.mortality$mort_permth_int
summary(dat.full.with.mortality$stime.since.interview)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0.0 59.0 113.0 117.9 172.0 250.0 42252
# Age in month at NHANES interview
# (RIDAGEYR)
dat.full.with.mortality$age.month <- dat.full.with.mortality$age * 12
summary(dat.full.with.mortality$age.month)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0 120.0 288.0 373.5 624.0 1020.0
# Survival time - Person-Months of Follow-up from birth = age at
# screening (months) + person-months of follow-up from interview.
# This combines the age at interview with the time from interview
# to death or end of follow-up.
dat.full.with.mortality$stime.since.birth <-
with(dat.full.with.mortality, age.month + stime.since.interview)
# Convert to years for consistent interpretation with age
dat.full.with.mortality$stime.since.birth <- dat.full.with.mortality$stime.since.birth/12
# Ensure NA values are consistent if stime.since.interview was NA
dat.full.with.mortality$stime.since.birth[
is.na(dat.full.with.mortality$stime.since.interview)] <- NA
summary(dat.full.with.mortality$stime.since.birth)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 19.08 41.50 57.17 57.56 72.92 105.67 42252
# All-cause mortality status (PRIMARY outcome)
dat.full.with.mortality$status_all <-
dat.full.with.mortality$mort_stat
table(dat.full.with.mortality$status_all, useNA = "always")
#>
#> 0 1 <NA>
#> 49815 9249 422523. Final Variable Cleaning
The following code drops mortality-related variables not needed for the final analytic dataset.
Show/Hide Code
# Remove unneeded columns
dat.full.with.mortality$mort_diabetes <- NULL
dat.full.with.mortality$mort_hyperten <- NULL
dat.full.with.mortality$mort_permth_int <- NULL
dat.full.with.mortality$mort_permth_exm <- NULL
dat.full.with.mortality$mort_ucod_leading <- NULL
dat.full.with.mortality$mort_eligstat <- NULL4. Recoding Survey Year: year.cat
The year variable, representing the NHANES survey cycle is recoded into a more descriptive categorical variable, year.cat.
Show/Hide Code
# Recode year variable into a factor with descriptive labels
dat.full.with.mortality$year.cat <- dat.full.with.mortality$year
new_levels <- c(
"1999-2000", "2001-2002", "2003-2004",
"2005-2006", "2007-2008", "2009-2010",
"2011-2012", "2013-2014", "2015-2016",
"2017-2018"
)
levels(dat.full.with.mortality$year.cat) <- new_levels
# Display the unique levels of the new year.cat variable
levels(dat.full.with.mortality$year.cat)
#> [1] "1999-2000" "2001-2002" "2003-2004" "2005-2006" "2007-2008" "2009-2010"
#> [7] "2011-2012" "2013-2014" "2015-2016" "2017-2018"6.2 Constructing the Final Analytic Cohort
- R Code Chunk 3: Final Analytic Dataset Construction
The following code creates the final analytic dataset by applying the specific inclusion and exclusion criteria used in the published paper. The two main steps are restricting the participant age range to 20–79 years and performing a complete-case analysis by excluding observations with missing data for key variables.
1. Apply Age and Variable Exclusions
First, we apply the age restriction, keeping only participants between 20 and 79 years old. We also drop several intermediate or raw variables that are no longer needed for the final analysis. We will remove the smoked.while.child variable. While this variable was created as a simple binary indicator for early smoking, the final analysis relies on the more granular, multi-level exposure.cat variable.
The exposure.cat variable (e.g., ‘Started before 10’, ‘Started at 10-14’) is superior for our research question because it allows us to investigate a dose-response relationship—that is, to see if the risk of mortality changes incrementally with different ages of initiation. A simple binary variable like smoked.while.child would only allow us to compare “early starters” to everyone else, masking the important nuances in risk across different age groups. Therefore, it is excluded from the final analytic dataset.
Show/Hide Code
# Drop the following column not being used
dat.full.with.mortality$smoked.while.child <- NULL
### Analytic Dataset - age 20 - 79
# Inclusion criteria: adults aged between 20 and 79 years
dat.analytic <- subset(dat.full.with.mortality, age>=20 & age < 80)
# Dimensions before age filtering
dim(dat.full.with.mortality)
#> [1] 101316 18
# Dimensions after age filtering
dim(dat.analytic)
#> [1] 50824 18
# Calculate number of participants dropped due to age restriction
dim(dat.full.with.mortality)[1] - dim(dat.analytic)[1]
#> [1] 50492Drop other variables that are not being used in the final analysis.
2. Apply Complete-Case Analysis
Next, we exclude participants who have missing data for the primary outcome (survival time) or the primary exposure (smoking initiation category). This is consistent with the original paper, which notes that about 0.5% of the sample (275 observations) was discarded for this reason. Also, there were no missing values in the covariates variables considered in the main analysis.
Show/Hide Code
# Participants with incomplete data on smoking status
# or mortality were excluded
dat.analytic1 <- dat.analytic[
complete.cases(dat.analytic$stime.since.birth),]
dim(dat.analytic1) # Dimension after removing missing survival time
#> [1] 50690 15
dat.analytic2 <- dat.analytic1[
complete.cases(dat.analytic1$exposure.cat),]
dim(dat.analytic2) # Dimension after removing missing exposure var
#> [1] 50549 15
# Complete case analysis: remove missing values in covariates
dat.complete <- na.omit(dat.analytic2) # No missing values
dim(dat.complete)
#> [1] 50549 15Profile of missing data after removing participants with missing outcome or exposure. This table confirms that no missing values remain in the key covariates.
3. Reconcile Sample Sizes The following code calculates the number of participants dropped at each stage of the filtering process. This helps confirm that our data cleaning has resulted in the same final sample size as the original study.
Show/Hide Code
# Participants dropped - overall
nrow(dat.full.with.mortality) - nrow(dat.complete)
#> [1] 50767
# Participants dropped - due to missing exposure or outcome
nrow(dat.analytic) - nrow(dat.analytic2)
#> [1] 275
# Participants dropped - due to missing covariates
nrow(dat.analytic2) - nrow(dat.complete)
#> [1] 06.3 Exporting Final Datasets
- R Code Chunk 4: Save Final Datasets
The final step is to save our processed datasets. We save three different versions of the data to create useful checkpoints for the analysis:
-
dat.full.with.mortality.RData: The updated, merged dataset before any filtering was applied in this chapter. -
dat.analytic2.RData: The dataset after applying age restrictions and removing participants with missing outcome or exposure data. -
dat.complete.RData: The final, clean analytic dataset containing only complete cases.
This completes the variable definitions and cleaning part of the data preparation stage.
6.4 Chapter Summary and Next Steps
In this chapter, we constructed the core variables for the analysis, including the categorized smoking initiation exposure (exposure.cat) and the survival outcome (time from birth to all-cause mortality). We then applied the study’s inclusion and exclusion criteria to create the final, complete-case analytic dataset with 50,549 participants, matching the sample size of the original paper.
With a clean dataset ready, the next chapter, “Cohort Definition,” will formally summarize the criteria used to define our study population.