9  Descriptive Analysis


This section presents the descriptive statistics of the final analytic cohort, providing an overview of participant characteristics. Cohort characteristics are summarized using weighted proportions and sample counts, accounting for the complex survey design of NHANES. This section aligns with the “Results” section of the paper, particularly referencing content related to Appendix Tables 1 and 2.

This chapter will generate the following key descriptive tables:

Table Description R Packages Used Weighting Purpose
Unweighted Summary tableone Unweighted Initial exploration of raw sample counts.
Detailed Unweighted Tables table1 Unweighted Visually appealing, detailed breakdowns of the cohort.
Appendix Table 1 survey Weighted Reproduces paper: Participant characteristics by mortality status.
Appendix Table 2 survey Weighted Reproduces paper: Participant characteristics by smoking exposure.
Show/Hide Code
# Load required packages
library(tableone) 
library(table1)
library(survey) 
options(survey.want.obsolete=TRUE)
require(knitr)
require(kableExtra) 
library(expss)

We begin by loading the dat.analytic2 and dat.complete datasets, and the final survey design object, w.design0. These are essential for generating accurate weighted descriptive statistics, as required for nationally representative results.

Show/Hide Code
# Load the complete case analytic dataset
load(file = "data/dat.analytic2.RData")
load(file = "data/dat.complete.RData")

# Load the subsetted survey design object
w.design0 <- readRDS(file = "data/w.design0.rds")

9.1 Exploratory Analysis: Unweighted Summaries

  • R Code Chunk 2: Create Unweighted Descriptive Data Table

The following code creates a basic unweighted descriptive data table, referred to as “Table 1”, using the CreateTableOne() function from the tableone package. This table provides unweighted counts and percentages for our key variables, stratified by mortality status. While the paper’s main results are based on weighted proportions, this unweighted table can be useful for initial data exploration and understanding the raw counts of variables within your specific analytic sample.

Show/Hide Code
# Unweighted Table 1 summarizing exposure, race, 
# and sex, stratified by mortality status
tab1 <- CreateTableOne(vars = c("exposure.cat", "race", "sex"),
                                  strata = "status_all",
                                  data = dat.analytic2, 
                                  addOverall = TRUE,
                                  test = TRUE)

# View the summarized version of the table
print(tab1$CatTable)
#>                       Stratified by status_all
#>                        Overall       0             1            p      test
#>   n                    50549         44377         6172                    
#>   exposure.cat (%)                                              <0.001     
#>      Never smoked      28593 (56.6)  26235 (59.1)  2358 (38.2)             
#>      Started before 10   337 ( 0.7)    244 ( 0.5)    93 ( 1.5)             
#>      Started at 10-14   3903 ( 7.7)   3145 ( 7.1)   758 (12.3)             
#>      Started at 15-17   7189 (14.2)   6003 (13.5)  1186 (19.2)             
#>      Started at 18-20   6254 (12.4)   5250 (11.8)  1004 (16.3)             
#>      Started after 20   4273 ( 8.5)   3500 ( 7.9)   773 (12.5)             
#>   race (%)                                                      <0.001     
#>      White             21069 (41.7)  17889 (40.3)  3180 (51.5)             
#>      Black             10977 (21.7)   9471 (21.3)  1506 (24.4)             
#>      Hispanic          13592 (26.9)  12342 (27.8)  1250 (20.3)             
#>      Others             4911 ( 9.7)   4675 (10.5)   236 ( 3.8)             
#>   sex = Female (%)     26158 (51.7)  23544 (53.1)  2614 (42.4)  <0.001

  • R Code Chunk 3: More Detailed Unweighted Tables

The following code uses the table1 package to generate more visually appealing unweighted descriptive tables. These tables offer flexibility in stratifying variables and are useful for detailed unweighted breakdowns of your cohort’s characteristics, though they do not account for survey weights.

1. Create and Display Tables

The following code chunk demonstrates this by creating four different tables, each exploring a different combination of demographic variables, smoking exposure categories, and mortality status. Each table is then rendered as a clean HTML table using table1::t1kable().

Show/Hide Code

# Table of exposure.cat stratified by race and mortality status
tab11 <- table1::table1(~ exposure.cat | race * status_all , 
                        data = dat.analytic2)
# Display as a formatted HTML table in Quarto
table1::t1kable(tab11)
White
Black
Hispanic
Others
Overall
  0 1 0 1 0 1 0 1 0 1
(N=17889) (N=3180) (N=9471) (N=1506) (N=12342) (N=1250) (N=4675) (N=236) (N=44377) (N=6172)
exposure.cat
Never smoked 9153 (51.2%) 1041 (32.7%) 5729 (60.5%) 605 (40.2%) 8054 (65.3%) 600 (48.0%) 3299 (70.6%) 112 (47.5%) 26235 (59.1%) 2358 (38.2%)
Started before 10 134 (0.7%) 58 (1.8%) 31 (0.3%) 13 (0.9%) 55 (0.4%) 18 (1.4%) 24 (0.5%) 4 (1.7%) 244 (0.5%) 93 (1.5%)
Started at 10-14 1685 (9.4%) 430 (13.5%) 503 (5.3%) 166 (11.0%) 788 (6.4%) 135 (10.8%) 169 (3.6%) 27 (11.4%) 3145 (7.1%) 758 (12.3%)
Started at 15-17 3198 (17.9%) 725 (22.8%) 1082 (11.4%) 267 (17.7%) 1348 (10.9%) 163 (13.0%) 375 (8.0%) 31 (13.1%) 6003 (13.5%) 1186 (19.2%)
Started at 18-20 2458 (13.7%) 561 (17.6%) 1116 (11.8%) 240 (15.9%) 1249 (10.1%) 174 (13.9%) 427 (9.1%) 29 (12.3%) 5250 (11.8%) 1004 (16.3%)
Started after 20 1261 (7.0%) 365 (11.5%) 1010 (10.7%) 215 (14.3%) 848 (6.9%) 160 (12.8%) 381 (8.1%) 33 (14.0%) 3500 (7.9%) 773 (12.5%)

Unweighted: Smoking initiation categories stratified by race and mortality status.

Show/Hide Code
# Table of exposure.cat stratified by sex and mortality status
tab12 <- table1::table1(~ exposure.cat| sex * status_all , 
                        data = dat.analytic2)
# Display as a formatted HTML table in Quarto
table1::t1kable(tab12)
Male
Female
Overall
  0 1 0 1 0 1
(N=20833) (N=3558) (N=23544) (N=2614) (N=44377) (N=6172)
exposure.cat
Never smoked 10504 (50.4%) 1030 (28.9%) 15731 (66.8%) 1328 (50.8%) 26235 (59.1%) 2358 (38.2%)
Started before 10 177 (0.8%) 80 (2.2%) 67 (0.3%) 13 (0.5%) 244 (0.5%) 93 (1.5%)
Started at 10-14 1923 (9.2%) 572 (16.1%) 1222 (5.2%) 186 (7.1%) 3145 (7.1%) 758 (12.3%)
Started at 15-17 3414 (16.4%) 823 (23.1%) 2589 (11.0%) 363 (13.9%) 6003 (13.5%) 1186 (19.2%)
Started at 18-20 3005 (14.4%) 656 (18.4%) 2245 (9.5%) 348 (13.3%) 5250 (11.8%) 1004 (16.3%)
Started after 20 1810 (8.7%) 397 (11.2%) 1690 (7.2%) 376 (14.4%) 3500 (7.9%) 773 (12.5%)

Unweighted: Smoking initiation categories stratified by sex and mortality status.

Show/Hide Code
# Table of demographics and survey year stratified by mortality status
tab13 <- table1::table1(~ exposure.cat + race + sex + year.cat |
                          status_all , data = dat.analytic2)
# Display as a formatted HTML table in Quarto
table1::t1kable(tab13)
  0 1 Overall
(N=44377) (N=6172) (N=50549)
exposure.cat
Never smoked 26235 (59.1%) 2358 (38.2%) 28593 (56.6%)
Started before 10 244 (0.5%) 93 (1.5%) 337 (0.7%)
Started at 10-14 3145 (7.1%) 758 (12.3%) 3903 (7.7%)
Started at 15-17 6003 (13.5%) 1186 (19.2%) 7189 (14.2%)
Started at 18-20 5250 (11.8%) 1004 (16.3%) 6254 (12.4%)
Started after 20 3500 (7.9%) 773 (12.5%) 4273 (8.5%)
race
White 17889 (40.3%) 3180 (51.5%) 21069 (41.7%)
Black 9471 (21.3%) 1506 (24.4%) 10977 (21.7%)
Hispanic 12342 (27.8%) 1250 (20.3%) 13592 (26.9%)
Others 4675 (10.5%) 236 (3.8%) 4911 (9.7%)
sex
Male 20833 (46.9%) 3558 (57.6%) 24391 (48.3%)
Female 23544 (53.1%) 2614 (42.4%) 26158 (51.7%)
year.cat
1999-2000 3188 (7.2%) 1247 (20.2%) 4435 (8.8%)
2001-2002 3772 (8.5%) 1073 (17.4%) 4845 (9.6%)
2003-2004 3583 (8.1%) 921 (14.9%) 4504 (8.9%)
2005-2006 3907 (8.8%) 667 (10.8%) 4574 (9.0%)
2007-2008 4705 (10.6%) 769 (12.5%) 5474 (10.8%)
2009-2010 5204 (11.7%) 563 (9.1%) 5767 (11.4%)
2011-2012 4770 (10.7%) 392 (6.4%) 5162 (10.2%)
2013-2014 5111 (11.5%) 285 (4.6%) 5396 (10.7%)
2015-2016 5132 (11.6%) 170 (2.8%) 5302 (10.5%)
2017-2018 5005 (11.3%) 85 (1.4%) 5090 (10.1%)

Unweighted: Full cohort characteristics stratified by mortality status.

Show/Hide Code
# Table of race and sex stratified by exposure.cat
tab14 <- table1::table1(~ race + sex | exposure.cat , 
                        data = dat.analytic2)
# Display as a formatted HTML table in Quarto
table1::t1kable(tab14)
  Never smoked Started before 10 Started at 10-14 Started at 15-17 Started at 18-20 Started after 20 Overall
(N=28593) (N=337) (N=3903) (N=7189) (N=6254) (N=4273) (N=50549)
race
White 10194 (35.7%) 192 (57.0%) 2115 (54.2%) 3923 (54.6%) 3019 (48.3%) 1626 (38.1%) 21069 (41.7%)
Black 6334 (22.2%) 44 (13.1%) 669 (17.1%) 1349 (18.8%) 1356 (21.7%) 1225 (28.7%) 10977 (21.7%)
Hispanic 8654 (30.3%) 73 (21.7%) 923 (23.6%) 1511 (21.0%) 1423 (22.8%) 1008 (23.6%) 13592 (26.9%)
Others 3411 (11.9%) 28 (8.3%) 196 (5.0%) 406 (5.6%) 456 (7.3%) 414 (9.7%) 4911 (9.7%)
sex
Male 11534 (40.3%) 257 (76.3%) 2495 (63.9%) 4237 (58.9%) 3661 (58.5%) 2207 (51.6%) 24391 (48.3%)
Female 17059 (59.7%) 80 (23.7%) 1408 (36.1%) 2952 (41.1%) 2593 (41.5%) 2066 (48.4%) 26158 (51.7%)

Unweighted: Demographic characteristics stratified by smoking initiation category.

2. Save Tables to Excel (Optional)

The tables created with the table1 package can also be saved to an external file, such as an Excel spreadsheet, for further review outside of this book. The following code demonstrates how to save two of these tables using the expss package.

Show/Hide Code
# Table 3
t13 <- as.data.frame(tab13)
expss::xl_write_file(t13, filename = "data/t13.xlsx")

# Table 4
t14 <- as.data.frame(tab14)
expss::xl_write_file(t14, filename = "data/t14.xlsx")

9.2 Weighted Descriptive Statistics (Paper Reproduction)

  • R Code Chunk 4: Weighted Descriptive Tables

The following code generates the primary descriptive tables for the analysis. Unlike the previous examples, these tables correctly account for the complex survey design by using the svyCreateTableOne() function on our survey design object (w.design0). This produces nationally representative, weighted percentages.

9.2.1 Appendix Table 1: Characteristics by Mortality Status

This first table summarizes participant characteristics (smoking exposure, race, sex, and survey year), stratified by their mortality status. The output is designed to directly reproduce the results shown in Appendix Table 1 of the supplementary material where percentages were in brackets. Those percentages were calculated by accounting for sampling (interview) weights.

Show/Hide Code
# Create weighted Table 1 
# stratified by outcome status (all-cause mortality)
tab13_weighted <- svyCreateTableOne(vars = c("exposure.cat", "race", 
                                             "sex", "year.cat"),
                                    strata = "status_all",
                                    data = w.design0, # CRITICAL
                                    addOverall = TRUE, 
                                    test = TRUE)

# Print the table with weighted proportions,
# specified decimal places, and all factor levels
tab13p_weighted <- print(tab13_weighted,
                         format = "p",         
                         catDigits = 2,        
                         showAllLevels = TRUE, 
                         smd = TRUE)     
# Re-label
colnames(tab13p_weighted)[colnames(tab13p_weighted) 
                          == "0"] <- "Alive"
colnames(tab13p_weighted)[colnames(tab13p_weighted) 
                          == "1"] <- "Dead"
# Order
new_order <- c("level", "Alive", "Dead", 
               "Overall", "p", "test", "SMD")
# Apply the new order to the table object
tab13p_weighted <- tab13p_weighted[, new_order]

# Save the weighted table to CSV 
write.csv(tab13p_weighted, file = "data/Table_App_1_Weighted_Mortality.csv")
Show/Hide Code
# Display the formatted table using kable for a clean Quarto output
kable(tab13p_weighted, caption = "Weighted Characteristics by 
      All-Cause Mortality Status (Analogous to Appendix 
      Table 1)") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Weighted Characteristics by All-Cause Mortality Status (Analogous to Appendix Table 1)
level Alive Dead Overall p test SMD
n 187497900.24 18761712.99 206259613.23
exposure.cat (%) Never smoked 57.99 36.57 56.04 <0.001 0.450
Started before 10 0.49 1.48 0.58
Started at 10-14 7.17 12.54 7.66
Started at 15-17 15.02 21.01 15.57
Started at 18-20 12.24 16.78 12.65
Started after 20 7.09 11.61 7.50
race (%) White 66.42 74.81 67.18 <0.001 0.266
Black 11.35 12.90 11.50
Hispanic 14.81 7.87 14.18
Others 7.42 4.42 7.15
sex (%) Male 47.88 55.16 48.54 <0.001 0.146
Female 52.12 44.84 51.46
year.cat (%) 1999-2000 7.83 19.75 8.92 <0.001 0.793
2001-2002 8.42 17.12 9.21
2003-2004 8.91 15.36 9.49
2005-2006 9.46 12.42 9.73
2007-2008 9.84 10.81 9.93
2009-2010 10.28 8.31 10.10
2011-2012 10.65 6.73 10.29
2013-2014 11.11 5.36 10.59
2015-2016 11.57 2.79 10.78
2017-2018 11.92 1.34 10.96

9.2.2 Appendix Table 2: Characteristics by Smoking Initiation Exposure

This second table reproduces the results from Appendix Table 2 of the supplementary material. It summarizes the demographic characteristics (race and sex) of the cohort, stratified by the exposure.cat variable (smoking initiation categories). This provides the nationally representative, weighted demographic composition within each smoking exposure group as it accounts for the complex survey design.

Show/Hide Code
# Create weighted Table 2 stratified by smoking initiation categories
tab14_weighted <- svyCreateTableOne(vars = c("race", "sex"),
                                    strata = "exposure.cat",
                                    data = w.design0, # CRITICAL
                                    addOverall = TRUE,
                                    test = TRUE)

# Print the table with weighted proportions 
# and specified decimal places
tab14p_weighted <- print(tab14_weighted,
                         format = "p",
                         catDigits = 2,
                         showAllLevels = TRUE,
                         smd = TRUE)

# Define the desired column order
new_order_t2 <- c("level", "Never smoked", "Started before 10", 
                  "Started at 10-14", "Started at 15-17", 
                  "Started at 18-20", "Started after 20", 
                  "Overall", "p", "test", "SMD")
# Apply the new order to the table object
tab14p_weighted <- tab14p_weighted[, new_order_t2]

# Save the weighted table to CSV
write.csv(tab14p_weighted, file = "data/Table_App_2_Weighted_Exposure.csv")
Show/Hide Code
# Display the formatted table using kable for a clean Quarto output
kable(tab14p_weighted, caption = "Weighted Characteristics by 
      Smoking Initiation Categories (Analogous to Appendix 
      Table 2)") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Weighted Characteristics by Smoking Initiation Categories (Analogous to Appendix Table 2)
level Never smoked Started before 10 Started at 10-14 Started at 15-17 Started at 18-20 Started after 20 Overall p test SMD
n 115595521.42 1188555.71 15799589.69 32111621.33 26098412.76 15465912.33 206259613.23
race (%) White 62.54 74.05 75.24 76.89 72.65 63.69 67.18 <0.001 0.215
Black 12.53 6.92 8.07 8.47 10.39 15.77 11.50
Hispanic 16.50 10.22 12.37 10.31 11.09 12.20 14.18
Others 8.42 8.81 4.32 4.34 5.87 8.34 7.15
sex (%) Male 42.98 73.85 58.70 56.37 55.14 50.42 48.54 <0.001 0.251
Female 57.02 26.15 41.30 43.63 44.86 49.58 51.46

This completes the descriptive analysis section of the statistical analysis stage.


9.3 Chapter Summary and Next Steps

In this chapter, we generated the primary descriptive statistics for the study cohort. By using the survey design object, we successfully reproduced the weighted characteristics of the participants, creating tables analogous to Appendix Tables 1 and 2 from the paper’s supplementary material. This gives us a clear, nationally representative picture of our study population.

Now that we understand the characteristics of our cohort, we will move on to the core analysis in the next chapter, “Survival Analysis,” where we will investigate the primary relationship between smoking initiation and mortality.