9 Descriptive Analysis
This section presents the descriptive statistics of the final analytic cohort, providing an overview of participant characteristics. Cohort characteristics are summarized using weighted proportions and sample counts, accounting for the complex survey design of NHANES. This section aligns with the “Results” section of the paper, particularly referencing content related to Appendix Tables 1 and 2.
This chapter will generate the following key descriptive tables:
| Table Description | R Packages Used | Weighting | Purpose |
|---|---|---|---|
| Unweighted Summary | tableone |
Unweighted | Initial exploration of raw sample counts. |
| Detailed Unweighted Tables | table1 |
Unweighted | Visually appealing, detailed breakdowns of the cohort. |
| Appendix Table 1 | survey |
Weighted | Reproduces paper: Participant characteristics by mortality status. |
| Appendix Table 2 | survey |
Weighted | Reproduces paper: Participant characteristics by smoking exposure. |
- R Code Chunk 1: Load Analytic Data and Survey Design
We begin by loading the dat.analytic2 and dat.complete datasets, and the final survey design object, w.design0. These are essential for generating accurate weighted descriptive statistics, as required for nationally representative results.
9.1 Exploratory Analysis: Unweighted Summaries
- R Code Chunk 2: Create Unweighted Descriptive Data Table
The following code creates a basic unweighted descriptive data table, referred to as “Table 1”, using the CreateTableOne() function from the tableone package. This table provides unweighted counts and percentages for our key variables, stratified by mortality status. While the paper’s main results are based on weighted proportions, this unweighted table can be useful for initial data exploration and understanding the raw counts of variables within your specific analytic sample.
Show/Hide Code
# Unweighted Table 1 summarizing exposure, race,
# and sex, stratified by mortality status
tab1 <- CreateTableOne(vars = c("exposure.cat", "race", "sex"),
strata = "status_all",
data = dat.analytic2,
addOverall = TRUE,
test = TRUE)
# View the summarized version of the table
print(tab1$CatTable)
#> Stratified by status_all
#> Overall 0 1 p test
#> n 50549 44377 6172
#> exposure.cat (%) <0.001
#> Never smoked 28593 (56.6) 26235 (59.1) 2358 (38.2)
#> Started before 10 337 ( 0.7) 244 ( 0.5) 93 ( 1.5)
#> Started at 10-14 3903 ( 7.7) 3145 ( 7.1) 758 (12.3)
#> Started at 15-17 7189 (14.2) 6003 (13.5) 1186 (19.2)
#> Started at 18-20 6254 (12.4) 5250 (11.8) 1004 (16.3)
#> Started after 20 4273 ( 8.5) 3500 ( 7.9) 773 (12.5)
#> race (%) <0.001
#> White 21069 (41.7) 17889 (40.3) 3180 (51.5)
#> Black 10977 (21.7) 9471 (21.3) 1506 (24.4)
#> Hispanic 13592 (26.9) 12342 (27.8) 1250 (20.3)
#> Others 4911 ( 9.7) 4675 (10.5) 236 ( 3.8)
#> sex = Female (%) 26158 (51.7) 23544 (53.1) 2614 (42.4) <0.001- R Code Chunk 3: More Detailed Unweighted Tables
The following code uses the table1 package to generate more visually appealing unweighted descriptive tables. These tables offer flexibility in stratifying variables and are useful for detailed unweighted breakdowns of your cohort’s characteristics, though they do not account for survey weights.
1. Create and Display Tables
The following code chunk demonstrates this by creating four different tables, each exploring a different combination of demographic variables, smoking exposure categories, and mortality status. Each table is then rendered as a clean HTML table using table1::t1kable().
Show/Hide Code
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
|---|---|---|---|---|---|---|---|---|---|---|
| (N=17889) | (N=3180) | (N=9471) | (N=1506) | (N=12342) | (N=1250) | (N=4675) | (N=236) | (N=44377) | (N=6172) | |
| exposure.cat | ||||||||||
| Never smoked | 9153 (51.2%) | 1041 (32.7%) | 5729 (60.5%) | 605 (40.2%) | 8054 (65.3%) | 600 (48.0%) | 3299 (70.6%) | 112 (47.5%) | 26235 (59.1%) | 2358 (38.2%) |
| Started before 10 | 134 (0.7%) | 58 (1.8%) | 31 (0.3%) | 13 (0.9%) | 55 (0.4%) | 18 (1.4%) | 24 (0.5%) | 4 (1.7%) | 244 (0.5%) | 93 (1.5%) |
| Started at 10-14 | 1685 (9.4%) | 430 (13.5%) | 503 (5.3%) | 166 (11.0%) | 788 (6.4%) | 135 (10.8%) | 169 (3.6%) | 27 (11.4%) | 3145 (7.1%) | 758 (12.3%) |
| Started at 15-17 | 3198 (17.9%) | 725 (22.8%) | 1082 (11.4%) | 267 (17.7%) | 1348 (10.9%) | 163 (13.0%) | 375 (8.0%) | 31 (13.1%) | 6003 (13.5%) | 1186 (19.2%) |
| Started at 18-20 | 2458 (13.7%) | 561 (17.6%) | 1116 (11.8%) | 240 (15.9%) | 1249 (10.1%) | 174 (13.9%) | 427 (9.1%) | 29 (12.3%) | 5250 (11.8%) | 1004 (16.3%) |
| Started after 20 | 1261 (7.0%) | 365 (11.5%) | 1010 (10.7%) | 215 (14.3%) | 848 (6.9%) | 160 (12.8%) | 381 (8.1%) | 33 (14.0%) | 3500 (7.9%) | 773 (12.5%) |
Unweighted: Smoking initiation categories stratified by race and mortality status.
Show/Hide Code
| 0 | 1 | 0 | 1 | 0 | 1 | |
|---|---|---|---|---|---|---|
| (N=20833) | (N=3558) | (N=23544) | (N=2614) | (N=44377) | (N=6172) | |
| exposure.cat | ||||||
| Never smoked | 10504 (50.4%) | 1030 (28.9%) | 15731 (66.8%) | 1328 (50.8%) | 26235 (59.1%) | 2358 (38.2%) |
| Started before 10 | 177 (0.8%) | 80 (2.2%) | 67 (0.3%) | 13 (0.5%) | 244 (0.5%) | 93 (1.5%) |
| Started at 10-14 | 1923 (9.2%) | 572 (16.1%) | 1222 (5.2%) | 186 (7.1%) | 3145 (7.1%) | 758 (12.3%) |
| Started at 15-17 | 3414 (16.4%) | 823 (23.1%) | 2589 (11.0%) | 363 (13.9%) | 6003 (13.5%) | 1186 (19.2%) |
| Started at 18-20 | 3005 (14.4%) | 656 (18.4%) | 2245 (9.5%) | 348 (13.3%) | 5250 (11.8%) | 1004 (16.3%) |
| Started after 20 | 1810 (8.7%) | 397 (11.2%) | 1690 (7.2%) | 376 (14.4%) | 3500 (7.9%) | 773 (12.5%) |
Unweighted: Smoking initiation categories stratified by sex and mortality status.
Show/Hide Code
| 0 | 1 | Overall | |
|---|---|---|---|
| (N=44377) | (N=6172) | (N=50549) | |
| exposure.cat | |||
| Never smoked | 26235 (59.1%) | 2358 (38.2%) | 28593 (56.6%) |
| Started before 10 | 244 (0.5%) | 93 (1.5%) | 337 (0.7%) |
| Started at 10-14 | 3145 (7.1%) | 758 (12.3%) | 3903 (7.7%) |
| Started at 15-17 | 6003 (13.5%) | 1186 (19.2%) | 7189 (14.2%) |
| Started at 18-20 | 5250 (11.8%) | 1004 (16.3%) | 6254 (12.4%) |
| Started after 20 | 3500 (7.9%) | 773 (12.5%) | 4273 (8.5%) |
| race | |||
| White | 17889 (40.3%) | 3180 (51.5%) | 21069 (41.7%) |
| Black | 9471 (21.3%) | 1506 (24.4%) | 10977 (21.7%) |
| Hispanic | 12342 (27.8%) | 1250 (20.3%) | 13592 (26.9%) |
| Others | 4675 (10.5%) | 236 (3.8%) | 4911 (9.7%) |
| sex | |||
| Male | 20833 (46.9%) | 3558 (57.6%) | 24391 (48.3%) |
| Female | 23544 (53.1%) | 2614 (42.4%) | 26158 (51.7%) |
| year.cat | |||
| 1999-2000 | 3188 (7.2%) | 1247 (20.2%) | 4435 (8.8%) |
| 2001-2002 | 3772 (8.5%) | 1073 (17.4%) | 4845 (9.6%) |
| 2003-2004 | 3583 (8.1%) | 921 (14.9%) | 4504 (8.9%) |
| 2005-2006 | 3907 (8.8%) | 667 (10.8%) | 4574 (9.0%) |
| 2007-2008 | 4705 (10.6%) | 769 (12.5%) | 5474 (10.8%) |
| 2009-2010 | 5204 (11.7%) | 563 (9.1%) | 5767 (11.4%) |
| 2011-2012 | 4770 (10.7%) | 392 (6.4%) | 5162 (10.2%) |
| 2013-2014 | 5111 (11.5%) | 285 (4.6%) | 5396 (10.7%) |
| 2015-2016 | 5132 (11.6%) | 170 (2.8%) | 5302 (10.5%) |
| 2017-2018 | 5005 (11.3%) | 85 (1.4%) | 5090 (10.1%) |
Unweighted: Full cohort characteristics stratified by mortality status.
Show/Hide Code
| Never smoked | Started before 10 | Started at 10-14 | Started at 15-17 | Started at 18-20 | Started after 20 | Overall | |
|---|---|---|---|---|---|---|---|
| (N=28593) | (N=337) | (N=3903) | (N=7189) | (N=6254) | (N=4273) | (N=50549) | |
| race | |||||||
| White | 10194 (35.7%) | 192 (57.0%) | 2115 (54.2%) | 3923 (54.6%) | 3019 (48.3%) | 1626 (38.1%) | 21069 (41.7%) |
| Black | 6334 (22.2%) | 44 (13.1%) | 669 (17.1%) | 1349 (18.8%) | 1356 (21.7%) | 1225 (28.7%) | 10977 (21.7%) |
| Hispanic | 8654 (30.3%) | 73 (21.7%) | 923 (23.6%) | 1511 (21.0%) | 1423 (22.8%) | 1008 (23.6%) | 13592 (26.9%) |
| Others | 3411 (11.9%) | 28 (8.3%) | 196 (5.0%) | 406 (5.6%) | 456 (7.3%) | 414 (9.7%) | 4911 (9.7%) |
| sex | |||||||
| Male | 11534 (40.3%) | 257 (76.3%) | 2495 (63.9%) | 4237 (58.9%) | 3661 (58.5%) | 2207 (51.6%) | 24391 (48.3%) |
| Female | 17059 (59.7%) | 80 (23.7%) | 1408 (36.1%) | 2952 (41.1%) | 2593 (41.5%) | 2066 (48.4%) | 26158 (51.7%) |
Unweighted: Demographic characteristics stratified by smoking initiation category.
2. Save Tables to Excel (Optional)
The tables created with the table1 package can also be saved to an external file, such as an Excel spreadsheet, for further review outside of this book. The following code demonstrates how to save two of these tables using the expss package.
9.2 Weighted Descriptive Statistics (Paper Reproduction)
- R Code Chunk 4: Weighted Descriptive Tables
The following code generates the primary descriptive tables for the analysis. Unlike the previous examples, these tables correctly account for the complex survey design by using the svyCreateTableOne() function on our survey design object (w.design0). This produces nationally representative, weighted percentages.
9.2.1 Appendix Table 1: Characteristics by Mortality Status
This first table summarizes participant characteristics (smoking exposure, race, sex, and survey year), stratified by their mortality status. The output is designed to directly reproduce the results shown in Appendix Table 1 of the supplementary material where percentages were in brackets. Those percentages were calculated by accounting for sampling (interview) weights.
Show/Hide Code
# Create weighted Table 1
# stratified by outcome status (all-cause mortality)
tab13_weighted <- svyCreateTableOne(vars = c("exposure.cat", "race",
"sex", "year.cat"),
strata = "status_all",
data = w.design0, # CRITICAL
addOverall = TRUE,
test = TRUE)
# Print the table with weighted proportions,
# specified decimal places, and all factor levels
tab13p_weighted <- print(tab13_weighted,
format = "p",
catDigits = 2,
showAllLevels = TRUE,
smd = TRUE)
# Re-label
colnames(tab13p_weighted)[colnames(tab13p_weighted)
== "0"] <- "Alive"
colnames(tab13p_weighted)[colnames(tab13p_weighted)
== "1"] <- "Dead"
# Order
new_order <- c("level", "Alive", "Dead",
"Overall", "p", "test", "SMD")
# Apply the new order to the table object
tab13p_weighted <- tab13p_weighted[, new_order]
# Save the weighted table to CSV
write.csv(tab13p_weighted, file = "data/Table_App_1_Weighted_Mortality.csv")Show/Hide Code
| level | Alive | Dead | Overall | p | test | SMD | |
|---|---|---|---|---|---|---|---|
| n | 187497900.24 | 18761712.99 | 206259613.23 | ||||
| exposure.cat (%) | Never smoked | 57.99 | 36.57 | 56.04 | <0.001 | 0.450 | |
| Started before 10 | 0.49 | 1.48 | 0.58 | ||||
| Started at 10-14 | 7.17 | 12.54 | 7.66 | ||||
| Started at 15-17 | 15.02 | 21.01 | 15.57 | ||||
| Started at 18-20 | 12.24 | 16.78 | 12.65 | ||||
| Started after 20 | 7.09 | 11.61 | 7.50 | ||||
| race (%) | White | 66.42 | 74.81 | 67.18 | <0.001 | 0.266 | |
| Black | 11.35 | 12.90 | 11.50 | ||||
| Hispanic | 14.81 | 7.87 | 14.18 | ||||
| Others | 7.42 | 4.42 | 7.15 | ||||
| sex (%) | Male | 47.88 | 55.16 | 48.54 | <0.001 | 0.146 | |
| Female | 52.12 | 44.84 | 51.46 | ||||
| year.cat (%) | 1999-2000 | 7.83 | 19.75 | 8.92 | <0.001 | 0.793 | |
| 2001-2002 | 8.42 | 17.12 | 9.21 | ||||
| 2003-2004 | 8.91 | 15.36 | 9.49 | ||||
| 2005-2006 | 9.46 | 12.42 | 9.73 | ||||
| 2007-2008 | 9.84 | 10.81 | 9.93 | ||||
| 2009-2010 | 10.28 | 8.31 | 10.10 | ||||
| 2011-2012 | 10.65 | 6.73 | 10.29 | ||||
| 2013-2014 | 11.11 | 5.36 | 10.59 | ||||
| 2015-2016 | 11.57 | 2.79 | 10.78 | ||||
| 2017-2018 | 11.92 | 1.34 | 10.96 |
9.2.2 Appendix Table 2: Characteristics by Smoking Initiation Exposure
This second table reproduces the results from Appendix Table 2 of the supplementary material. It summarizes the demographic characteristics (race and sex) of the cohort, stratified by the exposure.cat variable (smoking initiation categories). This provides the nationally representative, weighted demographic composition within each smoking exposure group as it accounts for the complex survey design.
Show/Hide Code
# Create weighted Table 2 stratified by smoking initiation categories
tab14_weighted <- svyCreateTableOne(vars = c("race", "sex"),
strata = "exposure.cat",
data = w.design0, # CRITICAL
addOverall = TRUE,
test = TRUE)
# Print the table with weighted proportions
# and specified decimal places
tab14p_weighted <- print(tab14_weighted,
format = "p",
catDigits = 2,
showAllLevels = TRUE,
smd = TRUE)
# Define the desired column order
new_order_t2 <- c("level", "Never smoked", "Started before 10",
"Started at 10-14", "Started at 15-17",
"Started at 18-20", "Started after 20",
"Overall", "p", "test", "SMD")
# Apply the new order to the table object
tab14p_weighted <- tab14p_weighted[, new_order_t2]
# Save the weighted table to CSV
write.csv(tab14p_weighted, file = "data/Table_App_2_Weighted_Exposure.csv")Show/Hide Code
| level | Never smoked | Started before 10 | Started at 10-14 | Started at 15-17 | Started at 18-20 | Started after 20 | Overall | p | test | SMD | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| n | 115595521.42 | 1188555.71 | 15799589.69 | 32111621.33 | 26098412.76 | 15465912.33 | 206259613.23 | ||||
| race (%) | White | 62.54 | 74.05 | 75.24 | 76.89 | 72.65 | 63.69 | 67.18 | <0.001 | 0.215 | |
| Black | 12.53 | 6.92 | 8.07 | 8.47 | 10.39 | 15.77 | 11.50 | ||||
| Hispanic | 16.50 | 10.22 | 12.37 | 10.31 | 11.09 | 12.20 | 14.18 | ||||
| Others | 8.42 | 8.81 | 4.32 | 4.34 | 5.87 | 8.34 | 7.15 | ||||
| sex (%) | Male | 42.98 | 73.85 | 58.70 | 56.37 | 55.14 | 50.42 | 48.54 | <0.001 | 0.251 | |
| Female | 57.02 | 26.15 | 41.30 | 43.63 | 44.86 | 49.58 | 51.46 |
This completes the descriptive analysis section of the statistical analysis stage.
9.3 Chapter Summary and Next Steps
In this chapter, we generated the primary descriptive statistics for the study cohort. By using the survey design object, we successfully reproduced the weighted characteristics of the participants, creating tables analogous to Appendix Tables 1 and 2 from the paper’s supplementary material. This gives us a clear, nationally representative picture of our study population.
Now that we understand the characteristics of our cohort, we will move on to the core analysis in the next chapter, “Survival Analysis,” where we will investigate the primary relationship between smoking initiation and mortality.