9 Descriptive Analysis

This section presents the descriptive statistics of the final analytic cohort, providing an overview of participant characteristics. Cohort characteristics are summarized using weighted proportions and sample counts, accounting for the complex survey design of NHANES. This section aligns with the “Results” section of the paper, particularly referencing content related to Appendix Tables 1 and 2.

This chapter will generate the following key descriptive tables:

Table Description	R Packages Used	Weighting	Purpose
Unweighted Summary	`tableone`	Unweighted	Initial exploration of raw sample counts.
Detailed Unweighted Tables	`table1`	Unweighted	Visually appealing, detailed breakdowns of the cohort.
Appendix Table 1	`survey`	Weighted	Reproduces paper: Participant characteristics by mortality status.
Appendix Table 2	`survey`	Weighted	Reproduces paper: Participant characteristics by smoking exposure.

Show/Hide Code

# Load required packages
library(tableone) 
library(table1)
library(survey) 
options(survey.want.obsolete=TRUE)
require(knitr)
require(kableExtra) 
library(expss)

R Code Chunk 1: Load Analytic Data and Survey Design

We begin by loading the dat.analytic2 and dat.complete datasets, and the final survey design object, w.design0. These are essential for generating accurate weighted descriptive statistics, as required for nationally representative results.

Show/Hide Code

# Load the complete case analytic dataset
load(file = "data/dat.analytic2.RData")
load(file = "data/dat.complete.RData")

# Load the subsetted survey design object
w.design0 <- readRDS(file = "data/w.design0.rds")

9.1 Exploratory Analysis: Unweighted Summaries

R Code Chunk 2: Create Unweighted Descriptive Data Table

The following code creates a basic unweighted descriptive data table, referred to as “Table 1”, using the CreateTableOne() function from the tableone package. This table provides unweighted counts and percentages for our key variables, stratified by mortality status. While the paper’s main results are based on weighted proportions, this unweighted table can be useful for initial data exploration and understanding the raw counts of variables within your specific analytic sample.

Show/Hide Code

# Unweighted Table 1 summarizing exposure, race, 
# and sex, stratified by mortality status
tab1 <- CreateTableOne(vars = c("exposure.cat", "race", "sex"),
                                  strata = "status_all",
                                  data = dat.analytic2, 
                                  addOverall = TRUE,
                                  test = TRUE)

# View the summarized version of the table
print(tab1$CatTable)
#>                       Stratified by status_all
#>                        Overall       0             1            p      test
#>   n                    50549         44377         6172                    
#>   exposure.cat (%)                                              <0.001     
#>      Never smoked      28593 (56.6)  26235 (59.1)  2358 (38.2)             
#>      Started before 10   337 ( 0.7)    244 ( 0.5)    93 ( 1.5)             
#>      Started at 10-14   3903 ( 7.7)   3145 ( 7.1)   758 (12.3)             
#>      Started at 15-17   7189 (14.2)   6003 (13.5)  1186 (19.2)             
#>      Started at 18-20   6254 (12.4)   5250 (11.8)  1004 (16.3)             
#>      Started after 20   4273 ( 8.5)   3500 ( 7.9)   773 (12.5)             
#>   race (%)                                                      <0.001     
#>      White             21069 (41.7)  17889 (40.3)  3180 (51.5)             
#>      Black             10977 (21.7)   9471 (21.3)  1506 (24.4)             
#>      Hispanic          13592 (26.9)  12342 (27.8)  1250 (20.3)             
#>      Others             4911 ( 9.7)   4675 (10.5)   236 ( 3.8)             
#>   sex = Female (%)     26158 (51.7)  23544 (53.1)  2614 (42.4)  <0.001

R Code Chunk 3: More Detailed Unweighted Tables

The following code uses the table1 package to generate more visually appealing unweighted descriptive tables. These tables offer flexibility in stratifying variables and are useful for detailed unweighted breakdowns of your cohort’s characteristics, though they do not account for survey weights.

1. Create and Display Tables

The following code chunk demonstrates this by creating four different tables, each exploring a different combination of demographic variables, smoking exposure categories, and mortality status. Each table is then rendered as a clean HTML table using table1::t1kable().

Show/Hide Code


# Table of exposure.cat stratified by race and mortality status
tab11 <- table1::table1(~ exposure.cat | race * status_all , 
                        data = dat.analytic2)
# Display as a formatted HTML table in Quarto
table1::t1kable(tab11)

	White		Black		Hispanic		Others		Overall
	0	1	0	1	0	1	0	1	0	1
	(N=17889)	(N=3180)	(N=9471)	(N=1506)	(N=12342)	(N=1250)	(N=4675)	(N=236)	(N=44377)	(N=6172)
exposure.cat
Never smoked	9153 (51.2%)	1041 (32.7%)	5729 (60.5%)	605 (40.2%)	8054 (65.3%)	600 (48.0%)	3299 (70.6%)	112 (47.5%)	26235 (59.1%)	2358 (38.2%)
Started before 10	134 (0.7%)	58 (1.8%)	31 (0.3%)	13 (0.9%)	55 (0.4%)	18 (1.4%)	24 (0.5%)	4 (1.7%)	244 (0.5%)	93 (1.5%)
Started at 10-14	1685 (9.4%)	430 (13.5%)	503 (5.3%)	166 (11.0%)	788 (6.4%)	135 (10.8%)	169 (3.6%)	27 (11.4%)	3145 (7.1%)	758 (12.3%)
Started at 15-17	3198 (17.9%)	725 (22.8%)	1082 (11.4%)	267 (17.7%)	1348 (10.9%)	163 (13.0%)	375 (8.0%)	31 (13.1%)	6003 (13.5%)	1186 (19.2%)
Started at 18-20	2458 (13.7%)	561 (17.6%)	1116 (11.8%)	240 (15.9%)	1249 (10.1%)	174 (13.9%)	427 (9.1%)	29 (12.3%)	5250 (11.8%)	1004 (16.3%)
Started after 20	1261 (7.0%)	365 (11.5%)	1010 (10.7%)	215 (14.3%)	848 (6.9%)	160 (12.8%)	381 (8.1%)	33 (14.0%)	3500 (7.9%)	773 (12.5%)

Unweighted: Smoking initiation categories stratified by race and mortality status.

Show/Hide Code

# Table of exposure.cat stratified by sex and mortality status
tab12 <- table1::table1(~ exposure.cat| sex * status_all , 
                        data = dat.analytic2)
# Display as a formatted HTML table in Quarto
table1::t1kable(tab12)

	Male		Female		Overall
	0	1	0	1	0	1
	(N=20833)	(N=3558)	(N=23544)	(N=2614)	(N=44377)	(N=6172)
exposure.cat
Never smoked	10504 (50.4%)	1030 (28.9%)	15731 (66.8%)	1328 (50.8%)	26235 (59.1%)	2358 (38.2%)
Started before 10	177 (0.8%)	80 (2.2%)	67 (0.3%)	13 (0.5%)	244 (0.5%)	93 (1.5%)
Started at 10-14	1923 (9.2%)	572 (16.1%)	1222 (5.2%)	186 (7.1%)	3145 (7.1%)	758 (12.3%)
Started at 15-17	3414 (16.4%)	823 (23.1%)	2589 (11.0%)	363 (13.9%)	6003 (13.5%)	1186 (19.2%)
Started at 18-20	3005 (14.4%)	656 (18.4%)	2245 (9.5%)	348 (13.3%)	5250 (11.8%)	1004 (16.3%)
Started after 20	1810 (8.7%)	397 (11.2%)	1690 (7.2%)	376 (14.4%)	3500 (7.9%)	773 (12.5%)

Unweighted: Smoking initiation categories stratified by sex and mortality status.

Show/Hide Code

# Table of demographics and survey year stratified by mortality status
tab13 <- table1::table1(~ exposure.cat + race + sex + year.cat |
                          status_all , data = dat.analytic2)
# Display as a formatted HTML table in Quarto
table1::t1kable(tab13)

	0	1	Overall
	(N=44377)	(N=6172)	(N=50549)
exposure.cat
Never smoked	26235 (59.1%)	2358 (38.2%)	28593 (56.6%)
Started before 10	244 (0.5%)	93 (1.5%)	337 (0.7%)
Started at 10-14	3145 (7.1%)	758 (12.3%)	3903 (7.7%)
Started at 15-17	6003 (13.5%)	1186 (19.2%)	7189 (14.2%)
Started at 18-20	5250 (11.8%)	1004 (16.3%)	6254 (12.4%)
Started after 20	3500 (7.9%)	773 (12.5%)	4273 (8.5%)
race
White	17889 (40.3%)	3180 (51.5%)	21069 (41.7%)
Black	9471 (21.3%)	1506 (24.4%)	10977 (21.7%)
Hispanic	12342 (27.8%)	1250 (20.3%)	13592 (26.9%)
Others	4675 (10.5%)	236 (3.8%)	4911 (9.7%)
sex
Male	20833 (46.9%)	3558 (57.6%)	24391 (48.3%)
Female	23544 (53.1%)	2614 (42.4%)	26158 (51.7%)
year.cat
1999-2000	3188 (7.2%)	1247 (20.2%)	4435 (8.8%)
2001-2002	3772 (8.5%)	1073 (17.4%)	4845 (9.6%)
2003-2004	3583 (8.1%)	921 (14.9%)	4504 (8.9%)
2005-2006	3907 (8.8%)	667 (10.8%)	4574 (9.0%)
2007-2008	4705 (10.6%)	769 (12.5%)	5474 (10.8%)
2009-2010	5204 (11.7%)	563 (9.1%)	5767 (11.4%)
2011-2012	4770 (10.7%)	392 (6.4%)	5162 (10.2%)
2013-2014	5111 (11.5%)	285 (4.6%)	5396 (10.7%)
2015-2016	5132 (11.6%)	170 (2.8%)	5302 (10.5%)
2017-2018	5005 (11.3%)	85 (1.4%)	5090 (10.1%)

Unweighted: Full cohort characteristics stratified by mortality status.

Show/Hide Code

# Table of race and sex stratified by exposure.cat
tab14 <- table1::table1(~ race + sex | exposure.cat , 
                        data = dat.analytic2)
# Display as a formatted HTML table in Quarto
table1::t1kable(tab14)

	Never smoked	Started before 10	Started at 10-14	Started at 15-17	Started at 18-20	Started after 20	Overall
	(N=28593)	(N=337)	(N=3903)	(N=7189)	(N=6254)	(N=4273)	(N=50549)
race
White	10194 (35.7%)	192 (57.0%)	2115 (54.2%)	3923 (54.6%)	3019 (48.3%)	1626 (38.1%)	21069 (41.7%)
Black	6334 (22.2%)	44 (13.1%)	669 (17.1%)	1349 (18.8%)	1356 (21.7%)	1225 (28.7%)	10977 (21.7%)
Hispanic	8654 (30.3%)	73 (21.7%)	923 (23.6%)	1511 (21.0%)	1423 (22.8%)	1008 (23.6%)	13592 (26.9%)
Others	3411 (11.9%)	28 (8.3%)	196 (5.0%)	406 (5.6%)	456 (7.3%)	414 (9.7%)	4911 (9.7%)
sex
Male	11534 (40.3%)	257 (76.3%)	2495 (63.9%)	4237 (58.9%)	3661 (58.5%)	2207 (51.6%)	24391 (48.3%)
Female	17059 (59.7%)	80 (23.7%)	1408 (36.1%)	2952 (41.1%)	2593 (41.5%)	2066 (48.4%)	26158 (51.7%)

Unweighted: Demographic characteristics stratified by smoking initiation category.

2. Save Tables to Excel (Optional)

The tables created with the table1 package can also be saved to an external file, such as an Excel spreadsheet, for further review outside of this book. The following code demonstrates how to save two of these tables using the expss package.

Show/Hide Code

# Table 3
t13 <- as.data.frame(tab13)
expss::xl_write_file(t13, filename = "data/t13.xlsx")

# Table 4
t14 <- as.data.frame(tab14)
expss::xl_write_file(t14, filename = "data/t14.xlsx")

9.2 Weighted Descriptive Statistics (Paper Reproduction)

R Code Chunk 4: Weighted Descriptive Tables

The following code generates the primary descriptive tables for the analysis. Unlike the previous examples, these tables correctly account for the complex survey design by using the svyCreateTableOne() function on our survey design object (w.design0). This produces nationally representative, weighted percentages.

9.2.1 Appendix Table 1: Characteristics by Mortality Status

This first table summarizes participant characteristics (smoking exposure, race, sex, and survey year), stratified by their mortality status. The output is designed to directly reproduce the results shown in Appendix Table 1 of the supplementary material where percentages were in brackets. Those percentages were calculated by accounting for sampling (interview) weights.

Show/Hide Code

# Create weighted Table 1 
# stratified by outcome status (all-cause mortality)
tab13_weighted <- svyCreateTableOne(vars = c("exposure.cat", "race", 
                                             "sex", "year.cat"),
                                    strata = "status_all",
                                    data = w.design0, # CRITICAL
                                    addOverall = TRUE, 
                                    test = TRUE)

# Print the table with weighted proportions,
# specified decimal places, and all factor levels
tab13p_weighted <- print(tab13_weighted,
                         format = "p",         
                         catDigits = 2,        
                         showAllLevels = TRUE, 
                         smd = TRUE)     
# Re-label
colnames(tab13p_weighted)[colnames(tab13p_weighted) 
                          == "0"] <- "Alive"
colnames(tab13p_weighted)[colnames(tab13p_weighted) 
                          == "1"] <- "Dead"
# Order
new_order <- c("level", "Alive", "Dead", 
               "Overall", "p", "test", "SMD")
# Apply the new order to the table object
tab13p_weighted <- tab13p_weighted[, new_order]

# Save the weighted table to CSV 
write.csv(tab13p_weighted, file = "data/Table_App_1_Weighted_Mortality.csv")

Show/Hide Code

# Display the formatted table using kable for a clean Quarto output
kable(tab13p_weighted, caption = "Weighted Characteristics by 
      All-Cause Mortality Status (Analogous to Appendix 
      Table 1)") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Weighted Characteristics by All-Cause Mortality Status (Analogous to Appendix Table 1)
	level	Alive	Dead	Overall	p	SMD
n		187497900.24	18761712.99	206259613.23
exposure.cat (%)	Never smoked	57.99	36.57	56.04	<0.001	0.450
	Started before 10	0.49	1.48	0.58
	Started at 10-14	7.17	12.54	7.66
	Started at 15-17	15.02	21.01	15.57
	Started at 18-20	12.24	16.78	12.65
	Started after 20	7.09	11.61	7.50
race (%)	White	66.42	74.81	67.18	<0.001	0.266
	Black	11.35	12.90	11.50
	Hispanic	14.81	7.87	14.18
	Others	7.42	4.42	7.15
sex (%)	Male	47.88	55.16	48.54	<0.001	0.146
	Female	52.12	44.84	51.46
year.cat (%)	1999-2000	7.83	19.75	8.92	<0.001	0.793
	2001-2002	8.42	17.12	9.21
	2003-2004	8.91	15.36	9.49
	2005-2006	9.46	12.42	9.73
	2007-2008	9.84	10.81	9.93
	2009-2010	10.28	8.31	10.10
	2011-2012	10.65	6.73	10.29
	2013-2014	11.11	5.36	10.59
	2015-2016	11.57	2.79	10.78
	2017-2018	11.92	1.34	10.96

9.2.2 Appendix Table 2: Characteristics by Smoking Initiation Exposure

This second table reproduces the results from Appendix Table 2 of the supplementary material. It summarizes the demographic characteristics (race and sex) of the cohort, stratified by the exposure.cat variable (smoking initiation categories). This provides the nationally representative, weighted demographic composition within each smoking exposure group as it accounts for the complex survey design.

Show/Hide Code

# Create weighted Table 2 stratified by smoking initiation categories
tab14_weighted <- svyCreateTableOne(vars = c("race", "sex"),
                                    strata = "exposure.cat",
                                    data = w.design0, # CRITICAL
                                    addOverall = TRUE,
                                    test = TRUE)

# Print the table with weighted proportions 
# and specified decimal places
tab14p_weighted <- print(tab14_weighted,
                         format = "p",
                         catDigits = 2,
                         showAllLevels = TRUE,
                         smd = TRUE)

# Define the desired column order
new_order_t2 <- c("level", "Never smoked", "Started before 10", 
                  "Started at 10-14", "Started at 15-17", 
                  "Started at 18-20", "Started after 20", 
                  "Overall", "p", "test", "SMD")
# Apply the new order to the table object
tab14p_weighted <- tab14p_weighted[, new_order_t2]

# Save the weighted table to CSV
write.csv(tab14p_weighted, file = "data/Table_App_2_Weighted_Exposure.csv")

Show/Hide Code

# Display the formatted table using kable for a clean Quarto output
kable(tab14p_weighted, caption = "Weighted Characteristics by 
      Smoking Initiation Categories (Analogous to Appendix 
      Table 2)") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Weighted Characteristics by Smoking Initiation Categories (Analogous to Appendix Table 2)
	level	Never smoked	Started before 10	Started at 10-14	Started at 15-17	Started at 18-20	Started after 20	Overall	p	SMD
n		115595521.42	1188555.71	15799589.69	32111621.33	26098412.76	15465912.33	206259613.23
race (%)	White	62.54	74.05	75.24	76.89	72.65	63.69	67.18	<0.001	0.215
	Black	12.53	6.92	8.07	8.47	10.39	15.77	11.50
	Hispanic	16.50	10.22	12.37	10.31	11.09	12.20	14.18
	Others	8.42	8.81	4.32	4.34	5.87	8.34	7.15
sex (%)	Male	42.98	73.85	58.70	56.37	55.14	50.42	48.54	<0.001	0.251
	Female	57.02	26.15	41.30	43.63	44.86	49.58	51.46

This completes the descriptive analysis section of the statistical analysis stage.

9.3 Chapter Summary and Next Steps

In this chapter, we generated the primary descriptive statistics for the study cohort. By using the survey design object, we successfully reproduced the weighted characteristics of the participants, creating tables analogous to Appendix Tables 1 and 2 from the paper’s supplementary material. This gives us a clear, nationally representative picture of our study population.

Now that we understand the characteristics of our cohort, we will move on to the core analysis in the next chapter, “Survival Analysis,” where we will investigate the primary relationship between smoking initiation and mortality.