Analyzing Early Smoking Initiation and Mortality

A Complete Walkthrough with 20 Years of NHANES Data

Authors
Affiliations

School of Population and Public Health, The University of British Columbia

Sadia Khan Durani

Department of Statistics, The University of British Columbia

Published

August 22, 2025

1 Reproducing the Analysis: Early Smoking Initiation and Mortality

A conceptual image representing data analysis and public health.


Welcome! This technical documentation provides a comprehensive and transparent guide for reproducing the analysis from the published paper:

Karim, M. E., Hossain, M. B., & Zheng, C. (2025). Examining the Role of Race/Ethnicity and Sex in Modifying the Association Between Early Smoking Initiation and Mortality: A 20-Year NHANES Analysis. AJPM Focus, 4(2), 100282. https://doi.org/10.1016/j.focus.2024.100282

By documenting the complete analytical pipeline—from raw data processing to the final statistical models—this guide allows researchers to fully understand and replicate the study, or to adapt this framework for new research questions.

1.1 About This Guide

This guide is designed for researchers, students, and public health analysts interested in longitudinal data analysis using NHANES. It demonstrates a complete workflow, including:

  • Data Acquisition: Programmatically downloading and merging multiple cycles of NHANES data.
  • Data Cleaning: Harmonizing variables that change across survey years.
  • Complex Survey Analysis: Correctly applying survey weights, strata, and clusters for nationally representative estimates.
  • Survival Modeling: Implementing Kaplan-Meier curves, Cox proportional hazards models, and effect modification analysis.

1.2 Book Structure

This book is structured to guide you through each stage of the analysis:

  • Part I: Introduction and Background
    • Sets the context for the study, provides an overview of the NHANES data, and outlines the research objectives.
  • Part II: Data Preparation
    • Details the steps for downloading, cleaning, and merging the raw NHANES and mortality linkage files to create the final analytic dataset.
  • Part III: Statistical Analysis
    • Elaborates on the complex survey design specification and the various statistical models employed, including survival analysis, effect modification, and sensitivity analyses.
  • Part IV: Discussion
    • Presents the key findings of the analysis, discusses their implications in the context of the original paper, and outlines limitations and future directions.

Let’s get started!