Concepts (D)

Survey Data Analysis

Design-based analysis differs from model-based analysis in its approach to handling survey data. Design-based analysis emphasizes the importance of the survey’s sampling method and structure, focusing on representativeness and accurate variance estimation according to how the data was collected. It accounts for the complexities of the sampling design, e.g., stratification and clustering, to ensure that results are representative of the entire population. On the other hand, model-based analysis uses statistical models to understand relationships and patterns, assuming data come from a specific distribution and often relying on random sampling.

Understanding survey features such as weights, strata, and clusters is crucial in complex survey data analysis. Survey weights adjust for unequal probabilities of selection and nonresponse, ensuring that the sample represents the population accurately. Stratification improves precision and representation of subgroups, while clustering, often used for practicality and cost considerations, must be accounted for to avoid underestimating standard errors. These features are vital in design-based analysis to provide unbiased, reliable estimates and are what fundamentally distinguish it from model-based approaches, which may not reflect the difficulties of complex survey structures. NHANES is used an an example to explain these ideas.

Reading list

Key reference: (Steven G. Heeringa, West, and Berglund 2017) (chapters 2 and 3)

Optional reading: (Steven G. Heeringa, West, and Berglund 2014)

Theoretical references (optional):

F/chi-squared statistic with the Rao-Scott second-order correction (Rao and Scott 1984; Koch, Freeman Jr, and Freeman 1975; Thomas and Rao 1987)
AIC and BIC for modeling with complex survey data (Lumley and Scott 2015)
Pseudo-R2 statistics under complex sampling (Lumley 2017)
Tests for regression models fitted to survey data (Lumley and Scott 2014)
Goodness-of-fit test for a logistic regression model fitted using survey sample data (Archer and Lemeshow 2006)

Video Lessons

Survey Data Analysis

Large-scale health and social surveys are invaluable resources for research, providing nationally representative data that informs public policy and scientific discovery. However, their sophisticated, multi-stage sampling designs render them unsuitable for analysis with standard statistical methods that assume a simple random sample (SRS). Treating complex survey data as an SRS is a critical error that can lead to profoundly incorrect conclusions, such as underestimated standard errors and deceptively small p-values, producing a distorted and unreliable picture of the population under study.

This guide serves as a comprehensive, theoretical roadmap for students and researchers navigating the unique landscape of complex survey data analysis. It is designed as a self-contained tutorial, focusing on the conceptual underpinnings required for valid analysis, with a special emphasis on the U.S. National Health and Nutrition Examination Survey (NHANES) as a practical example. The material is structured to build knowledge progressively, from the foundational theory of survey design to the application of advanced statistical models. By understanding that the design of the survey must be integrated into every step of the analysis, the reader will be equipped to produce scientifically rigorous and defensible conclusions.

The Landscape of Survey Data

1.1. Examples of Major Complex Surveys

In public health, epidemiology, and the social sciences, our understanding of population-level phenomena is often built upon data from large-scale surveys. Prominent examples include :

National Health and Nutrition Examination Survey (NHANES): A program of studies to assess the health and nutritional status of adults and children in the United States, unique for combining interviews with physical examinations.
Canadian Community Health Survey (CCHS): A cross-sectional survey that collects information related to health status, health care utilization, and health determinants for the Canadian population.
Behavioral Risk Factor Surveillance System (BRFSS): The world’s largest, continuously conducted telephone health survey system, tracking health conditions and risk behaviors in the United States.
Health and Retirement Study (HRS): A longitudinal panel study that surveys a representative sample of Americans over the age of 50 to inform research on aging.
European Social Survey (ESS): A cross-national survey conducted across Europe that measures attitudes, beliefs, and behavior patterns of diverse populations.

Each of these surveys employs a complex design, making the principles discussed in this guide broadly applicable. We will focus on NHANES to illustrate these concepts in detail.

1.2. The NHANES Sampling Procedure

The National Health and Nutrition Examination Survey (NHANES) does not use a simple random sample. Instead, it employs a complex, multistage, probability sampling design to select participants who are representative of the civilian, non-institutionalized U.S. population. This multi-stage process is a practical necessity to balance logistical efficiency with statistical rigor. The procedure involves four main stages :

Stage 1: Primary Sampling Units (PSUs): The U.S. is divided into geographically defined PSUs, which are typically counties or groups of contiguous counties. These PSUs are grouped into strata based on characteristics like geography and minority population density. From each stratum, a sample of PSUs is selected, usually with a probability proportional to its population size.
Stage 2: Segments: Each selected PSU is further divided into smaller geographic areas called segments (e.g., city blocks). A sample of these segments is then drawn, again often with probability proportional to size.
Stage 3: Households: Within each selected segment, a list of all housing units is compiled, and a sample of households is randomly selected.
Stage 4: Individuals: Finally, within each selected household, individuals are randomly chosen from a list of all household members based on specific age, sex, and race/ethnicity screening criteria.

This hierarchical process means that data collection is concentrated in a limited number of geographic areas (the sampled PSUs), which is far more cost-effective than traveling to households scattered randomly across the entire nation.

1.3. The Three Pillars of Complex Sampling

As discussed previously, Complex survey designs are built upon three core methodological pillars: stratification, clustering, and weighting. Each serves a distinct purpose, and together they form the intricate architecture that must be understood and accounted for in any valid analysis.

Stratification: This is the process of partitioning the target population into mutually exclusive subgroups (strata) before sampling begins. Sampling is then performed independently within each stratum. The primary purpose of stratification is to increase the statistical precision of survey estimates and to ensure that key subgroups of interest are adequately represented in the sample.
- Application in NHANES: NHANES stratifies its primary sampling units (PSUs) based on geography and the proportions of minority populations to ensure that the sample accurately reflects the nation’s demographic diversity and allows for reliable subgroup analyses.
Clustering: This is a sampling technique used primarily to improve logistical efficiency and reduce data collection costs. Instead of sampling individuals directly, the process involves sampling natural groupings (clusters) first, such as counties or city blocks.
- Application in NHANES: The first two stages of NHANES sampling (selecting counties and then segments within those counties) are examples of clustering. This design makes it feasible to deploy the Mobile Examination Centers (MECs) where physical examinations are conducted.
Weighting: In a complex survey, individuals have unequal probabilities of selection. Survey weights are created to compensate for this. A respondent’s final survey weight can be conceptually understood as the number of people in the target population that he or she represents.
- Application in NHANES: NHANES weights are crucial for generating unbiased national estimates. They are constructed in a multi-step process to account for the complex design :
  1. Base Weight: This is the inverse of the individual’s probability of selection, accounting for all stages of the sampling design. This step is necessary because of the deliberate oversampling of certain subgroups to increase the reliability of estimates for these groups. For example, NHANES often oversamples older adults, low-income individuals, and specific racial/ethnic minorities like African Americans and Mexican Americans.
  2. Nonresponse Adjustment: The weights are adjusted to account for individuals who were selected but did not participate in the survey (either the interview or the exam). This reduces potential bias if non-respondents differ systematically from respondents.
  3. Post-stratification: The weights are further adjusted so that the weighted sample’s demographic totals (e.g., by age, sex, and race/ethnicity) match known population totals from an independent source like the U.S. Census Bureau. This final calibration helps correct for any remaining discrepancies between the sample and the target population.

1.4. The Design Effect (DEFF): Why It Still Matters

The cumulative statistical impact of stratification, clustering, and unequal weighting is captured by the Design Effect (DEFF). It is formally defined as the ratio of the variance of an estimate obtained from the complex survey to the variance of the same estimate that would have been obtained from a simple random sample (SRS) of the exact same size.

\[DEFF = \frac{Var(\hat{\theta})_{complex}}{Var(\hat{\theta})_{srs}}\]

Even with modern software that directly calculates correct variances, the DEFF remains a valuable concept for several reasons:

Interpretation: The DEFF provides a simple, intuitive metric for understanding the efficiency of the survey design for a particular estimate. A DEFF of 2.0, for instance, implies that the variance of the estimate is twice as large as it would have been under SRS, meaning the confidence interval is about 41% wider (\(\sqrt{2} \approx 1.41\)).
Effective Sample Size: The DEFF allows you to calculate the “effective sample size” (\(n_{eff} = n / DEFF\)), which is the size of an SRS that would yield the same level of precision. This is a powerful tool for communicating the statistical power of a study.
Survey Planning: In the design phase of a new survey, estimated DEFFs from previous, similar surveys are crucial for calculating the required sample size to achieve a desired level of precision.

In essence, while software performs the complex calculations, the DEFF provides the conceptual understanding and interpretive context for the results.

TLDR

Complex surveys like NHANES do not use simple random sampling. They use a multi-stage design involving stratification (dividing the population into groups) and clustering (sampling geographic areas) for efficiency. This creates unequal selection probabilities, which are corrected by survey weights. The Design Effect (DEFF) is a key metric that quantifies how much the complex design increases the variance of an estimate compared to a simple random sample.

The Perils of Naive Analysis

2.1. The Breakdown of the I.I.D. Assumption

Standard statistical methods assume that observations are Independent and Identically Distributed (I.I.D.). Complex survey designs systematically violate both components of this assumption.

Violation of Independence: Cluster sampling directly violates independence. People living in the same neighborhood (cluster) are more similar to each other than randomly selected individuals, a phenomenon known as intra-cluster correlation. Standard formulas fail to account for this redundancy and therefore underestimate the true variability in the population.
Violation of Identical Distribution: Stratification and the deliberate oversampling of certain groups mean that individuals have unequal probabilities of selection. Therefore, observations are not “identically distributed.” Survey weights are introduced to correct for this, but their very existence signals a violation of the I.I.D. assumption.

2.2. Inappropriate Analysis in the Literature

The availability of public-use survey datasets to researchers who may lack adequate training contributes to methodological issues and analytic errors in published literature. A 2013 study reviewing articles that used the Korean National Health and Nutrition Examination Survey (KNHANES) found that a substantial portion of publications failed to properly account for the complex survey design, potentially leading to biased results and incorrect conclusions. This problem is not unique to one survey. The issue is widespread enough that checklists, such as the proposed Preferred Reporting Items for Complex Sample Survey Analysis (PRICSSA) (Seidenberg, Moser, and West 2023), have been developed to guide researchers and improve the quality of reporting. The root cause is often simple ignorance of the correct methods, perpetuated by the publication of flawed examples.

Common Pitfall: Ignoring the Survey Design The most frequent and dangerous error is to analyze complex survey data as if it were from a simple random sample. This ignores the clustering, stratification, and weighting, leading to incorrect point estimates and, most critically, underestimated standard errors. The result is an analysis prone to overstated significance and an increased risk of Type I errors (false positives). Researchers may report significant associations that are, in reality, merely artifacts of incorrect statistical analysis.

TLDR

Standard statistical tests assume data are I.I.D. (independent and identically distributed). Complex surveys violate this assumption through clustering (violating independence) and unequal selection probabilities (violating identical distribution). Ignoring the survey design leads to biased results and underestimated standard errors, a common problem in published research that can produce false-positive findings.

The Toolkit for Valid Inference

3.1. The Theory of Variance Estimation for Complex Surveys

Because standard formulas for variance are invalid for complex survey data, analysts must use specialized techniques. There are two major families of methods for this purpose: Taylor Series Linearization and Replication Methods.

Taylor Series Linearization (TSL): Also known as the delta method, TSL is an analytical approach that approximates a complex, non-linear statistic (like a regression coefficient) with a simpler linear function. The variance of this linear approximation, which correctly accounts for stratification and clustering, serves as an estimate for the variance of the original statistic. This is the default method in many software packages and requires the strata and cluster (PSU) identifiers for each observation to be specified.
- Application in NHANES: To use TSL with NHANES data, an analyst must specify the masked variance pseudo-stratum (SDMVSTRA) and pseudo-PSU (SDMVPSU) variables provided in the public-use data files.
Replication Methods: These methods take an empirical approach. The idea is to repeatedly draw subsamples (“replicates”) from the full sample in a way that mimics the original design. The statistic of interest is calculated for each replicate, and the variance of the full-sample estimate is determined by the variability of the estimates across these replicates. This approach requires a set of pre-calculated “replicate weights” instead of the strata and PSU variables. Common replication techniques include :
- Balanced Repeated Replication (BRR): Typically used for designs with two PSUs per stratum. It creates replicates by selecting one of the two PSUs from each stratum according to a balanced scheme.
- Jackknife Repeated Replication (JRR): Creates replicates by successively deleting one PSU at a time from the sample and up-weighting the remaining PSUs in that stratum.
- Fay’s BRR: A modification of BRR that retains all cases in each replicate but perturbs the weights by a factor k (where 0 < k < 1) instead of dropping cases entirely. This method can be more stable, especially for statistics like quantiles.
- Bootstrap Methods: Survey-specific bootstrap methods, such as the Rao-Wu bootstrap, create replicates by resampling PSUs within each stratum.

3.2. Hypothesis Testing in the Design-Based Framework

Hypothesis testing procedures must also be adjusted for complex survey data.

Design-Adjusted t-tests: The standard t-statistic formula is used, but the “Estimate” is calculated using survey weights, and the “Standard Error” in the denominator is a design-based SE from TSL or replication. Furthermore, the degrees of freedom are based on the number of PSUs and strata, not the number of individuals, which appropriately reflects the reduced amount of independent information in a clustered sample.
The Rao-Scott Chi-Square Test: The standard Pearson’s chi-squared test is invalid for complex survey data because it does not follow a standard chi-squared distribution. The Rao-Scott chi-square test was developed to correct this problem by adjusting the standard Pearson’s statistic to account for the survey’s design effect. An F-test version of this correction is often preferred as it provides a more accurate adjustment, especially when design effects vary across the cells of a contingency table.

TLDR

To get correct standard errors, use specialized variance estimation methods. Taylor Series Linearization (TSL) is an analytical approach requiring strata and PSU variables. Replication Methods (like BRR, JRR, and Bootstrap) are an empirical approach requiring pre-calculated replicate weights. Standard hypothesis tests like the t-test and chi-square test must be replaced with their design-adjusted counterparts (e.g., Rao-Scott test).

Multivariable Regression with Survey Data

4.1. Principles of Fitting Regression Models

When fitting regression models to complex survey data, the estimation procedures must be adapted. This is typically achieved through pseudo-maximum likelihood estimation (PMLE), which incorporates the survey weights into the likelihood function. This ensures that the resulting coefficient estimates are representative of the target population.

While weights are used to obtain unbiased point estimates, the most critical adjustment is in the estimation of their variance. After the coefficients are estimated, a design-based method (TSL or replication) is used to calculate the variance-covariance matrix. From this correctly calculated matrix, valid standard errors, confidence intervals, and hypothesis tests (e.g., Wald tests) for each predictor are derived.

4.2. A Word of Caution About “Weights”!

While there is universal agreement that weights are necessary for descriptive statistics (e.g., population means or prevalences), their use in regression models is more nuanced and has been a subject of debate.

The Argument for Using Weights: The primary goal of most public health research using national survey data is to make inferences about the target population. In this case, weights must be used to obtain unbiased estimates of the population-level associations. Failing to do so can lead to biased coefficients if the sampling probabilities are related to the outcome variable.
The Argument Against Using Weights: Some argue that using weights can reduce precision (i.e., inflate standard errors), especially if the weights are highly variable. They contend that if a regression model is perfectly specified (includes all relevant predictors), the unweighted estimates may be more efficient.
Practical Recommendation: In practice, one can never be certain that a model is perfectly specified. Therefore, the widely accepted best practice is to use the survey weights in regression models as the default approach. This prioritizes the avoidance of bias. A prudent strategy is to fit both weighted and unweighted models as a sensitivity analysis. If the coefficients differ substantially, it suggests the weights are correcting for significant bias, and the weighted results should be considered primary.

4.3. Assessing Model Fit and Performance

Tools for assessing model performance must also be adapted for complex survey data.

Goodness-of-Fit: For logistic regression, the standard Hosmer-Lemeshow test is invalid. The Archer-Lemeshow goodness-of-fit test is the design-based analogue that correctly accounts for the survey design. A non-significant p-value from this test suggests the model fits the data adequately.
Predictive Performance: Measures like pseudo-R² should be calculated using survey-weighted versions. For assessing a model’s ability to discriminate between outcomes, the Area Under the ROC Curve (AUC) should be calculated as a weighted AUC to ensure the measure is representative of the model’s performance in the target population.

TLDR

When running regressions, use pseudo-maximum likelihood estimation (PMLE) to incorporate weights for unbiased coefficients, and use design-based methods for valid standard errors. While there is some debate, the best practice is to always use weights in regression to avoid bias. Model fit should be assessed with specialized tools like the Archer-Lemeshow test and weighted AUC.

Practical Considerations for NHANES Data

5.1. Analyzing Subpopulations: The Correct Approach

A frequent task in survey analysis is to focus on a specific subgroup of the population, such as analyzing health outcomes only among women.

Common Pitfall: Incorrectly Subsetting the Data A critical mistake is to simply filter the dataset to keep only the individuals in the subpopulation of interest before specifying the survey design. This approach is incorrect because it provides the software with incomplete information about the total number of strata and PSUs in the original sample, leading to biased standard errors and invalid inference.

The correct procedure involves two steps : 1. Define the full survey design object first. The analyst must specify the survey design to the statistical software using the entire sample dataset, including all strata, PSUs, and weights. 2. Use a subpopulation or subset command. Once the full design object is created, the analyst should use a specific command within the survey analysis software to restrict subsequent analyses to the desired subpopulation. This command performs calculations only on the subgroup but does so while using the variance estimation structure of the full sample design.

5.2. Selecting and Combining NHANES Weights

The practical application of survey weights requires careful attention to selecting the correct weight for a given analysis and adjusting weights when combining survey cycles.

Selecting the Correct Weight

NHANES data files often provide multiple survey weights because different components of the survey are administered to different subsets of the sample. For example, NHANES provides an interview weight (wtint2yr), a Mobile Examination Center (MEC) exam weight (wtmec2yr), and even more specific subsample weights, such as a fasting subsample weight.

The analyst must choose the correct weight based on the variables included in the analysis. A good rule of thumb is to use the weight corresponding to the “least common denominator”—that is, the weight that applies to the smallest, most specific subsample required for the analysis. For example, if an analysis includes variables from both the interview and the MEC exam, the MEC exam weight must be used, and the analysis must be restricted to only those participants who have a MEC weight.

Combining Survey Cycles

Researchers often combine data from multiple two-year cycles of NHANES to increase sample size. When doing so, the survey weights must be adjusted.

Standard Cycles (2001-onward): The standard rule is to divide the original two-year sample weight by the number of cycles being combined. For example, if combining three two-year cycles (a total of six years of data), the new multi-year weight would be the original two-year weight divided by three.
Pandemic-Era Data (2017–March 2020): Due to the COVID-19 pandemic, data collection for the 2019–2020 cycle was halted prematurely. To create a nationally representative dataset, the partial 2019–March 2020 data were combined with the full 2017–2018 data. This created a 3.2-year file, and combining it with other 2-year cycles requires special weighting formulas provided by NCHS. For example, to combine 2015-2016 (2 years) and 2017-March 2020 (3.2 years), the new weight (MEC52Y) would be calculated as:
- MEC52Y = (2 / 5.2) * WTMEC2YR for the 2015-2016 respondents.
- MEC52Y = (3.2 / 5.2) * WTMECPRP for the 2017-March 2020 respondents.

5.3. Preferred reporting items for complex sample survey analysis (PRICSSA)

We apply preferred reporting items for complex sample survey analysis (PRICSSA) (Seidenberg, Moser, and West 2023) [link] on an example article (Karim, Hossain, and Zheng 2025) [link].

PRICSSA Item	Description	Example Text from/for the Article
1.1 Data collection dates	State the specific start and end dates of the survey to provide historical context.	“For this study, the authors utilized data from 10 aggregated NHANES cycles spanning from 1999–2000 to 2017–2018.”
1.2 Data collection mode(s)	Describe how the data was gathered (e.g., in-person interview, phone, web), as this can influence responses.	“Data collection was primarily done through interviews.”
1.3 Target population	Clearly define the population the survey is intended to represent.	The analysis focused on the “noninstitutionalized U.S. civilian population” aged “between 20 and 79 years.”
1.4 Sample design	Explain the survey’s sampling methodology, such as stratification and clustering.	“The survey employs a multistage, stratified cluster sampling design.”
1.5 Survey response rate(s)	Report the survey’s response rate and the calculation method to inform potential nonresponse bias.	✍️ This was not reported. An example would be: “The NHANES response rates vary by cycle. For the cycles included, the unweighted interview response rates ranged from approximately 84% (1999-2000) to 52% (2017-2018).” [link]
2.1 Missingness rates	Report the rate of missing data for key variables and describe how it was handled (e.g., complete case analysis, imputation).	“In total, 275 observations (about 0.5% of the entire sample size) were discarded owing to having missing exposure or outcome.” “There were no missing values for the covariates considered in the main analysis.”
2.2 Observation deletion	State if any observations were deleted and provide a justification. Best practice is to use subpopulation commands instead of deleting.	The authors state they “discarded” 275 observations due to missing data and “excluded participants who were aged either below 20 or above 79 years.”
2.3 Sample sizes	Include unweighted counts (n) for all weighted estimates to show the actual number of participants underlying the results.	“The analytic dataset comprised a sample size of 50,549.” The appendix tables provide detailed unweighted counts for all subgroups, such as the N=80 females who started smoking before age 10 (Appendix Table 2).
2.4 Confidence intervals/standard errors	Report 95% confidence intervals or standard errors for all estimates to convey their precision.	“…estimated hazard ratios (HRs) along with their associated 95% CIs…” All results in Figure 2 include 95% CIs.
2.5 Weighting	State which analyses were weighted and specify the exact weight variables used.	“The design was created on the entire data using the design features: interview weights, clusters, and strata.”
2.6 Variance estimation	Describe the method used to calculate design-adjusted variances and specify the design variables (e.g., PSU, strata).	The authors estimated “variances using the Taylor series linearization method.”
2.7 Subpopulation analysis	Explain the correct statistical procedure for analyzing subgroups (e.g., using a `subpop` or `domain` command).	“Subsequently, the authors subset the design to focus on eligible patients…” This describes the correct procedural step for subpopulation analysis.
2.8 Suppression rules	State whether a rule was followed to suppress unreliable estimates (e.g., based on small sample size or large relative standard error).	✍️ This was not reported as suppression was not done. An example would be: “Estimates based on an unweighted sample size of fewer than 30 participants were considered potentially unstable and are noted in the text.”
2.9 Software and code	State the statistical software and version used, and make the analysis code available for reproducibility.	“All analyses were conducted using R, Version 4.2.2.” The code is “available from the corresponding author upon reasonable request.”
2.10 Singleton problem (as needed)	If using Taylor Series Linearization, describe how any strata with only one PSU were handled during analysis.	✍️ This was not mentioned. As this is an “as needed” item, it is appropriate to omit if the problem was not encountered during the analysis.
2.11 Public/restricted data (as needed)	Specify whether the public-use or a restricted-use version of the dataset was analyzed.	The authors state that “NHANES data are publicly accessible” and that they used “public-use linked mortality files”, indicating public-use data was analyzed.
2.12 Embedded experiments (as needed)	If the survey included an experiment, describe it and how it was handled in the analysis.	This item is not applicable to this study, as the analysis did not involve an embedded experiment within the NHANES data.

Note

Reporting of Tables:

Expanding on item 2.3 “Sample sizes”, when presenting a descriptive statistics table (a “Table 1”), the recommended best practice is to report two pieces of information for each variable :

The unweighted sample size (n) for each category.
The weighted percentage (%) or mean for each category.

This dual presentation provides a complete picture of both the sample that was actually collected (the ‘n’) and the population it is intended to represent (the weighted estimate). This transparency allows readers to immediately see the effects of the survey design, such as the oversampling of certain groups.

Warning

To apply these principles in R, the svyTable1 package can be used to generate a “Table 1” from complex survey data. This package is provided “as is” and used at the reader’s discretion.

Reporting of lonely/singleton PSU:

The “lonely PSU” problem, also known as the “singleton PSU” problem is a technical issue that can arise when analyzing complex survey data. The Taylor Series Linearization (TSL) method for variance estimation works by measuring the variability between Primary Sampling Units (PSUs) within each stratum. To calculate variance (which is a measure of spread), you need at least two points to compare. This problem most often occurs during subpopulation analysis. While the full NHANES sample is designed to have at least two PSUs in every stratum, when you analyze a very specific subgroup (e.g., non-Hispanic Black participants who started smoking before age 10), it’s possible that your subgroup of interest exists in only one PSU within a particular stratum. Solutions are to [i] Centering at the grand mean (conservative approach), [ii] The stratum with the lonely PSU is merged with another, similar stratum, or [iii] use Replication methods.

Reporting of reliability of estimates:

design effect (deff): A value >1 indicates the complex design increases variance (e.g., 1.234 means ~23% inflation vs. SRS). Report it in footnotes or Methods for transparency.
Relative standard error (RSE) or %RSE = (Standard error of estimate / Estimate) * 100: Should be <30% for “stable” estimates per CDC guidelines; suppress or flag unstable ones (e.g., wide CIs).

TLDR

For subpopulation analysis, always define the survey design on the full dataset first, then use a subset command. When using NHANES, select the correct weight based on the “least common denominator” of your variables (e.g., interview vs. exam weight). When combining survey cycles, divide the 2-year weights by the number of cycles, but use special formulas for the pandemic-era data. For descriptive tables, always report both unweighted counts (n) and weighted percentages (%).

5.4 The Sex and Gender Equity in Research (SAGER) guidelines

For reporting sex- and gender-based analyses (not necessarily part of survey data analysis), we recommend SAGER guidelines (Heidari et al. 2016). We apply SAGER checklist (Van Epps et al. 2022) [link] on an example article (Karim, Hossain, and Zheng 2025) [link].

SAGER Item	What the Manuscript Said	What Should Have Been Said (to fully comply)
1. General: Use terms sex/gender appropriately.	The article consistently uses “sex” when referring to the demographic variable from NHANES and discussing biological differences.	This is correct. The article appropriately uses “sex” when referring to the demographic variable from NHANES, which is consistent with the data source.
3b. Abstract: Describe study population with sex/gender breakdown.	The abstract states the analysis explored “effect modification by race/ethnicity and sex” but does not provide the numerical breakdown of participants.	The abstract should have included the numbers: “Results: The analysis included 50,549 participants (48.5% male). The authors found that early smoking initiation…”
4a. Introduction: Cite previous studies on sex/gender differences.	The introduction cites literature on how “biological differences in nicotine metabolism… vary across sexes” and discusses disparities related to sex.	This is correct. The introduction properly cites existing literature to establish the rationale for investigating sex as a variable.
5a. Methods: State the method used to define sex/gender.	The Methods section lists “sex (male, female)” as a variable but does not specify how it was collected or defined by the survey.	The Methods section should have specified the data collection method: “Sex (male, female) was based on participant self-report as recorded in the NHANES demographic files.”
6a. Results: Provide a complete sex/gender breakdown.	Appendix Table 1 provides the complete unweighted and weighted breakdown: “Male 24,391 (48.54%)” and “Female 26,158 (51.46%)”.	This is correct. The article provides a full and appropriate breakdown of the study population by sex in an appendix table.
6b. Results: Present data disaggregated by sex/gender.	Figure 2 presents hazard ratios stratified by “Male” and “Female”. Appendix Figure 2 shows smoking duration disaggregated by sex.	This is correct. The results are clearly and appropriately disaggregated by sex throughout the article and appendix.
7a. Discussion: Discuss the implications of sex/gender on the results.	The Discussion section analyzes the findings: “Effect modification by sex resulted in slightly higher HR estimates for the female subpopulation…”.	This is correct. The manuscript properly discusses and interprets the sex-specific findings.

Glossary of Terms

Archer-Lemeshow Test: A goodness-of-fit test for logistic regression models that has been adapted for use with complex survey data.
Balanced Repeated Replication (BRR): A replication method for variance estimation, typically used for designs with two PSUs per stratum.
Bootstrap: A replication method for variance estimation that involves resampling PSUs within each stratum.
Clustering: A sampling technique where natural groups (e.g., counties, schools) are sampled first, followed by sampling of units within the selected groups. This increases variance.
Design Effect (DEFF): The ratio of the variance of an estimate from a complex survey to the variance of the same estimate from a simple random sample of the same size. It measures the impact of the design on precision.
Design-Based Inference: A statistical framework where inference is based on the known random process of sample selection from a fixed, finite population.
Fay’s BRR: A modification of BRR that perturbs weights rather than deleting cases, which can improve stability for certain estimates.
I.I.D. (Independent and Identically Distributed): A core assumption of classical statistics that observations are independent of one another and all drawn from the same distribution. This is violated by complex surveys.
Jackknife Repeated Replication (JRR): A replication method for variance estimation that involves creating replicates by successively deleting one PSU at a time.
Model-Based Inference: A statistical framework where inference is based on an assumed statistical model that generates the data.
Primary Sampling Unit (PSU): The first-stage sampling unit in a multi-stage design, often a geographic area like a county.
Pseudo-Maximum Likelihood Estimation (PMLE): An estimation method for regression models that incorporates survey weights into the likelihood function.
Rao-Scott Chi-Square Test: A design-adjusted version of the Pearson chi-square test used to assess association between categorical variables in complex survey data.
Replication Methods: A family of variance estimation techniques that use the variability across multiple subsamples (replicates) to estimate the variance of an estimate.
Simple Random Sample (SRS): A basic sampling method where every individual in the population has an equal and independent chance of being selected.
Strata: Mutually exclusive subgroups of a population from which independent samples are drawn. Stratification generally decreases variance.
Taylor Series Linearization (TSL): An analytical method for variance estimation that uses a linear approximation of a statistic. It is the most common default method.
Weighting: The process of assigning a weight to each respondent to adjust for unequal probabilities of selection, nonresponse, and deviations from population totals.

What is included in this Video Lesson:

reference 00:38
design-based 1:28
examples 3:33
NHANES and sampling 4:54
weights and other survey features 9:05
estimate of interest 12:55
design effect 15:52
Variance estimation 18:13
design-based analysis 25:11
How to make inference 29:33
inappropriate analysis 32:08
how useful are sampling weights 36:15
how useful are psu/cluster info 37:42
subpopulation / subsetting 38:57
missingness collected to weights? 40:45
Dealing with subpopulation 41:38

The timestamps are also included in the YouTube video description.

Video Lesson Slides

References

Archer, K. J., and S. Lemeshow. 2006. “Goodness-of-Fit Test for a Logistic Regression Model Fitted Using Survey Sample Data.” The Stata Journal 6 (1): 97–105.

Heeringa, Steven G., Brady T. West, and Patricia A. Berglund. 2014. “Regression with Complex Samples.” In The SAGE Handbook of Regression Analysis and Causal Inference, edited by Henning Best and Christof Wolf. SAGE Publications.

Heeringa, Steven G, Brady T West, and Patricia A Berglund. 2017. Applied Survey Data Analysis. Chapman; Hall/CRC.

Heidari, Shirin, Thomas F Babor, Paola De Castro, Sera Tort, and Mirjam Curno. 2016. “Sex and Gender Equity in Research: Rationale for the SAGER Guidelines and Recommended Use.” Research Integrity and Peer Review 1 (1): 2.

Karim, Mohammad Ehsanul, Md Belal Hossain, and Chuyi Zheng. 2025. “Examining the Role of Race/Ethnicity and Sex in Modifying the Association Between Early Smoking Initiation and Mortality: A 20-Year NHANES Analysis.” AJPM Focus 4 (2): 100282.

Koch, G. G., D. H. Freeman Jr, and J. L. Freeman. 1975. “Strategies in the Multivariate Analysis of Data from Complex Surveys.” International Statistical Review/Revue Internationale de Statistique, 59–78.

Lumley, Thomas. 2017. “Pseudo-R2 Statistics Under Complex Sampling.” Australian & New Zealand Journal of Statistics 59 (2): 187–94.

Lumley, Thomas, and Alan Scott. 2014. “Tests for Regression Models Fitted to Survey Data.” Australian & New Zealand Journal of Statistics 56 (1): 1–14.

———. 2015. “AIC and BIC for Modeling with Complex Survey Data.” Journal of Survey Statistics and Methodology 3 (1): 1–18.

Rao, J. N. K., and A. J. Scott. 1984. “On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data.” The Annals of Statistics, 46–60.

Seidenberg, Andrew B, Richard P Moser, and Brady T West. 2023. “Preferred Reporting Items for Complex Sample Survey Analysis (PRICSSA).” Journal of Survey Statistics and Methodology 11 (4): 743–57.

Thomas, D. R., and J. N. K. Rao. 1987. “Small-Sample Comparisons of Level and Power for Simple Goodness-of-Fit Statistics Under Cluster Sampling.” Journal of the American Statistical Association 82 (398): 630–36.

Van Epps, Heather, Olaya Astudillo, Yaiza Del Pozo Martin, and Joan Marsh. 2022. “The Sex and Gender Equity in Research (SAGER) Guidelines: Implementation and Checklist Development.” European Science Editing 48: e86910.

Survey Data Analysis

Reading list

Video Lessons

The Landscape of Survey Data

1.1. Examples of Major Complex Surveys

1.2. The NHANES Sampling Procedure

1.3. The Three Pillars of Complex Sampling

1.4. The Design Effect (DEFF): Why It Still Matters

TLDR

The Perils of Naive Analysis

2.1. The Breakdown of the I.I.D. Assumption

2.2. Inappropriate Analysis in the Literature

TLDR

The Toolkit for Valid Inference

3.1. The Theory of Variance Estimation for Complex Surveys

3.2. Hypothesis Testing in the Design-Based Framework

TLDR

Multivariable Regression with Survey Data

4.1. Principles of Fitting Regression Models

4.2. A Word of Caution About “Weights”!

4.3. Assessing Model Fit and Performance

TLDR

Practical Considerations for NHANES Data

5.1. Analyzing Subpopulations: The Correct Approach

5.2. Selecting and Combining NHANES Weights

Selecting the Correct Weight

Combining Survey Cycles

5.3. Preferred reporting items for complex sample survey analysis (PRICSSA)

TLDR

5.4 The Sex and Gender Equity in Research (SAGER) guidelines

Glossary of Terms

Video Lesson Slides

Links

References