Concepts (D)

Survey Data Analysis

Design-based analysis differs from model-based analysis in its approach to handling survey data. Design-based analysis emphasizes the importance of the survey’s sampling method and structure, focusing on representativeness and accurate variance estimation according to how the data was collected. It accounts for the complexities of the sampling design, e.g., stratification and clustering, to ensure that results are representative of the entire population. On the other hand, model-based analysis uses statistical models to understand relationships and patterns, assuming data come from a specific distribution and often relying on random sampling.

Understanding survey features such as weights, strata, and clusters is crucial in complex survey data analysis. Survey weights adjust for unequal probabilities of selection and nonresponse, ensuring that the sample represents the population accurately. Stratification improves precision and representation of subgroups, while clustering, often used for practicality and cost considerations, must be accounted for to avoid underestimating standard errors. These features are vital in design-based analysis to provide unbiased, reliable estimates and are what fundamentally distinguish it from model-based approaches, which may not reflect the difficulties of complex survey structures. NHANES is used an an example to explain these ideas.

Reading list

Key reference: (Steven G. Heeringa, West, and Berglund 2017) (chapters 2 and 3)

Optional reading: (Steven G. Heeringa, West, and Berglund 2014)

Theoretical references (optional):

Video Lessons

Survey Data Analysis

What is included in this Video Lesson:

  • reference 00:38
  • design-based 1:28
  • examples 3:33
  • NHANES and sampling 4:54
  • weights and other survey features 9:05
  • estimate of interest 12:55
  • design effect 15:52
  • Variance estimation 18:13
  • design-based analysis 25:11
  • How to make inference 29:33
  • inappropriate analysis 32:08
  • how useful are sampling weights 36:15
  • how useful are psu/cluster info 37:42
  • subpopulation / subsetting 38:57
  • missingness collected to weights? 40:45
  • Dealing with subpopulation 41:38

The timestamps are also included in the YouTube video description.

Video Lesson Slides

References

Archer, K. J., and S. Lemeshow. 2006. “Goodness-of-Fit Test for a Logistic Regression Model Fitted Using Survey Sample Data.” The Stata Journal 6 (1): 97–105.
Heeringa, Steven G., Brady T. West, and Patricia A. Berglund. 2014. “Regression with Complex Samples.” In The SAGE Handbook of Regression Analysis and Causal Inference, edited by Henning Best and Christof Wolf. SAGE Publications.
Heeringa, Steven G, Brady T West, and Patricia A Berglund. 2017. Applied Survey Data Analysis. Chapman; Hall/CRC.
Koch, G. G., D. H. Freeman Jr, and J. L. Freeman. 1975. “Strategies in the Multivariate Analysis of Data from Complex Surveys.” International Statistical Review/Revue Internationale de Statistique, 59–78.
Lumley, Thomas. 2017. “Pseudo-R2 Statistics Under Complex Sampling.” Australian & New Zealand Journal of Statistics 59 (2): 187–94.
Lumley, Thomas, and Alan Scott. 2014. “Tests for Regression Models Fitted to Survey Data.” Australian & New Zealand Journal of Statistics 56 (1): 1–14.
———. 2015. “AIC and BIC for Modeling with Complex Survey Data.” Journal of Survey Statistics and Methodology 3 (1): 1–18.
Rao, J. N. K., and A. J. Scott. 1984. “On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data.” The Annals of Statistics, 46–60.
Thomas, D. R., and J. N. K. Rao. 1987. “Small-Sample Comparisons of Level and Power for Simple Goodness-of-Fit Statistics Under Cluster Sampling.” Journal of the American Statistical Association 82 (398): 630–36.