Survey data analysis

Background

The chapter consists of a series of tutorials focused on conducting rigorous analyses of complex survey data, mainly using Canadian Community Health Survey (CCHS) and National Health and Nutrition Examination Survey (NHANES) datasets. The tutorials guide users through various stages of survey data analysis: formulating research questions via the PICOT framework, data preparation, quality assessment, and handling missing data. They cover both bivariate and multivariable statistical methods, such as logistic and linear regressions, emphasizing the need to account for complex survey design elements like weights, strata, and clusters to avoid biased estimates. Advanced statistical techniques like backward elimination and interaction effect assessments are also discussed. Predictive model performance is evaluated using metrics like AIC, pseudo R-squared, and ROC curves, along with specialized tests like Archer and Lemeshow Goodness of Fit. The tutorials serve as a comprehensive guide for anyone looking to delve deep into the intricacies of analyzing complex survey data effectively.

The foundation we’ve built in understanding various research questions, especially the distinction between causal and predictive inquiries, will be instrumental in our next phase. Survey data, with its rich and diverse information, often presents opportunities to address both causal and predictive questions. By leveraging the knowledge we’ve garnered about the intricacies of causality and the methodologies of prediction, we’ll be better equipped to extract meaningful insights from nationally representative survey data. This holistic approach ensures that we not only comprehend the underlying theories but also effectively apply them in real-world contexts, making our analysis robust, relevant, and impactful.

Important

Datasets:

All of the datasets used in this tutorial can be accessed from this GitHub repository folder

Overview of tutorials

CCHS: Revisiting PICOT

The tutorial focuses on revisiting a research question concerning the relationship between osteoarthritis (OA) and cardiovascular disease (CVD) in Canadian adults, utilizing data from the Canadian Community Health Survey (CCHS) spanning from 2001 to 2005. The approach follows the PICOT framework, which specifies the target population, outcome, exposure and control groups, and timeline. The tutorial provides detailed steps for data preparation and analysis, from loading the necessary R packages to subsetting data based on a comprehensive set of variables like age, sex, marital status, and income among others. The variables are recoded into broader categories for easier analysis. The tutorial then combines different cycles of CCHS data into one comprehensive dataset. Potential confounders are also identified to better understand the relationship between OA and CVD.

CCHS: Assessing data

This tutorial provides a comprehensive guide to data preparation and quality assessment. It emphasizes the importance of checking for missing data and visualizing it, creating summary tables to look for zero-cells in variables, and generating frequency tables for various variables to examine data distribution. Specific attention is given to the presence of problematic variables. Data dictionaries from different cycles are also consulted to ensure variable compatibility. After identifying and modifying problematic data, the tutorial also explains how to set appropriate reference levels for factors in the dataset and offers an option to create a new dataset that excludes missing values (although this is not generally recommended).

CCHS: Bivariate analysis

This tutorial outlines how to examine relationships between two variables using R. The tutorial covers data preparation steps such as accumulating survey weights across cycles. It also highlights the handling of missing data and survey design specifications for weighted analyses. Descriptive weighted statistics are generated in tables, stratified by exposure and outcome, to provide insights for survey weighted logistic regression analysis. Additionally, proportions and design effects are calculated to account for the complex survey design’s impact on statistical estimates. The tutorial employs specialized chi-square tests, such as, Rao-Scott and Thomas-Rao modifications, to assess associations between variables, accounting for the survey’s complex design.

CCHS: Regression analysis

This tutorial offers a comprehensive guide on conducting complex regression analyses on survey data using R. The tutorial starts by conducting basic data checks. It then performs both simple and multivariable logistic regression to explore the relationship between cardiovascular disease and osteoarthritis. Model fit is assessed using Akaike Information Criterion (AIC) and pseudo R-squared metrics. Variable selection techniques such as backward elimination and stepwise regression guided by AIC are applied to hone the model. The tutorial also delves into assessing interaction effects among variables like age, sex, and diabetes, incorporating significant interactions into the final model.

CCHS: Model performance

The tutorial guides users through the process of evaluating logistic regression models fitted to complex survey data in R, focusing primarily on the Receiver Operating Characteristic (ROC) curves and Archer and Lemeshow Goodness of Fit tests. It introduces a specialized function for plotting ROC curves and calculating the Area Under the Curve (AUC) to gauge the model’s predictive accuracy, while taking survey weights into account. Grading guidelines for AUC values are provided for model discrimination quality. For model fit, the Archer and Lemeshow test is used. The tutorial also covers additional functionalities for dealing with strata and clusters in the survey data.

NHANES: Predicting blood pressure

The tutorial provides a comprehensive guide for analyzing health survey data with a focus on how demographic factors like race, age, gender, and marital status relate to blood pressure levels using NHANES dataset. The tutorial constructs both bivariate and multivariate regression models. Additionally, the tutorial incorporates complex survey designs by creating a new survey design object that factors in sampling weight, strata, and clusters. It also generates box plots and summary statistics to visualize variations in blood pressure across different demographic groups, considering survey design. The tutorial emphasizes the importance of accounting for survey design features to avoid biased estimates and discusses the challenges of model overfitting and optimism when shifting from inference to prediction, recommending optimism-correction techniques.

NHANES: Predicting cholesterol level

In the study using NHANES data, the goal was to predict cholesterol levels in adults based on various predictors such as gender, country of birth, race, education, marital status, income, BMI, and diabetes. The data was filtered to include only adults 18 years and older, and multiple statistical tests were conducted. Linear regression and logistic regression models were fitted, with results suggesting an association between gender and cholesterol level. Various statistical tests, including Wald tests and backward elimination, were employed to optimize the model. The study found that income was not a significant predictor for cholesterol levels, and interaction terms did not improve the model. Despite utilizing survey design features, the model had poor discriminatory power. However, Archer-Lemeshow Goodness of Fit test showed that the model fits the data well. The inclusion of age as an additional predictor led to different odds ratios, and the AIC value suggested that adding age improved the model.

NHANES: Properly subsetting a design object

The tutorial provides a comprehensive guide on how to handle and analyze a subset of complex survey data from the NHANES study using R. It begins by checking for missing data. The focus is on subsetting data based on complete information, emphasizing the importance of accounting for the full complex survey design to obtain unbiased variance estimates. Logistic regression is then run on this subset, with the tutorial explicitly differentiating between correct and incorrect approaches to consider the survey design. Finally, variable selection methods like backward elimination are discussed to determine significant predictors, emphasizing the retention of variables deemed important based on prior research.

Tip

Optional Content:

You’ll find that some sections conclude with an optional video walkthrough that demonstrates the code. Keep in mind that the content might have been updated since these videos were recorded. Watching these videos is optional.

Warning

Bug Report:

Fill out this form to report any issues with the tutorial.