Accessing data

Background

Surveys serve as a pivotal tool for collecting and evaluating health-related information on a national scale. More often than not, it’s the governmental bodies that take the lead in gathering this data. Recognizing the value of this information, many governments not only compile and analyze these datasets but also ensure they are accessible to the public, especially for research purposes. In this guide, we will delve into an array of survey methodologies, provide illustrative examples, and guide you through the process of downloading pertinent data from both Canadian and American repositories. To make this more tangible, we will conclude with a hands-on example, showcasing how to replicate findings from an academic paper that leveraged one of these publicly available datasets.

Now that we are familiar with R basics and data wrangling, we are now dedicating a chapter to accessing and downloading nationally representative survey datasets is pivotal for a hands-on epidemiological tutorial. These datasets, often rich in information and reflective of diverse populations, serve as the backbone for real-world analysis. By guiding learners on how to obtain these datasets, the book ensures that they not only grasp theoretical concepts but also gain practical experience working with authentic, large-scale data. This approach bridges the gap between theory and practice, allowing readers to apply learned techniques on datasets that mirror real-world complexities, thereby enhancing the relevance and applicability of their analytical skills.

Important

Datasets:

All of the datasets used in this tutorial can be accessed from this GitHub repository folder

Overview of tutorials

Survey data sources

The tutorial lists primary complex survey data sources, including the Canadian Community Health Survey and National Health and Nutrition Examination Survey, with several offering dedicated R packages for data access.

Descriptions of data sources

This tutorial provides comprehensive instructions on how to import and process health survey datasets, specifically focusing on the Canadian Community Health Survey (CCHS), National Health and Nutrition Examination Survey (NHANES), and National Health Interview Survey (NHIS).

Importing CCHS to R

The section provides detailed steps for importing the Canadian Community Health Survey dataset from the UBC library into RStudio, with processing options using SAS, the free software PSPP, and directly in R.

Importing NHANES to R

The tutorial guides users on how to access and import the NHANES dataset from the CDC website into RStudio, detailing the dataset’s structure and providing methods both manually and using an R package.

Reproducing results

The tutorial guides users through accessing, processing, and analyzing NHANES data to reproduce the results from a referenced article using R code.

Importing NHIS to R

This chapter serves as a tutorial on accessing and importing the National Health Interview Survey (NHIS) dataset from the US Centers for Disease Control and Prevention (CDC) website into RStudio. The NHIS is an annual cross-sectional survey managed by the CDC, offering insights into population disease prevalence, disability extent, and utilization of health care services. The data files are available in various formats, including ASCII, CSV, and SAS. Users can combine datasets from different years; however, tracing the same individual across cycles is not feasible. The chapter provides step-by-step guidance on downloading the NHIS dataset, particularly the ‘Adult’ data from 2021, verifying the imported data, and merging datasets within the same survey cycle using the unique household identifier.

Linking mortality data

This tutorial provides instructions on how to link public-use US mortality data with NHANES dataset, with the option to do the same with NHIS. The process involves downloading the mortality data in dat format from the CDC website, loading it into the R environment, and then merging it with the NHANES data using a unique identifier. Various variables related to mortality status, underlying causes of death, and other health-related factors are explained, and Table 1 is created with unweighted and weighted statistics.

Note

What is Coming Next:

The subsequent chapter on Research Questions serves as a valuable guide for constructing an analytics-driven data set tailored to your specific research queries. It will cover crucial aspects such as the types of variables to collect and how to set eligibility criteria, followed by approaches to data analysis based on your research questions. It’s important to note that research questions can fall into two main categories: predictive or causal. For a deeper understanding of variable selection and analytical tools suited to these types of questions, the chapters on the Roles of Variables and Predictive Models offer insightful guidance.

Warning

Bug Report:

Fill out this form to report any issues with the tutorial.