Exploratory Data Analysis

Background

Now that we have accessed our data, it is essential to spend some time familiarizing ourselves with it. Conducting an exploratory data analysis (EDA) helps us understand the data’s characteristics by visualizing it or using quantitative measures to summarize variables. These steps can reveal outliers and uncover interesting or unexpected patterns in the data. The results of the exploratory analysis will guide our decisions in subsequent analyses.

In this guide, we will demonstrate a wide array of exploratory methods for different variable types, with hands-on examples. This will provide a comprehensive toolbox of EDA strategies.

In the previous chapter, we learned how to access external survey datasets. In this chapter, we will focus on the first step of working with data: exploratory data analysis. EDA is crucial for gaining an overview and understanding of the dataset. The next chapter will address two major types of research questions and will go through examples of each. These discussions will lead us into more detailed chapters on the different approaches taken when analyzing each type of research question.

Important

Datasets:

All the datasets used in this tutorial can be accessed from this GitHub repository folder.

Overview of Tutorials

Exploring Individual Variables

This tutorial introduces basic methods for summarizing and visualizing continuous and categorical variables. These methods provide an overview of the types of variables in the dataset and how they behave.

Exploring Pairwise Variable Relationships

This tutorial introduces methods to explore and visualize pairwise variable relationships. Examining these relationships allows us to identify correlations and discover potentially interesting patterns for further investigation.

Useful Packages for Exploratory Data Analysis

This tutorial provides detailed instructions on how to import and process health survey datasets, specifically focusing on the Canadian Community Health Survey (CCHS), National Health and Nutrition Examination Survey (NHANES), and National Health Interview Survey (NHIS).

Note

What is Coming Next:

The upcoming chapter on Research Questions serves as a guide for constructing an analytics-driven dataset tailored to specific research queries. It will cover critical aspects such as selecting relevant variables and setting eligibility criteria, followed by approaches to data analysis based on the research questions. Research questions generally fall into two categories: predictive and causal. For a deeper understanding of variable selection and analytical tools suited to these questions, refer to the chapters on Roles of Variables and Predictive Models.

Warning

Bug Report:

Please fill out this form to report any issues with the tutorial.