Data wrangling

Background

The realm of data science is vast, and one of its foundational pillars is data wrangling. Data wrangling, often known as data munging, is the process of transforming raw data into a more digestible and usable format for analysis. In the context of R, a powerful statistical programming language, data wrangling becomes an essential skill for any data enthusiast. This chapter is dedicated to imparting practical knowledge on various data manipulation, import, and summarization techniques in R. Through a series of meticulously crafted tutorials, you will be equipped with the tools and techniques to handle, transform, and visualize data efficiently.

Introducing R Basics at the outset of an epidemiological methods tutorial book is akin to laying the foundation before building a house. It ensures that all readers, regardless of their prior experience, start on the same page, understanding the fundamental tools and language of R. This foundational knowledge not only smoothens the learning curve but also boosts confidence, allowing learners to focus on complex epidemiological techniques without being bogged down by the intricacies of the R language. In essence, mastering the basics first ensures a more cohesive and effective learning experience as the material advances.

In this chapter, we embark on a structured journey through the intricate world of data wrangling in R. We begin by laying a solid foundation with R Basics, ensuring you grasp the essential elements of R programming. Once grounded in the basics, we progress to understanding the core R Data Types, diving deep into matrices, lists, and data frames. With a firm grasp of these structures, we introduce Automating Tasks to empower you with techniques that streamline the handling of vast datasets. Following this, we delve into the practical aspects of Importing Datasets, showcasing various methods to bring data from different formats into R. Building on this, Data Manipulation comes next, where we explore the myriad ways to modify and reshape your datasets to suit analytical needs. We then turn our attention to Importing External Data, offering a hands-on demonstration of how to integrate specific external datasets into your R environment. As we approach the chapter’s culmination, we emphasize the importance of Summary Tables in medical research, teaching you the art and science of data summarization. Finally, we wrap up with R Markdown, providing a comprehensive guide on how to seamlessly document your R code and analytical findings, ensuring your work is both reproducible and presentable.

Important

Datasets:

All of the datasets used in this tutorial can be accessed from this GitHub repository folder

Overview of tutorials

R Basics

This tutorial introduces the basics of R programming. It covers topics such as setting up R and RStudio, using R as a calculator, creating variables, working with vectors, plotting data, and accessing help resources.

R Data Types

This tutorial covers three primary data structures in R: matrices, lists, and data frames. Matrices are two-dimensional arrays with elements of the same type, and their manipulation includes reshaping and combining. Lists in R are versatile collections that can store various R objects, including matrices. Data frames, on the other hand, are akin to matrices but permit columns of diverse data types. The tutorial offers guidance on creating, modifying, and merging data frames and checking their dimensions.

Automating Tasks

Medical data analysis often grapples with vast and intricate data sets. Manual handling isn’t just tedious; it’s error-prone, especially given the critical decisions hinging on the results. This tutorial introduces automation techniques in R, a leading language for statistical analysis. By learning to use loops and functions, you can automate repetitive tasks, minimize errors, and conduct analyses more efficiently. Dive in to enhance your data handling skills.

Importing Dataset from the Local Computer

This tutorial focuses on importing data into R. It demonstrates how to import data from CSV and SAS formats using functions like read.csv and sasxport.get. It also includes examples of loading specific variables, dropping variables, subsetting observations based on certain criteria, and handling missing values.

Data Manipulation

This tutorial explores various data manipulation techniques in R. It covers topics such as dropping variables from a dataset, keeping specific variables, subsetting observations based on specific criteria, converting variable types (e.g., factors, strings), and handling missing values.

Import External Data Over the Internet

This tutorial provides examples of importing external data into R. It includes specific examples of importing a CSV file (Employee Salaries - 2017 data) and a SAS file (NHANES 2015-2016 data). It also demonstrates how to save a working dataset in different formats, such as CSV and RData.

Summary Tables

This tutorial emphasizes the importance of data summarization in medical research and epidemiology, specifically how to summarize medical data using R. It demonstrates creating “Table 1”, a typical descriptive statistics table in research papers, with examples that use the built-in R functions and specialized packages to efficiently summarize and stratify data.

R Markdown

This beginner-friendly tutorial guides you through working with R Markdown (RMD) files in RStudio, a popular IDE for R. The tutorial covers installing prerequisites, creating a new RMD file, and the basics of “knitting” to compile the document into various formats like HTML, PDF, or Word. It delves into embedding R code chunks and plain text within the RMD file, using the knitr package for document rendering. Tips for troubleshooting common issues and additional resources for further learning are also provided.

Tip

Optional Content:

You’ll find that some sections conclude with an optional video walkthrough that demonstrates the code. Keep in mind that the content might have been updated since these videos were recorded. Watching these videos is optional.

Warning

Bug Report:

Fill out this form to report any issues with the tutorial.