packages <- c("dplyr", "car", "tableone", "Publish")
lapply(packages, function(pkg) {
if (!require(pkg, character.only = TRUE)) install.packages(pkg, dependencies = TRUE)
library(pkg, character.only = TRUE)
})
directory <- "Data/accessing" # Enter the path to your working directory here
load(paste0(directory, "/analytic_data.RData"))
Exercise 1 (A) Vibe Part 2
In this tutorial, we will be going through the following exercise (found here).
This exercise replicates the analysis done in this article:
Palis, Marchand & Oviedo-Joekes (2020). The relationship between sense of community belonging and self-rated mental health among Canadians with mental or substance use disorders. Journal of Mental Health, 29(2): 168–175. DOI: 10.1080/09638237.2018.1437602
We will go through the exercise questions and evaluate how well Gemini performs on each task. These tasks include applying eligibility criteria, subsetting the data in terms of variables and retaining those necessary for analysis, creating an analytic dataset, creating a Table 1, and building a logistic regression model, and presenting odds ratios with 95% confidence intervals. Additionally, we will use what we learned from the previous tutorial (found here) to streamline these exercises.
In this part of the tutorial, we will go through the second set of tasks, which includes creating a Table 1, performing logistic regression and producing odds ratios and confidence intervals. These tutorials were completed in RStudio (Posit team 2023) using R version 4.3.2 (R Core Team 2023), with the following packages: dplyr (Wickham et al. 2023), car (Fox and Weisberg 2019), tableone (Yoshida and Bartel 2022), and Publish (Gerds and Ozenne 2023).
For this tutorial, I used Gemini’s free 2.5 Flash model.
Load data and required packages
Load the data saved in the previous tutorial. If you just completed the last tutorial and still have the data loaded in your environment, you can skip this step.
References
Exercise 2a: Recode
For this first exercise, we need to recreate the Table 1 from the article. However, we need to omit the ‘Main source of income’ variable from the table.
Let’s send Gemini the following prompt:
Prompt Sent to Gemini:
Reproduce Table 1 from the article using the tableone
package in RStudio, based on the cleaned analytic dataset.
-
Stratify the table by the variable
srmh
(this will be used for the columns).Include the following variables in the table:
community
sex
age
race
Exclude the variable ‘Main source of income’ (
income
) from the table.Display counts and percentages for categorical variables, and means with standard deviations for continuous variables.
Ensure that the frequencies and percentages match those reported in the published Table 1.
Use the
kableone
function to print the table.
Gemini’s response:
Here is the code to create a Table 1 using the tableone
packages:
# Define the variables to be included in Table 1
myVars <- c("community", "sex", "age", "race")
# Define the stratifying variable (self-rated mental health)
strata <- "srmh"
# Create Table 1 object with specified options
tableOne <- CreateTableOne(
vars = myVars,
data = analytic_data,
strata = strata,
includeNA = FALSE, # Do not include NA counts/percentages
test = FALSE, # Do not perform or display statistical tests (p-values)
addOverall = TRUE # Add an overall column
)
# Print the table using kableone with desired formatting options
print(
kableone(
tableOne,
showAllLevels = TRUE, # Show all levels for categorical variables
padColnames = TRUE, # Pad column names for alignment
insertLevel = TRUE # Insert a level column for hierarchical variables
)
)
#>
#>
#> | | level | Overall |Poor or Fair | Good |Very good or excellent |
#> |:-------------|:-----------------|:-----------|:------------|:----------|:----------------------|
#> |n | |2628 |1002 |885 |741 |
#> |community (%) |Very weak |480 (18.3) |282 (28.1) |118 (13.3) |80 (10.8) |
#> | |Somewhat weak |857 (32.6) |358 (35.7) |309 (34.9) |190 (25.6) |
#> | |Somewhat strong |1005 (38.2) |288 (28.7) |362 (40.9) |355 (47.9) |
#> | |Very strong |286 (10.9) |74 ( 7.4) |96 (10.8) |116 (15.7) |
#> |sex (%) |Females |1407 (53.5) |616 (61.5) |487 (55.0) |304 (41.0) |
#> | |Males |1221 (46.5) |386 (38.5) |398 (45.0) |437 (59.0) |
#> |age (%) |15 to 24 years |740 (28.2) |191 (19.1) |264 (29.8) |285 (38.5) |
#> | |25 to 34 years |475 (18.1) |141 (14.1) |167 (18.9) |167 (22.5) |
#> | |35 to 44 years |393 (15.0) |185 (18.5) |119 (13.4) |89 (12.0) |
#> | |45 to 54 years |438 (16.7) |220 (22.0) |139 (15.7) |79 (10.7) |
#> | |55 to 64 years |379 (14.4) |198 (19.8) |113 (12.8) |68 ( 9.2) |
#> | |65 years or older |203 ( 7.7) |67 ( 6.7) |83 ( 9.4) |53 ( 7.2) |
#> |race (%) |Non-white |458 (17.4) |184 (18.4) |140 (15.8) |134 (18.1) |
#> | |White |2170 (82.6) |818 (81.6) |745 (84.2) |607 (81.9) |
That matches the table presented in the paper exactly! Gemini was able to create the table using the tableone
package and printed it cleanly with the kableone
function.
Exercise 3a: Regression
For this next exercise, we need to subset the dataset such that it excludes individuals who responded either “Very good”, or “Excellent” on the self-rated mental health variables. Then following this we need to recode this variable and make it binary: ‘Good’ vs. ‘Poor’. We then want to convert this variable to a factor with ‘Poor’ as the reference level.
Let’s send Gemini the following prompt:
Prompt Sent to Gemini:
Subset the dataset in RStudio, excluding individuals who responded “Very good” or “Excellent” on the self-rated mental health variable.
Gemini’s response:
# Subset the dataset to exclude individuals with "Very good or excellent" self-rated mental health
subset_data <- analytic_data %>%
filter(srmh != "Very good or excellent")
# You can check the table of srmh in the new dataset
table(subset_data$srmh)
#>
#> Poor or Fair Good Very good or excellent
#> 1002 885 0
That worked as intended. Now let’s send the prompt to recode the variable as a binary factor.
Prompt Sent to Gemini:
Recode self-rated mental health variable and make it a binary variable: ‘Good’ vs. ‘Poor’ (simplifying category labels only). Convert that variable to a factor variable with ‘Poor’ being the reference level.
Gemini’s response:
Here’s the R code to recode the srmh variable in your subset_data into a binary factor variable with ‘Good’ and ‘Poor’ categories, setting ‘Poor’ as the reference level.
# Recode the 'srmh' variable into a binary 'Good' vs. 'Poor' factor
subset_data <- subset_data %>%
mutate(
srmh_binary = car::recode(srmh,
"'Good' = 'Good';
'Poor or Fair' = 'Poor'"
# Any other values (like NA from previous steps) would remain NA
) %>%
factor(levels = c("Poor", "Good")) # Set 'Poor' as the reference level
)
# You can now check the levels and a table of the new binary variable
levels(subset_data$srmh_binary)
#> [1] "Poor" "Good"
table(subset_data$srmh_binary, useNA = "always")
#>
#> Poor Good <NA>
#> 1002 885 0
That worked great. The self-rated mental health variable was correctly recoded into a binary factor with ‘Poor’ as the reference level. The code Gemini provided also included checks to confirm the recoding was done correctly
Exercise 3b: Reporting odds ratio
For this next exercise, we need to construct a logistic regression model to examine the relationship between community belonging and self-rated mental health among respondents with mental or substance use disorders. The model should be adjusted for three confounders: sex, age, and race/ethnicity.
Once the model is constructed, report the odds ratios and their associated confidence intervals. To do so we will use the publish
package.
Prompt Sent to Gemini:
Construct a logistic regression model to examine the relationship between community belonging (explanatory variable) and self-rated mental health (outcome variable) among respondents with mental or substance use disorders.
Adjust the model for three confounders:
Sex
Age
Race/ethnicity
After fitting the model:
Report the odds ratios and their 95% confidence intervals
Use the publish package to present the results in a clear and interpretable format
Provide the R code to complete this task
Gemini’s response:
model <- glm(srmh_binary ~ community + sex + age + race,
data = subset_data,
family = binomial(link = "logit"))
results_table <- publish(model, level = 0.95)
#> Variable Units OddsRatio CI.95 p-value
#> community Very weak Ref
#> Somewhat weak 1.93 [1.48;2.53] < 1e-04
#> Somewhat strong 2.90 [2.22;3.80] < 1e-04
#> Very strong 3.32 [2.27;4.85] < 1e-04
#> sex Females Ref
#> Males 1.32 [1.09;1.60] 0.003993
#> age 15 to 24 years Ref
#> 25 to 34 years 0.85 [0.63;1.15] 0.292243
#> 35 to 44 years 0.45 [0.33;0.61] < 1e-04
#> 45 to 54 years 0.45 [0.34;0.61] < 1e-04
#> 55 to 64 years 0.41 [0.30;0.56] < 1e-04
#> 65 years or older 0.87 [0.59;1.27] 0.468623
#> race Non-white Ref
#> White 1.32 [1.03;1.71] 0.030025
print(results_table)
#> Variable Units OddsRatio CI.95 p-value
#> 1 community Very weak Ref
#> 2 Somewhat weak 1.93 [1.48;2.53] < 1e-04
#> 3 Somewhat strong 2.90 [2.22;3.80] < 1e-04
#> 4 Very strong 3.32 [2.27;4.85] < 1e-04
#> 5 sex Females Ref
#> 6 Males 1.32 [1.09;1.60] 0.003993
#> 7 age 15 to 24 years Ref
#> 8 25 to 34 years 0.85 [0.63;1.15] 0.292243
#> 9 35 to 44 years 0.45 [0.33;0.61] < 1e-04
#> 10 45 to 54 years 0.45 [0.34;0.61] < 1e-04
#> 11 55 to 64 years 0.41 [0.30;0.56] < 1e-04
#> 12 65 years or older 0.87 [0.59;1.27] 0.468623
#> 13 race Non-white Ref
#> 14 White 1.32 [1.03;1.71] 0.030025
That worked well. Gemini successfully built the logistic regression model, adjusted for the correct confounders, and presented the odds ratios with confidence intervals using the publish
package. The code ran smoothly and the output was clear and interpretable.
Summary
Using Gemini, we were able to complete the second set of tasks. In all instances, Gemini was able to complete the tasks without error, including identifying the correct variables to use when creating the table one, when subsetting the data, and constructing the logistic regression model. It also was successful and printing the table and regression results in in a markdown friendly manner
In this set of exercises, Gemini again performed well and returned accurate results. Still, it’s important to approach each step with care. Even when outputs look correct, reviewing them closely ensures the work aligns with the intended analysis and avoids potential oversights.
Step-by-Step Plan for Using Generative AI in Data Analysis
Be clear and specific
Define the task directly and avoid vague instructions.Provide starter code or examples
Help the model understand your expectations and structure.Include variable structure
Share the output ofstr()
or a data dictionary so the model knows variable types.State factor levels explicitly
List how variables should be recoded or grouped rather than relying on inference.Reference sources directly
If you’re asking the model to replicate something (e.g., a published Table 1), summarize it clearly instead of assuming it can interpret the reference alone.Review the output
Check that the generated code and results match your goals and are statistically valid.Be prepared to troubleshoot
If the model’s response is incomplete or incorrect, use your understanding and available documentation to fix it.Use AI as a guide, not a replacement
Generative AI can streamline your work, but knowledge and software familiarity are still essential.(NEW) Double Check any work produced by AI
While some AI models may perform well and produce exactly what you ask for, it’s still important to carry out thorough checks to ensure the output works as intended.