Exercise 1 (A) Vibe Part 2

In this tutorial, we will be going through the following exercise (found here).

This exercise replicates the analysis done in this article:

Palis, Marchand & Oviedo-Joekes (2020). The relationship between sense of community belonging and self-rated mental health among Canadians with mental or substance use disorders. Journal of Mental Health, 29(2): 168–175. DOI: 10.1080/09638237.2018.1437602

We will go through the exercise questions and evaluate how well Gemini performs on each task. These tasks include applying eligibility criteria, subsetting the data in terms of variables and retaining those necessary for analysis, creating an analytic dataset, creating a Table 1, and building a logistic regression model, and presenting odds ratios with 95% confidence intervals. Additionally, we will use what we learned from the previous tutorial (found here) to streamline these exercises.

In this part of the tutorial, we will go through the second set of tasks, which includes creating a Table 1, performing logistic regression and producing odds ratios and confidence intervals. These tutorials were completed in RStudio (Posit team 2023) using R version 4.3.2 (R Core Team 2023), with the following packages: dplyr (Wickham et al. 2023), car (Fox and Weisberg 2019), tableone (Yoshida and Bartel 2022), and Publish (Gerds and Ozenne 2023).

For this tutorial, I used Gemini’s free 2.5 Flash model.


Load data and required packages

Load the data saved in the previous tutorial. If you just completed the last tutorial and still have the data loaded in your environment, you can skip this step.

packages <- c("dplyr", "car", "tableone", "Publish")

lapply(packages, function(pkg) {
  if (!require(pkg, character.only = TRUE)) install.packages(pkg, dependencies = TRUE)
  library(pkg, character.only = TRUE)
})


directory <- "Data/accessing"  # Enter the path to your working directory here

load(paste0(directory, "/analytic_data.RData"))

References



Exercise 2a: Recode

For this first exercise, we need to recreate the Table 1 from the article. However, we need to omit the ‘Main source of income’ variable from the table.

Let’s send Gemini the following prompt:


Prompt Sent to Gemini:

Note

Reproduce Table 1 from the article using the tableone package in RStudio, based on the cleaned analytic dataset.

  • Stratify the table by the variable srmh (this will be used for the columns).

    Include the following variables in the table:

    • community

    • sex

    • age

    • race

  • Exclude the variable ‘Main source of income’ (income) from the table.

  • Display counts and percentages for categorical variables, and means with standard deviations for continuous variables.

  • Ensure that the frequencies and percentages match those reported in the published Table 1.

  • Use the kableone function to print the table.


Gemini’s response:

Here is the code to create a Table 1 using the tableone packages:

Tip
# Define the variables to be included in Table 1
myVars <- c("community", "sex", "age", "race")

# Define the stratifying variable (self-rated mental health)
strata <- "srmh"

# Create Table 1 object with specified options
tableOne <- CreateTableOne(
  vars = myVars,
  data = analytic_data,
  strata = strata,
  includeNA = FALSE, # Do not include NA counts/percentages
  test = FALSE,      # Do not perform or display statistical tests (p-values)
  addOverall = TRUE  # Add an overall column
)

# Print the table using kableone with desired formatting options
print(
  kableone(
    tableOne,
    showAllLevels = TRUE,  # Show all levels for categorical variables
    padColnames = TRUE,    # Pad column names for alignment
    insertLevel = TRUE     # Insert a level column for hierarchical variables
  )
)
#> 
#> 
#> |              |      level       |   Overall  |Poor or Fair |    Good   |Very good or excellent |
#> |:-------------|:-----------------|:-----------|:------------|:----------|:----------------------|
#> |n             |                  |2628        |1002         |885        |741                    |
#> |community (%) |Very weak         |480 (18.3)  |282 (28.1)   |118 (13.3) |80 (10.8)              |
#> |              |Somewhat weak     |857 (32.6)  |358 (35.7)   |309 (34.9) |190 (25.6)             |
#> |              |Somewhat strong   |1005 (38.2) |288 (28.7)   |362 (40.9) |355 (47.9)             |
#> |              |Very strong       |286 (10.9)  |74 ( 7.4)    |96 (10.8)  |116 (15.7)             |
#> |sex (%)       |Females           |1407 (53.5) |616 (61.5)   |487 (55.0) |304 (41.0)             |
#> |              |Males             |1221 (46.5) |386 (38.5)   |398 (45.0) |437 (59.0)             |
#> |age (%)       |15 to 24 years    |740 (28.2)  |191 (19.1)   |264 (29.8) |285 (38.5)             |
#> |              |25 to 34 years    |475 (18.1)  |141 (14.1)   |167 (18.9) |167 (22.5)             |
#> |              |35 to 44 years    |393 (15.0)  |185 (18.5)   |119 (13.4) |89 (12.0)              |
#> |              |45 to 54 years    |438 (16.7)  |220 (22.0)   |139 (15.7) |79 (10.7)              |
#> |              |55 to 64 years    |379 (14.4)  |198 (19.8)   |113 (12.8) |68 ( 9.2)              |
#> |              |65 years or older |203 ( 7.7)  |67 ( 6.7)    |83 ( 9.4)  |53 ( 7.2)              |
#> |race (%)      |Non-white         |458 (17.4)  |184 (18.4)   |140 (15.8) |134 (18.1)             |
#> |              |White             |2170 (82.6) |818 (81.6)   |745 (84.2) |607 (81.9)             |


That matches the table presented in the paper exactly! Gemini was able to create the table using the tableone package and printed it cleanly with the kableone function.



Exercise 3a: Regression

For this next exercise, we need to subset the dataset such that it excludes individuals who responded either “Very good”, or “Excellent” on the self-rated mental health variables. Then following this we need to recode this variable and make it binary: ‘Good’ vs. ‘Poor’. We then want to convert this variable to a factor with ‘Poor’ as the reference level.

Let’s send Gemini the following prompt:


Prompt Sent to Gemini:

Note

Subset the dataset in RStudio, excluding individuals who responded “Very good” or “Excellent” on the self-rated mental health variable.


Gemini’s response:

Tip
# Subset the dataset to exclude individuals with "Very good or excellent" self-rated mental health
subset_data <- analytic_data %>%
  filter(srmh != "Very good or excellent")

# You can check the table of srmh in the new dataset
table(subset_data$srmh)
#> 
#>           Poor or Fair                   Good Very good or excellent 
#>                   1002                    885                      0


That worked as intended. Now let’s send the prompt to recode the variable as a binary factor.


Prompt Sent to Gemini:

Note

Recode self-rated mental health variable and make it a binary variable: ‘Good’ vs. ‘Poor’ (simplifying category labels only). Convert that variable to a factor variable with ‘Poor’ being the reference level.


Gemini’s response:

Tip

Here’s the R code to recode the srmh variable in your subset_data into a binary factor variable with ‘Good’ and ‘Poor’ categories, setting ‘Poor’ as the reference level.

# Recode the 'srmh' variable into a binary 'Good' vs. 'Poor' factor
subset_data <- subset_data %>%
  mutate(
    srmh_binary = car::recode(srmh,
                              "'Good' = 'Good';
                               'Poor or Fair' = 'Poor'"
                              # Any other values (like NA from previous steps) would remain NA
    ) %>%
      factor(levels = c("Poor", "Good")) # Set 'Poor' as the reference level
  )

# You can now check the levels and a table of the new binary variable
levels(subset_data$srmh_binary)
#> [1] "Poor" "Good"
table(subset_data$srmh_binary, useNA = "always")
#> 
#> Poor Good <NA> 
#> 1002  885    0


That worked great. The self-rated mental health variable was correctly recoded into a binary factor with ‘Poor’ as the reference level. The code Gemini provided also included checks to confirm the recoding was done correctly



Exercise 3b: Reporting odds ratio

For this next exercise, we need to construct a logistic regression model to examine the relationship between community belonging and self-rated mental health among respondents with mental or substance use disorders. The model should be adjusted for three confounders: sex, age, and race/ethnicity.

Once the model is constructed, report the odds ratios and their associated confidence intervals. To do so we will use the publish package.


Prompt Sent to Gemini:

Note

Construct a logistic regression model to examine the relationship between community belonging (explanatory variable) and self-rated mental health (outcome variable) among respondents with mental or substance use disorders.

Adjust the model for three confounders:

  • Sex

  • Age

  • Race/ethnicity

After fitting the model:

  • Report the odds ratios and their 95% confidence intervals

  • Use the publish package to present the results in a clear and interpretable format

  • Provide the R code to complete this task


Gemini’s response:

Tip
model <- glm(srmh_binary ~ community + sex + age + race,
             data = subset_data,
             family = binomial(link = "logit"))

results_table <- publish(model, level = 0.95)
#>   Variable             Units OddsRatio       CI.95    p-value 
#>  community         Very weak       Ref                        
#>                Somewhat weak      1.93 [1.48;2.53]    < 1e-04 
#>              Somewhat strong      2.90 [2.22;3.80]    < 1e-04 
#>                  Very strong      3.32 [2.27;4.85]    < 1e-04 
#>        sex           Females       Ref                        
#>                        Males      1.32 [1.09;1.60]   0.003993 
#>        age    15 to 24 years       Ref                        
#>               25 to 34 years      0.85 [0.63;1.15]   0.292243 
#>               35 to 44 years      0.45 [0.33;0.61]    < 1e-04 
#>               45 to 54 years      0.45 [0.34;0.61]    < 1e-04 
#>               55 to 64 years      0.41 [0.30;0.56]    < 1e-04 
#>            65 years or older      0.87 [0.59;1.27]   0.468623 
#>       race         Non-white       Ref                        
#>                        White      1.32 [1.03;1.71]   0.030025

print(results_table)
#>     Variable             Units OddsRatio       CI.95    p-value
#> 1  community         Very weak       Ref                       
#> 2                Somewhat weak      1.93 [1.48;2.53]    < 1e-04
#> 3              Somewhat strong      2.90 [2.22;3.80]    < 1e-04
#> 4                  Very strong      3.32 [2.27;4.85]    < 1e-04
#> 5        sex           Females       Ref                       
#> 6                        Males      1.32 [1.09;1.60]   0.003993
#> 7        age    15 to 24 years       Ref                       
#> 8               25 to 34 years      0.85 [0.63;1.15]   0.292243
#> 9               35 to 44 years      0.45 [0.33;0.61]    < 1e-04
#> 10              45 to 54 years      0.45 [0.34;0.61]    < 1e-04
#> 11              55 to 64 years      0.41 [0.30;0.56]    < 1e-04
#> 12           65 years or older      0.87 [0.59;1.27]   0.468623
#> 13      race         Non-white       Ref                       
#> 14                       White      1.32 [1.03;1.71]   0.030025

That worked well. Gemini successfully built the logistic regression model, adjusted for the correct confounders, and presented the odds ratios with confidence intervals using the publish package. The code ran smoothly and the output was clear and interpretable.



Summary

Using Gemini, we were able to complete the second set of tasks. In all instances, Gemini was able to complete the tasks without error, including identifying the correct variables to use when creating the table one, when subsetting the data, and constructing the logistic regression model. It also was successful and printing the table and regression results in in a markdown friendly manner

In this set of exercises, Gemini again performed well and returned accurate results. Still, it’s important to approach each step with care. Even when outputs look correct, reviewing them closely ensures the work aligns with the intended analysis and avoids potential oversights.



Step-by-Step Plan for Using Generative AI in Data Analysis

  1. Be clear and specific
    Define the task directly and avoid vague instructions.

  2. Provide starter code or examples
    Help the model understand your expectations and structure.

  3. Include variable structure
    Share the output of str() or a data dictionary so the model knows variable types.

  4. State factor levels explicitly
    List how variables should be recoded or grouped rather than relying on inference.

  5. Reference sources directly
    If you’re asking the model to replicate something (e.g., a published Table 1), summarize it clearly instead of assuming it can interpret the reference alone.

  6. Review the output
    Check that the generated code and results match your goals and are statistically valid.

  7. Be prepared to troubleshoot
    If the model’s response is incomplete or incorrect, use your understanding and available documentation to fix it.

  8. Use AI as a guide, not a replacement
    Generative AI can streamline your work, but knowledge and software familiarity are still essential.

  9. (NEW) Double Check any work produced by AI
    While some AI models may perform well and produce exactly what you ask for, it’s still important to carry out thorough checks to ensure the output works as intended.

References

Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. Third. Thousand Oaks CA: Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Gerds, Thomas A., and Brice Ozenne. 2023. Publish: Format Output of Various Routines in a Suitable Way for Reports and Publication. https://CRAN.R-project.org/package=Publish.
Posit team. 2023. RStudio: Integrated Development Environment for r. Boston, MA: Posit Software, PBC. http://www.posit.co/.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Yoshida, Kazuki, and Alexander Bartel. 2022. Tableone: Create ’Table 1’ to Describe Baseline Characteristics with or Without Propensity Score Weights. https://CRAN.R-project.org/package=tableone.