# --- Load Libraries ---
library(survey)
library(dplyr)
library(knitr)
library(car)
library(tidyr)
library(Publish)
library(stringr)
library(kableExtra)
# install.packages("devtools")
# devtools::install_github("ehsanx/svyTable1",
# build_vignettes = TRUE)
library(svyTable1)
# --- Load Data ---
load("Data/surveydata/Flegal2016.RData")
# --- Create Analytic Sample (N=5,455) ---
dat.analytic <- subset(dat.full, RIDAGEYR >= 20)
dat.analytic <- subset(dat.analytic, !is.na(BMXBMI))
dat.analytic <- subset(dat.analytic, is.na(RIDEXPRG) | RIDEXPRG != "Yes, positive lab pregnancy test")
# --- Create ALL Analysis Variables in the Full Dataset ---
dat.full <- dat.full %>%
mutate(
Age = cut(RIDAGEYR,
breaks = c(20, 40, 60, Inf),
right = FALSE,
labels = c("20-39", "40-59", ">=60")),
race = case_when(
RIDRETH3 == "Non-Hispanic White" ~ "Non-Hispanic white",
RIDRETH3 == "Non-Hispanic Black" ~ "Non-Hispanic black",
RIDRETH3 == "Non-Hispanic Asian" ~ "Non-Hispanic Asian",
RIDRETH3 %in% c("Mexican American", "Other Hispanic") ~ "Hispanic",
RIDRETH3 == "Other Race - Including Multi-Rac" ~ "Other",
TRUE ~ NA_character_
),
race = factor(race, levels = c("Non-Hispanic white", "Non-Hispanic black", "Non-Hispanic Asian", "Hispanic", "Other")),
gender = factor(RIAGENDR, labels = c("Male", "Female")),
obese = factor(ifelse(BMXBMI >= 30, "Yes", "No"), levels = c("No", "Yes")),
obese_class3 = factor(ifelse(BMXBMI >= 40, "Yes", "No"), levels = c("No", "Yes")),
smoking = case_when(
SMQ020 == "No" ~ "Never smoker",
SMQ020 == "Yes" & SMQ040 == "Not at all" ~ "Former smoker",
SMQ020 == "Yes" ~ "Current smoker",
TRUE ~ NA_character_
),
smoking = factor(smoking, levels = c("Never smoker", "Former smoker", "Current smoker")),
education = case_when(
DMDEDUC2 %in% c("Less than 9th grade", "9-11th grade (Includes 12th grad)") ~ "<High school",
DMDEDUC2 == "High school graduate/GED or equi" ~ "High school",
DMDEDUC2 %in% c("Some college or AA degree", "College graduate or above") ~ ">High school",
TRUE ~ NA_character_
),
education = factor(education, levels = c("High school", "<High school", ">High school"))
)
# --- Create Survey Design Object ---
options(survey.lonely.psu = "adjust")
svy.design.full <- svydesign(id = ~SDMVPSU,
strata = ~SDMVSTRA,
weights = ~WTINT2YR,
nest = TRUE,
data = dat.full)
analytic_design <- subset(svy.design.full, SEQN %in% dat.analytic$SEQN)
NHANES: Reliability Standards
Introduction
This tutorial reproduces the key tables from the Flegal et al. (2016) article. The analysis uses the same NHANES data and aims to replicate the unweighted sample size counts from Table 1 and the weighted logistic regression models from Table 3. We incorporate NCHS/CDC reliability standards to ensure estimates are statistically defensible.
Setup and Data Preparation
This first section prepares the data for analysis. The key steps are:
-
Load Data: The
Flegal2016.RData
file containing the necessary NHANES data is loaded. -
Define Analytic Sample: An
analytic_design
object is created. This sample of N=5,455 matches the total number of participants reported in the paper for the 2013–2014 cycle. - Recode Variables: All variables needed for the analysis (e.g., Age, race, obese, smoking, education) are created and categorized to match the definitions used by Flegal et al.
-
Create Survey Design Object: A
svydesign
object is created to account for survey design (weights, strata, clusters).
Reproducing Table 1
This section reproduces the unweighted sample sizes shown in Flegal et al.’s Table 1. The code first generates separate summary tables for all participants, men, and women. It then performs several formatting steps to combine these into a single table.
What svytable1
Does
The svytable1
function creates a descriptive summary table—commonly referred to as a “Table 1”—from complex survey data. It is specifically designed to produce publication-ready results that align with NCHS Data Presentation Standards for reliability.
Key svytable1
Operations
When you call svytable1
(link), it performs the following steps for each analysis (for example, for all participants, men, and women):
Calculates Proportions
It summarizes categorical variables (likeAge
) by calculating the proportion of participants in each category (e.g., 20–39, 40–59, ≥60).Stratifies by a Grouping Variable
Results are calculated separately by levels of thestrata_var
(for example,race
), creating side-by-side columns.-
Displays Mixed Mode Results
Becausemode = "mixed"
, each cell shows both:- The unweighted sample count (N)
- The weighted percentage (%), which represents the population estimate.
- The unweighted sample count (N)
Performs Reliability Checks
Whenreliability_checks = TRUE
, the function evaluates each estimate against NCHS Data Presentation Standards. These checks prevent publication of unstable or statistically unreliable estimates.
What the Asterisk (*) Means
An asterisk (*
) in these tables output indicate suppression: the estimate was determined to be statistically unreliable. The function hides the unreliable value to avoid misinterpretation.
NCHS reliability rules (for proportions)
Estimates are suppressed if they fail one or more NCHS reliability rules (for proportions):
-
fail_n_30: The unweighted sample size (n) is fewer than 30 participants.
-
fail_eff_n_30: The effective sample size (adjusted for design effects) is less than 30.
-
fail_df_8: The design degrees of freedom (df) are fewer than 8.
-
fail_ciw_30: The absolute confidence interval width is ≥ 30 percentage points.
-
fail_rciw_130: The relative CI width is greater than 130%.
Each of these flags indicates limited precision or instability in the estimate.
In the output, the asterisks appear in the “Other” race column for certain age groups (such as “40–59” and “≥60”).
This happens because the number of participants in those cells is very small, producing unstable or wide confidence intervals. Thus, the function correctly replaces the unreliable estimates with *
, ensuring your published results remain statistically defensible and transparent.
Reliability Metrics Table
In addition to the detailed checks for proportions, the svytable1
function also assesses the reliability of means for numeric variables. For these estimates, it applies the standard NCHS recommendation, which uses the Relative Standard Error (RSE). If a mean’s RSE is 30% or greater, it is considered statistically unreliable and will be suppressed with an asterisk (*) in the formatted table.
The $reliability_metrics
table will be printed with the output if you select return_metrics = TRUE
which will include rows for each mean, reporting the calculated RSE and the outcome of this check in the fail_rse_30
column.
# View reliability_metrics
table1_svy <- svytable1(
design = analytic_design, strata_var = "race",
table_vars = "Age", mode = "mixed",
reliability_checks = TRUE,
return_metrics = TRUE
)
table1_svy$reliability_metrics[table1_svy$reliability_metrics$suppressed == TRUE, ]
Summary tables for each group
# --- Create the summary tables for each group ---
table1_svy_all <- svytable1(
design = analytic_design, strata_var = "race",
table_vars = "Age", mode = "mixed",
reliability_checks = TRUE
) %>%
mutate(Variable = dplyr::recode(Variable, "Age" = "Age Groups"))
kable(table1_svy_all)
Variable | Level | Overall | Non-Hispanic white | Non-Hispanic black | Non-Hispanic Asian | Hispanic | Other |
---|---|---|---|---|---|---|---|
n | 5,455 | 2,343 | 1,115 | 623 | 1,214 | 160 | |
Age Groups | |||||||
20-39 | 1,810 (35.5%) | 734 (30.8%) | 362 (38.9%) | 216 (40.2%) | 412 (49.6%) | 86 (49.2%) | |
40-59 | 1,896 (37.5%) | 759 (37.5%) | 383 (39.3%) | 251 (38.5%) | 449 (35.6%) | * | |
>=60 | 1,749 (27.0%) | 850 (31.7%) | 370 (21.8%) | 156 (21.2%) | 353 (14.8%) | * |
male_design <- subset(analytic_design, gender == "Male")
table1_svy_men <- svytable1(
design = male_design, strata_var = "race",
table_vars = "Age", mode = "mixed",
reliability_checks = TRUE
) %>%
mutate(Variable = dplyr::recode(Variable, "Age" = "Age group"))
kable(table1_svy_men)
Variable | Level | Overall | Non-Hispanic white | Non-Hispanic black | Non-Hispanic Asian | Hispanic | Other |
---|---|---|---|---|---|---|---|
n | 2,638 | 1,130 | 556 | 300 | 573 | 79 | |
Age group | |||||||
20-39 | 909 (37.2%) | 386 (32.3%) | 182 (41.3%) | 106 (41.7%) | 189 (51.6%) | 46 (52.9%) | |
40-59 | 897 (37.6%) | 360 (37.9%) | 179 (39.0%) | 120 (38.5%) | 215 (35.1%) | * | |
>=60 | 832 (25.1%) | 384 (29.8%) | 195 (19.7%) | 74 (19.8%) | 169 (13.2%) | * |
female_design <- subset(analytic_design, gender == "Female")
table1_svy_women <- svytable1(
design = female_design, strata_var = "race",
table_vars = "Age", mode = "mixed",
reliability_checks = TRUE
) %>%
mutate(Variable = dplyr::recode(Variable, "Age" = "Age group"))
kable(table1_svy_women)
Variable | Level | Overall | Non-Hispanic white | Non-Hispanic black | Non-Hispanic Asian | Hispanic | Other |
---|---|---|---|---|---|---|---|
n | 2,817 | 1,213 | 559 | 323 | 641 | 81 | |
Age group | |||||||
20-39 | 901 (33.8%) | 348 (29.4%) | 180 (36.9%) | 110 (39.0%) | 223 (47.6%) | * | |
40-59 | 999 (37.4%) | 399 (37.1%) | 204 (39.5%) | 131 (38.6%) | 234 (36.0%) | * | |
>=60 | 917 (28.8%) | 466 (33.5%) | 175 (23.6%) | 82 (22.5%) | 184 (16.4%) | * |
Format and combine the tables
# --- Format and combine the tables ---
table_all_formatted <- table1_svy_all %>%
select(
`Age Groups,y` = Level, `All Groups` = Overall,
`White` = `Non-Hispanic white`, `Black` = `Non-Hispanic black`,
`Asian` = `Non-Hispanic Asian`, `Hispanic`
) %>%
slice(3:5)
table_men_formatted <- table1_svy_men %>%
select(
`Age Groups,y` = Level, `All Groups` = Overall,
`White` = `Non-Hispanic white`, `Black` = `Non-Hispanic black`,
`Asian` = `Non-Hispanic Asian`, `Hispanic`
) %>%
slice(3:5)
table_women_formatted <- table1_svy_women %>%
select(
`Age Groups,y` = Level, `All Groups` = Overall,
`White` = `Non-Hispanic white`, `Black` = `Non-Hispanic black`,
`Asian` = `Non-Hispanic Asian`, `Hispanic`
) %>%
slice(3:5)
final_table_data <- bind_rows(
table_all_formatted,
table_men_formatted,
table_women_formatted
)
# --- Render the final, publication-quality table ---
final_table_data %>%
kable(caption = "Characteristics by Group and Race/Hispanic Origin", align = "lrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "bordered"), full_width = FALSE) %>%
add_header_above(c(" " = 1, "All Groups" = 1, "Race/Ethnicity" = 4)) %>%
kableExtra::group_rows(index = c("All participants" = 3,
"Men" = 3,
"Women" = 3))
Age Groups,y | All Groups | White | Black | Asian | Hispanic |
---|---|---|---|---|---|
All participants | |||||
20-39 | 1,810 (35.5%) | 734 (30.8%) | 362 (38.9%) | 216 (40.2%) | 412 (49.6%) |
40-59 | 1,896 (37.5%) | 759 (37.5%) | 383 (39.3%) | 251 (38.5%) | 449 (35.6%) |
>=60 | 1,749 (27.0%) | 850 (31.7%) | 370 (21.8%) | 156 (21.2%) | 353 (14.8%) |
Men | |||||
20-39 | 909 (37.2%) | 386 (32.3%) | 182 (41.3%) | 106 (41.7%) | 189 (51.6%) |
40-59 | 897 (37.6%) | 360 (37.9%) | 179 (39.0%) | 120 (38.5%) | 215 (35.1%) |
>=60 | 832 (25.1%) | 384 (29.8%) | 195 (19.7%) | 74 (19.8%) | 169 (13.2%) |
Women | |||||
20-39 | 901 (33.8%) | 348 (29.4%) | 180 (36.9%) | 110 (39.0%) | 223 (47.6%) |
40-59 | 999 (37.4%) | 399 (37.1%) | 204 (39.5%) | 131 (38.6%) | 234 (36.0%) |
>=60 | 917 (28.8%) | 466 (33.5%) | 175 (23.6%) | 82 (22.5%) | 184 (16.4%) |
Reproducing Table 3
This section reproduces the weighted logistic regression models from Flegal et al.’s Table 3. Four separate models are fit: one for obesity and one for class 3 obesity, each stratified by gender. We extend the replication by incorporating NCHS/CDC reliability checks using relative standard error (RSE) for regression coefficients. Per NCHS practices, coefficients with RSE ≥ 30% are considered unreliable and flagged with an asterisk (*).
# --- Helper function to format the output of publish() ---
format_publish_output <- function(publish_object, new_col_name) {
# Extract the clean regression table from the publish object
df <- publish_object$regressionTable
# Combine OR and CI into one string, handling reference groups
df$result <- ifelse(
df$OddsRatio == "Ref",
"1 [Reference]",
# Combine the OR and CI, removing the brackets from the CI string and replacing the semicolon
paste0(df$OddsRatio, " (", str_replace(str_remove_all(df$CI.95, "\\[|\\]"), ";", "-"), ")")
)
# Return a clean tibble with just the variable level and the formatted result
tibble(term = df$Units, !!new_col_name := df$result)
}
# --- Create complete-case survey designs for each gender ---
male_design_complete <- subset(analytic_design, gender == "Male" & !is.na(smoking) & !is.na(education))
female_design_complete <- subset(analytic_design, gender == "Female" & !is.na(smoking) & !is.na(education))
Fit the Models
# --- Fit all four models ---
fit_men_obese <- svyglm(I(obese == "Yes") ~ Age + race + smoking + education, design = male_design_complete, family = binomial())
# print(summary(fit_men_obese))
fit_men_obese3 <- svyglm(I(obese_class3 == "Yes") ~ Age + race + smoking + education, design = male_design_complete, family = binomial())
fit_women_obese <- svyglm(I(obese == "Yes") ~ Age + race + smoking + education, design = female_design_complete, family = binomial())
fit_women_obese3 <- svyglm(I(obese_class3 == "Yes") ~ Age + race + smoking + education, design = female_design_complete, family = binomial())
Reliability check for regression coefficients
The National Center for Health Statistics (NCHS) and the Centers for Disease Control and Prevention (CDC) do not recommend using the relative standard error (RSE) as a primary reliability check for regression coefficients. Their guidelines focus on other types of estimates.
- For means and totals, the RSE is a commonly used metric.
- For proportions, while the RSE was used in the past, the NCHS has transitioned to a more sophisticated, multi-step approach. This newer method considers a minimum number of events or a minimum denominator sample size, and the width of the confidence interval around the proportion. This change was made because the RSE can be misleading for proportions, especially those close to 0% or 100%.
- There are no well-established NCHS or CDC guidelines that advocate for the use of RSE to determine the reliability of regression coefficients. The statistical properties of regression coefficients are more complex than those of means or proportions. The reliability of a regression coefficient is influenced by a variety of factors, including the sample size, the variance of the independent variable, and the overall fit of the model. Given this complexity, a simple RSE threshold is not considered an adequate measure of reliability for regression coefficients. Instead, we assess reliability by examining several key metrics that, taken together, give us a complete picture of a coefficient’s stability and precision. The main tool for this is the
svyglmdiag()
functions insvyTable1
package, which we’ll use to look at:
The Standard Error (SE): A direct measure of the coefficient’s precision. A smaller SE relative to its coefficient suggests a more reliable estimate.
The p-value: Tells you if the coefficient is statistically distinguishable from zero. A non-significant p-value (e.g., p > 0.05) means we cannot be confident the predictor has any association with the outcome.
The Confidence Interval (CI): Provides a plausible range for the true value of the coefficient. A very wide CI indicates a high degree of uncertainty and, therefore, low reliability. For logistic regression, if the CI for the odds ratio contains 1.0, the result is not statistically significant.
We will also calculate the RSE to demonstrate why it can be misleading. Finally, we’ll run a quick check for multicollinearity using the Variance Inflation Factor (VIF), as this is a common cause of unstable (unreliable) coefficients.
# Let's examine one model in detail: fit_men_obese
# The same process applies to the other three models.
# Get the standard model summary and confidence intervals
diagnostics_table <- svyglmdiag(fit_men_obese)
knitr::kable(
diagnostics_table,
caption = "Table 3: Reliability Diagnostics for Obesity Model Coefficients",
digits = 3
)
Term | Estimate | SE | p.value | is_significant | CI_Lower | CI_Upper | CI_Width | RSE_percent | is_rse_high |
---|---|---|---|---|---|---|---|---|---|
(Intercept) | -0.745 | 0.167 | 0.007 | TRUE | -1.175 | -0.315 | 0.860 | 22.451 | FALSE |
Age40-59 | 0.204 | 0.174 | 0.294 | FALSE | -0.243 | 0.651 | 0.895 | 85.311 | TRUE |
Age>=60 | 0.159 | 0.218 | 0.498 | FALSE | -0.401 | 0.719 | 1.119 | 136.779 | TRUE |
raceNon-Hispanic black | 0.279 | 0.136 | 0.096 | FALSE | -0.071 | 0.630 | 0.701 | 48.796 | TRUE |
raceNon-Hispanic Asian | -1.277 | 0.164 | 0.001 | TRUE | -1.697 | -0.856 | 0.841 | 12.811 | FALSE |
raceHispanic | 0.264 | 0.190 | 0.224 | FALSE | -0.225 | 0.752 | 0.977 | 72.060 | TRUE |
raceOther | 0.158 | 0.322 | 0.644 | FALSE | -0.670 | 0.987 | 1.657 | 203.734 | TRUE |
smokingFormer smoker | 0.207 | 0.140 | 0.198 | FALSE | -0.152 | 0.567 | 0.719 | 67.436 | TRUE |
smokingCurrent smoker | -0.250 | 0.149 | 0.153 | FALSE | -0.632 | 0.132 | 0.764 | 59.346 | TRUE |
education<High school | -0.528 | 0.154 | 0.019 | TRUE | -0.924 | -0.132 | 0.792 | 29.164 | FALSE |
education>High school | -0.013 | 0.147 | 0.932 | FALSE | -0.390 | 0.364 | 0.754 | 1109.146 | TRUE |
Generally, the model shows limited reliability and predictive power. Most of the predictor variables, such as Age
and smoking status
, are not statistically significant (their p.value
is high). This indicates that, for men in this dataset, these factors don’t have a clear, reliable association with obesity.
The few significant predictors are raceNon-Hispanic Asian
and education\<High school
. These coefficients are considered stable and reliable. The unreliability of the other terms is not caused by the variables being correlated with each other, as the multicollinearity check shows.
The RSE Can Be Misleading for Regression: Notice that some statistically insignificant coefficients (like Age40-59
and raceHispanic
) have high RSEs, which is expected. However, the education\>High school
coefficient is highly insignificant: p-value of 0.932 correctly tells you that this coefficient is not statistically significant and is not reliably different from zero. However its RSE is flagged as “TRUE” for being unreliable. The RSE is calculated as (0.147 / -0.013) * 100 = 1109%. Here, the extremely high RSE here is not a result of a large standard error, but of the coefficient estimate being very close to zero. An inflated RSE doesn’t provide any new or more accurate information than the p-value; it simply reflects that the coefficient itself is minuscule. This is a great example of why RSE isn’t a primary tool for regression coefficients: it can be inflated by estimates close to zero, regardless of their precision.
Check for Multicollinearity
Multicollinearity occurs when predictor variables in a model are highly correlated with each other. This can inflate the standard errors and make your coefficient estimates unstable. The VIF is used to detect this issue.
GVIF^(1/(2*Df))
is a scaled version of the Generalized Variance Inflation Factor (GVIF) used to assess multicollinearity in regression models with categorical predictors. It adjusts for the number of dummy variables created for a categorical variable, making its value directly comparable to the traditional VIF used for continuous predictors within the same model.
Why GVIF1/(2×Df) is Necessary for Categorical Variables
A categorical variable with (k) levels (e.g., race) is typically represented in a regression model by (k - 1) dummy variables. Dummy variables are inherently correlated because they all describe the same categorical feature. This intrinsic relationship would lead to very high — but misleading — GVIF scores if the overall GVIF were interpreted directly.
The adjustment GVIF1/(2×Df) standardizes the GVIF value. It reduces the measure from a hypervolume of confidence for multiple coefficients down to a linear measure, making it comparable to the single VIF value used for continuous predictors. Here, Df is the degrees of freedom for the categorical term, which equals (k - 1), the number of dummy variables.
Acceptable Ranges and Interpretation
The interpretation of GVIF1/(2×Df) follows the same guidelines as the standard VIF:
GVIF1/(2×Df) Range | Interpretation |
---|---|
1 | No correlation among predictors. |
1 – 2.5 | Low to moderate correlation — generally acceptable (typical for most well-specified models). |
2.5 – 5 | Moderate to high correlation — may warrant further investigation. |
> 5 | Potentially severe multicollinearity. The predictor may have a strong overlap with others, obscuring its effect on the outcome. A more conservative cutoff of 4 is sometimes used. |
The key values in the GVIF\^(1/(2\*Df))
column are all low (below 2.5). This confirms that your predictor variables are independent enough from one another and are not artificially inflating each other’s standard errors. The lack of precision in the model comes from other sources, not from multicollinearity.
Formatting the Table
# --- Use the helper function to format results from each model ---
men_obese_res <- format_publish_output(publish(fit_men_obese),
"Men, Obese, All Grades")
#> Variable Units OddsRatio CI.95 p-value
#> Age 20-39 Ref
#> 40-59 1.23 [0.87;1.72] 0.293928
#> >=60 1.17 [0.77;1.80] 0.497521
#> race Non-Hispanic white Ref
#> Non-Hispanic black 1.32 [1.01;1.73] 0.095724
#> Non-Hispanic Asian 0.28 [0.20;0.38] 0.000553
#> Hispanic 1.30 [0.90;1.89] 0.223882
#> Other 1.17 [0.62;2.20] 0.644324
#> smoking Never smoker Ref
#> Former smoker 1.23 [0.94;1.62] 0.198208
#> Current smoker 0.78 [0.58;1.04] 0.152799
#> education High school Ref
#> <High school 0.59 [0.44;0.80] 0.018658
#> >High school 0.99 [0.74;1.32] 0.931661
men_obese3_res <- format_publish_output(publish(fit_men_obese3),
"Men, Class 3 Obesity")
#> Variable Units OddsRatio CI.95 p-value
#> Age 20-39 Ref
#> 40-59 0.85 [0.54;1.35] 0.52065
#> >=60 0.88 [0.33;2.31] 0.80102
#> race Non-Hispanic white Ref
#> Non-Hispanic black 1.52 [0.85;2.72] 0.21824
#> Non-Hispanic Asian 0.08 [0.01;0.53] 0.04797
#> Hispanic 1.32 [0.57;3.06] 0.54578
#> Other 0.72 [0.28;1.82] 0.51598
#> smoking Never smoker Ref
#> Former smoker 1.08 [0.63;1.85] 0.78547
#> Current smoker 0.92 [0.47;1.79] 0.82227
#> education High school Ref
#> <High school 0.52 [0.27;1.03] 0.11800
#> >High school 0.93 [0.65;1.33] 0.70887
women_obese_res <- format_publish_output(publish(fit_women_obese),
"Women, Obese, All Grades")
#> Variable Units OddsRatio CI.95 p-value
#> Age 20-39 Ref
#> 40-59 1.39 [1.08;1.78] 0.0486143
#> >=60 1.08 [0.86;1.36] 0.5252155
#> race Non-Hispanic white Ref
#> Non-Hispanic black 2.17 [1.88;2.50] 0.0001332
#> Non-Hispanic Asian 0.23 [0.16;0.35] 0.0008139
#> Hispanic 1.30 [0.92;1.84] 0.2016538
#> Other 0.97 [0.57;1.65] 0.9169417
#> smoking Never smoker Ref
#> Former smoker 1.33 [1.00;1.76] 0.1067193
#> Current smoker 0.94 [0.59;1.49] 0.7987621
#> education High school Ref
#> <High school 0.96 [0.61;1.51] 0.8623856
#> >High school 0.68 [0.55;0.86] 0.0215212
women_obese3_res <- format_publish_output(publish(fit_women_obese3),
"Women, Class 3 Obesity")
#> Variable Units OddsRatio CI.95 p-value
#> Age 20-39 Ref
#> 40-59 1.23 [0.80;1.91] 0.38944
#> >=60 0.57 [0.39;0.85] 0.03880
#> race Non-Hispanic white Ref
#> Non-Hispanic black 1.92 [1.39;2.64] 0.01054
#> Non-Hispanic Asian 0.02 [0.00;0.16] 0.01235
#> Hispanic 1.01 [0.56;1.82] 0.98433
#> Other 1.65 [0.89;3.05] 0.17137
#> smoking Never smoker Ref
#> Former smoker 1.57 [1.05;2.35] 0.07838
#> Current smoker 1.02 [0.78;1.33] 0.88571
#> education High school Ref
#> <High school 1.25 [0.62;2.50] 0.55965
#> >High school 0.89 [0.59;1.35] 0.61174
# --- Create a template for the final table structure ---
final_table_structure <- tibble(
Group = c(
"Age, y", "", "",
"Race", "", "", "", "",
"Smoking", "", "",
"Education", "", ""
),
term = c(
"20-39", "40-59", ">=60",
"Non-Hispanic white", "Non-Hispanic black", "Non-Hispanic Asian", "Hispanic", "Other",
"Never smoker", "Former smoker", "Current smoker",
"High school", "<High school", ">High school"
)
)
# --- Join all formatted results onto the template ---
final_table <- final_table_structure %>%
left_join(men_obese_res, by = "term") %>%
left_join(men_obese3_res, by = "term") %>%
left_join(women_obese_res, by = "term") %>%
left_join(women_obese3_res, by = "term")
# --- Display the final, publication-quality table ---
kable(final_table, caption = "Weighted Logistic Regression Models for Obesity",
col.names = c("", "",
"Men, Obese, All Grades",
"Men, Class 3 Obesity",
"Women, Obese, All Grades",
"Women, Class 3 Obesity"))
Men, Obese, All Grades | Men, Class 3 Obesity | Women, Obese, All Grades | Women, Class 3 Obesity | ||
---|---|---|---|---|---|
Age, y | 20-39 | 1 [Reference] | 1 [Reference] | 1 [Reference] | 1 [Reference] |
40-59 | 1.23 (0.87-1.72) | 0.85 (0.54-1.35) | 1.39 (1.08-1.78) | 1.23 (0.80-1.91) | |
>=60 | 1.17 (0.77-1.80) | 0.88 (0.33-2.31) | 1.08 (0.86-1.36) | 0.57 (0.39-0.85) | |
Race | Non-Hispanic white | 1 [Reference] | 1 [Reference] | 1 [Reference] | 1 [Reference] |
Non-Hispanic black | 1.32 (1.01-1.73) | 1.52 (0.85-2.72) | 2.17 (1.88-2.50) | 1.92 (1.39-2.64) | |
Non-Hispanic Asian | 0.28 (0.20-0.38) | 0.08 (0.01-0.53) | 0.23 (0.16-0.35) | 0.02 (0.00-0.16) | |
Hispanic | 1.30 (0.90-1.89) | 1.32 (0.57-3.06) | 1.30 (0.92-1.84) | 1.01 (0.56-1.82) | |
Other | 1.17 (0.62-2.20) | 0.72 (0.28-1.82) | 0.97 (0.57-1.65) | 1.65 (0.89-3.05) | |
Smoking | Never smoker | 1 [Reference] | 1 [Reference] | 1 [Reference] | 1 [Reference] |
Former smoker | 1.23 (0.94-1.62) | 1.08 (0.63-1.85) | 1.33 (1.00-1.76) | 1.57 (1.05-2.35) | |
Current smoker | 0.78 (0.58-1.04) | 0.92 (0.47-1.79) | 0.94 (0.59-1.49) | 1.02 (0.78-1.33) | |
Education | High school | 1 [Reference] | 1 [Reference] | 1 [Reference] | 1 [Reference] |
<High school | 0.59 (0.44-0.80) | 0.52 (0.27-1.03) | 0.96 (0.61-1.51) | 1.25 (0.62-2.50) | |
>High school | 0.99 (0.74-1.32) | 0.93 (0.65-1.33) | 0.68 (0.55-0.86) | 0.89 (0.59-1.35) |
Differences from the Original Paper
While the ‘Statistical Analyses’ section of Flegal et al. (2016) details their models, it does not explicitly state the method used to handle missing data for covariates. Our replication employs a complete-case analysis, which excludes participants with missing smoking or education data from the models. This difference is the most likely reason for the minor discrepancies between our results and those published in the original paper.