NHANES: Blood Pressure

The tutorial aims to guide the user through the process of analyzing health survey data, focusing specifically on the relationship between various demographic factors and blood pressure levels.

Required packages are imported for data manipulation and statistical analysis.

# Load required packages
library(survey)

Load data

NHANES survey data is loaded into the workspace.

load(file = "Data/surveydata/NHANESsample.Rdata")
ls()
#> [1] "analytic.data"

Regression summary

Bivariate Regression

Four separate models are constructed to explore the relationship between blood pressure and individual demographic variables like race, age, gender, and marital status.

summary(fit1r <- glm(blood.pressure ~ race, data=analytic.data))
#> 
#> Call:
#> glm(formula = blood.pressure ~ race, data = analytic.data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -66.501   -8.437    0.217    8.400   53.563  
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)             65.4408     0.3566 183.528  < 2e-16 ***
#> raceNon-Hispanic Black   2.1594     0.4891   4.415 1.02e-05 ***
#> raceNon-Hispanic White   0.9957     0.4498   2.214   0.0269 *  
#> raceOther Hispanic       0.3422     0.5553   0.616   0.5378    
#> raceOther race           3.0601     0.5326   5.746 9.54e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 168.7193)
#> 
#>     Null deviance: 1202625  on 7086  degrees of freedom
#> Residual deviance: 1194870  on 7082  degrees of freedom
#>   (2884 observations deleted due to missingness)
#> AIC: 56463
#> 
#> Number of Fisher Scoring iterations: 2
summary(fit1a <- glm(blood.pressure ~ age.centred, data=analytic.data))
#> 
#> Call:
#> glm(formula = blood.pressure ~ age.centred, data = analytic.data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -59.275   -8.046    0.366    8.083   54.057  
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 68.297917   0.158325   431.4   <2e-16 ***
#> age.centred  0.179501   0.006623    27.1   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 153.7967)
#> 
#>     Null deviance: 1202625  on 7086  degrees of freedom
#> Residual deviance: 1089649  on 7085  degrees of freedom
#>   (2884 observations deleted due to missingness)
#> AIC: 55804
#> 
#> Number of Fisher Scoring iterations: 2
summary(fit1g <- glm(blood.pressure ~ gender, data=analytic.data))
#> 
#> Call:
#> glm(formula = blood.pressure ~ gender, data = analytic.data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -63.998   -7.998    0.002    8.514   54.002  
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   67.4863     0.2210 305.409  < 2e-16 ***
#> genderFemale  -1.4885     0.3091  -4.816  1.5e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 169.1886)
#> 
#>     Null deviance: 1202625  on 7086  degrees of freedom
#> Residual deviance: 1198702  on 7085  degrees of freedom
#>   (2884 observations deleted due to missingness)
#> AIC: 56480
#> 
#> Number of Fisher Scoring iterations: 2
summary(fit1m <- glm(blood.pressure ~ marital, data=analytic.data))
#> 
#> Call:
#> glm(formula = blood.pressure ~ marital, data = analytic.data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -55.664   -7.625   -0.593    7.407   50.336  
#> 
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                70.5934     0.2129 331.648   <2e-16 ***
#> maritalNever married       -0.9293     0.4430  -2.098   0.0360 *  
#> maritalPreviously married  -0.9689     0.4200  -2.307   0.0211 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 141.0891)
#> 
#>     Null deviance: 723763  on 5124  degrees of freedom
#> Residual deviance: 722658  on 5122  degrees of freedom
#>   (4846 observations deleted due to missingness)
#> AIC: 39915
#> 
#> Number of Fisher Scoring iterations: 2

Multivariate Regression

A more comprehensive model is built, combining all the aforementioned demographic factors to predict blood pressure.

summary(fit1 <- glm(blood.pressure ~ race + age.centred + gender + marital, data=analytic.data))
#> 
#> Call:
#> glm(formula = blood.pressure ~ race + age.centred + gender + 
#>     marital, data = analytic.data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -58.794   -7.601   -0.159    7.409   52.141  
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               70.879238   0.436140 162.515  < 2e-16 ***
#> raceNon-Hispanic Black     2.632298   0.539123   4.883 1.08e-06 ***
#> raceNon-Hispanic White     0.432023   0.486314   0.888 0.374389    
#> raceOther Hispanic         0.042382   0.596004   0.071 0.943313    
#> raceOther race             2.854707   0.576021   4.956 7.43e-07 ***
#> age.centred               -0.039408   0.010467  -3.765 0.000168 ***
#> genderFemale              -2.728473   0.332506  -8.206 2.87e-16 ***
#> maritalNever married      -1.786003   0.468287  -3.814 0.000138 ***
#> maritalPreviously married  0.001345   0.439870   0.003 0.997560    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 137.6355)
#> 
#>     Null deviance: 723763  on 5124  degrees of freedom
#> Residual deviance: 704143  on 5116  degrees of freedom
#>   (4846 observations deleted due to missingness)
#> AIC: 39794
#> 
#> Number of Fisher Scoring iterations: 2

Including survey features

The tutorial takes into account the complex survey design by creating a new survey design object (based on survey features such as, sampling weight, strata and cluster).

require(survey)
analytic.design <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTMEC2YR, data=analytic.data, nest=TRUE)

Box plots and summary statistics are generated to visualize how blood pressure varies across different demographic groups, considering the survey weights and design.

svyboxplot(blood.pressure~race, design = analytic.design, col="grey", 
           ylab="Blood Pressure", 
           xlab ="Race")

svyby(~blood.pressure, by=~race, design = analytic.design, na.rm=TRUE, ci=TRUE, svymean)
svyboxplot(blood.pressure~gender, design = analytic.design, col="grey", 
           ylab="Blood Pressure", 
           xlab ="Gender")

svyby (~blood.pressure, by=~gender, design = analytic.design, na.rm=TRUE, ci=TRUE, svymean)
svyboxplot(blood.pressure~marital, design = analytic.design, col="grey", 
           ylab="Blood Pressure", 
           xlab ="Marital Status")

svyby (~blood.pressure, by=~marital, design = analytic.design, na.rm=TRUE, ci=TRUE, svymean)

Bivariate Regression

New regression models are constructed using the survey design object to understand how each demographic variable impacts blood pressure while taking survey features into consideration.

require(survey)
summary(fit1r <- svyglm(blood.pressure ~ race, design=analytic.design))
#> 
#> Call:
#> svyglm(formula = blood.pressure ~ race, design = analytic.design)
#> 
#> Survey design:
#> svydesign(strata = ~SDMVSTRA, id = ~SDMVPSU, weights = ~WTMEC2YR, 
#>     data = analytic.data, nest = TRUE)
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)             66.6150     0.2756 241.671  < 2e-16 ***
#> raceNon-Hispanic Black   2.0491     0.6144   3.335 0.006650 ** 
#> raceNon-Hispanic White   1.9182     0.4269   4.493 0.000912 ***
#> raceOther Hispanic       0.2021     0.7825   0.258 0.800928    
#> raceOther race           2.6091     0.3998   6.527 4.27e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 179.8689)
#> 
#> Number of Fisher Scoring iterations: 2
summary(fit1a <- svyglm(blood.pressure ~ age.centred, design=analytic.design))
#> 
#> Call:
#> svyglm(formula = blood.pressure ~ age.centred, design = analytic.design)
#> 
#> Survey design:
#> svydesign(strata = ~SDMVSTRA, id = ~SDMVPSU, weights = ~WTMEC2YR, 
#>     data = analytic.data, nest = TRUE)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 69.25489    0.47091  147.07  < 2e-16 ***
#> age.centred  0.15142    0.01503   10.08  8.5e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 169.12)
#> 
#> Number of Fisher Scoring iterations: 2
summary(fit1g <- svyglm(blood.pressure ~ gender, design=analytic.design))
#> 
#> Call:
#> svyglm(formula = blood.pressure ~ gender, design = analytic.design)
#> 
#> Survey design:
#> svydesign(strata = ~SDMVSTRA, id = ~SDMVPSU, weights = ~WTMEC2YR, 
#>     data = analytic.data, nest = TRUE)
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   69.2898     0.4552 152.208  < 2e-16 ***
#> genderFemale  -1.9096     0.4233  -4.511 0.000488 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 179.4491)
#> 
#> Number of Fisher Scoring iterations: 2
summary(fit1m <- svyglm(blood.pressure ~ marital, design=analytic.design))
#> 
#> Call:
#> svyglm(formula = blood.pressure ~ marital, design = analytic.design)
#> 
#> Survey design:
#> svydesign(strata = ~SDMVSTRA, id = ~SDMVPSU, weights = ~WTMEC2YR, 
#>     data = analytic.data, nest = TRUE)
#> 
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                70.8262     0.4978 142.265   <2e-16 ***
#> maritalNever married       -1.1963     0.5482  -2.182   0.0481 *  
#> maritalPreviously married  -0.5038     0.4258  -1.183   0.2579    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 175.0038)
#> 
#> Number of Fisher Scoring iterations: 2

Multivariate Regression

A final, more complex model is fitted, which includes multiple demographic variables and also considers the survey design.

analytic.design2 <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTMEC2YR, data=analytic.data, nest=TRUE)
summary(fit1 <- svyglm(blood.pressure ~ race + age.centred + gender + marital, design=analytic.design2))
#> 
#> Call:
#> svyglm(formula = blood.pressure ~ race + age.centred + gender + 
#>     marital, design = analytic.design2)
#> 
#> Survey design:
#> svydesign(strata = ~SDMVSTRA, id = ~SDMVPSU, weights = ~WTMEC2YR, 
#>     data = analytic.data, nest = TRUE)
#> 
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               71.14686    0.37976 187.348 3.26e-14 ***
#> raceNon-Hispanic Black     2.30248    0.84786   2.716 0.029953 *  
#> raceNon-Hispanic White     1.08919    0.35921   3.032 0.019055 *  
#> raceOther Hispanic         0.07339    1.20016   0.061 0.952947    
#> raceOther race             2.11590    0.45579   4.642 0.002363 ** 
#> age.centred               -0.02223    0.01566  -1.419 0.198767    
#> genderFemale              -2.91901    0.43979  -6.637 0.000294 ***
#> maritalNever married      -1.74002    0.48086  -3.619 0.008526 ** 
#> maritalPreviously married  0.22443    0.45139   0.497 0.634278    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 171.5455)
#> 
#> Number of Fisher Scoring iterations: 2

A note about Predictive models

In statistical analyses involving survey data, it’s crucial to account for the survey’s design features. These features can include sampling weights, stratification, and clustering, among others. Ignoring these could lead to biased estimates and incorrect conclusions. Considering such survey design features make the analysis more robust and reliable in terms of inference.

However, when the goal shifts from inference to prediction, additional challenges come into play. Specifically, the model may perform well on the data used to fit it (the “training” data) but not generalize well to new, unseen data. This discrepancy between training performance and generalization to new data is often referred to as “overfitting,” and the optimism of the model refers to the extent to which it overestimates its predictive performance on new data based on its performance on the training data.

Optimism-correction techniques are methodologies designed to address this issue. They allow you to evaluate how well your model is likely to perform on new data, not just the data you used to build it. Methods for optimism correction often involve techniques like cross-validation, bootstrapping, or specialized types of model validation that help in estimating the ‘true’ predictive performance of the model. Some of these techniques were discussed in the predictive modelling chapter.

Video content (optional)

Tip

For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.