34 Merge three cycles – hdPS and its machine learning extensions in residual confounding control

34.1 Analytic dataset

34.1.1 Load 2013-18 datasets

load("data/analytic13recoded.RData")
load("data/analytic15recoded.RData")
load("data/analytic17recoded.RData")

34.1.2 Merge 2013-18 datasets

# adults aged 20 years or more
data.merged0 <- rbind(analytic13, analytic15, analytic17)
dim(data.merged0)
#> [1] 17057    34
data.merged <- droplevels(data.merged0)

34.1.3 Check missingness

plot_missing(data.merged)

# profile_missing(data.merged)
dim(data.merged)
#> [1] 17057    34

The data contants variables with some missing information.

data.complete <- na.omit(data.merged)
dim(data.complete)
#> [1] 6850   34

Only complete cases retained, and survey features/weights were ignored for simplicity.
In a realistic analysis, we would consider the missingness pattern before deleting or imputing such information.

34.2 Summary statistics

	No (N=4291)	Yes (N=2559)	Overall (N=6850)
age.cat
20-49	2208 (51.5%)	1227 (47.9%)	3435 (50.1%)
50-64	1085 (25.3%)	767 (30.0%)	1852 (27.0%)
65+	998 (23.3%)	565 (22.1%)	1563 (22.8%)
sex
Male	2086 (48.6%)	1106 (43.2%)	3192 (46.6%)
Female	2205 (51.4%)	1453 (56.8%)	3658 (53.4%)
education
Less than high school	597 (13.9%)	419 (16.4%)	1016 (14.8%)
High school	1809 (42.2%)	1375 (53.7%)	3184 (46.5%)
College graduate or above	1885 (43.9%)	765 (29.9%)	2650 (38.7%)
race
White	1496 (34.9%)	932 (36.4%)	2428 (35.4%)
Black	583 (13.6%)	581 (22.7%)	1164 (17.0%)
Hispanic	955 (22.3%)	763 (29.8%)	1718 (25.1%)
Others	1257 (29.3%)	283 (11.1%)	1540 (22.5%)
marital
Never married	757 (17.6%)	408 (15.9%)	1165 (17.0%)
Married/with partner	2756 (64.2%)	1533 (59.9%)	4289 (62.6%)
Other	778 (18.1%)	618 (24.2%)	1396 (20.4%)
income
less than $20,000	668 (15.6%)	443 (17.3%)	1111 (16.2%)
$20,000 to $74,999	1955 (45.6%)	1353 (52.9%)	3308 (48.3%)
$75,000 and Over	1668 (38.9%)	763 (29.8%)	2431 (35.5%)
born
Born in US	2269 (52.9%)	1745 (68.2%)	4014 (58.6%)
Other place	2022 (47.1%)	814 (31.8%)	2836 (41.4%)
year
NHANES 2013-2014 public release	1976 (46.0%)	1100 (43.0%)	3076 (44.9%)
NHANES 2015-2016 public release	740 (17.2%)	337 (13.2%)	1077 (15.7%)
NHANES 2017-2018 public release	1575 (36.7%)	1122 (43.8%)	2697 (39.4%)
diabetes.family.history
No	3656 (85.2%)	1971 (77.0%)	5627 (82.1%)
Yes	635 (14.8%)	588 (23.0%)	1223 (17.9%)
smoking
Never smoker	2760 (64.3%)	1591 (62.2%)	4351 (63.5%)
Previous smoker	917 (21.4%)	636 (24.9%)	1553 (22.7%)
Current smoker	614 (14.3%)	332 (13.0%)	946 (13.8%)
diet.healthy
Poor or fair	876 (20.4%)	1006 (39.3%)	1882 (27.5%)
Good	1747 (40.7%)	1039 (40.6%)	2786 (40.7%)
Very good or excellent	1668 (38.9%)	514 (20.1%)	2182 (31.9%)
physical.activity
No	3590 (83.7%)	2007 (78.4%)	5597 (81.7%)
Yes	701 (16.3%)	552 (21.6%)	1253 (18.3%)
medical.access
No	767 (17.9%)	319 (12.5%)	1086 (15.9%)
Yes	3524 (82.1%)	2240 (87.5%)	5764 (84.1%)
sleep
Mean (SD)	7.32 (1.42)	7.21 (1.54)	7.28 (1.47)
Median [Min, Max]	7.00 [2.00, 14.0]	7.00 [2.00, 14.0]	7.00 [2.00, 14.0]
systolicBP
Mean (SD)	122 (18.2)	127 (17.4)	124 (18.1)
Median [Min, Max]	118 [64.7, 229]	125 [74.0, 212]	121 [64.7, 229]
diastolicBP
Mean (SD)	70.2 (11.1)	72.8 (11.5)	71.2 (11.3)
Median [Min, Max]	70.7 [12.0, 123]	72.7 [26.0, 124]	71.3 [12.0, 124]
uric.acid
Mean (SD)	5.19 (1.36)	5.74 (1.48)	5.39 (1.43)
Median [Min, Max]	5.10 [1.10, 12.3]	5.60 [2.10, 13.3]	5.30 [1.10, 13.3]
protein.total
Mean (SD)	7.14 (0.454)	7.10 (0.443)	7.12 (0.450)
Median [Min, Max]	7.10 [4.70, 10.2]	7.10 [5.40, 9.10]	7.10 [4.70, 10.2]
bilirubin.total
Mean (SD)	0.594 (0.307)	0.513 (0.304)	0.564 (0.308)
Median [Min, Max]	0.500 [0, 3.30]	0.500 [0, 7.10]	0.500 [0, 7.10]
phosphorus
Mean (SD)	3.73 (0.545)	3.66 (0.575)	3.70 (0.557)
Median [Min, Max]	3.70 [2.00, 6.10]	3.60 [1.80, 8.90]	3.70 [1.80, 8.90]
sodium
Mean (SD)	140 (2.45)	140 (2.58)	140 (2.50)
Median [Min, Max]	140 [124, 150]	140 [121, 154]	140 [121, 154]
potassium
Mean (SD)	4.01 (0.358)	4.04 (0.363)	4.02 (0.360)
Median [Min, Max]	4.00 [2.80, 6.00]	4.00 [2.80, 6.60]	4.00 [2.80, 6.60]
globulin
Mean (SD)	2.88 (0.438)	3.02 (0.450)	2.93 (0.448)
Median [Min, Max]	2.80 [1.60, 6.50]	3.00 [1.40, 5.20]	2.90 [1.40, 6.50]
calcium.total
Mean (SD)	9.39 (0.364)	9.32 (0.381)	9.36 (0.371)
Median [Min, Max]	9.40 [6.40, 14.8]	9.30 [6.60, 12.0]	9.40 [6.40, 14.8]
high.cholesterol
No	2833 (66.0%)	1504 (58.8%)	4337 (63.3%)
Yes	1458 (34.0%)	1055 (41.2%)	2513 (36.7%)

Investigator specified covariates stratified by the exposure (obesity)
This Table includes information about participants with and without ICD-10-CM proxy information. Therefore, the sample is is larger than the original analysis.

34.3 Proxy data from ICD10 codes

dat.proxy.long <- rbind(rx2013, rx2015, rx2017) 
dat.proxy.long$icd10 <- NULL
# Rename 3 digits ICD-10 codes as icd10
colnames(dat.proxy.long)[names(dat.proxy.long)=="icd10.new"] <- "icd10"

We combine all of the ICD-10-CM information form all 3 cycles.

34.4 Save dataset for later use

save(data.merged, 
     data.complete, 
     dat.proxy.long, 
     file = "data/analytic3cycles.RData")