4 Step 1: Proxy sources

Load saved data

We saved analytic3cycles.RData in Appendix. This contained the following datasets:

data.merged = merged data from 3 cycles,
data.complete = complete case merged data,
dat.proxy.long = only proxy information

load("data/analytic3cycles.RData")

4.1 Data with investigator-specified variables

Data: part 1

We will work with the data.complete data for the investigator-specified information.

analytic <- data.complete
idx <- analytic$id
outcome <- as.numeric(analytic$diabetes == "Yes") 
exposure <- as.numeric(analytic$obese == "Yes")
domain <- "dx"
analytic.dfx <- as.data.frame(cbind(idx, exposure, outcome, domain))

We prepare the minimal analytic data only with the following 4 information:

identifying information (idx)
exposure (obese)
outcome (diabetes)
domain of the codes (dx). In this example we only have prescription domain (1 domain dx)

4.2 Proxy data

4.2.1 Identify the data dimensions (proxy sources)

In this example we only have prescription domain (1 domain dx of ICD-10-CM code). Hence \(p = 1\) in this exercise.

NHANES Questionnaire collects information on: (a) dietary supplements, (b) nonprescription antacids, (c) prescription medications, and (d) preventive aspirin use.

4.2.2 Define a covariate assessment period (CAP)

(Connolly et al. 2019; Schneeweiss et al. 2009)

We only collect proxy information from a well-defined CAP. In our case, it was \(30\) days.

NHANES asked “In the past 30 days, have you used or taken medication for which a prescription is needed? Do not include prescription vitamins or minerals you may have already told me about.”

Data: part 2

We will work with the merge proxy data (ICD-10 codes) from 3 cycles: dat.proxy.long.

4.2.3 Omit duplicated information

We need to delete codes that could be close proxies of exposure and/or outcome, or other investigator specified covariates we have already selected in step0.

dat.proxy.long <- subset(dat.proxy.long, 
                         icd10 != "E66") # Overweight and obesity
dat.proxy.long <- subset(dat.proxy.long, 
                         icd10 != "O24") # Gestational diabetes mellitus
dat.proxy.long <- subset(dat.proxy.long, 
                         icd10 != "E10") # Type 1 diabetes mellitus
dat.proxy.long <- subset(dat.proxy.long, 
                         icd10 != "E11") # Type 2 diabetes mellitus

We delete codes associated with exposure and outcome.
Same should be done for any other proxies that may have duplicating information compared to the investigator-specified covariates.

4.2.4 Long format proxy data

Here is an example of 3 digit codes for 1 patient with subject ID “100001”. We create the same for all patients.

ID	ICD 10 codes (3 digit)	Description
100001	F33	Major depressive disorder, recurrent
100001	I10	Hypertension
100001	M62	Muscle spasm
100001	F32	Major depressive disorder, single episode
100001	M25	Joint disorder/pain
100001	K21	Gastro-esophageal reflux disease
100001	M79	musculoskeletal pain conditions
100001	R12	Heartburn

4.3 Merge Proxy data with Analytic data

Merged Data: parts 1 and 2

We will work with the merge proxy data with analytic data.
That will provide us with the IDs (idx) of the subject that have proxy (ICD-10) information associated with them.

require(dplyr) 
dfx <- merge(analytic.dfx, proxy.var.long, by = "idx")
head(dfx)

basetable <- dfx %>% select(idx, exposure, outcome) %>% distinct()
patientIds <- basetable$idx
length(patientIds)
#> [1] 7585