5 Step 2: Empirical – hdPS and its machine learning extensions in residual confounding control

5.1 Sort by prevalence

Check out the frequency of each codes:

library(dplyr)
df <- data.frame(
  icd10 = names(sort(table(dfx$icd10), decreasing = TRUE)),
  count = sort(table(dfx$icd10), decreasing = TRUE)
)

ICD10 Code Frequencies
ICD10 Code	Count
I10	2775
E78	1517
F32	536
F41	524
K21	441
M79	401
E03	397
M54	314
G47	307
J45	301

Only top 10 prevalent codes are shown.

However, some may be associated with lower counts (e.g., less than 20).

Restrictions

Candidate empirical covariates list is constrained by

their prevalence of codes. Only top n covariates with highest prevalence would be chosen.
analysts absolutely need to get rid of the codes that have zero variance (e.g., everyone has the code, or nobody has it).
codes associated with very low prevalence are also numerically problematic for further analyses.

We choose n = 200 [for (1)] as it was proposed in the original algorithm (Schneeweiss et al. 2009). In reality, this is not necessary to be so restrictive (Schuster, Pang, and Platt 2015). Parts (2) and (3) are more likely and addressed by the following restriction: At least min_num_patients number of patients need to have that code to be selected in the list.

If there were more dimensions, separate list of candidate empirical covariates would be identified.

5.2 Choose Granularity

One important point here is that we have chosen granularity to be 3 digits in the ICD-10 code.

We have already truncated the codes at 3 digit level while preparing the data.

5.3 Retain top n empirical covariates

require(autoCovariateSelection)
step1 <- get_candidate_covariates(df = dfx,  
                                  domainVarname = "domain",
                                  eventCodeVarname = "icd10", 
                                  patientIdVarname = "idx",
                                  patientIdVector = patientIds,
                                  n = 200, 
                                  min_num_patients = 20)

You can use autoCovariateSelection package to implement these restrictions (Robert 2020).

5.3.1 Long format data

out1 <- step1$covars_data
head(out1)

5.3.2 Updated frequency data

df2 <- data.frame(
  icd10 = names(table(out1$icd10)),
  count = as.numeric(table(out1$icd10))
)

ICD10 Code	Count
dx_A49	28
dx_B00	20
dx_B35	22
dx_C50	31
dx_D75	136
dx_E03	397

Only first few code frequencies are shown (alphabetic order), that were selected based on the restrictions n = 200 and min_num_patients = 20.

	ICD10 Code	Count
77	dx_R52	40
78	dx_R60	187
79	dx_R73	202
80	dx_T14	82
81	dx_T78	96
82	dx_Z79	277

Only last few code frequencies are shown (alphabetic order).

5.3.3 Total number of codes retained

nrow(df2)
#> [1] 82