Rethinking Residual Confounding Bias Reduction: Why Vanilla hdPS Alone is No Longer Enough

5.1 Sort by prevalence

Check out the frequency of each codes: Here is the list of top 10:

ICD10 Code Frequencies
ICD10 Code	Count
I10	5742
E78	2965
F32	1135
F41	1090
K21	911
M79	870
E03	807
M54	772
G47	697
J45	626

However, some may be associated with lower counts (e.g., less than 20).

Restrictions

Candidate empirical covariates list is constrained by

their prevalence of codes. Only top n covariates with highest prevalence would be chosen.
analysts absolutely need to get rid of the codes that have zero variance (e.g., everyone has the code, or nobody has it).
codes associated with very low prevalence are also numerically problematic for further analyses.

We choose n = 200 [for (1)] as it was proposed in the original algorithm (Schneeweiss et al. 2009). In reality, this is not necessary to be so restrictive (Schuster, Pang, and Platt 2015). Parts (2) and (3) are more likely and addressed by the following restriction: At least min_num_patients number of patients need to have that code to be selected in the list.

If there were more dimensions, separate list of candidate empirical covariates would be identified.

5.2 Choose Granularity

One important point here is that we have chosen granularity to be 3 digits in the ICD-10 code.

We have already truncated the codes at 3 digit level while preparing the data.

5.3 Retain top n empirical covariates

step1 <- suppressMessages(get_candidate_covariates(df = dfx,  
                                  domainVarname = "domain",
                                  eventCodeVarname = "icd10", 
                                  patientIdVarname = "idx",
                                  patientIdVector = patientIds,
                                  n = 200, 
                                  min_num_patients = 20))

You can use autoCovariateSelection package to implement these restrictions (Robert 2020).