5  Step 2: Empirical

Based on the merged dataset, we identify which patients were linked in both databases. Using those IDs, we want to sort the list of candidate empirical covariates.

5.1 Sort by prevalence

Check out the frequency of each codes: Here is the list of top 10:

ICD10 Code Frequencies
ICD10 Code Count
I10 5742
E78 2965
F32 1135
F41 1090
K21 911
M79 870
E03 807
M54 772
G47 697
J45 626

However, some may be associated with lower counts (e.g., less than 20).

Restrictions

Candidate empirical covariates list is constrained by

  1. their prevalence of codes. Only top n covariates with highest prevalence would be chosen.
  2. analysts absolutely need to get rid of the codes that have zero variance (e.g., everyone has the code, or nobody has it).
  3. codes associated with very low prevalence are also numerically problematic for further analyses.

We choose n = 200 [for (1)] as it was proposed in the original algorithm (Schneeweiss et al. 2009). In reality, this is not necessary to be so restrictive (Schuster, Pang, and Platt 2015). Parts (2) and (3) are more likely and addressed by the following restriction: At least min_num_patients number of patients need to have that code to be selected in the list.

If there were more dimensions, separate list of candidate empirical covariates would be identified.

5.2 Choose Granularity

One important point here is that we have chosen granularity to be 3 digits in the ICD-10 code.

We have already truncated the codes at 3 digit level while preparing the data.

5.3 Retain top n empirical covariates

step1 <- suppressMessages(get_candidate_covariates(df = dfx,  
                                  domainVarname = "domain",
                                  eventCodeVarname = "icd10", 
                                  patientIdVarname = "idx",
                                  patientIdVector = patientIds,
                                  n = 200, 
                                  min_num_patients = 20))

You can use autoCovariateSelection package to implement these restrictions (Robert 2020).