5  Step 2: Empirical

Based on the merged dataset, we identify which patients were linked in both databases. Using those IDs, we want to sort the list of candidate empirical covariates.

5.1 Sort by prevalence

Check out the frequency of each codes:

library(dplyr)
df <- data.frame(
  icd10 = names(sort(table(dfx$icd10), decreasing = TRUE)),
  count = sort(table(dfx$icd10), decreasing = TRUE)
)
ICD10 Code Frequencies
ICD10 Code Count
I10 2775
E78 1517
F32 536
F41 524
K21 441
M79 401
E03 397
M54 314
G47 307
J45 301

Only top 10 prevalent codes are shown.

However, some may be associated with lower counts (e.g., less than 20).

Restrictions

Candidate empirical covariates list is constrained by

  1. their prevalence of codes. Only top n covariates with highest prevalence would be chosen.
  2. analysts absolutely need to get rid of the codes that have zero variance (e.g., everyone has the code, or nobody has it).
  3. codes associated with very low prevalence are also numerically problematic for further analyses.

We choose n = 200 [for (1)] as it was proposed in the original algorithm (Schneeweiss et al. 2009). In reality, this is not necessary to be so restrictive (Schuster, Pang, and Platt 2015). Parts (2) and (3) are more likely and addressed by the following restriction: At least min_num_patients number of patients need to have that code to be selected in the list.

If there were more dimensions, separate list of candidate empirical covariates would be identified.

5.2 Choose Granularity

One important point here is that we have chosen granularity to be 3 digits in the ICD-10 code.

We have already truncated the codes at 3 digit level while preparing the data.

5.3 Retain top n empirical covariates

require(autoCovariateSelection)
step1 <- get_candidate_covariates(df = dfx,  
                                  domainVarname = "domain",
                                  eventCodeVarname = "icd10", 
                                  patientIdVarname = "idx",
                                  patientIdVector = patientIds,
                                  n = 200, 
                                  min_num_patients = 20)

You can use autoCovariateSelection package to implement these restrictions (Robert 2020).

5.3.1 Long format data

out1 <- step1$covars_data
head(out1)

5.3.2 Updated frequency data

df2 <- data.frame(
  icd10 = names(table(out1$icd10)),
  count = as.numeric(table(out1$icd10))
)
ICD10 Code Count
dx_A49 28
dx_B00 20
dx_B35 22
dx_C50 31
dx_D75 136
dx_E03 397

Only first few code frequencies are shown (alphabetic order), that were selected based on the restrictions n = 200 and min_num_patients = 20.

ICD10 Code Count
77 dx_R52 40
78 dx_R60 187
79 dx_R73 202
80 dx_T14 82
81 dx_T78 96
82 dx_Z79 277

Only last few code frequencies are shown (alphabetic order).

5.3.3 Total number of codes retained

nrow(df2)
#> [1] 82