5 Step 2: Empirical
Based on the merged dataset, we identify which patients were linked in both databases. Using those IDs, we want to sort the list of candidate empirical covariates.
5.1 Sort by prevalence
Check out the frequency of each codes: Here is the list of top 10:
ICD10 Code | Count |
---|---|
I10 | 5742 |
E78 | 2965 |
F32 | 1135 |
F41 | 1090 |
K21 | 911 |
M79 | 870 |
E03 | 807 |
M54 | 772 |
G47 | 697 |
J45 | 626 |
However, some may be associated with lower counts (e.g., less than 20).
We choose n = 200
[for (1)] as it was proposed in the original algorithm (Schneeweiss et al. 2009). In reality, this is not necessary to be so restrictive (Schuster, Pang, and Platt 2015). Parts (2) and (3) are more likely and addressed by the following restriction: At least min_num_patients
number of patients need to have that code to be selected in the list.
If there were more dimensions, separate list of candidate empirical covariates would be identified.
5.2 Choose Granularity
One important point here is that we have chosen granularity to be 3 digits in the ICD-10 code.
We have already truncated the codes at 3 digit level while preparing the data.
5.3 Retain top n empirical covariates
<- suppressMessages(get_candidate_covariates(df = dfx,
step1 domainVarname = "domain",
eventCodeVarname = "icd10",
patientIdVarname = "idx",
patientIdVector = patientIds,
n = 200,
min_num_patients = 20))
You can use autoCovariateSelection
package to implement these restrictions (Robert 2020).