SuperLearner::listWrappers()
#> All prediction algorithm wrappers in SuperLearner:
#> [1] "SL.bartMachine" "SL.bayesglm" "SL.biglasso"
#> [4] "SL.caret" "SL.caret.rpart" "SL.cforest"
#> [7] "SL.earth" "SL.extraTrees" "SL.gam"
#> [10] "SL.gbm" "SL.glm" "SL.glm.interaction"
#> [13] "SL.glmnet" "SL.ipredbagg" "SL.kernelKnn"
#> [16] "SL.knn" "SL.ksvm" "SL.lda"
#> [19] "SL.leekasso" "SL.lm" "SL.loess"
#> [22] "SL.logreg" "SL.mean" "SL.nnet"
#> [25] "SL.nnls" "SL.polymars" "SL.qda"
#> [28] "SL.randomForest" "SL.ranger" "SL.ridge"
#> [31] "SL.rpart" "SL.rpartPrune" "SL.speedglm"
#> [34] "SL.speedlm" "SL.step" "SL.step.forward"
#> [37] "SL.step.interaction" "SL.stepAIC" "SL.svm"
#> [40] "SL.template" "SL.xgboost"
#>
#> All screening algorithm wrappers in SuperLearner:
#> [1] "All"
#> [1] "screen.corP" "screen.corRank" "screen.glmnet"
#> [4] "screen.randomForest" "screen.SIS" "screen.template"
#> [7] "screen.ttest" "write.screen.template"
SuperLearner
Choosing Learners
SuperLearner is a type 2 ensemble method, meaning it combines many methods of different types into one predictive model. SuperLearner uses cross-validation to find the best weighted combination of algorithms based on the predictive performance measure specified (default in the SuperLearner
package is non-negative least squares based on the Lawson-Hanson algorithm (Mullen and Stokkum 2023), but measures such as AUC can also be used). To run SuperLearner, the user needs to specify a library consisting of all the different methods SuperLearner should incorporate in the final model, as well as the number of cross-validation folds.
See previous chapter for other types of ensemble learning methods.
SuperLearner will perform as well as possible given the library of algorithms considered. A very recent paper by Phillips et al. (2023) provides some concrete guidelines for the determination of the number of cross-validation folds necessary and the selection of algorithms to include. Overall, we want to make sure the set of algorithms provided is:
-
Diverse: Having a rich library of algorithms allows the SuperLearner to adapt to a range of underlying data structures. Diverse libraries include:
- Parametric learners such as generalized linear models (GLMs)
- Highly data-adaptive learners
- Multiple variants of the same learner with different parameter specifications
Computationally feasible: Lots of machine learning algorithms take a long time to run. Having multiple computationally intensive algorithms in your library will cause the SuperLearner as a whole to take much too long to run.
Some of the more specific guidelines depend on our effective sample size. For binary outcomes, this can be calculated as:
\[ n_{eff}=min(n, 5*(n*min(\bar{p},1-\bar{p}))) \]
where \(\bar{p}\): prevalence of the outcome.
For continuous outcomes, the effective sample size is the same as the sample size (\(n_{eff} = n\)).
We also want to consider the characteristics of our particular sample.
If there are continuous covariates: We should include learners that do not force relationships to be linear/monotonic. For example, we could include regression splines, support vector machines, and tree-based methods like regression trees.
If we have high-dimensional data (a large number of covariates e.g. more than \(n_{eff}/20\) ): We should include some learners that fall under the class of screeners. These are learners which incorporate dimension reduction such as LASSO and random forests.
If the sample size is very large (i.e. \(n_{eff}>500\) ): We should include as many learners as is computationally feasible.
If the sample size is small (i.e. \(n_{eff} \leq 500\) ): We should include fewer learners (e.g. up to \(n_{eff}/5\) ), and include less flexible learners.
Some examples of learners that could be included are given in the table below (Polley 2021):
Type of learner | Examples |
---|---|
Parametric |
|
Highly data-adaptive |
|
Allowing non-linear/monotonic relationships |
|
Screeners |
|
There is also a useful tool implemented in the SuperLearner
library which allows us to easily see a list of all available learners.
SuperLearner in TMLE
-
The default SuperLearner library for estimating the outcome includes (Gruber, Van Der Laan, and Kennedy 2020)
-
SL.glm
: generalized linear models (GLMs) -
SL.glmnet
: least absolute shrinkage and selection operator (LASSO) -
tmle.SL.dbarts2
: modeling and prediction using Bayesian Additive Regression Trees (BART)
-
-
The default library for estimating the propensity scores includes
-
SL.glm
: generalized linear models (GLMs) -
tmle.SL.dbarts.k.5
: SL wrappers for modeling and prediction using BART -
SL.gam
: generalized additive models (GAMs)
-
-
It is certainly possible to use different set of learners
- More methods can be added by
- specifying lists of models in the Q.SL.library (for the outcome model) and g.SL.library (for the propensity score model)
- More methods can be added by