13

I am trying to implement the AdaBoost.M1 algorithm (trees as base-learners) to a data set with a large feature space (~ 20.000 features) and ~ 100 samples in R. There exists a variety of different packages for this purpose; AdaBag, Ada and gbm. gbm() (from the gbm-package) appears to be my only available option, as stack.overflow is a problem in the others, and though it works, it is very time-consuming.

Questions:

  1. Is there any way to overcome the stack.overflow the problem in the other packages, or have the gbm() run faster? I have tried converting the data.frame into a matrix without success.
  2. When performing AdaBoost in gbm() (with distribution set to "AdaBoost"), An Introduction to Statistical Learning (Hastie et al.) mentions the following parameters needed for tuning:

Needs:

  1. The total number of trees to fit.
  2. The shrinkage parameter denoted lambda.
  3. The number of splits in each tree, controlling the complexity of the boosted ensemble.

As the algorithm is very time consuming to perform in R, I need to find literature on what tuning parameters are within a suitable range for this kind of large feature space data, before performing cross-validation over the range to estimate the test error rate.

Any suggestions?

Osama Rizwan
  • 201
  • 1
  • 2
  • 7
AfBM
  • 131
  • 3

1 Answers1

2

Your problem with AdaBoost on a very high-dimensional dataset (20,000 features, ~100 samples) is a classic and difficult scenario, especially in R where memory and recursion limits can cause stack overflow errors with some packages.

1. Stack overflow issues Stack overflow in AdaBag or Ada packages likely arises from deep recursion or inefficient memory management in their implementation. Unfortunately, there may not be an easy fix except:

Increasing R’s stack size or memory limits (OS-level), though this may be limited.

Switching to packages that handle boosting iteratively rather than recursively.

Using gradient boosting implementations (e.g., gbm, xgboost, or lightgbm) which tend to be more optimized and scalable.

2. Speeding up gbm() Convert data frames to matrices, as you did. Use fewer trees or shallower trees (reduce n.trees and interaction.depth). Use early stopping with validation sets to avoid unnecessary computation.

Parallelize if possible (some gbm versions support multicore).

Consider reducing your feature space prior to training (see below).

3. Dimensionality reduction With 20,000 features but only ~100 samples, your model risks severe overfitting and long training times. Consider:

Applying feature selection methods or domain-driven pruning.

Using PCA or other dimensionality reduction techniques to reduce features to a manageable size (e.g., 50-200 components).

Using regularization or sparse methods that handle high-dimensional data better.

4. Tuning parameters (based on Hastie et al.) Number of trees (n.trees): Start with 100-500 trees, increase if needed, but be mindful of training time.

Shrinkage (lambda): Usually small values like 0.01 or 0.001 lead to better generalization.

Tree complexity (interaction.depth or max.depth): For AdaBoost, shallow trees (depth=1 or 2) often work best as base learners.

Use cross-validation to find an optimal balance but try to narrow the search space guided by these heuristics.

AdaBoost via gbm() is workable but not ideal. You may find better performance and speed with more modern gradient boosting frameworks like xgboost or lightgbm which handle sparse and high-dimensional data efficiently.

If you continue with gbm(), focus on reducing feature dimensions first and tuning trees and shrinkage conservatively.

Tipy
  • 51
  • 2