1

The AdaBoost algorithm is:

enter image description here

My trouble is how the classifier $G_m(x)$ is trained, What does mean a classifier to be trained using weights $w_i$? Is it to fit classifier through $\{w_i,y_i\}_{i=1}^{N}$?

3 Answers3

0

Adaboost is a model (ensemble) that starts with high bias but low variance, in contrast with bagging ensembles that are models with a high variance but low variance (see fig 1.)

figure 1

Although the original paper makes usage of decision tree stumps, you could theoretically make use of any other classifier, more precisely of unstable classifier

The process of fitting in AdaBoost happens to be a depth 1 decision tree on a set of data. Given such a set and weights associated with each, the problem of fitting a decision tree (a stump) involves finding the "best" variable $x$ and threshold $s$, where the best variable and split threshold are defined as the pair that minimizes some measure of node impurity, like the Gini index.

So, given a set of candidate variables to split on and a set of training data, there will be a unique solution (a single variable and threshold) that is the best depth 1 decision tree for the current boosting stage. The set of variables $={\{{1,2,…}\}}$ that we can pick our split variable from may be either the entire set of features we have, or it can be a (random) subset. Many implementations of decision tree classifiers will enable the fitting algorithm to randomly pick up a subset of variables at each branching phase.

The reason that we weigh the misclassified points more is that those are the ones we want to correct. A particular variable may not be the best variable to split on given equally weighted data, but it may become the best variable to split on once the weights become unevenly distributed. It could be good at correcting the mistakes of a previous learner.

It is also worth mentioning that isn't guaranteed that each of your features will appear twice in the final result. It could be that several features are repeated, and potentially some are ignored altogether, It may indeed be the case that a variable appears in more than 1 stump during the AdaBoost procedure.

Finally, you could find those resources useful:

  1. https://www.youtube.com/watch?v=thR9ncsyMBE&list=PL05umP7R6ij35ShKLDqccJSDntugY4FQT&index=8
Multivac
  • 3,199
  • 2
  • 10
  • 26
0

The weights refers to the weights assigned to misclassified samples. Adaboost assigns weights to each misclassified observations based on the calculation showed above. When it trains boosting model this weights help model to give more importance to misclassified observations

Ashwiniku918
  • 2,092
  • 5
  • 18
0

In the context of AdaBoost.M1 (also known as Discrete AdaBoost), the base classifier returns a discrete class label, usually $1$ or $-1$.

When the base classifier is a classification tree, the CART algorithm can handle sample weights after minor modifications described next.

For the sake of generality and consistency with the source code of scikit-learn, let us consider multiclass classification with labels $1,\ldots,C$ (instead of just binary classification). Assume weights $w_1,\ldots,w_n$ are assigned to each observation in the training sample.
Recall that the optimal split of a node $A$ is found by minimizing the weighted sum of the impurities in the left- and right- child $$\frac{n_{A_{\text{left}}}}{n_A} H(A_{\text{left}}) + \frac{n_{A_{\text{right}}}}{n_A} H(A_{\text{right}}).$$ Let $\hat \eta_{1},\ldots,\hat \eta_{C}$ denote the class-probability estimates for node $A$. Usual choices for the node impurity are $$H(A) = \begin{cases} 1-\max_{1\leq c\leq C} \hat \eta_{c} \quad &\text{Classification error rate} \\ \sum_{c=1}^{C} \hat \eta_{c}(1-\hat \eta_{c}) \quad &\text{Gini index} \\ -\sum_{c=1}^{C} \hat \eta_{c}\ln \hat \eta_{c} \quad &\text{Entropy} \end{cases}$$ With samples weights, these formulas for impurity remain the same, only the way $\hat\eta_c$ is computed changes.

  • Without sample weights, the definition of $\hat\eta_c$ is $$\hat\eta_c = \frac{|\{i:x_i\in A \text{ and } y_i=c\}|}{n_A}.$$
  • With sample weights, define $S=\sum\limits_{\substack{1\leqslant i\leqslant n,\\x_i\in A}} w_i$ and $\hat\eta_c$ is now $$\hat\eta_c = \frac 1S \sum\limits_{\substack{1\leqslant i\leqslant n,\\x_i\in A \text{ and } y_i=c }} w_i = \frac 1S\sum_{i=1}^n w_i 1_{x_i\in A \text{ and } y_i=c}.$$

See for yourself in scikit-learn's source code:

This change in $\hat \eta_{c}$ also affects the predictions returned by the decision tree. See in scikit-learn.