How is a splitting point chosen for continuous variables in decision trees?

Question

I have two questions related to decision trees:

If we have a continuous attribute, how do we choose the splitting value?

Example: Age=(20,29,50,40....)
Imagine that we have a continuous attribute $f$ that have values in $R$. How can I write an algorithm that finds the split point $v$, in order that when we split $f$ by $v$, we have a minimum gain for $f>v$?

score 34 · Accepted Answer · answered Nov 03 '17 at 22:18

In order to come up with a split point, the values are sorted, and the mid-points between adjacent values are evaluated in terms of some metric, usually information gain or gini impurity. For your example, lets say we have four examples and the values of the age variable are $(20, 29, 40, 50)$. The midpoints between the values $(24.5, 34.5, 45)$ are evaluated, and whichever split gives the best information gain (or whatever metric you're using) on the training data is used.

You can save some computation time by only checking split points that lie between examples of different classes, because only these splits can be optimal for information gain.

How is a splitting point chosen for continuous variables in decision trees?

1 Answers1

Linked

Related