51

I'm following this example on the scikit-learn website to perform a multioutput classification with a Random Forest model.

from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
import numpy as np

X, y1 = make_classification(n_samples=5, n_features=5, n_informative=2, n_classes=2, random_state=1) y2 = shuffle(y1, random_state=1) Y = np.vstack((y1, y2)).T

forest = RandomForestClassifier(n_estimators=10, random_state=1) multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1) multi_target_forest.fit(X, Y).predict(X)

print(multi_target_forest.predict_proba(X))

From this predict_proba I get a 2 5x2 arrays:

[array([[ 0.8,  0.2],
       [ 0.4,  0.6],
       [ 0.8,  0.2],
       [ 0.9,  0.1],
       [ 0.4,  0.6]]), array([[ 0.6,  0.4],
       [ 0.1,  0.9],
       [ 0.2,  0.8],
       [ 0.9,  0.1],
       [ 0.9,  0.1]])]

I was really expecting a n_sample by n_classes matrix. I'm struggling to understand how this relates to the probability of the classes present.

The docs for predict_proba states:

array of shape = [n_samples, n_classes], or a list of n_outputs such arrays if n_outputs > 1.

The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

I'm guessing I have the latter in the description, but I'm still struggling to understand how this relates to my class probabilities.

Furthermore, when I attempt to access the classes_ attribute for the forest model I get an AttributeError and this attribute does not exist on the MultiOutputClassifier. How can I relate the classes to the output?

print(forest.classes_)

AttributeError: 'RandomForestClassifier' object has no attribute 'classes_'

Harpal
  • 913
  • 1
  • 7
  • 13

3 Answers3

46

Assuming your target is (0,1), then the classifier would output a probability matrix of dimension (N,2). The first index refers to the probability that the data belong to class 0, and the second refers to the probability that the data belong to class 1.

These two would sum to 1.

You can then output the result by:

probability_class_1 = model.predict_proba(X)[:, 1]

If you have k classes, the output would be (N,k), you would have to specify the probability of which class you want.

chrisckwong821
  • 602
  • 6
  • 4
7

In the MultiOutputClassifier, you're treating the two outputs as separate classification tasks; from the docs you linked:

This strategy consists of fitting one classifier per target.

So the two arrays in the resulting list represent each of the two classifiers / dependent variables. The arrays then are the binary classification outputs (columns that are probability of class 0, probability of class 1) that @chrisckwong821 mentioned, but one for each problem.

In other words, the return value of predict_proba will be a list whose length is equal to the width of your y, i.e. n_outputs, in your case 2. Your quote from the predict_proba documentation references n_outputs, which is introduced in the documentation for fit:

fit(self, X, y[, sample_weight])

y : (sparse) array-like, shape (n_samples, n_outputs)

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
0

For the first question:

The first 5x2 array gives you the probabilities of the 5 testing samples classified in the first class. In addition, the first column of this 5x2 array tells you "the probability that the testing sample is not classified as the first class" and the second column of this 5x2 array tells you "the probability that the testing sample is classified as the first class".

Similarly, the second 5x2 array gives you the classification probability of testing samples in the second class.

If you want to check this, you can contrast the value in those arrays with the results from predict.

Sometimes, the return of predict_proba might give you a list that contains Nx1 arrays and Nx2 arrays. If so, no testing data is classified into those Nx1 arrays represented classes.