How to correctly reshape the multiclass output of predict_proba of a sklearn classifier?

Question

I have a multiclass problem with 10 classes. Using any of the sklearn classifiers with predict_proba I get an output of

(n_classes, n_samples, n_classes_probability_1_or_0)

in my case (10, 4789, 2)

Now with binary Classification I would just do

model.predict_proba(X)[:, 1]

I had assumed that:

pred = np.array(model.predict_proba(X))
pred = pred.reshape(-1, 10, 2)[:, :, 1]

would do the same, but the ordering is completely off.

Now y[:, class] corresponds to pred[class, :, 1]

I know I'm thinking of the shape the wrong way, but unfortunately I can't see.

How do I reshape it correctly? The goal is to use it in the roc_auc_score metrics and I want to have a shape of (instances, classes_probabilities = 1)

Could you help please? Thank you in advance!

Answer 1

It would be useful if you mention that you are using MultiOutputClassifier because most classifiers for multiclass in scikit learn don't return something like yours, so using an example dataset:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn import preprocessing

lb = preprocessing.LabelBinarizer()

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500,n_classes=10,n_informative=10,n_clusters_per_class=1)
y = lb.fit_transform(y)

Set up classifier

forest = RandomForestClassifier(n_estimators=10, random_state=1)
model = MultiOutputClassifier(forest, n_jobs=-1)
model.fit(X, y)

You don't need to think about reshaping it, simply pull out the values:

pred = np.array(model.predict_proba(X))

Like you have done before, this will correspond to every row being a class, every column being your observation:

pred[:,:,1].shape
(10, 500)

To get your probabilities, just transpose:

prob1 = pred[:,:, 1].T

prob1[:2]
array([[0.9, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0.1, 0. , 0.1, 0. , 0.7, 0. , 0.1, 0. , 0.1, 0. ]])

Compare with if we actually extract it and stack:

prob2 = np.hstack([i[:,1].reshape(-1,1) for i in model.predict_proba(X)])
array([[0.9, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
   [0.1, 0. , 0.1, 0. , 0.7, 0. , 0.1, 0. , 0.1, 0. ]])

How to correctly reshape the multiclass output of predict_proba of a sklearn classifier?

Question

1 answers

solution1
0 2021-04-17 17:26:58

How to correctly reshape the multiclass output of predict_proba of a sklearn classifier?

Question

1 answers

solution1 0 2021-04-17 17:26:58

solution1
0 2021-04-17 17:26:58