简体   繁体   中英

How to correctly reshape the multiclass output of predict_proba of a sklearn classifier?

I have a multiclass problem with 10 classes. Using any of the sklearn classifiers with predict_proba I get an output of

(n_classes, n_samples, n_classes_probability_1_or_0)

in my case (10, 4789, 2)

Now with binary Classification I would just do

model.predict_proba(X)[:, 1]

I had assumed that:

pred = np.array(model.predict_proba(X))
pred = pred.reshape(-1, 10, 2)[:, :, 1]

would do the same, but the ordering is completely off.

Now y[:, class] corresponds to pred[class, :, 1]

I know I'm thinking of the shape the wrong way, but unfortunately I can't see.

How do I reshape it correctly? The goal is to use it in the roc_auc_score metrics and I want to have a shape of (instances, classes_probabilities = 1)

Could you help please? Thank you in advance!

It would be useful if you mention that you are using MultiOutputClassifier because most classifiers for multiclass in scikit learn don't return something like yours, so using an example dataset:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn import preprocessing

lb = preprocessing.LabelBinarizer()

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500,n_classes=10,n_informative=10,n_clusters_per_class=1)
y = lb.fit_transform(y)

Set up classifier

forest = RandomForestClassifier(n_estimators=10, random_state=1)
model = MultiOutputClassifier(forest, n_jobs=-1)
model.fit(X, y)

You don't need to think about reshaping it, simply pull out the values:

pred = np.array(model.predict_proba(X))

Like you have done before, this will correspond to every row being a class, every column being your observation:

pred[:,:,1].shape
(10, 500)

To get your probabilities, just transpose:

prob1 = pred[:,:, 1].T

prob1[:2]
array([[0.9, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0.1, 0. , 0.1, 0. , 0.7, 0. , 0.1, 0. , 0.1, 0. ]])

Compare with if we actually extract it and stack:

prob2 = np.hstack([i[:,1].reshape(-1,1) for i in model.predict_proba(X)])
array([[0.9, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
   [0.1, 0. , 0.1, 0. , 0.7, 0. , 0.1, 0. , 0.1, 0. ]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM