简体   繁体   English

如何正确重塑sklearn分类器的predict_proba的多类output?

[英]How to correctly reshape the multiclass output of predict_proba of a sklearn classifier?

I have a multiclass problem with 10 classes.我有 10 个班级的多班级问题。 Using any of the sklearn classifiers with predict_proba I get an output of使用任何带有 predict_proba 的 sklearn 分类器,我得到一个 output

(n_classes, n_samples, n_classes_probability_1_or_0)

in my case (10, 4789, 2)就我而言(10, 4789, 2)

Now with binary Classification I would just do现在使用二进制分类我会做

model.predict_proba(X)[:, 1]

I had assumed that:我曾假设:

pred = np.array(model.predict_proba(X))
pred = pred.reshape(-1, 10, 2)[:, :, 1]

would do the same, but the ordering is completely off.会做同样的事情,但订单完全关闭。

Now y[:, class] corresponds to pred[class, :, 1]现在y[:, class]对应于pred[class, :, 1]

I know I'm thinking of the shape the wrong way, but unfortunately I can't see.我知道我想错了形状,但不幸的是我看不到。

How do I reshape it correctly?如何正确重塑它? The goal is to use it in the roc_auc_score metrics and I want to have a shape of (instances, classes_probabilities = 1)目标是在 roc_auc_score 指标中使用它,我想要一个形状为(instances, classes_probabilities = 1)

Could you help please?你能帮忙吗? Thank you in advance!先感谢您!

It would be useful if you mention that you are using MultiOutputClassifier because most classifiers for multiclass in scikit learn don't return something like yours, so using an example dataset:如果您提到您正在使用MultiOutputClassifier这将很有用,因为 scikit learn 中的大多数多类分类器不会返回像您这样的东西,因此使用示例数据集:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn import preprocessing

lb = preprocessing.LabelBinarizer()

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500,n_classes=10,n_informative=10,n_clusters_per_class=1)
y = lb.fit_transform(y)

Set up classifier设置分类器

forest = RandomForestClassifier(n_estimators=10, random_state=1)
model = MultiOutputClassifier(forest, n_jobs=-1)
model.fit(X, y)

You don't need to think about reshaping it, simply pull out the values:您无需考虑重塑它,只需提取值:

pred = np.array(model.predict_proba(X))

Like you have done before, this will correspond to every row being a class, every column being your observation:就像您之前所做的那样,这将对应于每一行是 class,每一列都是您的观察:

pred[:,:,1].shape
(10, 500)

To get your probabilities, just transpose:要获得您的概率,只需转置:

prob1 = pred[:,:, 1].T

prob1[:2]
array([[0.9, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0.1, 0. , 0.1, 0. , 0.7, 0. , 0.1, 0. , 0.1, 0. ]])

Compare with if we actually extract it and stack:与我们实际提取它并堆叠相比:

prob2 = np.hstack([i[:,1].reshape(-1,1) for i in model.predict_proba(X)])
array([[0.9, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
   [0.1, 0. , 0.1, 0. , 0.7, 0. , 0.1, 0. , 0.1, 0. ]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM