简体   繁体   中英

Predicting multilabel data with sklearn

According to the docs, the OneVsRest classifier supports multilabel classification: http://scikit-learn.org/stable/modules/multiclass.html#multilabel-learning

Here's the code I'm trying to run:

from sklearn import metrics
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC

x = [[1,2,3],[3,3,2],[8,8,7],[3,7,1],[4,5,6]]
y = [['bar','foo'],['bar'],['foo'],['foo','jump'],['bar','fox','jump']]

y_enc = MultiLabelBinarizer().fit_transform(y)

train_x, train_y, test_x, test_y = train_test_split(x, y_enc, test_size=0.33)

clf = OneVsRestClassifier(SVC())
clf.fit(train_x, train_y)
predictions = clf.predict_proba(test_x)

my_metrics = metrics.classification_report( test_y, predictions)
print my_metrics

I get the following error:

Traceback (most recent call last):
  File "multilabel.py", line 178, in <module>
    clf.fit(train_x, train_y)
  File "/sklearn/lib/python2.6/site-packages/sklearn/multiclass.py", line 277, in fit
    Y = self.label_binarizer_.fit_transform(y)
  File "/sklearn/lib/python2.6/site-packages/sklearn/base.py", line 455, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/sklearn/lib/python2.6/site-packages/sklearn/preprocessing/label.py", line 302, in fit
    raise ValueError("Multioutput target data is not supported with "
ValueError: Multioutput target data is not supported with label binarization

Not using the MultiLabelBinarizer gives the same error, so I'm assuming that's not the problem. Does anyone know how to use this classifier for multilabel data?

Your train_test_split() output is not correct. Change this line:

train_x, train_y, test_x, test_y = train_test_split(x, y_enc, test_size=0.33)

To this:

train_x, test_x, train_y, test_y = train_test_split(x, y_enc, test_size=0.33)

Also, to use probabilities instead of class predictions, you'll need to change SVC() to SVC(probability = True) and change clf.predict_proba to clf.predict .

Putting it all together:

from sklearn import metrics
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC


x = [[1,2,3],[3,3,2],[8,8,7],[3,7,1],[4,5,6]]
y = [['bar','foo'],['bar'],['foo'],['foo','jump'],['bar','fox','jump']]

mlb = MultiLabelBinarizer()
y_enc = mlb.fit_transform(y)

train_x, test_x, train_y, test_y = train_test_split(x, y_enc, test_size=0.33)

clf = OneVsRestClassifier(SVC(probability=True))
clf.fit(train_x, train_y)
predictions = clf.predict(test_x)

my_metrics = metrics.classification_report( test_y, predictions)
print my_metrics

This gives me no errors when I run it.

I also experienced "ValueError: Multioutput target data is not supported with label binarization" with OneVsRestClassifier. My issue was caused by the type of training data was "list", after casting with np.array(), it works.

对我来说,在 np.array() 中包装train_xtrain_ytext_xtest_y已经解决了这个问题。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM