简体   繁体   English

用sci-kit分类学习多值输出

[英]Classifying with sci-kit learn for multi-valued output

Let us say I have chosen a single training document from a training set. 假设我从一个培训集中选择了一个培训文档。 I have put it into feature vector X for my chosen features. 对于我选择的特征,我将其放入特征向量X中。

I am trying to do: 我正在尝试做:

self.clf = LogisticRegression()
self.clf.fit(X, Y)

My Y would be something like: [0 0 0 1 1 0 1 0 0 1 0] 我的Y会是这样的: [0 0 0 1 1 0 1 0 0 1 0]

I would like to train my one single model so that it best fits each of the 11 output values simultaneously. 我想训练我的单一模型,以便最佳地同时适应11个输出值中的每一个。 This doesn't seem to work for fit as I get a unhashable type 'list' error because it is expecting a single value which is ether binary or multi-class but does not allow for more than one value. 这似乎不fit我,因为我收到一个unhashable type 'list'错误,因为它期望的是单个值,该值是以太二进制或多类,但不允许有多个值。

Is there anyway to do this with sci-kit learn? 无论如何,使用sci-kit学习可以做到这一点吗?

Multi-label classification has a somewhat different API than ordinary classification. 多标签分类与普通分类有一些不同的API。 Your Y should be a sequence of sequences, eg a list of lists, like 你的Y应该是一系列序列,例如列表列表

Y = [["foo", "bar"],          # the first sample is a foo and a bar
     ["foo"],                 # the second is only a foo
     ["bar", "baz"]]          # the third is a bar and a baz

Such a Y can then be fed to an estimator that handles multiple classifications. 然后可以将这样的Y馈送到处理多个分类的估计器。 You can construct such an estimator using the OneVsRestClassifier wrapper: 您可以使用OneVsRestClassifier包装器构造这样的估算器:

from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LogisticRegression())

then train with clf.fit(X, Y) . 然后用clf.fit(X, Y)训练。 clf.predict will now produce sequences of sequences as well. clf.predict现在clf.predict将产生序列序列。

UPDATE as of scikit-learn 0.15, this API is deprecated because its input is ambiguous. UPDATE从scikit-learn 0.15开始,此API已弃用,因为其输入不明确。 You should convert the Y I gave above to a matrix with a MultiLabelBinarizer : 您应该使用MultiLabelBinarizer将上面给出的Y转换为矩阵:

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> mlb.fit_transform(Y)
array([[1, 0, 1],
       [0, 0, 1],
       [1, 1, 0]])

Then feed this to an estimator's fit method. 然后将其输入估算器的fit方法。 Converting back is done with inverse_transform on the same binarizer: 相同的二值化器上使用inverse_transform完成转换:

>>> mlb.inverse_transform(mlb.transform(Y))
[('bar', 'foo'), ('foo',), ('bar', 'baz')]

Could you please be more specific what your task is? 您能否更具体地说明您的任务是什么? Is a label a fixed length vector of binary variables? 标签是二进制变量的固定长度向量吗? Then this would be called multi label classification (ie multiple labels are either on or off). 然后,这将被称为多标签分类(即,多个标签处于打开或关闭状态)。 If each label can have more than two values it is called "multi output" in scikit-learn and can only be done by trees and ensembles. 如果每个标签可以有两个以上的值,则在scikit-learn中将其称为“多输出”,并且只能由树和集合来完成。

PS: if you use a linear classifier such as logistic regression, the output variables will be treated independently any way. PS:如果使用线性分类器(如逻辑回归),输出变量将以任何方式独立处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM