scikit-learn-使用svm.svc分类器进行多标签分类，没有概率=真有可能吗？

Question

I tried to achieve multilabel classification with Pipeline\\onevsrest classifier in scikit-learn. 我试图在scikit-learn中使用Pipeline \\ onevsrest分类器实现多标签分类。 Code is below, but let me mention first that I construct my multilabel examples from a pandas dataframe. 代码在下面，但首先让我提及，我是从pandas数据框构造多标签示例的。

Code is below: 代码如下：

df = pd.read_csv(fileIn, header = 0, encoding='utf-8-sig')
rows = random.sample(df.index, int(len(df) * 0.9))

work = df.ix[rows]

work_test = df.drop(rows)

X_train = []

y_train = []

X_test = []

y_test = []
for i in work[[i for i in list(work.columns.values) if i.startswith('Change')]].values:
    X_train.append(','.join(i.T.tolist()))

X_train = np.array(X_train)

for i in work[[i for i in list(work.columns.values) if i.startswith('Corax')]].values:
    y_train.append(list(i))


for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Change')]].values:
    X_test.append(','.join(i.T.tolist()))

X_test = np.array(X_test)

for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Corax')]].values:
    y_test.append(list(i))


lb = preprocessing.MultiLabelBinarizer()

Y = lb.fit_transform(y_train)

classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(SVC(kernel='rbf')))])

classifier.fit(X_train, Y)

predicted = classifier.predict(X_test)

But the issue is that when you use this set of transformations: CountVectorizer -> TfidfTransformer you get a sparse matrix. 但是问题是，当您使用以下一组转换时： CountVectorizer -> TfidfTransformer您将获得一个稀疏矩阵。 The issue is that when you try to predict labels using OneVsRest classifier it looks for decision_function or predict_proba methods. 问题是，当你试图预测标签使用OneVsRest分类它看起来decision_function或predict_proba方法。 predict_proba is not available on svm.SVC unless you specify probability=True . 除非您指定svm.SVC probability=True否则predict_proba上的svm.SVC不可用。 On the other hand, as I see in code, decision_function is not implemented for sparse matrices. 在另一方面，正如我在代码中看到， decision_function没有为稀疏矩阵来实现。 Thus my code fails since none of these 2 required methods are available. 因此，我的代码失败了，因为这两个必需方法都不可用。 But maybe I am doing something wrong? 但是也许我做错了什么？ Is it possible to somehow achieve multilabel classification with svm.SVC without specifying probability=True? 是否可以通过svm.SVC以某种方式实现多svm.SVC分类而无需指定probability=True? (doing this adds some significant overhead to classificator training), maybe by somehow forcing TfidfTransformer to output a dense matrix instead of sparse one? （这样做会给分类器训练增加一些可观的开销），也许是通过强迫TfidfTransformer输出密集矩阵而不是稀疏矩阵来实现的？

Answer 1

This is a well-known issue and by now no easy solution exists. 这是一个众所周知的问题，目前尚不存在简单的解决方案。

You can use Pipeline to "densify" your sparse data (by calling .toarray ), but this can blow up memory consumption. 您可以使用Pipeline来“ .toarray ”稀疏数据（通过调用.toarray ），但这会消耗大量内存。 You can do TruncatedSVD (AFAIK, it's the only dimensionality reduction method that works with sparse data), but it can mess with your data so that SVM's performance would decrease. 您可以执行TruncatedSVD （AFAIK，这是唯一适用于稀疏数据的降维方法），但是它可能会使您的数据混乱，从而降低SVM的性能。

scikit-learn-使用svm.svc分类器进行多标签分类，没有概率=真有可能吗？

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-12-10 09:43:42

scikit-learn-使用svm.svc分类器进行多标签分类，没有概率=真有可能吗？

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-12-10 09:43:42

解决方案1
1 已采纳 2014-12-10 09:43:42