使用 Sklearn 进行多标签文本分类

Question

I have already tried everything that I can think of in order to solve my multilabel text classification in Python and I would really appreciate any help.为了解决 Python 中的多标签文本分类问题，我已经尝试了所有我能想到的方法，我非常感谢任何帮助。 I have based my result in here using multilabelbinarizer and in this web page .我在这里使用 multilabelbinarizer 和在此 web 页面中基于我的结果。

I am trying to predict certain categories in a dataset written in Spanish where I have 7 different labels, where my dataset is shown here .我正在尝试预测用西班牙语编写的数据集中的某些类别，其中我有 7 个不同的标签，我的数据集显示在这里。 I have a message written and different labels for each of the rows.我为每一行写了一条消息和不同的标签。 Each of the text messages has either one or two labels , depending on the message.每个文本消息都有一个或两个标签，具体取决于消息。

df2=df.copy()
df2.drop(["mensaje", "pregunta_parseada", "tags_totales"], axis=1, inplace=True)

# Divide into train and test
X_train, X_test, y_train, y_test = train_test_split(df['pregunta_parseada'], 
                                                df2,
                                                test_size=0.15, 
                                                random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test


from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier


lr = LogisticRegression(solver='sag', n_jobs=1)
clf = OneVsRestClassifier(lr)

# fit model on train data
clf.fit(features_train, labels_train)

# make predictions for validation set
y_pred = clf.predict(features_test)

So far, so good, but when I try to validate the problem it seems as almost every category is classified as "None"到目前为止，一切都很好，但是当我尝试验证问题时，似乎几乎每个类别都被归类为“无”

y_pred[2]
accuracy_score(y_test,y_pred)

Output Output

array([0, 0, 0, 0, 0, 0, 0])
0.2574626865671642

I also tried with MultiLabelBinarizer and I had the same problem, what am I doing wrong?我也尝试了 MultiLabelBinarizer 并且我遇到了同样的问题，我做错了什么？ Trying with MultiLabelBinarizer raised the following results:尝试使用 MultiLabelBinarizer 产生了以下结果：

z=[["Generico"],["Mantenimiento"],["Motor"],["Generico"],["Motor"], 
["Generico"],["Motor"],["Generico","Configuracion"],["Generico"], 
["Motor"],["Consumo"],...,["Consumo"]]

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y=mlb.fit_transform(z)

message = df["pregunta_parseada"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(message, 
                                                y, 
                                                test_size=0.15, 
                                                random_state=42)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

 classifier.fit(X_train, y_train)
 predicted = classifier.predict(X_test)
 accuracy_score(y_test, predicted)
 #predicted[150]
 all_labels = mlb.inverse_transform(predicted)
 all_labels

With the following output配以下output

 (),
 (),
 (),
 (),
 ('Generico',),
 (),
 (),
 (),
 (),
 ('Compra',),
 ('Motor', 'extras'),

Thank you so much for your help非常感谢你的帮助

Answer 1

The problem I think is with your data.我认为问题在于您的数据。 It could be too sparse.它可能太稀疏了。

I see you're using OneVsRestClassifier , so it builds multiple binary classifiers to decide the tags.我看到您正在使用OneVsRestClassifier ，因此它构建了多个二进制分类器来决定标签。

I think, there's no straight-forward bug in your code, but the choice of model is just not right for the task.我认为，您的代码中没有直接的错误，但 model 的选择不适合该任务。

The problem with these binary classifiers is data imbalance, let's say even if you have the exactly the same number of samples ( n ) per class ( c ), the binary classifier will divide the data into n vs (n-1) x c samples for the positive and negative class. The problem with these binary classifiers is data imbalance, let's say even if you have the exactly the same number of samples ( n ) per class ( c ), the binary classifier will divide the data into n vs (n-1) x c samples为正负极class。

So, obviously there is more data in negative class than positive class for all the classifiers.因此，对于所有分类器，显然负 class 中的数据多于正 class 中的数据。 They are biased towards the negative class, as a result each binary classifier tends to predict (All in oneVsall scenario) for most of the cases.它们偏向于负 class，因此对于大多数情况，每个二元分类器都倾向于预测（All in oneVsall 场景）。

If you don't want to change your setup, then one thing you can do is:如果您不想更改设置，那么您可以做的一件事是：

Instead of predict , use predict_proba to get the probability per class and set a lower threshold (<0.5) to decide which set of classes to choose.而不是predict ，使用predict_proba来获取每个 class 的概率，并设置一个较低的阈值 (<0.5) 来决定选择哪一组类。

Your test accuracy is pretty low, maybe re-adjust the threshold to get better accuracy.您的测试准确度很低，可能需要重新调整阈值以获得更好的准确度。

Use Deep Learning based approach if possible like Bert which will give much better performance.如果可能，请使用基于深度学习的方法，例如 Bert，这将提供更好的性能。

使用 Sklearn 进行多标签文本分类

问题描述

1 个解决方案

解决方案1
1 2020-05-17 09:14:38

使用 Sklearn 进行多标签文本分类

问题描述

1 个解决方案

解决方案1 1 2020-05-17 09:14:38

解决方案1
1 2020-05-17 09:14:38