简体   繁体   English

使用 Sklearn 进行多标签文本分类

[英]Multilabel text classification with Sklearn

I have already tried everything that I can think of in order to solve my multilabel text classification in Python and I would really appreciate any help.为了解决 Python 中的多标签文本分类问题,我已经尝试了所有我能想到的方法,我非常感谢任何帮助。 I have based my result in here using multilabelbinarizer and in this web page .我在这里使用 multilabelbinarizer 和在此 web 页面中基于我的结果。

I am trying to predict certain categories in a dataset written in Spanish where I have 7 different labels, where my dataset is shown here .我正在尝试预测用西班牙语编写的数据集中的某些类别,其中我有 7 个不同的标签,我的数据集显示在这里 I have a message written and different labels for each of the rows.我为每一行写了一条消息和不同的标签。 Each of the text messages has either one or two labels , depending on the message.每个文本消息都有一个或两个标签,具体取决于消息。

df2=df.copy()
df2.drop(["mensaje", "pregunta_parseada", "tags_totales"], axis=1, inplace=True)

# Divide into train and test
X_train, X_test, y_train, y_test = train_test_split(df['pregunta_parseada'], 
                                                df2,
                                                test_size=0.15, 
                                                random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test


from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier


lr = LogisticRegression(solver='sag', n_jobs=1)
clf = OneVsRestClassifier(lr)

# fit model on train data
clf.fit(features_train, labels_train)

# make predictions for validation set
y_pred = clf.predict(features_test)

So far, so good, but when I try to validate the problem it seems as almost every category is classified as "None"到目前为止,一切都很好,但是当我尝试验证问题时,似乎几乎每个类别都被归类为“无”

y_pred[2]
accuracy_score(y_test,y_pred)

Output Output

array([0, 0, 0, 0, 0, 0, 0])
0.2574626865671642

I also tried with MultiLabelBinarizer and I had the same problem, what am I doing wrong?我也尝试了 MultiLabelBinarizer 并且我遇到了同样的问题,我做错了什么? Trying with MultiLabelBinarizer raised the following results:尝试使用 MultiLabelBinarizer 产生了以下结果:

z=[["Generico"],["Mantenimiento"],["Motor"],["Generico"],["Motor"], 
["Generico"],["Motor"],["Generico","Configuracion"],["Generico"], 
["Motor"],["Consumo"],...,["Consumo"]]

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y=mlb.fit_transform(z)

message = df["pregunta_parseada"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(message, 
                                                y, 
                                                test_size=0.15, 
                                                random_state=42)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

 classifier.fit(X_train, y_train)
 predicted = classifier.predict(X_test)
 accuracy_score(y_test, predicted)
 #predicted[150]
 all_labels = mlb.inverse_transform(predicted)
 all_labels

With the following output配以下output

 (),
 (),
 (),
 (),
 ('Generico',),
 (),
 (),
 (),
 (),
 ('Compra',),
 ('Motor', 'extras'),

Thank you so much for your help非常感谢你的帮助

The problem I think is with your data.我认为问题在于您的数据。 It could be too sparse.它可能太稀疏了。

I see you're using OneVsRestClassifier , so it builds multiple binary classifiers to decide the tags.我看到您正在使用OneVsRestClassifier ,因此它构建了多个二进制分类器来决定标签。

I think, there's no straight-forward bug in your code, but the choice of model is just not right for the task.我认为,您的代码中没有直接的错误,但 model 的选择不适合该任务。

The problem with these binary classifiers is data imbalance, let's say even if you have the exactly the same number of samples ( n ) per class ( c ), the binary classifier will divide the data into n vs (n-1) x c samples for the positive and negative class. The problem with these binary classifiers is data imbalance, let's say even if you have the exactly the same number of samples ( n ) per class ( c ), the binary classifier will divide the data into n vs (n-1) x c samples为正负极class。

So, obviously there is more data in negative class than positive class for all the classifiers.因此,对于所有分类器,显然负 class 中的数据多于正 class 中的数据。 They are biased towards the negative class, as a result each binary classifier tends to predict (All in oneVsall scenario) for most of the cases.它们偏向于负 class,因此对于大多数情况,每个二元分类器都倾向于预测(All in oneVsall 场景)。

If you don't want to change your setup, then one thing you can do is:如果您不想更改设置,那么您可以做的一件事是:

  1. Instead of predict , use predict_proba to get the probability per class and set a lower threshold (<0.5) to decide which set of classes to choose.而不是predict ,使用predict_proba来获取每个 class 的概率,并设置一个较低的阈值 (<0.5) 来决定选择哪一组类。

Your test accuracy is pretty low, maybe re-adjust the threshold to get better accuracy.您的测试准确度很低,可能需要重新调整阈值以获得更好的准确度。

  1. Use Deep Learning based approach if possible like Bert which will give much better performance.如果可能,请使用基于深度学习的方法,例如 Bert,这将提供更好的性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 多标签文本分类的分类报告? - classification report for multilabel text classification? 如何在 sklearn 中执行多类多标签分类? - How to perform multiclass-multilabel classification in sklearn? Sklearn Linear SVM 无法在多标签分类中进行训练 - Sklearn Linear SVM cannot train in multilabel classification 使用TensorFlow进行多标签文本分类 - Multilabel Text Classification using TensorFlow Sklearn的roc_auc_score用于多标签二进制分类 - Sklearn's roc_auc_score for multilabel binary classification 使用 BERT 和 Tensorflow 的多标签文本分类 2 - Multilabel text classification using BERT and Tensorflow 2 SkLearn model 用于文本分类 - SkLearn model for text classification 为什么多标签分类不能对训练数据 (sklearn) 进行 100% 分类? - Why doesn't multilabel classification give 100% classification on train data (sklearn)? ValueError:分类指标无法处理多标签指标和连续多输出目标 sklearn 的混合 - ValueError: Classification metrics can't handle a mix of multilabel-indicator and continuous-multioutput targets sklearn 如何使用sklearn.metrics计算多标签分类任务的微观/宏观指标? - How do I use sklearn.metrics to compute micro/macro measures for multilabel classification task?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM