简体繁体 English

每个训练数据的类标签分布不均匀的多标签文本分类

[英]Multi-label text classification with non-uniform distribution of class labels for every train data

原文 2019-12-17 10:19:56 8 1 python/ classification/ sentiment-analysis/ text-classification/ multilabel-classification

I have a multi-label classification problem, I want to classify texts with six labels, each text can have one to six labels but this label distribution is not equal.我有一个多标签分类问题，我想用六个标签对文本进行分类，每个文本可以有一到六个标签，但这个标签分布不相等。 For example, 10 people annotated sentence1 as below:例如，10 个人将句子 1 注释如下：

These labels are the number of votes for that class.这些标签是该类的投票数。 I can normalize them like sad 0.7, anger 0.2, fear 0.1, happy 0.0,...我可以将它们正常化，例如悲伤 0.7、愤怒 0.2、恐惧 0.1、快乐 0.0，...

What is the best classifier for this problem?这个问题的最佳分类器是什么？ What is the best type for labels I mean I should normalize them or not?标签的最佳类型是什么？我的意思是我应该对它们进行标准化还是不标准化？

What keywords should I search for this kind of multi-label classification problem where the probability of labels is not equal?这种标签概率不等的多标签分类问题，我应该搜索哪些关键词？

1 个解决方案

Well, first, to clarify if I understand your problem correctly.好吧，首先，澄清我是否正确理解您的问题。 You have sentences=[sent1, sent2, ... sentn] and you want to classify them into these six labels labels=[l1,l2,...,l6].您有句子=[sent1, sent2, ... sentn] 并且您想将它们分类为这六个标签labels=[l1,l2,...,l6]。 Your data isn't the labels themselves, but the probability of having that label in the text.您的数据不是标签本身，而是文本中包含该标签的概率。 You also mentioned the six labels comes from human annotation (I don't know what you mean by 10 people commented, I'll guess it is annotation)你还提到六个标签来自人工注释（我不知道你说的10个人评论是什么意思，我猜是注释）

If this is the case, you can deal with the problem with multi-label classification or a multi-target regression perspectives.如果是这种情况，您可以使用多标签分类或多目标回归视角来处理问题。 I'll approach what you can do with your data both cases:在这两种情况下，我都会处理您可以对数据执行的操作：

Multilabel Classification : In this case, you need to define the classes for each sentence so that you can train your model.多标签分类：在这种情况下，您需要为每个句子定义类别，以便您可以训练您的模型。 Right now you have only the probabilities.现在你只有概率。 You can do that by creating a threshold and the probabilities of labels that are above the threshold can be considered the labels for a sentence.您可以通过创建阈值来做到这一点，高于阈值的标签的概率可以被视为句子的标签。 You can read more about the evaluation metrics here .您可以在此处阅读有关评估指标的更多信息。
Multi-target Regression : In this case, you don't need to define the classes, you just use the training input and we use the data to predict the probabilities for each label.多目标回归：在这种情况下，您不需要定义类别，您只需使用训练输入，我们使用数据来预测每个标签的概率。 I think it is a better and easier problem, given your data collection.考虑到您的数据收集，我认为这是一个更好、更容易的问题。 If you want to know more about the problem of multi-target regression, you can read more about it here , but the models they used in this tutorial are not the the state-of-the-art (be aware of it).如果您想了解更多关于多目标回归问题的信息，您可以在此处阅读更多相关信息，但是他们在本教程中使用的模型并不是最先进的（请注意）。

Training Models: You can use both shallow and deep models for this task.训练模型：您可以针对此任务使用浅层模型和深层模型。 You need a model that can receive a sentence as input and predict six labels or six probabilities.您需要一个可以接收句子作为输入并预测六个标签或六个概率的模型。 I suggest you take a look into this example , it can be a very good starting point for your work.我建议你看看这个例子，它可以成为你工作的一个很好的起点。 The author provides a tutorial on how to build a multi-label text classifier using deep neural networks.作者提供了有关如何使用深度神经网络构建多标签文本分类器的教程。 He basically built a LSTM and a Feed-forward layer in the end to classify the labels.他最终基本上构建了一个 LSTM 和一个前馈层来对标签进行分类。 If you decide to use regression instead of classification, you can just drop the activation in the end.如果您决定使用回归而不是分类，您可以在最后删除激活。

The best results are likely to be obtained by deep neural networks, so the article I sent you can work very well.最好的结果很可能是通过深度神经网络获得的，所以我发给你的文章可以很好地工作。 I also suggest you take a look in the state-of-the-art methods for text classification, such as BERT or XLNET.我还建议您查看最先进的文本分类方法，例如 BERT 或 XLNET。 I implemented a Multi-label classification method using BERT , maybe it can be helpful to you.我使用BERT实现了一个多标签分类方法，也许对你有帮助。