如何使用多个 class 标签对数据进行编码？

Question

I have a classification problem with multiple classes, say A, B, C and D. My data has the following y labels:我有多个类的分类问题，比如 A、B、C 和 D。我的数据具有以下 y 标签：

y0 = [['A'], ['B'], ['A','D'], ['A'], ['A','C','D'], ['D'], ..., ['C'], ['A','B','C','D'] , ['B']]

I want to train a Random Forest classifier on these labels.我想在这些标签上训练一个随机森林分类器。 First I need to encode the labels.首先，我需要对标签进行编码。 I first tried LabelEncoder :我首先尝试LabelEncoder ：

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
le.fit_transform(y0)
# encoded labels: array([0, 1, 2, 0, 3, 4, ... 5, 6, 1], dtype=int64)

I also tried OneHotEncoder , but obviously, neither LabelEncoder nor OneHotEncoder would work here.我也试过OneHotEncoder ，但很明显， LabelEncoder和OneHotEncoder都不能在这里工作。 The thing is that I cannot encode data with multiple class labels (eg ['A','B','C'] ).问题是我无法使用多个 class 标签（例如['A','B','C'] ）对数据进行编码。 I guess these trivial encoding methods are not the way to go here, so what is the best way to encode my class labels?我想这些简单的编码方法不是 go 的方法，那么编码我的 class 标签的最佳方法是什么？ To clarify, I don't want to treat eg ['A','B'] as a completely different class from ['A'] or ['B'] .为了澄清，我不想将例如['A','B']视为与['A']或['B']完全不同的 class 。 I want it to be a different class but at the same time still inherit features from both A and B classes.我希望它是一个不同的 class 但同时仍然继承 A 和 B 类的特性。

Answer 1

This kind of problem is called multilabel (as opposed to multiclass where each sample has exactly one class label), and sklearn expects multilabel problems to have the target encoded as a binary array of shape (n_samples, n_labels) .这种问题称为多标签（与多类相反，其中每个样本只有一个 class 标签），sklearn 期望多标签问题将目标编码为形状为(n_samples, n_labels)的二进制数组。 You can encode your data in that format using MultiLabelBinarizer .您可以使用MultiLabelBinarizer以该格式对数据进行编码。

Answer 2

Instead of using OneHotEncoder or LabelEncoder You can use OrdinalEncoder which encodes categorical features as an integer array.而不是使用OneHotEncoder或LabelEncoder您可以使用OrdinalEncoder将分类特征编码为 integer 数组。

Resulting classes will be on ordinal scale, so for example in alphabetical order A , AB , AD etc.结果类将按顺序排列，例如按字母顺序A 、 AB 、 AD等。

Question may be if AB is more similar to AC or to AD .问题可能是AB是否更类似于AC或AD 。 I mean alphabetical order may not reflect real similarity, like in ordinal scale 'cold','warm','hot' , so manual encoding and reordering should be used.我的意思是字母顺序可能无法反映真正的相似性，例如序数比例'cold','warm','hot' ，因此应该使用手动编码和重新排序。 But this details require some domain knowledge.但是这些细节需要一些领域知识。

如何使用多个 class 标签对数据进行编码？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-07-10 00:20:51

解决方案2
0 2020-07-09 22:56:36

如何使用多个 class 标签对数据进行编码？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-07-10 00:20:51

解决方案2 0 2020-07-09 22:56:36

解决方案1
1 已采纳 2020-07-10 00:20:51

解决方案2
0 2020-07-09 22:56:36