[英]How to encode data with multiple class labels?
I have a classification problem with multiple classes, say A, B, C and D. My data has the following y labels:我有多个类的分类问题,比如 A、B、C 和 D。我的数据具有以下 y 标签:
y0 = [['A'], ['B'], ['A','D'], ['A'], ['A','C','D'], ['D'], ..., ['C'], ['A','B','C','D'] , ['B']]
I want to train a Random Forest classifier on these labels.我想在这些标签上训练一个随机森林分类器。 First I need to encode the labels.首先,我需要对标签进行编码。 I first tried LabelEncoder
:我首先尝试LabelEncoder
:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
le.fit_transform(y0)
# encoded labels: array([0, 1, 2, 0, 3, 4, ... 5, 6, 1], dtype=int64)
I also tried OneHotEncoder
, but obviously, neither LabelEncoder
nor OneHotEncoder
would work here.我也试过OneHotEncoder
,但很明显, LabelEncoder
和OneHotEncoder
都不能在这里工作。 The thing is that I cannot encode data with multiple class labels (eg ['A','B','C']
).问题是我无法使用多个 class 标签(例如['A','B','C']
)对数据进行编码。 I guess these trivial encoding methods are not the way to go here, so what is the best way to encode my class labels?我想这些简单的编码方法不是 go 的方法,那么编码我的 class 标签的最佳方法是什么? To clarify, I don't want to treat eg ['A','B']
as a completely different class from ['A']
or ['B']
.为了澄清,我不想将例如['A','B']
视为与['A']
或['B']
完全不同的 class 。 I want it to be a different class but at the same time still inherit features from both A and B classes.我希望它是一个不同的 class 但同时仍然继承 A 和 B 类的特性。
This kind of problem is called multilabel (as opposed to multiclass where each sample has exactly one class label), and sklearn expects multilabel problems to have the target encoded as a binary array of shape (n_samples, n_labels)
.这种问题称为多标签(与多类相反,其中每个样本只有一个 class 标签),sklearn 期望多标签问题将目标编码为形状为(n_samples, n_labels)
的二进制数组。 You can encode your data in that format using MultiLabelBinarizer
.您可以使用MultiLabelBinarizer
以该格式对数据进行编码。
Instead of using OneHotEncoder
or LabelEncoder
You can use OrdinalEncoder
which encodes categorical features as an integer array.而不是使用OneHotEncoder
或LabelEncoder
您可以使用OrdinalEncoder
将分类特征编码为 integer 数组。
Resulting classes will be on ordinal scale, so for example in alphabetical order A
, AB
, AD
etc.结果类将按顺序排列,例如按字母顺序A
、 AB
、 AD
等。
Question may be if AB
is more similar to AC
or to AD
.问题可能是AB
是否更类似于AC
或AD
。 I mean alphabetical order may not reflect real similarity, like in ordinal scale 'cold','warm','hot'
, so manual encoding and reordering should be used.我的意思是字母顺序可能无法反映真正的相似性,例如序数比例'cold','warm','hot'
,因此应该使用手动编码和重新排序。 But this details require some domain knowledge.但是这些细节需要一些领域知识。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.