简体   繁体   中英

How to encode data with multiple class labels?

I have a classification problem with multiple classes, say A, B, C and D. My data has the following y labels:

y0 = [['A'], ['B'], ['A','D'], ['A'], ['A','C','D'], ['D'], ..., ['C'], ['A','B','C','D'] , ['B']]

I want to train a Random Forest classifier on these labels. First I need to encode the labels. I first tried LabelEncoder :

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
le.fit_transform(y0)
# encoded labels: array([0, 1, 2, 0, 3, 4, ... 5, 6, 1], dtype=int64)

I also tried OneHotEncoder , but obviously, neither LabelEncoder nor OneHotEncoder would work here. The thing is that I cannot encode data with multiple class labels (eg ['A','B','C'] ). I guess these trivial encoding methods are not the way to go here, so what is the best way to encode my class labels? To clarify, I don't want to treat eg ['A','B'] as a completely different class from ['A'] or ['B'] . I want it to be a different class but at the same time still inherit features from both A and B classes.

This kind of problem is called multilabel (as opposed to multiclass where each sample has exactly one class label), and sklearn expects multilabel problems to have the target encoded as a binary array of shape (n_samples, n_labels) . You can encode your data in that format using MultiLabelBinarizer .

Instead of using OneHotEncoder or LabelEncoder You can use OrdinalEncoder which encodes categorical features as an integer array.

Resulting classes will be on ordinal scale, so for example in alphabetical order A , AB , AD etc.

Question may be if AB is more similar to AC or to AD . I mean alphabetical order may not reflect real similarity, like in ordinal scale 'cold','warm','hot' , so manual encoding and reordering should be used. But this details require some domain knowledge.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM