简体   繁体   English

为什么不应该使用 sklearn LabelEncoder 来编码输入数据?

[英]Why shouldn't the sklearn LabelEncoder be used to encode input data?

The docs for sklearn.LabelEncoder start with sklearn.LabelEncoder 的文档

This transformer should be used to encode target values, ie y, and not the input X.这个转换器应该用于编码目标值,即 y,而不是输入 X。

Why is this?为什么是这样?

I post just one example of this recommendation being ignored in practice, although there seems to be loads more.我只发布了这个建议在实践中被忽略的一个例子,尽管似乎还有更多。 https://www.kaggle.com/matleonard/feature-generation contains https://www.kaggle.com/matleonard/feature-generation包含

#(ks is the input data)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

It is not that big of deal that it changes the output vales y because it is only relearn based on that (if it a regression based on error).它改变输出值 y 没什么大不了的,因为它只是基于它重新学习(如果它是基于错误的回归)。

The problem if it changes up the weights of the input values “X” that makes it impossible to do correct predictions.问题是,如果它改变了输入值“X”的权重,就无法进行正确的预测。

You can do it on the X if there are not many options for example 2 category, 2 currency, 2 city encoded in to int-s does not changes the game too much.如果没有太多选项,例如编码为 int-s 的 2 类别、2 货币、2 城市,您可以在 X 上执行此操作不会对游戏产生太大影响。

Maybe because:可能是因为:

  1. It doesn't naturally work on multiple columns at once.它自然不会一次在多个列上工作。
  2. It doesn't support ordering.它不支持订购。 Ie if your categories are ordinal, such as:即,如果您的类别是有序的,例如:

Awful, Bad, Average, Good, Excellent糟糕、差、一般、好、极好

LabelEncoder would give them an arbitrary order (probably as they are encountered in the data), which will not help your classifier. LabelEncoder会给它们一个任意的顺序(可能是因为它们在数据中遇到的),这对你的分类器没有帮助。

In this case you could use either an OrdinalEncoder or a manual replacement.在这种情况下,您可以使用OrdinalEncoder或手动替换。

1. OrdinalEncoder: 1. 序数编码器:

Encode categorical features as an integer array.将分类特征编码为整数数组。

df = pd.DataFrame(data=[['Bad', 200], ['Awful', 100], ['Good', 350], ['Average', 300], ['Excellent', 1000]], columns=['Quality', 'Label'])
enc = OrdinalEncoder(categories=[['Awful', 'Bad', 'Average', 'Good', 'Excellent']])  # Use the 'categories' parameter to specify the desired order. Otherwise the ordered is inferred from the data.
enc.fit_transform(df[['Quality']])  # Can either fit on 1 feature, or multiple features at once.

Output:输出:

array([[1.],
       [0.],
       [3.],
       [2.],
       [4.]])

Notice the logical order in the ouput.注意输出中的逻辑顺序。

2. Manual replacement: 2. 手动更换:

scale_mapper = {'Awful': 0, 'Bad': 1, 'Average': 2, 'Good': 3, 'Excellent': 4}
df['Quality'].replace(scale_mapper)

Output:输出:

0    1
1    0
2    3
3    2
4    4
Name: Quality, dtype: int64

I think they warn from using it for X (input data), because:我认为他们警告不要将它用于 X(输入数据),因为:

  • Categorical input data are better encoded as one hot encoding and not as integers in most of the cases, since mostly you have non-sortable categories.在大多数情况下,分类输入数据最好编码为一种热编码,而不是整数,因为大多数情况下您都有不可排序的类别。

  • Second, another technical problem will be that LabelEncoder is not programmed to handle tables (column-wise/feature-wise encoding would be necessary for X).其次,另一个技术问题是 LabelEncoder 没有被编程来处理表格(X 需要按列/按特征编码)。 LabelEncoder assumes that the data is just a flat list. LabelEncoder 假设数据只是一个平面列表。 That will be the problem.那将是问题所在。

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

categories = [x for x in 'abcdabaccba']
categories
## ['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a']

categories_numerical = enc.fit_transform(categories)

categories_numerical
# array([0, 1, 2, 3, 0, 1, 0, 2, 2, 1, 0])

# so it makes out of categories numbers
# and can transform back

enc.inverse_transform(categories_numerical)
# array(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a'], dtype='<U1')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM