为什么 sklearn 中的 LabelEncoder 只能用于目标变量？

Question

I was trying to create a pipeline with a LabelEncoder to transform categorical values.我试图创建一个带有 LabelEncoder 的管道来转换分类值。

cat_variable = Pipeline(steps = [
    ('imputer',SimpleImputer(strategy = 'most_frequent')),
    ('lencoder',LabelEncoder())
])
                        
num_variable = SimpleImputer(strategy = 'mean')

preprocess = ColumnTransformer (transformers = [
    ('categorical',cat_variable,cat_columns),
    ('numerical',num_variable,num_columns)
])

odel = RandomForestRegressor(n_estimators = 100, random_state = 0)

final_pipe = Pipeline(steps = [
    ('preprocessor',preprocess),
    ('model',model)
])

scores = -1 * cross_val_score(final_pipe,X_train,y,cv = 5,scoring = 'neg_mean_absolute_error')

But this is throwing a TypeError:但这会引发 TypeError：


TypeError: fit_transform() takes 2 positional arguments but 3 were given

On further reference, I found out that transformers like LabelEncoders are not supposed to be used with features and should only be used on the prediction target.在进一步的参考中，我发现像 LabelEncoders 这样的转换器不应该与特征一起使用，而应该只用于预测目标。

From Documentation: 从文档：

class sklearn.preprocessing.LabelEncoder class sklearn.preprocessing.LabelEncoder

Encode target labels with value between 0 and n_classes-1.使用 0 和 n_classes-1 之间的值对目标标签进行编码。

This transformer should be used to encode target values, ie y, and not the input X.这个转换器应该用于编码目标值，即 y，而不是输入 X。

My question is, why can we not use LabelEncoder on feature variables and are there any other transformers that have a condition like this?我的问题是，为什么我们不能在特征变量上使用 LabelEncoder，还有其他具有这种情况的转换器吗？

Answer 1

LabelEncoder can be used to normalize labels or to transform non-numerical labels. LabelEncoder可用于规范化标签或转换非数字标签。 For the input categorical you should use OneHotEncoder .对于输入分类，您应该使用OneHotEncoder 。

The difference:区别：

le = preprocessing.LabelEncoder()
le.fit_transform([1, 2, 2, 6])
array([0, 0, 1, 2])

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit_transform([[1], [2], [2], [6]]).toarray()
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Answer 2

You can use OrdinalEncoder for categorical variables您可以将 OrdinalEncoder 用于分类变量

Answer 3

LabelEncoder , by design, has to be used on the target variable and not on feature variables.按照设计， LabelEncoder必须用于目标变量而不是特征变量。 This implies that the signature of methods .fit() , .transform() and .fit_transform() of the LabelEncoder class differs from the one of the transformers which are meant to be applied on features.这意味着LabelEncoder class 的方法.fit() 、 .transform()和.fit_transform()的签名不同于旨在应用于特征的转换器之一。

fit(y) vs fit(X[,y]) |适合（y）与适合（X [，y]）| transform(y) vs transform(X) |变换（y）与变换（X）| fit_transform(y) vs fit_transform(X[,y]) or similarly fit_transform(y) 与 fit_transform(X[,y]) 或类似

fit(self, y) vs fit(self, X, y=None) |适合（自我，y）与适合（自我，X，y =无）| transform(self, y) vs transform(self, X) |变换（自我，y）与变换（自我，X）| fit_transform(self, y) vs fit_transform(self, X, y=None) fit_transform(self, y) 与 fit_transform(self, X, y=None)

respectively for LabelEncoder-like transformers (ie transformers to be applied on target) and for transformers to be applied on features.分别用于类标签编码器的转换器（即应用于目标的转换器）和应用于特征的转换器。

This same design also holds for LabelBinarizer and MultiLabelBinarizer .同样的设计也适用于LabelBinarizer和MultiLabelBinarizer 。 I would suggest the reading of the Transforming the prediction target (y) paragraph of the User Guide.我建议阅读用户指南的转换预测目标 (y)段落。

This said, here are a couple of considerations describing what happens when you try to use LabelEncoder in a Pipeline or in a ColumnTransformer :这就是说，这里有几个注意事项描述了当您尝试在Pipeline或ColumnTransformer中使用LabelEncoder时会发生什么：

Pipeline s and ColumnTransformer s are about transforming and fitting data, not targets. Pipeline和ColumnTransformer用于转换和拟合数据，而不是目标。 They somehow "assume" the target is already in a state that the estimator can use.他们以某种方式“假设”目标已经在估计器可以使用的 state 中。
Within this github issue and the ones referenced in it you can follow the long-standing discussion about making it possible to enable pipelines to transform the target, too.在这个github 问题和其中引用的问题中，您可以关注关于使管道也可以转换目标的长期讨论。 This is also summarized within this sklearn FAQ .这也在此 sklearn 常见问题解答中进行了总结。
The specific reason for which you're getting TypeError: fit_transform() takes 2 positional arguments but 3 were given is the following (here seen from the perspective of a ColumnTransformer ): when calling either .fit_transform() or .fit() on the ColumnTransformer istance, method ._fit_transform() is called in turn on X and y , and it triggers the call of ._fit_transform_one() and here the error arises.您收到TypeError: fit_transform() takes 2 positional arguments but 3 were given的具体原因如下（从ColumnTransformer的角度来看）：在调用.fit_transform()或.fit()时ColumnTransformer ，方法._fit_transform()在X和y上依次调用，它触发了._fit_transform_one()的调用，这里出现了错误。 Indeed, it calls .fit_transform() on the transformer istance (your LabelEncoder );实际上，它在transformer istance （您的LabelEncoder ）上调用.fit_transform() ) ； here the different method signature comes into play:这里不同的方法签名开始发挥作用：
```
 with _print_elapsed_time(message_clsname, message): if hasattr(transformer, "fit_transform"): res = transformer.fit_transform(X, y, **fit_params) else: res = transformer.fit(X, y, **fit_params).transform(X)
```
Indeed, .fit_transform() is called on (self, X, y) ([...] 3 arguments were given) while expecting (self, y) only ([...] takes 2 positional arguments).实际上， .fit_transform()在(self, X, y)上被调用（[...] 3 arguments 被给出）而只期望(self, y) （[...] 需要 2 个位置参数）。 Following the code within the Pipeline class, it can be seen that the same happens.按照Pipeline class 中的代码，可以看出同样的情况。
As already specified, an alternative to label-encoding applicable on feature variables (and therefore in pipelines and column transformers) is the OrdinalEncoder (from version 0.20).如前所述，适用于特征变量（因此适用于管道和列转换器）的标签编码的替代方案是OrdinalEncoder （从 0.20 版开始）。 At this proposal, I would suggest the reading of Difference between OrdinalEncoder and LabelEncoder .在这个提议中，我建议阅读OrdinalEncoder 和 LabelEncoder 之间的差异。

为什么 sklearn 中的 LabelEncoder 只能用于目标变量？

问题描述

3 个解决方案

解决方案1
1 2020-07-14 09:58:53

解决方案2
0 2022-01-29 11:30:01

解决方案3
0 2022-01-31 08:28:47

为什么 sklearn 中的 LabelEncoder 只能用于目标变量？

问题描述

3 个解决方案

解决方案1 1 2020-07-14 09:58:53

解决方案2 0 2022-01-29 11:30:01

解决方案3 0 2022-01-31 08:28:47

解决方案1
1 2020-07-14 09:58:53

解决方案2
0 2022-01-29 11:30:01

解决方案3
0 2022-01-31 08:28:47