简体   繁体   English

为什么 sklearn 中的 LabelEncoder 只能用于目标变量?

[英]Why should LabelEncoder from sklearn be used only for the target variable?

I was trying to create a pipeline with a LabelEncoder to transform categorical values.我试图创建一个带有 LabelEncoder 的管道来转换分类值。

cat_variable = Pipeline(steps = [
    ('imputer',SimpleImputer(strategy = 'most_frequent')),
    ('lencoder',LabelEncoder())
])
                        
num_variable = SimpleImputer(strategy = 'mean')

preprocess = ColumnTransformer (transformers = [
    ('categorical',cat_variable,cat_columns),
    ('numerical',num_variable,num_columns)
])

odel = RandomForestRegressor(n_estimators = 100, random_state = 0)

final_pipe = Pipeline(steps = [
    ('preprocessor',preprocess),
    ('model',model)
])

scores = -1 * cross_val_score(final_pipe,X_train,y,cv = 5,scoring = 'neg_mean_absolute_error')

But this is throwing a TypeError:但这会引发 TypeError:


TypeError: fit_transform() takes 2 positional arguments but 3 were given

On further reference, I found out that transformers like LabelEncoders are not supposed to be used with features and should only be used on the prediction target.在进一步的参考中,我发现像 LabelEncoders 这样的转换器不应该与特征一起使用,而应该只用于预测目标。

From Documentation: 从文档:

class sklearn.preprocessing.LabelEncoder class sklearn.preprocessing.LabelEncoder

Encode target labels with value between 0 and n_classes-1.使用 0 和 n_classes-1 之间的值对目标标签进行编码。

This transformer should be used to encode target values, ie y, and not the input X.这个转换器应该用于编码目标值,即 y,而不是输入 X。

My question is, why can we not use LabelEncoder on feature variables and are there any other transformers that have a condition like this?我的问题是,为什么我们不能在特征变量上使用 LabelEncoder,还有其他具有这种情况的转换器吗?

LabelEncoder can be used to normalize labels or to transform non-numerical labels. LabelEncoder可用于规范化标签或转换非数字标签。 For the input categorical you should use OneHotEncoder .对于输入分类,您应该使用OneHotEncoder

The difference:区别:

le = preprocessing.LabelEncoder()
le.fit_transform([1, 2, 2, 6])
array([0, 0, 1, 2])

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit_transform([[1], [2], [2], [6]]).toarray()
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

You can use OrdinalEncoder for categorical variables您可以将 OrdinalEncoder 用于分类变量

LabelEncoder , by design, has to be used on the target variable and not on feature variables.按照设计, LabelEncoder必须用于目标变量而不是特征变量。 This implies that the signature of methods .fit() , .transform() and .fit_transform() of the LabelEncoder class differs from the one of the transformers which are meant to be applied on features.这意味着LabelEncoder class 的方法.fit().transform().fit_transform()的签名不同于旨在应用于特征的转换器之一。

fit(y) vs fit(X[,y]) |适合(y)与适合(X [,y])| transform(y) vs transform(X) |变换(y)与变换(X)| fit_transform(y) vs fit_transform(X[,y]) or similarly fit_transform(y) 与 fit_transform(X[,y]) 或类似

fit(self, y) vs fit(self, X, y=None) |适合(自我,y)与适合(自我,X,y =无)| transform(self, y) vs transform(self, X) |变换(自我,y)与变换(自我,X)| fit_transform(self, y) vs fit_transform(self, X, y=None) fit_transform(self, y) 与 fit_transform(self, X, y=None)

respectively for LabelEncoder-like transformers (ie transformers to be applied on target) and for transformers to be applied on features.分别用于类标签编码器的转换器(即应用于目标的转换器)和应用于特征的转换器。

This same design also holds for LabelBinarizer and MultiLabelBinarizer .同样的设计也适用于LabelBinarizerMultiLabelBinarizer I would suggest the reading of the Transforming the prediction target (y) paragraph of the User Guide.我建议阅读用户指南的转换预测目标 (y)段落。

This said, here are a couple of considerations describing what happens when you try to use LabelEncoder in a Pipeline or in a ColumnTransformer :这就是说,这里有几个注意事项描述了当您尝试在PipelineColumnTransformer中使用LabelEncoder时会发生什么:

  • Pipeline s and ColumnTransformer s are about transforming and fitting data, not targets. PipelineColumnTransformer用于转换和拟合数据,而不是目标。 They somehow "assume" the target is already in a state that the estimator can use.他们以某种方式“假设”目标已经在估计器可以使用的 state 中。

  • Within this github issue and the ones referenced in it you can follow the long-standing discussion about making it possible to enable pipelines to transform the target, too.在这个github 问题和其中引用的问题中,您可以关注关于使管道也可以转换目标的长期讨论。 This is also summarized within this sklearn FAQ .这也在此 sklearn 常见问题解答中进行了总结。

  • The specific reason for which you're getting TypeError: fit_transform() takes 2 positional arguments but 3 were given is the following (here seen from the perspective of a ColumnTransformer ): when calling either .fit_transform() or .fit() on the ColumnTransformer istance, method ._fit_transform() is called in turn on X and y , and it triggers the call of ._fit_transform_one() and here the error arises.您收到TypeError: fit_transform() takes 2 positional arguments but 3 were given的具体原因如下(从ColumnTransformer的角度来看):在调用.fit_transform().fit()ColumnTransformer ,方法._fit_transform()Xy上依次调用,它触发了._fit_transform_one()的调用,这里出现了错误。 Indeed, it calls .fit_transform() on the transformer istance (your LabelEncoder );实际上,它在transformer istance (您的LabelEncoder )上调用.fit_transform() ) ; here the different method signature comes into play:这里不同的方法签名开始发挥作用:

     with _print_elapsed_time(message_clsname, message): if hasattr(transformer, "fit_transform"): res = transformer.fit_transform(X, y, **fit_params) else: res = transformer.fit(X, y, **fit_params).transform(X)

    Indeed, .fit_transform() is called on (self, X, y) ([...] 3 arguments were given) while expecting (self, y) only ([...] takes 2 positional arguments).实际上, .fit_transform()(self, X, y)上被调用([...] 3 arguments 被给出)而只期望(self, y) ([...] 需要 2 个位置参数)。 Following the code within the Pipeline class, it can be seen that the same happens.按照Pipeline class 中的代码,可以看出同样的情况。

  • As already specified, an alternative to label-encoding applicable on feature variables (and therefore in pipelines and column transformers) is the OrdinalEncoder (from version 0.20).如前所述,适用于特征变量(因此适用于管道和列转换器)的标签编码的替代方案是OrdinalEncoder (从 0.20 版开始)。 At this proposal, I would suggest the reading of Difference between OrdinalEncoder and LabelEncoder .在这个提议中,我建议阅读OrdinalEncoder 和 LabelEncoder 之间的差异

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 SkLearn-为什么LabelEncoder()。仅适用于训练数据 - SkLearn - Why LabelEncoder().fit only to training data 为什么不应该使用 sklearn LabelEncoder 来编码输入数据? - Why shouldn't the sklearn LabelEncoder be used to encode input data? 为什么 sklearn 预处理 LabelEncoder inverse_transform 仅适用于一列? - Why does sklearn preprocessing LabelEncoder inverse_transform apply from only one column? labelEncoder 在 sklearn 中的工作 - Working of labelEncoder in sklearn sklearn LabelEncoder中的标签不一致? - Inconsistent labeling in sklearn LabelEncoder? 使用来自sklearn的LabelEncoder和OneHotEncoder编码数据时出现意外问题 - Unexpected issue when encoding data using LabelEncoder and OneHotEncoder from sklearn Sklearn:仅从每个目标 class 中获取少量记录 - Sklearn: Take only few records from each target class 在for循环中使用Sklearn的LabelEncoder错误 - LabelEncoder error using Sklearn in a for loop 如何仅使用 numpy(而不是 sklearn LabelEncoder)创建标签编码器? - How can I create a label encoder utilizing only numpy (and not sklearn LabelEncoder)? 在Anaconda中更新软件包后,“从sklearn.preprocessing导入LabelEncoder,OneHotEncoder”失败 - “from sklearn.preprocessing import LabelEncoder, OneHotEncoder” fails after update of packages in Anaconda
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM