在 sklearn 的管道中使用 LabelEncoder 给出：fit_transform 需要 2 个位置参数，但给出了 3 个

Question

I've been trying to run some ML code but I keep faltering at the fitting stage after running my pipeline.我一直在尝试运行一些 ML 代码，但在运行我的管道后，我在拟合阶段一直步履蹒跚。 I've looked around on various forums to not much avail.我在各种论坛上环顾四周，但无济于事。 What I've discovered is that some people say you can't use LabelEncoder within a pipeline.我发现有些人说你不能在管道中使用 LabelEncoder。 I'm not sure how true that is.我不确定这有多真实。 If anyone has any insights on the matter I'd be very happy to hear them.如果有人对此事有任何见解，我会很高兴听到他们的消息。

I keep getting this error:我不断收到此错误：

TypeError: fit_transform() takes 2 positional arguments but 3 were given

And so I'm not sure if the problem is from me or from python.所以我不确定问题是来自我还是来自 python。 Here's my code:这是我的代码：

data = pd.read_csv("ks-projects-201801.csv",
                   index_col="ID",
                   parse_dates=["deadline","launched"],
                   infer_datetime_format=True)

var = list(data)

data = data.drop(labels=[1014746686,1245461087, 1384087152, 1480763647, 330942060, 462917959, 69489148])
missing = [i for i in var if data[i].isnull().any()]
data = data.dropna(subset=missing,axis=0)
le = LabelEncoder()
oe = OrdinalEncoder()
oh = OneHotEncoder()
y = [i for i in var if i=="state"]
y = data[var.pop(8)]

p,p.index = pd.Series(le.fit_transform(y)),y.index
q = pd.read_csv("y.csv",index_col="ID")["0"]
label_y = le.fit_transform(y)

x = data[var]

obj_feat = x.select_dtypes(include="object")
dat_feat = x.select_dtypes(include="datetime64[ns]")
dat_feat = dat_feat.assign(dmonth=dat_feat.deadline.dt.month.astype("int64"),
                           dyear = dat_feat.deadline.dt.year.astype("int64"),
                           lmonth=dat_feat.launched.dt.month.astype("int64"),
                           lyear=dat_feat.launched.dt.year.astype("int64"))
dat_feat = dat_feat.drop(labels=["deadline","launched"],axis=1)
num_feat = x.select_dtypes(include=["int64","float64"])

u = dict(zip(list(obj_feat),[len(obj_feat[i].unique()) for i in obj_feat]))
le_obj = [i for i in u if u[i]<10]
oh_obj = [i for i in u if u[i]<20 and u[i]>10]
te_obj = [i for i in u if u[i]>20 and u[i]<25]
cb_obj = [i for i in u if u[i]>100]

# Pipeline time
#Impute and encode

strat = ["constant","most_frequent","mean","median"]
sc = StandardScaler()
oh_unk = "ignore"
encoders = [LabelEncoder(),
            OneHotEncoder(handle_unknown=oh_unk),
            TargetEncoder(),
            CatBoostEncoder()]

#num_trans = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[2])),
num_trans = Pipeline(steps=[("sc",sc)])
#obj_imp = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[1]))])
oh_enc = Pipeline(steps=[("oh_enc",encoders[1])])
te_enc = Pipeline(steps=[("te_enc",encoders[2])])
cb_enc = Pipeline(steps=[("cb_enc",encoders[0])])

trans = ColumnTransformer(transformers=[
                                        ("num",num_trans,list(num_feat)+list(dat_feat)),
                                        #("obj",obj_imp,list(obj_feat)),
                                        ("onehot",oh_enc,oh_obj),
                                        ("target",te_enc,te_obj),
                                        ("catboost",cb_enc,cb_obj)
                                        ])

models = [RandomForestClassifier(random_state=0),
          KNeighborsClassifier(),
          DecisionTreeClassifier(random_state=0)]

model = models[2]

print("Check 4")

# Chaining it all together
run = Pipeline(steps=[("Transformation",trans),("Model",model)])

x = pd.concat([obj_feat,dat_feat,num_feat],axis=1)
print("Check 5")
run.fit(x,p)

It runs fine until run.fit where it throws an error.它运行良好，直到 run.fit 抛出错误。 I'd love to hear any advice anyone might have, and any possible ways to resolve this problem would also be greatly appreciated!我很想听听任何人可能有的任何建议，并且任何解决此问题的可能方法也将不胜感激！ Thank you.谢谢你。

Answer 1

The problem is the same as spotted in this answer , but with a LabelEncoder in your case.问题与此答案中发现的问题相同，但在您的情况下使用LabelEncoder 。 The LabelEncoder 's fit_transform method takes: LabelEncoder的fit_transform方法采用：

def fit_transform(self, y):
    """Fit label encoder and return encoded labels
    ...

Whereas Pipeline is expecting that all its transformers are taking three positional arguments fit_transform(self, X, y) .而Pipeline期望它的所有转换器都采用三个位置参数fit_transform(self, X, y) 。

You could make a custom transformer as in the aforementioned answer, however, a LabelEncoder should not be used as a feature transformer .您可以按照上述答案制作自定义转换器，但是，不应将LabelEncoder用作特征转换器。 An extensive explanation on why can be seen in LabelEncoder for categorical features?关于为什么可以在LabelEncoder 中看到分类特征的广泛解释？ . . So I'd recommend not using a LabelEcoder and using some other bayesian encoders if the amount of features gets too high such as the TargetEncoder which you also have in the list of encoders.因此，如果特征数量过多，我建议不要使用LabelEcoder并使用其他一些贝叶斯编码器，例如编码器列表中的TargetEncoder 。

在 sklearn 的管道中使用 LabelEncoder 给出：fit_transform 需要 2 个位置参数，但给出了 3 个

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-14 11:21:49

在 sklearn 的管道中使用 LabelEncoder 给出：fit_transform 需要 2 个位置参数，但给出了 3 个

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-14 11:21:49

解决方案1
1 已采纳 2020-10-14 11:21:49