[英]Using a LabelEncoder in sklearn's Pipeline gives: fit_transform takes 2 positional arguments but 3 were given
I've been trying to run some ML code but I keep faltering at the fitting stage after running my pipeline.我一直在尝试运行一些 ML 代码,但在运行我的管道后,我在拟合阶段一直步履蹒跚。 I've looked around on various forums to not much avail.我在各种论坛上环顾四周,但无济于事。 What I've discovered is that some people say you can't use LabelEncoder within a pipeline.我发现有些人说你不能在管道中使用 LabelEncoder。 I'm not sure how true that is.我不确定这有多真实。 If anyone has any insights on the matter I'd be very happy to hear them.如果有人对此事有任何见解,我会很高兴听到他们的消息。
I keep getting this error:我不断收到此错误:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
And so I'm not sure if the problem is from me or from python.所以我不确定问题是来自我还是来自 python。 Here's my code:这是我的代码:
data = pd.read_csv("ks-projects-201801.csv",
index_col="ID",
parse_dates=["deadline","launched"],
infer_datetime_format=True)
var = list(data)
data = data.drop(labels=[1014746686,1245461087, 1384087152, 1480763647, 330942060, 462917959, 69489148])
missing = [i for i in var if data[i].isnull().any()]
data = data.dropna(subset=missing,axis=0)
le = LabelEncoder()
oe = OrdinalEncoder()
oh = OneHotEncoder()
y = [i for i in var if i=="state"]
y = data[var.pop(8)]
p,p.index = pd.Series(le.fit_transform(y)),y.index
q = pd.read_csv("y.csv",index_col="ID")["0"]
label_y = le.fit_transform(y)
x = data[var]
obj_feat = x.select_dtypes(include="object")
dat_feat = x.select_dtypes(include="datetime64[ns]")
dat_feat = dat_feat.assign(dmonth=dat_feat.deadline.dt.month.astype("int64"),
dyear = dat_feat.deadline.dt.year.astype("int64"),
lmonth=dat_feat.launched.dt.month.astype("int64"),
lyear=dat_feat.launched.dt.year.astype("int64"))
dat_feat = dat_feat.drop(labels=["deadline","launched"],axis=1)
num_feat = x.select_dtypes(include=["int64","float64"])
u = dict(zip(list(obj_feat),[len(obj_feat[i].unique()) for i in obj_feat]))
le_obj = [i for i in u if u[i]<10]
oh_obj = [i for i in u if u[i]<20 and u[i]>10]
te_obj = [i for i in u if u[i]>20 and u[i]<25]
cb_obj = [i for i in u if u[i]>100]
# Pipeline time
#Impute and encode
strat = ["constant","most_frequent","mean","median"]
sc = StandardScaler()
oh_unk = "ignore"
encoders = [LabelEncoder(),
OneHotEncoder(handle_unknown=oh_unk),
TargetEncoder(),
CatBoostEncoder()]
#num_trans = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[2])),
num_trans = Pipeline(steps=[("sc",sc)])
#obj_imp = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[1]))])
oh_enc = Pipeline(steps=[("oh_enc",encoders[1])])
te_enc = Pipeline(steps=[("te_enc",encoders[2])])
cb_enc = Pipeline(steps=[("cb_enc",encoders[0])])
trans = ColumnTransformer(transformers=[
("num",num_trans,list(num_feat)+list(dat_feat)),
#("obj",obj_imp,list(obj_feat)),
("onehot",oh_enc,oh_obj),
("target",te_enc,te_obj),
("catboost",cb_enc,cb_obj)
])
models = [RandomForestClassifier(random_state=0),
KNeighborsClassifier(),
DecisionTreeClassifier(random_state=0)]
model = models[2]
print("Check 4")
# Chaining it all together
run = Pipeline(steps=[("Transformation",trans),("Model",model)])
x = pd.concat([obj_feat,dat_feat,num_feat],axis=1)
print("Check 5")
run.fit(x,p)
It runs fine until run.fit where it throws an error.它运行良好,直到 run.fit 抛出错误。 I'd love to hear any advice anyone might have, and any possible ways to resolve this problem would also be greatly appreciated!我很想听听任何人可能有的任何建议,并且任何解决此问题的可能方法也将不胜感激! Thank you.谢谢你。
The problem is the same as spotted in this answer , but with a LabelEncoder
in your case.问题与此答案中发现的问题相同,但在您的情况下使用LabelEncoder
。 The LabelEncoder
's fit_transform
method takes: LabelEncoder
的fit_transform
方法采用:
def fit_transform(self, y):
"""Fit label encoder and return encoded labels
...
Whereas Pipeline
is expecting that all its transformers are taking three positional arguments fit_transform(self, X, y)
.而Pipeline
期望它的所有转换器都采用三个位置参数fit_transform(self, X, y)
。
You could make a custom transformer as in the aforementioned answer, however, a LabelEncoder
should not be used as a feature transformer .您可以按照上述答案制作自定义转换器,但是,不应将LabelEncoder
用作特征转换器。 An extensive explanation on why can be seen in LabelEncoder for categorical features?关于为什么可以在LabelEncoder 中看到分类特征的广泛解释? . . So I'd recommend not using a LabelEcoder
and using some other bayesian encoders if the amount of features gets too high such as the TargetEncoder
which you also have in the list of encoders.因此,如果特征数量过多,我建议不要使用LabelEcoder
并使用其他一些贝叶斯编码器,例如编码器列表中的TargetEncoder
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.