简体   繁体   English

如何让我的管道执行插补阶段?

[英]How can I make my pipeline execute the imputation stage?

I'm trying to run a basic model but it seems as though the imputation stage of my pipeline is failing, and I don't really understand why.我正在尝试运行一个基本模型,但似乎我的管道的插补阶段失败了,我真的不明白为什么。

Here's the minimal replicable code这是最小的可复制代码

If you'd like you can find the data for x and y .如果您愿意,可以找到xy的数据。 Originally they were in a public file that I can easily link you to, but I transformed them a little so I'll use the edited output to cut down on the code you have to read.最初它们位于一个公共文件中,我可以轻松地将您链接到该文件,但我对它们进行了一些转换,因此我将使用编辑后的输出来减少您必须阅读的代码。 I can easily link to the original code and dataset if need be, however.但是,如果需要,我可以轻松链接到原始代码和数据集。

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier,AdaBoostRegressor,AdaBoostClassifier,RandomForestRegressor
from category_encoders import CatBoostEncoder,CountEncoder,TargetEncoder,SumEncoder
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import datetime as dt

x = pd.read_csv("/home/user/Python Practice/Working/Playstore/x.csv",index_col=("Unnamed: 0"))
y = pd.read_csv("/home/user/Python Practice/Working/Playstore/y.csv",index_col=("Unnamed: 0"))

# Set up Imputers
strat = ["mean","median","most_frequent","constant"]
num_imp = SimpleImputer(strategy=strat[0])
obj_imp = SimpleImputer(strategy=strat[2])

# Set up the scaler
sc = StandardScaler()

# Set up Encoders
cb = CatBoostEncoder()
oh = OneHotEncoder()

# Set up columns
obj = list(x.select_dtypes(include="object"))
num = list(x.select_dtypes(exclude="object"))
cb_col = [i for i in obj if len(x[i].unique())>30]
oh_col = [i for i in obj if len(x[i].unique())<10]

# Col Transformation
col = make_column_transformer((cb,cb_col),
                              (obj_imp,cb_col),
                              (oh,oh_col),
                              (obj_imp,oh_col),
                              (num_imp,num),
                              (sc,num))

model = AdaBoostRegressor(random_state=(0))

#Second Pipeline
run = make_pipeline((col),(model))
run.fit(x,y)
print("The score is",run.score(x,y))

The model crashes at the .fit stage with the error message: ValueError: Input contains NaN .模型在.fit阶段崩溃并显示错误消息: ValueError: Input contains NaN Why woud it do this when I've imputed?为什么在我估算后它会这样做? And how can I resolve it?我该如何解决?

I am using pandas v1.1.3 and sklearn v0.23.2.我正在使用 Pandas v1.1.3 和 sklearn v0.23.2。

I guess the main problem is caused by CatBoostEncoder .我想主要问题是由CatBoostEncoder引起的。 It requires column y as input , so it may not work with make_column_transformer() , at least not according what the manual describes.需要列 y 作为 input ,因此它可能无法与make_column_transformer() ,至少不能根据手册描述的内容。 Its output format is also different from other transformers as shown in the fixed code.它的输出格式也不同于其他转换器,如固定代码所示。

Fix使固定

First, your index was messed up and must be fixed after loading.首先,你的索引搞砸了,加载后必须修复。

x.index[10470:10475]
Out[34]: Int64Index([10470, 10471, 10473, 10474, 10475], dtype='int64')

# fix
x.reset_index(drop=True, inplace=True)
y.reset_index(drop=True, inplace=True)

Second, make the OneHotEncoder output a dense array.其次,使 OneHotEncoder 输出一个密集数组。

oh = OneHotEncoder(sparse=False)

Third, break down the pipeline.三是打通管道。

# 1. Impute
x[num] = num_imp.fit_transform(x[num])
x[obj] = obj_imp.fit_transform(x[obj])
assert x.isnull().sum().sum() == 0  # make sure no missing value exists

# 2. Transform
x = pd.concat([pd.DataFrame(sc.fit_transform(x[num])),
               cb.fit_transform(x[cb_col], y),
               pd.DataFrame(oh.fit_transform(x[oh_col]))
               ], axis=1)

Finally, train and evaluate the model directly.最后,直接训练和评估模型。 The shape conversion suppresses warnings.形状转换抑制警告。

model = AdaBoostRegressor(random_state=0)
model.fit(x.values, y.values.reshape(-1))
print("The score is", model.score(x, y.values.reshape(-1)))

Result:结果:

The score is 0.6329093797171869

Additional Info附加信息

I have tried to ignore the third-party CatBoostEncoder and just use OneHotEncoder on all object columns.我试图忽略第三方CatBoostEncoder ,只在所有对象列上使用OneHotEncoder

col = make_column_transformer(
    (num_imp, num),
    (obj_imp, obj),
    (sc, num),
    (oh, obj),
)

However, the attempt failed in many strange manners I don't understand.但是,尝试以许多我不明白的奇怪方式失败了。

  • oh failed with ValueError: Input contains NaN . ohValueError: Input contains NaN失败ValueError: Input contains NaN
  • sc failed with ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). scValueError: Input contains NaN, infinity or a value too large for dtype('float64').失败ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). This happens when only x[num] are passed into the pipeline, plus that obj_imp and oh were turned off.当只有x[num]被传递到管道时会发生这种情况,而且obj_impoh被关闭。

This is the main reason why I decided to give up on pipeline, as the behavior of transformers in the pipeline deviates greatly from what I observed in the fixed code.这是我决定放弃管道的主要原因,因为管道中转换器的行为与我在固定代码中观察到的有很大不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使我的语义功能以特定顺序执行? - How can I make my Sematic functions execute in a particular order? 如何让我的 Python 代码执行得更快? - How can I make my Python Code execute faster? 事件可以执行命令吗? 如果是这样,我怎样才能让我的人这样做? - Can an event execute a command? If so, how can I make my one do so? 如何使GridSeachCV在我的管道中使用自定义变换器? - How do I make GridSeachCV work with a custom transformer in my pipeline? 如何使我的python抓取函数在一定范围的帖子之间执行? - How can I make my python scraping function execute between a certain range of post? 如何让我的机器人在使用不同的消息时执行相同的操作? - How can I make my bot execute the same action while using different messages? 如何让程序从键盘快捷键执行功能? - How can I make my program execute a function from a keyboard shortcut? 如何在 Python 中执行我的网络浏览器? - How can i execute my webbrowser in Python? 如果我的gpio引脚为高电平,如何才能执行仅执行的cron作业? - How can I make a cron job that would only execute if my gpio pin is HIGH? Sklearn 的 SimpleImputer 在管道中无法检索插补值 - Sklearn's SimpleImputer can't retrieve imputation values when in pipeline
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM