简体   繁体   中英

How can I make my pipeline execute the imputation stage?

I'm trying to run a basic model but it seems as though the imputation stage of my pipeline is failing, and I don't really understand why.

Here's the minimal replicable code

If you'd like you can find the data for x and y . Originally they were in a public file that I can easily link you to, but I transformed them a little so I'll use the edited output to cut down on the code you have to read. I can easily link to the original code and dataset if need be, however.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier,AdaBoostRegressor,AdaBoostClassifier,RandomForestRegressor
from category_encoders import CatBoostEncoder,CountEncoder,TargetEncoder,SumEncoder
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import datetime as dt

x = pd.read_csv("/home/user/Python Practice/Working/Playstore/x.csv",index_col=("Unnamed: 0"))
y = pd.read_csv("/home/user/Python Practice/Working/Playstore/y.csv",index_col=("Unnamed: 0"))

# Set up Imputers
strat = ["mean","median","most_frequent","constant"]
num_imp = SimpleImputer(strategy=strat[0])
obj_imp = SimpleImputer(strategy=strat[2])

# Set up the scaler
sc = StandardScaler()

# Set up Encoders
cb = CatBoostEncoder()
oh = OneHotEncoder()

# Set up columns
obj = list(x.select_dtypes(include="object"))
num = list(x.select_dtypes(exclude="object"))
cb_col = [i for i in obj if len(x[i].unique())>30]
oh_col = [i for i in obj if len(x[i].unique())<10]

# Col Transformation
col = make_column_transformer((cb,cb_col),
                              (obj_imp,cb_col),
                              (oh,oh_col),
                              (obj_imp,oh_col),
                              (num_imp,num),
                              (sc,num))

model = AdaBoostRegressor(random_state=(0))

#Second Pipeline
run = make_pipeline((col),(model))
run.fit(x,y)
print("The score is",run.score(x,y))

The model crashes at the .fit stage with the error message: ValueError: Input contains NaN . Why woud it do this when I've imputed? And how can I resolve it?

I am using pandas v1.1.3 and sklearn v0.23.2.

I guess the main problem is caused by CatBoostEncoder . It requires column y as input , so it may not work with make_column_transformer() , at least not according what the manual describes. Its output format is also different from other transformers as shown in the fixed code.

Fix

First, your index was messed up and must be fixed after loading.

x.index[10470:10475]
Out[34]: Int64Index([10470, 10471, 10473, 10474, 10475], dtype='int64')

# fix
x.reset_index(drop=True, inplace=True)
y.reset_index(drop=True, inplace=True)

Second, make the OneHotEncoder output a dense array.

oh = OneHotEncoder(sparse=False)

Third, break down the pipeline.

# 1. Impute
x[num] = num_imp.fit_transform(x[num])
x[obj] = obj_imp.fit_transform(x[obj])
assert x.isnull().sum().sum() == 0  # make sure no missing value exists

# 2. Transform
x = pd.concat([pd.DataFrame(sc.fit_transform(x[num])),
               cb.fit_transform(x[cb_col], y),
               pd.DataFrame(oh.fit_transform(x[oh_col]))
               ], axis=1)

Finally, train and evaluate the model directly. The shape conversion suppresses warnings.

model = AdaBoostRegressor(random_state=0)
model.fit(x.values, y.values.reshape(-1))
print("The score is", model.score(x, y.values.reshape(-1)))

Result:

The score is 0.6329093797171869

Additional Info

I have tried to ignore the third-party CatBoostEncoder and just use OneHotEncoder on all object columns.

col = make_column_transformer(
    (num_imp, num),
    (obj_imp, obj),
    (sc, num),
    (oh, obj),
)

However, the attempt failed in many strange manners I don't understand.

  • oh failed with ValueError: Input contains NaN .
  • sc failed with ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). This happens when only x[num] are passed into the pipeline, plus that obj_imp and oh were turned off.

This is the main reason why I decided to give up on pipeline, as the behavior of transformers in the pipeline deviates greatly from what I observed in the fixed code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM