I have a simple model with a pipelie using ColumnTransformer
I am able to train the model and save the model as pickle
When I load the pickle and predict on the real-time data, I received the following error regarding ColumnTransformer
Column ordering must be equal for fit and for transform when using the remainder keyword
The training data and the data used for prediction has exact the same number of column, eg, 50. I am not sure how the "ordering" of the column could have changed.
Why ordering of the column is important for columntransformer? How to fix this? Is there a way to ensure the "ordering" after running a column transformer?
Thanks.
pipeline = Pipeline([
('RepalceInf', ReplaceInf()),
('impute_30_100', ColumnTransformer(
[
('oneStdNorm', OneStdImputer(), self.cont_feature_strategy_dict['FEATS_30_100']),
],
remainder='passthrough'
)),
('regress_impute', IterativeImputer(random_state=0, estimator=self.cont_estimator)),
('replace_outlier', OutlierReplacer(quantile_range=(1, 99))),
('scaler', StandardScaler(with_mean=True))
])
class OneStdImputer(TransformerMixin, BaseEstimator):
def __init__(self):
"""
Impute the missing data with random value in the range of mean +/- one standard deviation
This is a simplified implementation without sparse/dense fit and check.
"""
self.mean = None
self.std = None
def fit(self, X, y=None):
self.mean = X.mean()
self.std = X.std()
return self
def transform(self, X):
# X_imp = X.fillna(np.random.randint()*2*self.std+self.mean-self.std)
for col in X:
self._fill_randnorm(X[col], col)
return X
def _fill_randnorm(self, df, col):
val = df.values
mask = np.isnan(df)
mu, sigma = self.mean[col], self.std[col]
val[mask] = np.random.normal(mu, sigma, size=mask.sum())
return df
You can use df_new =pd.DataFrame(df_origin, columns=df_train.columns
to make sure the data to predict have same columns with training data.
And from the given example, it's obviously that ColumnTransformer
will take the order number of a chosen column as a mark to process.(Although you can use exactly name to choose a column, but I think it will transform to number too)
>>> import numpy as np
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
... [("norm1", Normalizer(norm='l1'), [0, 1]),
... ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
... [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
[0.5, 0.5, 0. , 1. ]])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.