简体   繁体   中英

Scikitlearn Column Transformer Error: Column ordering must be equal for fit and for transform when using the remainder keyword

I have a simple model with a pipelie using ColumnTransformer

I am able to train the model and save the model as pickle

When I load the pickle and predict on the real-time data, I received the following error regarding ColumnTransformer

Column ordering must be equal for fit and for transform when using the remainder keyword

The training data and the data used for prediction has exact the same number of column, eg, 50. I am not sure how the "ordering" of the column could have changed.

Why ordering of the column is important for columntransformer? How to fix this? Is there a way to ensure the "ordering" after running a column transformer?

Thanks.

   pipeline = Pipeline([
        ('RepalceInf', ReplaceInf()),
        ('impute_30_100', ColumnTransformer(
            [
                ('oneStdNorm', OneStdImputer(), self.cont_feature_strategy_dict['FEATS_30_100']),
            ],
            remainder='passthrough'
        )),
        ('regress_impute', IterativeImputer(random_state=0, estimator=self.cont_estimator)),
        ('replace_outlier', OutlierReplacer(quantile_range=(1, 99))),
        ('scaler', StandardScaler(with_mean=True))
    ])



class OneStdImputer(TransformerMixin, BaseEstimator):
def __init__(self):
    """
    Impute the missing data with random value in the range of mean +/- one standard deviation
    This is a simplified implementation without sparse/dense fit and check.
    """
    self.mean = None
    self.std = None

def fit(self, X, y=None):
    self.mean = X.mean()
    self.std = X.std()
    return self

def transform(self, X):
    # X_imp = X.fillna(np.random.randint()*2*self.std+self.mean-self.std)
    for col in X:
        self._fill_randnorm(X[col], col)
    return X

def _fill_randnorm(self, df, col):
    val = df.values
    mask = np.isnan(df)
    mu, sigma = self.mean[col], self.std[col]
    val[mask] = np.random.normal(mu, sigma, size=mask.sum())
    return df

You can use df_new =pd.DataFrame(df_origin, columns=df_train.columns to make sure the data to predict have same columns with training data.

And from the given example, it's obviously that ColumnTransformer will take the order number of a chosen column as a mark to process.(Although you can use exactly name to choose a column, but I think it will transform to number too)

>>> import numpy as np
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
...     [("norm1", Normalizer(norm='l1'), [0, 1]),
...      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
...               [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM