Scikitlearn Column Transformer Error: Column ordering must be equal for fit and for transform when using the remainder keyword

Question

I have a simple model with a pipelie using ColumnTransformer

I am able to train the model and save the model as pickle

When I load the pickle and predict on the real-time data, I received the following error regarding ColumnTransformer

Column ordering must be equal for fit and for transform when using the remainder keyword

The training data and the data used for prediction has exact the same number of column, eg, 50. I am not sure how the "ordering" of the column could have changed.

Why ordering of the column is important for columntransformer? How to fix this? Is there a way to ensure the "ordering" after running a column transformer?

Thanks.

   pipeline = Pipeline([
        ('RepalceInf', ReplaceInf()),
        ('impute_30_100', ColumnTransformer(
            [
                ('oneStdNorm', OneStdImputer(), self.cont_feature_strategy_dict['FEATS_30_100']),
            ],
            remainder='passthrough'
        )),
        ('regress_impute', IterativeImputer(random_state=0, estimator=self.cont_estimator)),
        ('replace_outlier', OutlierReplacer(quantile_range=(1, 99))),
        ('scaler', StandardScaler(with_mean=True))
    ])



class OneStdImputer(TransformerMixin, BaseEstimator):
def __init__(self):
    """
    Impute the missing data with random value in the range of mean +/- one standard deviation
    This is a simplified implementation without sparse/dense fit and check.
    """
    self.mean = None
    self.std = None

def fit(self, X, y=None):
    self.mean = X.mean()
    self.std = X.std()
    return self

def transform(self, X):
    # X_imp = X.fillna(np.random.randint()*2*self.std+self.mean-self.std)
    for col in X:
        self._fill_randnorm(X[col], col)
    return X

def _fill_randnorm(self, df, col):
    val = df.values
    mask = np.isnan(df)
    mu, sigma = self.mean[col], self.std[col]
    val[mask] = np.random.normal(mu, sigma, size=mask.sum())
    return df

Answer 1

You can use df_new =pd.DataFrame(df_origin, columns=df_train.columns to make sure the data to predict have same columns with training data.

And from the given example, it's obviously that ColumnTransformer will take the order number of a chosen column as a mark to process.(Although you can use exactly name to choose a column, but I think it will transform to number too)

>>> import numpy as np
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
...     [("norm1", Normalizer(norm='l1'), [0, 1]),
...      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
...               [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

Scikitlearn Column Transformer Error: Column ordering must be equal for fit and for transform when using the remainder keyword

Question

1 answers

solution1
3 ACCPTED 2020-01-02 10:18:15

Scikitlearn Column Transformer Error: Column ordering must be equal for fit and for transform when using the remainder keyword

Question

1 answers

solution1 3 ACCPTED 2020-01-02 10:18:15

solution1
3 ACCPTED 2020-01-02 10:18:15