简体   繁体   中英

Shape gets changed when preprocessing with column transformer and predicting the testing data

The data structure is like below.

df_train.head()

ID  y   X0  X1  X2  X3  X4  X5  X6  X8  ... X375    X376    X377    X378    X379    X380    X382    X383    X384    X385
0   0   130.81  k   v   at  a   d   u   j   o   ... 0   0   1   0   0   0   0   0   0   0
1   6   88.53   k   t   av  e   d   y   l   o   ... 1   0   0   0   0   0   0   0   0   0
2   7   76.26   az  w   n   c   d   x   j   x   ... 0   0   0   0   0   0   1   0   0   0
3   9   80.62   az  t   n   f   d   x   l   e   ... 0   0   0   0   0   0   0   0   0   0
4   13  78.02   az  v   n   f   d   h   d   n   ... 0   0   0   0   0   0   0   0   0   0

df_train.shape
(4209, 378)

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB

cat_cols=df_train.select_dtypes(include="object").columns

y=df_train['y']

y.shape
(4209,)
y.head()
0    130.81
1     88.53
2     76.26
3     80.62
4     78.02
Name: y, dtype: float64

X=df_train.drop(['y','ID'],axis=1)
X.shape
(4209, 376)

X.head()
X0  X1  X2  X3  X4  X5  X6  X8  X10 X11 ... X375    X376    X377    X378    X379    X380    X382    X383    X384    X385
0   k   v   at  a   d   u   j   o   0   0   ... 0   0   1   0   0   0   0   0   0   0
1   k   t   av  e   d   y   l   o   0   0   ... 1   0   0   0   0   0   0   0   0   0
2   az  w   n   c   d   x   j   x   0   0   ... 0   0   0   0   0   0   1   0   0   0
3   az  t   n   f   d   x   l   e   0   0   ... 0   0   0   0   0   0   0   0   0   0
4   az  v   n   f   d   h   d   n   0   0   ... 0   0   0   0   0   0   0   0   0   0
5 rows × 376 columns

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(2946, 376)
(1263, 376)
(2946,)
(1263,)


ct=make_column_transformer((OneHotEncoder(),cat_cols),remainder='passthrough')
ct
ColumnTransformer(remainder='passthrough',
                  transformers=[('onehotencoder', OneHotEncoder(),
                                 Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype='object'))])
X_train.columns
Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10', 'X11',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=376)

type(X_train)
pandas.core.frame.DataFrame

X_train_transformed=ct.fit_transform(X_train)
(2946, 558)
type(X_train_transformed)
numpy.ndarray

linereg=LinearRegression()
linereg.fit(X_train_transformed,y_train)
X_test_transformed=ct.fit_transform(X_test)
X_test.shape
(1263, 376)
X_test_transformed.shape
(1263, 544)
linereg.predict(X_test_transformed)

Error faced at this step (last extract shared here).

ValueError                                Traceback (most recent call last)
<ipython-input-126-9d1b72421dd0> in <module>

D:\Anaconda\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

D:\Anaconda\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    151             ret = np.dot(a, b)
    152     else:
--> 153         ret = a @ b
    154 
    155     if (sparse.issparse(a) and sparse.issparse(b)

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 558 is different from 544)

The shape is getting distorted while transforming the data set. Not sure whether any better way for preprocessing the data in this case as all columns are categorical.There are 8 columns of nominal categorical data values as strings and balance all columns have binary values only. THe column transformer had used One Hot Encoder and balance columns were passed directly to the predictor.Appreciate your help to resolve this .

I have tried to create a Minimal Reproducible Example of your problem, and I do not run into any errors myself. Can you run it on your side? See if there are any important differences between the dataframe created here and yours?

Note that:

  • When transforming your test data, you should only transform the data with the ColumnTransformer and not fit it
  • The OneHotEncoder is initialized with handle_unknown = 'ignore'
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Parameters to tweak
n_categories = 10 # Number of categorical columns
groups_by_cat = [3 , 10] # Number of groups which a category will have, to be chosen 
                        # randomly between these two numbers
n_rows = 20
n_binary_cols = 10

# code
list_alpha = list('abcdefghijklmnopqrstuvwxyz')
np.random.seed(42)
groups = []

# names of the columns of the dataframe
col_names = ['X'+str(i) for i in range(n_categories + n_binary_cols)]

# first we generate randomly a set of groups that each category can have
for i in range(n_categories):
    np.random.randn()
    temp_groups = []
    temp_n_groups = np.random.randint(*groups_by_cat)
    for k in range(temp_n_groups):
        group = "".join(np.random.choice(list_alpha,2, replace = True))
        temp_groups.append(group)
    groups.append(temp_groups)

# then we generate n_rows taking samples from the groups generated previously
array_categories = np.random.choice(groups[0],(n_rows,1), replace = True)
for i in range(1,n_categories):
    temp_column = np.random.choice(groups[i],(n_rows,1), replace = True)
    array_categories = np.hstack((array_categories, temp_column))
    

# we generate an array containing the binary columns
array_binaries = np.random.randint(0, 2, (n_rows, n_binary_cols))


# we create the dataframe concatenating together the two arrays
df = pd.DataFrame(np.hstack((array_categories, array_binaries)), columns = col_names)

y = np.random.random_sample((n_rows,1))

# split
X_train, X_test, y_train, y_test = train_test_split(df, y)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# create column transformer
cat_cols = df.select_dtypes(include="object").columns
ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore'),cat_cols),
                             remainder='passthrough')

# fit transform the ColumnTransformer
X_train_transformed = ct.fit_transform(X_train)

# fit linearRegression and predict
linereg = LinearRegression()
linereg.fit(X_train_transformed,y_train)
X_test_transformed = ct.transform(X_test)

print("\nSizes of transformed arrays")
print(X_train_transformed.shape)
print(X_test_transformed.shape)

linereg.predict(X_test_transformed)

Note that the test data, is only transformed with the ColumnTransformer :

X_test_transformed = ct.transform(X_test)

Otherwise the OneHotEncoder() will calculate again the necessary columns for your test data, which might not be exactly the same columns than for your training data (if for example the test data does not have some of the groups that were found on your training data). Here you have more information in the differences between fit fit_transform and transform

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM