I'm doing a problem where I'm building a step function in statsmodels while first using cross validation to determine the ideal amount of cuts. However I'm running into an issue I just can't understand how to fix.
After I added the cross validation loop using the KFold function from Sklearn, I began receiving an error:
ValueError: shapes (480,2) and (1,) not aligned: 2 (dim 1) != 1 (dim 0)
I'm not exactly sure why this is happening now as before I started using the cross validation loop it worked perfectly fine without any issues.
If someone could take a look at my code block and point out where this issue is stemming from I'd really appreciate it.
Shape of X_train and y_train before going in:
X_train: (2400,) y_train: (2400,)
Code:
import statsmodels.api as sm
from sklearn.model_selection import KFold
kf = KFold(n_splits=5,shuffle=True, random_state=1)
cuts = []
RMSE = []
for i in range(1,11):
cuts.append(i)
cross_val_rms = []
for train_index, test_index in kf.split(X_train):
train_x,test_x= X_train.iloc[train_index], X_train.iloc[test_index]
train_y,test_y= y_train.iloc[train_index], y_train.iloc[test_index]
df_cut, bins = pd.cut(train_x, i, retbins=True, right=True)
df_steps = pd.concat([train_x, df_cut, train_y],
keys=['age','age_cuts','wage'], axis = 1)
df_steps_dummies = pd.get_dummies(df_cut)
GLM_fitted = sm.GLM(df_steps.wage, df_steps_dummies).fit()
bin_mapping = np.digitize(test_x, bins)
X_valid = pd.get_dummies(bin_mapping)
pred = GLM_fitted.predict(X_valid)
rms = np.sqrt(mean_squared_error(test_y, pred))
cross_val_rms.append(rms)
mean_rms = sum(cross_vall_rms)/len(cross_vall_rms)
RMSE.append(mean_rms)
cuts_df = pd.DataFrame()
cuts_df['Cuts'] = cuts
cuts_df['RMSE'] = RMSE
print('Cuts with lowest Root Mean Squared Error:',cuts_df.loc[cuts_df['RMSE'].idxmin], sep='\n')
Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-166-a9794538c3e5> in <module>()
21 bin_mapping = np.digitize(test_x, bins)
22 X_valid = pd.get_dummies(bin_mapping)
---> 23 pred = GLM_fitted.predict(X_valid)
24 rms = np.sqrt(mean_squared_error(test_y, pred))
25 cross_val_rms.append(rms)
1 frames
/usr/local/lib/python3.7/dist-packages/statsmodels/genmod/generalized_linear_model.py in predict(self, params, exog, exposure, offset, linear)
870 exog = self.exog
871
--> 872 linpred = np.dot(exog, params) + offset + exposure
873 if linear:
874 return linpred
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (480,2) and (1,) not aligned: 2 (dim 1) != 1 (dim 0)
I think it helps if you explain what you are trying to do in the regression. You get the error because if you get 3 bins from the training fold, it does not imply you get exactly 3 bins from test fold, you might get 2 folds because there are no values in 1 bin.
From what I can see, you can simply discretize the values first and go into the training, using an example data:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
X_train = pd.Series(np.random.uniform(0,1,2400))
y_train = pd.Series(np.random.uniform(0,1,2400))
Then
for i in range(2,11):
cross_val_rms = []
df_steps_dummies = pd.get_dummies(pd.cut(X_train,i))
for train_index, test_index in kf.split(X_train):
train_x,test_x= df_steps_dummies.iloc[train_index,:], df_steps_dummies.iloc[test_index,:]
train_y,test_y= y_train[train_index], y_train[test_index]
GLM_fitted = sm.GLM(train_y, train_x).fit()
pred = GLM_fitted.predict(test_x)
rms = np.sqrt(mean_squared_error(test_y, pred))
cross_val_rms.append(rms)
RMSE.append(np.array(cross_val_rms).mean())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.