简体   繁体   中英

10-fold cross-validation and obtaining RMSE

I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn. I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.

TIA for any help!

Here's my linear regression function

  def standRegres(xArr,yArr):
      xMat = np.mat(xArr); yMat = np.mat(yArr).T
      xTx = xMat.T*xMat
      if np.linalg.det(xTx) == 0.0:
          print("This matrix is singular, cannot do inverse")
          return
      ws = xTx.I * (xMat.T*yMat)
      return ws

  ##  I run it on my matrix ("comm_df") and my dependent var (comm_target)

  ##  Calculate RMSE (omitted some code)

  initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)

  ##  Now trying to get RMSE after training model through 10-fold cross validation

  from sklearn.model_selection import KFold
  from sklearn.linear_model import LinearRegression

  kf = KFold(n_splits=10)
  xval_err = 0
  for train, test in kf:
      linreg.fit(comm_df,comm_target)
      p = linreg.predict(comm_df)
      e = p-comm_target
      xval_err += np.sqrt(np.dot(e,e)/len(comm_df))

  rmse_10cv = xval_err/10

I get an error about how kfold object is not iterable

There are several things you need to correct in this code.

  • You cannot iterate over kf . You can only iterate over kf.split(comm_df)

  • You need to somehow use the train test split that KFold provides. You are not using them in your code, The goal of the KFold is to fit your regression on the train observations. and to evaluate the regression (ie compute the RMSE in your case) on the test observations.

With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)

kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
    linreg.fit(comm_df[train],comm_target[train])
    p = linreg.predict(comm_df[test])
    e = p-comm_label[test]
    xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))

rmse_10cv = xval_err/10

So the code you provided still threw an error. I abandoned what I had above in favor of the following, which works:

## KFold cross-validation

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

## Define variables for the for loop

kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)

for loop_number, (train, test) in enumerate(kf.split(X)):

    ## Get Training Matrix and Vector

    training_X_array = X[train]
    training_y_array = y[train].reshape(-1, 1)

    ## Get Testing Matrix Values

    X_test_array = X[test]
    y_actual_values = y[test]

    ## Fit the Linear Regression Model

    lr_model = LinearRegression().fit(training_X_array, training_y_array)

    ## Compute the predictions for the test data

    prediction = lr_model.predict(X_test_array)      
    crime_probabilites = np.array(prediction)   

    ## Calculate the RMSE

    RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)

    ## Add each RMSE_cross_fold value to the sum

    RMSE_sum=RMSE_cross_fold+RMSE_sum

## Calculate the average and print    

RMSE_cross_fold_avg=RMSE_sum/RMSE_length

print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM