简体   繁体   English

10折交叉验证并获得RMSE

[英]10-fold cross-validation and obtaining RMSE

I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn.我正在尝试使用 scikit learn 中的 KFold 模块将我从对完整数据集执行多重线性回归的 RMSE 与 10 倍交叉验证的 RMSE 进行比较。 I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.我发现了一些我试图调整的代码,但我无法让它工作(我怀疑它从一开始就没有工作过。

TIA for any help! TIA 寻求帮助!

Here's my linear regression function这是我的线性回归 function

  def standRegres(xArr,yArr):
      xMat = np.mat(xArr); yMat = np.mat(yArr).T
      xTx = xMat.T*xMat
      if np.linalg.det(xTx) == 0.0:
          print("This matrix is singular, cannot do inverse")
          return
      ws = xTx.I * (xMat.T*yMat)
      return ws

  ##  I run it on my matrix ("comm_df") and my dependent var (comm_target)

  ##  Calculate RMSE (omitted some code)

  initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)

  ##  Now trying to get RMSE after training model through 10-fold cross validation

  from sklearn.model_selection import KFold
  from sklearn.linear_model import LinearRegression

  kf = KFold(n_splits=10)
  xval_err = 0
  for train, test in kf:
      linreg.fit(comm_df,comm_target)
      p = linreg.predict(comm_df)
      e = p-comm_target
      xval_err += np.sqrt(np.dot(e,e)/len(comm_df))

  rmse_10cv = xval_err/10

I get an error about how kfold object is not iterable我收到关于 kfold object 如何不可迭代的错误

There are several things you need to correct in this code.您需要在此代码中更正几件事。

  • You cannot iterate over kf .您不能迭代kf You can only iterate over kf.split(comm_df)你只能迭代kf.split(comm_df)

  • You need to somehow use the train test split that KFold provides.您需要以某种方式使用 KFold 提供的训练测试拆分。 You are not using them in your code, The goal of the KFold is to fit your regression on the train observations.您没有在代码中使用它们,KFold 的目标是使您的回归适合训练观察。 and to evaluate the regression (ie compute the RMSE in your case) on the test observations.并评估测试观察的回归(即在您的情况下计算 RMSE)。

With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)考虑到这一点,我将如何更正您的代码(假设您的数据在 numpy arrays 中,但您可以轻松切换到 pandas)

kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
    linreg.fit(comm_df[train],comm_target[train])
    p = linreg.predict(comm_df[test])
    e = p-comm_label[test]
    xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))

rmse_10cv = xval_err/10

So the code you provided still threw an error.所以你提供的代码仍然抛出错误。 I abandoned what I had above in favor of the following, which works:我放弃了上面的内容,转而使用以下内容,这很有效:

## KFold cross-validation

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

## Define variables for the for loop

kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)

for loop_number, (train, test) in enumerate(kf.split(X)):

    ## Get Training Matrix and Vector

    training_X_array = X[train]
    training_y_array = y[train].reshape(-1, 1)

    ## Get Testing Matrix Values

    X_test_array = X[test]
    y_actual_values = y[test]

    ## Fit the Linear Regression Model

    lr_model = LinearRegression().fit(training_X_array, training_y_array)

    ## Compute the predictions for the test data

    prediction = lr_model.predict(X_test_array)      
    crime_probabilites = np.array(prediction)   

    ## Calculate the RMSE

    RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)

    ## Add each RMSE_cross_fold value to the sum

    RMSE_sum=RMSE_cross_fold+RMSE_sum

## Calculate the average and print    

RMSE_cross_fold_avg=RMSE_sum/RMSE_length

print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM