[英]10-fold cross-validation and obtaining RMSE
I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn.我正在尝试使用 scikit learn 中的 KFold 模块将我从对完整数据集执行多重线性回归的 RMSE 与 10 倍交叉验证的 RMSE 进行比较。 I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.
我发现了一些我试图调整的代码,但我无法让它工作(我怀疑它从一开始就没有工作过。
TIA for any help! TIA 寻求帮助!
def standRegres(xArr,yArr):
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T*yMat)
return ws
## I run it on my matrix ("comm_df") and my dependent var (comm_target)
## Calculate RMSE (omitted some code)
initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)
## Now trying to get RMSE after training model through 10-fold cross validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf:
linreg.fit(comm_df,comm_target)
p = linreg.predict(comm_df)
e = p-comm_target
xval_err += np.sqrt(np.dot(e,e)/len(comm_df))
rmse_10cv = xval_err/10
I get an error about how kfold object is not iterable我收到关于 kfold object 如何不可迭代的错误
There are several things you need to correct in this code.您需要在此代码中更正几件事。
You cannot iterate over kf
.您不能迭代
kf
。 You can only iterate over kf.split(comm_df)
你只能迭代
kf.split(comm_df)
You need to somehow use the train test split that KFold provides.您需要以某种方式使用 KFold 提供的训练测试拆分。 You are not using them in your code, The goal of the KFold is to fit your regression on the train observations.
您没有在代码中使用它们,KFold 的目标是使您的回归适合训练观察。 and to evaluate the regression (ie compute the RMSE in your case) on the test observations.
并评估测试观察的回归(即在您的情况下计算 RMSE)。
With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)考虑到这一点,我将如何更正您的代码(假设您的数据在 numpy arrays 中,但您可以轻松切换到 pandas)
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
linreg.fit(comm_df[train],comm_target[train])
p = linreg.predict(comm_df[test])
e = p-comm_label[test]
xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))
rmse_10cv = xval_err/10
So the code you provided still threw an error.所以你提供的代码仍然抛出错误。 I abandoned what I had above in favor of the following, which works:
我放弃了上面的内容,转而使用以下内容,这很有效:
## KFold cross-validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
## Define variables for the for loop
kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)
for loop_number, (train, test) in enumerate(kf.split(X)):
## Get Training Matrix and Vector
training_X_array = X[train]
training_y_array = y[train].reshape(-1, 1)
## Get Testing Matrix Values
X_test_array = X[test]
y_actual_values = y[test]
## Fit the Linear Regression Model
lr_model = LinearRegression().fit(training_X_array, training_y_array)
## Compute the predictions for the test data
prediction = lr_model.predict(X_test_array)
crime_probabilites = np.array(prediction)
## Calculate the RMSE
RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)
## Add each RMSE_cross_fold value to the sum
RMSE_sum=RMSE_cross_fold+RMSE_sum
## Calculate the average and print
RMSE_cross_fold_avg=RMSE_sum/RMSE_length
print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.