[英]scikit-learn scores are different when using cross_val_predict vs cross_val_score
I expected both methods to return rather similar errors, can someone point me to the mistake please?我希望这两种方法都会返回相当相似的错误,有人可以指出错误吗?
Calculating RMSE...计算 RMSE...
rf = RandomForestRegressor(random_state=555, n_estimators=100, max_depth=8)
rf_preds = cross_val_predict(rf, train_, targets, cv=7, n_jobs=7)
print("RMSE Score using cv preds: {:0.5f}".format(metrics.mean_squared_error(targets, rf_preds, squared=False)))
scores = cross_val_score(rf, train_, targets, cv=7, scoring='neg_root_mean_squared_error', n_jobs=7)
print("RMSE Score using cv_score: {:0.5f}".format(scores.mean() * -1))
RMSE Score using cv preds: 0.01658
RMSE Score using cv_score: 0.01073
There are two issues here, both of which are mentioned in the documentation of cross_val_predict
:这里有两个问题,这两个问题都在cross_val_predict
的文档中提到:
Results can differ from
cross_validate
andcross_val_score
unless all tests sets have equal size and the metric decomposes over samples.结果可能与cross_validate
和cross_val_score
不同,除非所有测试集都具有相同的大小,并且度量标准在样本上分解。
The first is to make all sets (training and test) the same in both cases, which is not the case in your example.首先是在两种情况下使所有集合(训练和测试)都相同,而您的示例中并非如此。 To do so, we need to employ the kfold
method in order to define our CV folds, and then use these same folds in both cases.为此,我们需要使用kfold
方法来定义我们的 CV 折叠,然后在两种情况下使用这些相同的折叠。 Here is an example with dummy data:这是一个带有虚拟数据的示例:
from sklearn.datasets import make_regression
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
X, y = make_regression(n_samples=2000, n_features=4, n_informative=2,
random_state=42, shuffle=False)
rf = RandomForestRegressor(max_depth=2, random_state=0)
kf = KFold(n_splits=5)
rf_preds = cross_val_predict(rf, X, y, cv=kf, n_jobs=5)
print("RMSE Score using cv preds: {:0.5f}".format(mean_squared_error(y, rf_preds, squared=False)))
scores = cross_val_score(rf, X, y, cv=kf, scoring='neg_root_mean_squared_error', n_jobs=5)
print("RMSE Score using cv_score: {:0.5f}".format(scores.mean() * -1))
The result of the above code snippet (fully reproducible, since we have explicitly set all the necessary random seeds) is:上述代码片段的结果(完全可重现,因为我们已经明确设置了所有必要的随机种子)是:
RMSE Score using cv preds: 15.16839
RMSE Score using cv_score: 15.16031
So, we can see that the two scores are indeed similar, but still not identical .因此,我们可以看到,这两个分数确实相似,但仍然不完全相同。
Why is that?这是为什么? The answer lies in the rather cryptic second part of the quoted sentence above, ie the RMSE score does not decompose over samples (to be honest, I don't know any ML score that it does).答案在于上面引用的句子的相当神秘的第二部分,即 RMSE 分数不会分解样本(老实说,我不知道它有任何 ML 分数)。
In simple words, while cross_val_predict
computes the RMSE strictly according to its definition, ie (pseudocode):简单来说,虽然cross_val_predict
严格按照其定义计算 RMSE,即(伪代码):
RMSE = square_root([(y[1] - y_pred[1])^2 + (y[2] - y_pred[2])^2 + ... + (y[n] - y_pred[n])^2]/n)
where n
is the number of samples, the cross_val_score
method does not do exactly that;其中n
是样本数, cross_val_score
方法并不能完全做到这一点; what it does instead is that it computes the RMSE for each one of the k
CV folds, and then averages these k
values, ie (pseudocode again):它的作用是计算k
个 CV 折叠中的每一个的 RMSE,然后平均这些k
值,即(再次伪代码):
RMSE = (RMSE[1] + RMSE[2] + ... + RMSE[k])/k
And exactly because the RMSE is not decomposable over the samples, these two values, although close, are not identical .正是因为 RMSE 在样本上不可分解,这两个值虽然接近,但并不相同。
We can actually demonstrate that this is the case indeed, by doing the CV procedure manually and emulating the RMSE calculation as done by cross_val_score
and described above, ie:实际上,我们可以通过手动执行 CV 过程并模拟上面描述的cross_val_score
完成的 RMSE 计算来证明确实如此,即:
import numpy as np
RMSE__cv_score = []
for train_index, val_index in kf.split(X):
rf.fit(X[train_index], y[train_index])
pred = rf.predict(X[val_index])
err = mean_squared_error(y[val_index], pred, squared=False)
RMSE__cv_score.append(err)
print("RMSE Score using manual cv_score: {:0.5f}".format(np.mean(RMSE__cv_score)))
The result being:结果是:
RMSE Score using manual cv_score: 15.16031
ie identical with the one returned by cross_val_score
above.即与上面的cross_val_score
返回的相同。
So, if we want to be very precise, the truth is that the correct RMSE (ie calculated exactly according to its definition) is the one returned by cross_val_predict
;因此,如果我们想要非常精确,事实是正确的 RMSE(即根据其定义精确计算)是cross_val_predict
返回的那个; cross_val_score
returns an approximation of it. cross_val_score
返回它的近似值。 But in practice, we often find that the difference is not that significant, so we can also use cross_val_score
if it is more convenient.但在实践中,我们经常会发现差异并没有那么显着,所以如果更方便的话,我们也可以使用cross_val_score
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.