简体   繁体   English

cross_val_score 与 .score 的回归评分结果显着不同

[英]Regression scoring results dramatically different for cross_val_score vs .score

I'm running RandomForestRegressor().我正在运行 RandomForestRegressor()。 I'm using R-squared for scoring.我正在使用 R 平方进行评分。 Why do I get dramatically different results with .score versus cross_val_score?为什么使用 .score 和 cross_val_score 会得到截然不同的结果? Here is the relevant code:这是相关的代码:

X = df.drop(['y_var'], axis=1)
y = df['y_var']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Random Forest Regression
rfr = RandomForestRegressor()
model_rfr = rfr.fit(X_train,y_train)
pred_rfr = rfr.predict(X_test)
result_rfr = model_rfr.score(X_test, y_test)

# cross-validation
rfr_cv_r2 = cross_val_score(rfr, X, y, cv=5, scoring='r2')

I understand that cross-validation is scoring multiple times versus one for .score, but the results are so radically different, that something is clearly wrong.我知道交叉验证是多次得分,而 .score 是一次得分,但结果完全不同,这显然是错误的。 Here are the results:结果如下:

R2-dot-score: .99072
R2-cross-val: [0.5349302  0.65832268 0.52918704 0.74957719 0.45649582]

What am I doing wrong?我究竟做错了什么? Or what might explain this discrepancy?或者什么可以解释这种差异?

EDIT:编辑:

OK, I may have solved this.好的,我可能已经解决了这个问题。 It seems as if cross_val_score does not shuffle the data, which may be leading to worse predictions when data is grouped together.似乎 cross_val_score 不会对数据进行混洗,这可能会在将数据分组在一起时导致更糟糕的预测。 The easiest solution I found (via this answer ) to this was to simply shuffle the dataframe before running the model:我找到的最简单的解决方案(通过这个答案)是在运行模型之前简单地洗牌数据帧:

shuffled_df = df.reindex(np.random.permutation(df.index))

After I did that, I started getting similar results between .score and cross_val_score:在我这样做之后,我开始在 .score 和 cross_val_score 之间得到类似的结果:

R2-dot-score: 0.9910715555903232
R2-cross-val: [0.99265184 0.9923142  0.9922923  0.99259524 0.99195022]

OK, I may have solved this.好的,我可能已经解决了这个问题。 It seems as if cross_val_score does not randomize the data, which may be leading to worse predictions when similar data is grouped together.似乎 cross_val_score 不会随机化数据,当相似的数据组合在一起时,这可能会导致更糟糕的预测。 The easiest solution I found (via this answer ) to this was to simply shuffle the dataframe before running the model:我找到的最简单的解决方案(通过这个答案)是在运行模型之前简单地洗牌数据帧:

shuffled_df = df.reindex(np.random.permutation(df.index))

After I did that, I started getting similar results between .score and cross_val_score:在我这样做之后,我开始在 .score 和 cross_val_score 之间得到类似的结果:

R2-dot-score: 0.9910715555903232
R2-cross-val: [0.99265184 0.9923142  0.9922923  0.99259524 0.99195022]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM