為什么我的 KNeighborsRegressor 訓練准確度下降而測試准確度增加？

Question

問題總結

我正在使用188 萬美國野火數據集，並使用 SciKit Learn 的 KNeighborsRegressor 對“FIRE_SIZE”進行回歸。 我收到以下 output 並且對為什么我的訓練准確度下降而測試准確度增加感到有些困惑。 在這里尋找一些關於幕后可能發生的事情的見解。

Output 片段


Test RMSE:  7495.765269614677
Train Accuracy:  0.9995951877448755
Test Accuracy:  0.04561166544992734 

--x--

3-Nearest Neighbor(s) Results:

Test RMSE:  5798.419599886992
Train Accuracy:  0.5157901853607345
Test Accuracy:  0.4288996249038137 

--x--

5-Nearest Neighbor(s) Results:

Test RMSE:  4370.705370544834
Train Accuracy:  0.3818744943896586
Test Accuracy:  0.6755138015850977 

--x--

7-Nearest Neighbor(s) Results:

Test RMSE:  5234.077626536805
Train Accuracy:  0.32715455088444
Test Accuracy:  0.5346566791409124 

--x--

9-Nearest Neighbor(s) Results:

Test RMSE:  4833.210891971975
Train Accuracy:  0.2925369697746403
Test Accuracy:  0.603206401422826 

--x--

11-Nearest Neighbor(s) Results:

Test RMSE:  4662.668487875189
Train Accuracy:  0.27812301457721345
Test Accuracy:  0.6307145104081042 

--x--

13-Nearest Neighbor(s) Results:

Test RMSE:  4475.217632469529
Train Accuracy:  0.2623128334766227
Test Accuracy:  0.659810044524328 

--x--

回歸的代碼

def k_nearest_neighbors(X, y, n):
  
  # Get training and testing splits.

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=42)

  # Initialize a LinearRegr model and return scores/results in a dictionary.
  classifier = KNeighborsRegressor(n_neighbors=n, n_jobs=-1)
  classifier.fit(X_train, y_train)

  y_pred = classifier.predict(X_test)
  mse_test = mean_squared_error(y_test, y_pred) # Mean-squared error, test
  
  test_predictions = classifier.predict(X_test) # prediction accuracy, test
  test_score = r2_score(y_test, test_predictions)
  
  train_predictions = classifier.predict(X_train) # prediction accuracy, train
  train_score = r2_score(y_train, train_predictions)

  return {'rmse': sqrt(mse_test), 'train': train_score, 'test': test_score}

產生 Output 片段的代碼

for i in range(1, 15, 2):
  print(f'{i}-Nearest Neighbor(s) Results:\n')
  
  X, y = get_prediction_df(conn, cols_with_log, 'FIRE_SIZE', 700000, geohash_precision=2)
  result = k_nearest_neighbors(X, y, i)

  print('Test RMSE: ', result['rmse'])
  print('Train Accuracy: ', result['train'])
  print('Test Accuracy: ', result['test'], '\n')
  print('--x--\n')

Answer 1

這在下面的線程中進行了詳細討論， https://stats.stackexchange.com/questions/59630/test-accuracy-higher-than-training-how-to-interpret

在您的情況下，訓練測試拆分為 99:1，這不是推薦的拆分，這可能是導致奇怪結果的原因之一。 Go 用於 90-10 或 80-20 拆分並使用 K 折交叉驗證（使用 K 10 或 20），然后再次評估您的結果。

此處很好地解釋了進行交叉驗證： https://towardsdatascience.com/building-ak-nearest-neighbors-k-nn-model-with-scikit-learn-51209555453a

Answer 2

It looks as though your model is overfitting - as you incorporate more neighbors into the model, you give it the opportunity for the model learn a more and more complex function, but when you check the model against your test set, it doesn't perform以及。 這是因為您的 model 已經開始學習未訓練過的數據中不存在的關系（以及可能不存在的關系）。 從您的結果來看，似乎在 3-neighbor 和 5-neighbor 版本之間開始出現過擬合。 也許嘗試一個 4 鄰居版本，看看它是否能產生最好的測試准確度？

=====

編輯：根據評論中的對話，我想知道這里是否有更多的事情不僅僅是過度擬合。 根據有關此答案的建議

我認為第一步是檢查報告的訓練和測試性能是否真的正確。

在這種情況下，我將看幾個預測被歸類為准確的示例，並且通常會意識到我在評估代碼中犯了一個錯誤，導致准確的結果看起來不准確，反之亦然。

Answer 3

看看下面的圖片（取自這里）：

這是機器學習中偏差-方差權衡的通用表示。 兩條曲線的下方代表您的訓練錯誤，上方的曲線代表測試（或驗證）錯誤。

當您的 model 具有較低的復雜性時，例如預測變量的數量較少，這兩個錯誤都很高，但是隨着您添加更多數據，它們都開始減少但到某個點。 隨着 model 變得更加復雜並且可以無限期地繼續，訓練誤差將繼續減少。 簡單地說，隨着您向 model 添加大量數據，算法現在可以更好地“記住”所有訓練數據並准確預測。

但與此同時，由於過度擬合，驗證錯誤開始增加 - 現在您的 model 可以很好地“記住”訓練數據，但這會削弱它對新數據進行預測的能力。

通常情況下，最好的 model 是測試誤差曲線處於最小值的位置，在該點您有足夠的數據來解釋大部分方差，但沒有那么多，以至於偏差很高。

為什么我的 KNeighborsRegressor 訓練准確度下降而測試准確度增加？

問題描述

問題總結

Output 片段

回歸的代碼

產生 Output 片段的代碼

3 個解決方案

解決方案1
3 已采納 2020-12-15 02:26:49

解決方案2
2 2020-12-15 01:57:37

解決方案3
1 2020-12-15 02:10:10

為什么我的 KNeighborsRegressor 訓練准確度下降而測試准確度增加？

問題描述

問題總結

Output 片段

回歸的代碼

產生 Output 片段的代碼

3 個解決方案

解決方案1 3 已采納 2020-12-15 02:26:49

解決方案2 2 2020-12-15 01:57:37

解決方案3 1 2020-12-15 02:10:10

解決方案1
3 已采納 2020-12-15 02:26:49

解決方案2
2 2020-12-15 01:57:37

解決方案3
1 2020-12-15 02:10:10