如何在使用 k 折交叉驗證訓練訓練數據后測試數據？

Question

在代碼中，我有：

將數據集分成兩部分：訓練集和測試集 (7:3)。 該數據集由 200 行和 9394 列組成。
定義 model
使用的交叉驗證：訓練集上的 10 折
每次折疊獲得的准確度
獲得的平均准確率：94.29%

困惑是：

這是我正在做的正確方式嗎？
是否以正確的方式使用 cross_val_predict() 來預測測試數據上的 x？

剩余任務：

plot 精度為 model。
到 plot 丟失 model。

任何人都可以在這方面提出建議。 抱歉這么長的筆記！！！

數據集如下：（這些是新聞標題和正文中每個單詞的 tfidf）

    Unnamed: 0  Unnamed: 0.1    Label   Cosine_Similarity   c0  c1  c2  c3  c4  c5  ... c9386   c9387   c9388   c9389   c9390   c9391   c9392   c9393   c9394   c9395
0   0   0   Real    0.180319    0.000000    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1   1   1   Real    0.224159    0.166667    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2   2   2   Real    0.233877    0.142857    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3   3   3   Real    0.155789    0.111111    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4   4   4   Real    0.225480    0.000000    0.0 0.111111    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

代碼和output為：

df_all = pd.read_csv("C:/Users/shiva/Desktop/allinone200.csv")

dataset=df_all.values
x=dataset[0:,3:]
Y= dataset[0:,2]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
y = np_utils.to_categorical(encoded_Y)

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=15,shuffle=True)
x_train.shape,y_train.shape

def baseline_model():
    model = Sequential()
    model.add(Dense(512, activation='relu',input_dim=x_train.shape[1]))
    model.add(Dense(64, activation='relu')))
    model.add(Dense(2, activation='softmax'))

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model

安裝 model 的代碼：

estimator = KerasClassifier(build_fn=baseline_model, epochs=5, batch_size=4, verbose=1)
kf = KFold(n_splits=10, shuffle=True,random_state=15)

for train_index, test_index in kf.split(x_train,y_train):
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)

取出結果的代碼：

results = cross_val_score(estimator, x_train, y_train, cv=kf)
print results

Output：

[0.9285714  1.         0.9285714  1.         0.78571427 0.85714287
 1.         1.         0.9285714  1.        ]

平均准確度：`

print("Accuracy: %0.2f (+/-%0.2f)" % (results.mean()*100, results.std()*2))

Output：

Accuracy: 94.29 (+/-0.14)

預測代碼：

from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(estimator, x_test, y_test,cv=kf)
print(y_test[0])
print(y_pred[0])

Output：加工后

[1. 0.]
0

這里的預測似乎還可以。 因為 1 是 REAL 而 O 是 FALSE。 y_test 為 0，y_predict 也為 0。

混淆矩陣：

import numpy as np
y_test=np.argmax(y_test, axis=1)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

Output：

array([[32,  0],
       [ 1, 27]], dtype=int64)

Answer 1

根據 Andreas 對您的觀察數量的評論，這是否對您有任何幫助： Keras - Plot 訓練、驗證和測試集准確性

最好的

Answer 2

不幸的是，我的評論變得很長，因此我在這里嘗試：

請看一下： https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e簡而言之，較大的批量通常會產生更差的結果，但速度更快，這在您的情況可能無關緊要（200 行）。 其次，您沒有（可重復使用的）保留，這可能會給您關於您的真實准確性的錯誤假設。 第一次嘗試的准確率超過 90% 可能意味着：過度擬合、泄漏或不平衡數據（例如： https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data .html ）或者你真的很幸運。 K-fold 與小樣本量相結合會導致錯誤的假設： https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224365

一些經驗法則： 1. 您希望數據點（行）的數量是特征（列）的 2 倍。 2. 如果你仍然得到一個好結果，這可能意味着多方面的事情。 很可能是代碼或方法中的錯誤。

想象一下，您必須預測銀行的欺詐風險。 如果發生欺詐的可能性是 1%，我可以為您構建一個 99% 正確的模型，只需簡單地說從來沒有欺詐......

神經網絡非常強大，有好有壞。 壞事是他們幾乎總能找到某種模式，即使沒有。 如果你給他們 2000 列本質上它有點像數字“Pi”，如果你在逗號后面的數字中搜索足夠長的時間，你會找到你想要的任何數字組合。 這里有更詳細的解釋： https://medium.com/@jennifer.zzz/more-features-than-data-points-in-linear-regression-5bcabba6883e

如何在使用 k 折交叉驗證訓練訓練數據后測試數據？

問題描述

2 個解決方案

解決方案1
0 2020-05-28 02:49:20

解決方案2
0 2020-05-30 17:49:57

如何在使用 k 折交叉驗證訓練訓練數據后測試數據？

問題描述

2 個解決方案

解決方案1 0 2020-05-28 02:49:20

解決方案2 0 2020-05-30 17:49:57

解決方案1
0 2020-05-28 02:49:20

解決方案2
0 2020-05-30 17:49:57