簡體   English   中英

如何將驗證集納入機器學習?

[英]How to incorporate the validation set in machine learning?

我正在嘗試學習機器學習,但我無法理解何時以及如何使用驗證集。 我知道它用於在檢查測試集之前評估候選模型,但我不明白如何在代碼中正確編寫它。 以我正在處理的這段代碼為例:

# Split the set into train, validation, and test set (70:15:15 for train:valid:test)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.7)          # Split the data in training and remaining set
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5) # Split the remaining data 50/50 into validation and test set

print("Properties (shapes):\nTraining set: {}\nValidation set: {}\nTest set: {}".format(X_train.shape, X_valid.shape, X_test.shape))

import warnings # supress warnings
warnings.filterwarnings('ignore')

# SCALING
std = StandardScaler()
minmax = MinMaxScaler()
rob = RobustScaler()

# Transforming the TRAINING set
X_train_Standard = std.fit_transform(X_train)   # Standardization: each value has mean = 0 and std = 1
X_train_MinMax = minmax.fit_transform(X_train)  # Normalization: each value is between 0 and 1
X_train_Robust = rob.fit_transform(X_train)     # Robust scales each values variance and quartiles (ignores outliers)

# Transforming the TEST set
X_test_Standard = std.fit_transform(X_test)
X_test_MinMax = minmax.fit_transform(X_test)
X_test_Robust = rob.fit_transform(X_test)

# Test scalers for decision tree classifier
treeStd = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Standard, y_train)
treeMinMax = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_MinMax, y_train)
treeRobust = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Robust, y_train)
print("Decision tree with standard scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeStd.score(X_train_Standard, y_train), treeStd.score(X_test_Standard, y_test)))
print("Decision tree with min/max scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeMinMax.score(X_train_MinMax, y_train), treeMinMax.score(X_test_MinMax, y_test)))
print("Decision tree with robust scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeRobust.score(X_train_Robust, y_train), treeRobust.score(X_test_Robust, y_test)))

# Now we train our model for different values of `max_depth`, ranging from 1 to 20.

max_depths = range(1, 30)
training_error = []

for max_depth in max_depths:
    model_1 = DecisionTreeRegressor(max_depth=max_depth)
    model_1.fit(X,y)
    training_error.append(mean_squared_error(y, model_1.predict(X)))


testing_error = []
for max_depth in max_depths:
    model_2 = DecisionTreeRegressor(max_depth=max_depth)
    model_2.fit(X, y)
    testing_error.append(mean_squared_error(y_test, model_2.predict(X_test)))

plt.plot(max_depths, training_error, color='blue', label='Training error')
plt.plot(max_depths, testing_error, color='green', label='Testing error')
plt.xlabel('Tree depth')
plt.axvline(x=25, color='orange', linestyle='--')
plt.annotate('optimum = 25', xy=(20, 20), color='red')
plt.ylabel('Mean squared error')
plt.title('Hyperparameters tuning', pad=20, size=30)
plt.legend()

我將在哪里運行驗證集上的測試? 如何將其合並到代碼中?

首先確保只創建一個 model 繼續使用這個 model。目前你在每個訓練步驟中創建一個 model 並覆蓋舊的。 否則你的 model 永遠不會提高。

其次:驗證集背后的想法是評估你的訓練進度,看看你的 model 如何處理它以前沒有見過的數據。 因此,您需要將其納入您的培訓過程。

所以在你的情況下它看起來像那樣。

model = DecisionTreeRegressor(max_depth=max_depth) # here we create the model we want to use
for max_depth in max_depths:
    model.fit(X_train,y_train) # here we train the model
    training_error.append(mean_squared_error(y_train, model.predict(X_train))) # here we calculate the training error
    val_error.append(mean_squared_error(y_val, model.predict(X_val))) # here we calculate the validation error
test_error = mean_squared_error(y_test, model.predict(X_test)) # here we calculate the test error

確保你只訓練你的訓練數據,而不是你的驗證或測試數據。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM