[英]How to incorporate the validation set in machine learning?
我正在嘗試學習機器學習,但我無法理解何時以及如何使用驗證集。 我知道它用於在檢查測試集之前評估候選模型,但我不明白如何在代碼中正確編寫它。 以我正在處理的這段代碼為例:
# Split the set into train, validation, and test set (70:15:15 for train:valid:test)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.7) # Split the data in training and remaining set
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5) # Split the remaining data 50/50 into validation and test set
print("Properties (shapes):\nTraining set: {}\nValidation set: {}\nTest set: {}".format(X_train.shape, X_valid.shape, X_test.shape))
import warnings # supress warnings
warnings.filterwarnings('ignore')
# SCALING
std = StandardScaler()
minmax = MinMaxScaler()
rob = RobustScaler()
# Transforming the TRAINING set
X_train_Standard = std.fit_transform(X_train) # Standardization: each value has mean = 0 and std = 1
X_train_MinMax = minmax.fit_transform(X_train) # Normalization: each value is between 0 and 1
X_train_Robust = rob.fit_transform(X_train) # Robust scales each values variance and quartiles (ignores outliers)
# Transforming the TEST set
X_test_Standard = std.fit_transform(X_test)
X_test_MinMax = minmax.fit_transform(X_test)
X_test_Robust = rob.fit_transform(X_test)
# Test scalers for decision tree classifier
treeStd = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Standard, y_train)
treeMinMax = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_MinMax, y_train)
treeRobust = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Robust, y_train)
print("Decision tree with standard scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeStd.score(X_train_Standard, y_train), treeStd.score(X_test_Standard, y_test)))
print("Decision tree with min/max scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeMinMax.score(X_train_MinMax, y_train), treeMinMax.score(X_test_MinMax, y_test)))
print("Decision tree with robust scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeRobust.score(X_train_Robust, y_train), treeRobust.score(X_test_Robust, y_test)))
# Now we train our model for different values of `max_depth`, ranging from 1 to 20.
max_depths = range(1, 30)
training_error = []
for max_depth in max_depths:
model_1 = DecisionTreeRegressor(max_depth=max_depth)
model_1.fit(X,y)
training_error.append(mean_squared_error(y, model_1.predict(X)))
testing_error = []
for max_depth in max_depths:
model_2 = DecisionTreeRegressor(max_depth=max_depth)
model_2.fit(X, y)
testing_error.append(mean_squared_error(y_test, model_2.predict(X_test)))
plt.plot(max_depths, training_error, color='blue', label='Training error')
plt.plot(max_depths, testing_error, color='green', label='Testing error')
plt.xlabel('Tree depth')
plt.axvline(x=25, color='orange', linestyle='--')
plt.annotate('optimum = 25', xy=(20, 20), color='red')
plt.ylabel('Mean squared error')
plt.title('Hyperparameters tuning', pad=20, size=30)
plt.legend()
我將在哪里運行驗證集上的測試? 如何將其合並到代碼中?
首先確保只創建一個 model 繼續使用這個 model。目前你在每個訓練步驟中創建一個 model 並覆蓋舊的。 否則你的 model 永遠不會提高。
其次:驗證集背后的想法是評估你的訓練進度,看看你的 model 如何處理它以前沒有見過的數據。 因此,您需要將其納入您的培訓過程。
所以在你的情況下它看起來像那樣。
model = DecisionTreeRegressor(max_depth=max_depth) # here we create the model we want to use
for max_depth in max_depths:
model.fit(X_train,y_train) # here we train the model
training_error.append(mean_squared_error(y_train, model.predict(X_train))) # here we calculate the training error
val_error.append(mean_squared_error(y_val, model.predict(X_val))) # here we calculate the validation error
test_error = mean_squared_error(y_test, model.predict(X_test)) # here we calculate the test error
確保你只訓練你的訓練數據,而不是你的驗證或測試數據。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.