如何将验证集纳入机器学习？

Question

I am trying to learn about machine learning, and I am having trouble understanding when and how to use the validation set.我正在尝试学习机器学习，但我无法理解何时以及如何使用验证集。 I have understood that it is used to evaluate the candidate models, before checking with the test set, but I don't understand how to properly write it in code.我知道它用于在检查测试集之前评估候选模型，但我不明白如何在代码中正确编写它。 Take for example this code I am working on:以我正在处理的这段代码为例：

# Split the set into train, validation, and test set (70:15:15 for train:valid:test)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.7)          # Split the data in training and remaining set
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5) # Split the remaining data 50/50 into validation and test set

print("Properties (shapes):\nTraining set: {}\nValidation set: {}\nTest set: {}".format(X_train.shape, X_valid.shape, X_test.shape))

import warnings # supress warnings
warnings.filterwarnings('ignore')

# SCALING
std = StandardScaler()
minmax = MinMaxScaler()
rob = RobustScaler()

# Transforming the TRAINING set
X_train_Standard = std.fit_transform(X_train)   # Standardization: each value has mean = 0 and std = 1
X_train_MinMax = minmax.fit_transform(X_train)  # Normalization: each value is between 0 and 1
X_train_Robust = rob.fit_transform(X_train)     # Robust scales each values variance and quartiles (ignores outliers)

# Transforming the TEST set
X_test_Standard = std.fit_transform(X_test)
X_test_MinMax = minmax.fit_transform(X_test)
X_test_Robust = rob.fit_transform(X_test)

# Test scalers for decision tree classifier
treeStd = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Standard, y_train)
treeMinMax = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_MinMax, y_train)
treeRobust = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Robust, y_train)
print("Decision tree with standard scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeStd.score(X_train_Standard, y_train), treeStd.score(X_test_Standard, y_test)))
print("Decision tree with min/max scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeMinMax.score(X_train_MinMax, y_train), treeMinMax.score(X_test_MinMax, y_test)))
print("Decision tree with robust scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeRobust.score(X_train_Robust, y_train), treeRobust.score(X_test_Robust, y_test)))

# Now we train our model for different values of `max_depth`, ranging from 1 to 20.

max_depths = range(1, 30)
training_error = []

for max_depth in max_depths:
    model_1 = DecisionTreeRegressor(max_depth=max_depth)
    model_1.fit(X,y)
    training_error.append(mean_squared_error(y, model_1.predict(X)))


testing_error = []
for max_depth in max_depths:
    model_2 = DecisionTreeRegressor(max_depth=max_depth)
    model_2.fit(X, y)
    testing_error.append(mean_squared_error(y_test, model_2.predict(X_test)))

plt.plot(max_depths, training_error, color='blue', label='Training error')
plt.plot(max_depths, testing_error, color='green', label='Testing error')
plt.xlabel('Tree depth')
plt.axvline(x=25, color='orange', linestyle='--')
plt.annotate('optimum = 25', xy=(20, 20), color='red')
plt.ylabel('Mean squared error')
plt.title('Hyperparameters tuning', pad=20, size=30)
plt.legend()

Where would I run the tests on the validation set?我将在哪里运行验证集上的测试？ How do I incorporate it into the code?如何将其合并到代码中？

Answer 1

First of all make sure to only create one model keep using this one model. Currently you create a model in every training step and overwrite the old one.首先确保只创建一个 model 继续使用这个 model。目前你在每个训练步骤中创建一个 model 并覆盖旧的。 Otherwise your model will never improve.否则你的 model 永远不会提高。

Secondly: The Idea behind the validation set is to evaluate the progress of your training, to see how your model performs on data it hasn't seen before.其次：验证集背后的想法是评估你的训练进度，看看你的 model 如何处理它以前没有见过的数据。 Therefore you need to incorporate it into your training process.因此，您需要将其纳入您的培训过程。

So in your case it would look like that.所以在你的情况下它看起来像那样。

model = DecisionTreeRegressor(max_depth=max_depth) # here we create the model we want to use
for max_depth in max_depths:
    model.fit(X_train,y_train) # here we train the model
    training_error.append(mean_squared_error(y_train, model.predict(X_train))) # here we calculate the training error
    val_error.append(mean_squared_error(y_val, model.predict(X_val))) # here we calculate the validation error
test_error = mean_squared_error(y_test, model.predict(X_test)) # here we calculate the test error

Make sure that you only train on your training data, never on your validation or test data.确保你只训练你的训练数据，而不是你的验证或测试数据。

如何将验证集纳入机器学习？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-09 14:11:50

如何将验证集纳入机器学习？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-09 14:11:50

解决方案1
1 已采纳 2022-05-09 14:11:50