[英]How to incorporate the validation set in machine learning?
I am trying to learn about machine learning, and I am having trouble understanding when and how to use the validation set.我正在尝试学习机器学习,但我无法理解何时以及如何使用验证集。 I have understood that it is used to evaluate the candidate models, before checking with the test set, but I don't understand how to properly write it in code.我知道它用于在检查测试集之前评估候选模型,但我不明白如何在代码中正确编写它。 Take for example this code I am working on:以我正在处理的这段代码为例:
# Split the set into train, validation, and test set (70:15:15 for train:valid:test)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.7) # Split the data in training and remaining set
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5) # Split the remaining data 50/50 into validation and test set
print("Properties (shapes):\nTraining set: {}\nValidation set: {}\nTest set: {}".format(X_train.shape, X_valid.shape, X_test.shape))
import warnings # supress warnings
warnings.filterwarnings('ignore')
# SCALING
std = StandardScaler()
minmax = MinMaxScaler()
rob = RobustScaler()
# Transforming the TRAINING set
X_train_Standard = std.fit_transform(X_train) # Standardization: each value has mean = 0 and std = 1
X_train_MinMax = minmax.fit_transform(X_train) # Normalization: each value is between 0 and 1
X_train_Robust = rob.fit_transform(X_train) # Robust scales each values variance and quartiles (ignores outliers)
# Transforming the TEST set
X_test_Standard = std.fit_transform(X_test)
X_test_MinMax = minmax.fit_transform(X_test)
X_test_Robust = rob.fit_transform(X_test)
# Test scalers for decision tree classifier
treeStd = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Standard, y_train)
treeMinMax = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_MinMax, y_train)
treeRobust = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Robust, y_train)
print("Decision tree with standard scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeStd.score(X_train_Standard, y_train), treeStd.score(X_test_Standard, y_test)))
print("Decision tree with min/max scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeMinMax.score(X_train_MinMax, y_train), treeMinMax.score(X_test_MinMax, y_test)))
print("Decision tree with robust scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeRobust.score(X_train_Robust, y_train), treeRobust.score(X_test_Robust, y_test)))
# Now we train our model for different values of `max_depth`, ranging from 1 to 20.
max_depths = range(1, 30)
training_error = []
for max_depth in max_depths:
model_1 = DecisionTreeRegressor(max_depth=max_depth)
model_1.fit(X,y)
training_error.append(mean_squared_error(y, model_1.predict(X)))
testing_error = []
for max_depth in max_depths:
model_2 = DecisionTreeRegressor(max_depth=max_depth)
model_2.fit(X, y)
testing_error.append(mean_squared_error(y_test, model_2.predict(X_test)))
plt.plot(max_depths, training_error, color='blue', label='Training error')
plt.plot(max_depths, testing_error, color='green', label='Testing error')
plt.xlabel('Tree depth')
plt.axvline(x=25, color='orange', linestyle='--')
plt.annotate('optimum = 25', xy=(20, 20), color='red')
plt.ylabel('Mean squared error')
plt.title('Hyperparameters tuning', pad=20, size=30)
plt.legend()
Where would I run the tests on the validation set?我将在哪里运行验证集上的测试? How do I incorporate it into the code?如何将其合并到代码中?
First of all make sure to only create one model keep using this one model. Currently you create a model in every training step and overwrite the old one.首先确保只创建一个 model 继续使用这个 model。目前你在每个训练步骤中创建一个 model 并覆盖旧的。 Otherwise your model will never improve.否则你的 model 永远不会提高。
Secondly: The Idea behind the validation set is to evaluate the progress of your training, to see how your model performs on data it hasn't seen before.其次:验证集背后的想法是评估你的训练进度,看看你的 model 如何处理它以前没有见过的数据。 Therefore you need to incorporate it into your training process.因此,您需要将其纳入您的培训过程。
So in your case it would look like that.所以在你的情况下它看起来像那样。
model = DecisionTreeRegressor(max_depth=max_depth) # here we create the model we want to use
for max_depth in max_depths:
model.fit(X_train,y_train) # here we train the model
training_error.append(mean_squared_error(y_train, model.predict(X_train))) # here we calculate the training error
val_error.append(mean_squared_error(y_val, model.predict(X_val))) # here we calculate the validation error
test_error = mean_squared_error(y_test, model.predict(X_test)) # here we calculate the test error
Make sure that you only train on your training data, never on your validation or test data.确保你只训练你的训练数据,而不是你的验证或测试数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.