I am looking at implementing a hyper-parameter tuning method for a feed-forward neural network (FNN) implemented using PyTorch
. My original FNN , the model is named net
, has been implemented using a mini-batch learning approach with epochs:
#Parameters
batch_size = 50 #larger batch size leads to over fitting
num_epochs = 1000
learning_rate = 0.01 #was .01-AKA step size - The amount that the weights are updated during training
batch_no = len(x_train) // batch_size
criterion = nn.CrossEntropyLoss() #performance of a classification model whose output is a probability value between 0 and 1
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
for epoch in range(num_epochs):
if epoch % 20 == 0:
print('Epoch {}'.format(epoch+1))
x_train, y_train = shuffle(x_train, y_train)
# Mini batch learning - mini batch since batch size < n(batch gradient descent), but > 1 (stochastic gradient descent)
for i in range(batch_no):
start = i * batch_size
end = start + batch_size
x_var = Variable(torch.FloatTensor(x_train[start:end]))
y_var = Variable(torch.LongTensor(y_train[start:end]))
# Forward + Backward + Optimize
optimizer.zero_grad()
ypred_var = net(x_var)
loss =criterion(ypred_var, y_var)
loss.backward()
optimizer.step()
I lastly test my model on a separate test set.
I came across an approach using randomised search to tune the hyper-parameters as well as implementing K-fold cross-validation ( RandomizedSearchCV
).
My question is two-fold(no pun intended!) and firstly is theoretical: Is k-fold validation is necessary or could add any benefit to mini-batch feed-forward neural network? From what I can see, the mini-batch approach should do roughly the same job, stopping over-fitting.
I also found a good answer here but I'm not sure this addresses a mini-batch approach approach specifically.
Secondly, if k-fold is not necessary, is there another hyper-parameter tuning function for PyTorch
to avoid manually creating one?
Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a small batch size might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.