简体繁体 English

训练和验证损失和数据集大小

[英]Training & Validation loss and dataset size

原文 2021-01-03 17:54:35 4 1 python/ machine-learning/ neural-network/ pytorch

I'm new on Neural Networks and I am doing a project that has to define a NN and train it.我是神经网络的新手，我正在做一个必须定义 NN 并对其进行训练的项目。 I've defined a NN of 2 hidden layers with 17 inputs and 17 output.我定义了 2 个隐藏层的 NN，有 17 个输入和 17 个 output。 The NN has 21 inputs and 3 outputs. NN 有 21 个输入和 3 个输出。

I have a data set of labels of 10 million, and a dataset of samples of another 10 million.我有一个包含 1000 万个标签的数据集，以及一个包含另外 1000 万个样本的数据集。 My first issue is about the size of the validation set and the training set.我的第一个问题是关于验证集和训练集的大小。 I'm using PyTorch and batches, and of what I've read, the batches shouldn't be larger.我正在使用 PyTorch 和批次，在我读过的内容中，批次不应该更大。 But I don't know how many approximately should be the size of the sets.但我不知道大约应该是多少组的大小。

I've tried with larger and small numbers, but I cannot find a correlation that shows me if I'm right choosing a large set o small set in one of them (apart from the time that requires to process a very large set).我尝试过使用较大和较小的数字，但我找不到显示我在其中一个中选择大集合或小集合是否正确的相关性（除了处理非常大的集合所需的时间）。

My second issue is about the Training and Validation loss, which I've read that can tell me if I'm overfitting or underfitting depending on if it is bigger or smaller.我的第二个问题是关于训练和验证损失的，我读过它可以告诉我我是过拟合还是欠拟合，这取决于它是更大还是更小。 The perfect should be the same value for both, and it also depends on the epochs.两者的完美值应该相同，并且还取决于时期。 But I am not able to tune the network parameters like batch size, learning rate or choosing how much data should I use in the training and validation.但我无法调整网络参数，如批量大小、学习率或选择在训练和验证中应该使用多少数据。 If 80% of the set (8 million), it takes hours to finish it, and I'm afraid that if I choose a smaller dataset, it won't learn.如果 80% 的集合（800 万），需要几个小时才能完成，而且我担心如果我选择一个较小的数据集，它不会学习。

If anything is badly explained, please feel free to ask me for more information.如果有什么不好的解释，请随时向我询问更多信息。 As I said, the data is given, and I only have to define the network and train it with PyTorch.正如我所说，数据给定了，我只需要定义网络并使用 PyTorch 对其进行训练。

Thanks!谢谢！

1 个解决方案

For your first question about batch size, there is no fix rule for what value should it have.对于您关于批量大小的第一个问题，对于它应该具有什么值没有固定规则。 You have to try and see which one works best.你必须尝试看看哪一个效果最好。 When your NN starts performing badly don't go above or below that value for batch size.当您的 NN 开始表现不佳时，不要 go 高于或低于该批大小值。 There is no hard rule here to follow.这里没有硬性规则可以遵循。

For your second question, first of all, having training and validation loss same doesn't mean your NN is performing nicely, it is just an indication that its performance will be good enough on a test set if the above is the case, but it largely depends on many other things like your train and test set distribution.对于您的第二个问题，首先，具有相同的训练和验证损失并不意味着您的 NN 表现良好，这只是表明如果是上述情况，它在测试集上的表现将足够好，但它很大程度上取决于许多其他因素，例如您的训练集和测试集分布。

And with NN you need to try as many things you can try.使用 NN，您需要尝试尽可能多的事情。 Try different parameter values, train and validation split size, etc. You cannot just assume that it won't work.尝试不同的参数值、训练和验证拆分大小等。你不能只是假设它不起作用。