简体   繁体   English


[英]Loss is not converging in Pytorch but does in Tensorflow

Epoch: 1    Training Loss: 0.816370     Validation Loss: 0.696534
Validation loss decreased (inf --> 0.696534).  Saving model ...
Epoch: 2    Training Loss: 0.507756     Validation Loss: 0.594713
Validation loss decreased (0.696534 --> 0.594713).  Saving model ...
Epoch: 3    Training Loss: 0.216438     Validation Loss: 1.119294
Epoch: 4    Training Loss: 0.191799     Validation Loss: 0.801231
Epoch: 5    Training Loss: 0.111334     Validation Loss: 1.753786
Epoch: 6    Training Loss: 0.064309     Validation Loss: 1.348847
Epoch: 7    Training Loss: 0.058158     Validation Loss: 1.839139
Epoch: 8    Training Loss: 0.015489     Validation Loss: 1.370469
Epoch: 9    Training Loss: 0.082856     Validation Loss: 1.701200
Epoch: 10   Training Loss: 0.003859     Validation Loss: 2.657933
Epoch: 11   Training Loss: 0.018133     Validation Loss: 0.593986
Validation loss decreased (0.594713 --> 0.593986).  Saving model ...
Epoch: 12   Training Loss: 0.160197     Validation Loss: 1.499911
Epoch: 13   Training Loss: 0.012942     Validation Loss: 1.879732
Epoch: 14   Training Loss: 0.002037     Validation Loss: 2.399405
Epoch: 15   Training Loss: 0.035908     Validation Loss: 1.960887
Epoch: 16   Training Loss: 0.051137     Validation Loss: 2.226335
Epoch: 17   Training Loss: 0.003953     Validation Loss: 2.619108
Epoch: 18   Training Loss: 0.000381     Validation Loss: 2.746541
Epoch: 19   Training Loss: 0.094646     Validation Loss: 3.555713
Epoch: 20   Training Loss: 0.022620     Validation Loss: 2.833098
Epoch: 21   Training Loss: 0.004800     Validation Loss: 4.181845
Epoch: 22   Training Loss: 0.014128     Validation Loss: 1.933705
Epoch: 23   Training Loss: 0.026109     Validation Loss: 2.888344
Epoch: 24   Training Loss: 0.000768     Validation Loss: 3.029443
Epoch: 25   Training Loss: 0.000327     Validation Loss: 3.079959
Epoch: 26   Training Loss: 0.000121     Validation Loss: 3.578420
Epoch: 27   Training Loss: 0.148478     Validation Loss: 3.297387
Epoch: 28   Training Loss: 0.030328     Validation Loss: 2.218535
Epoch: 29   Training Loss: 0.001673     Validation Loss: 2.934132
Epoch: 30   Training Loss: 0.000253     Validation Loss: 3.215722

My loss is not converging. 我的损失没有收敛。 I am working on Horses vs humans dataset. 我正在研究“马vs人”数据集。 There is an official notebook in tensorflow for that and it worked like a charm. 为此,在tensorflow中有一个官方笔记本 ,它就像一个魅力。 When I am trying to replicate the same with pytorch, loss is not converging. 当我尝试用pytorch复制相同内容时,损失并未收敛。 Can you please have a look? 你能看看吗?

I am using criterion = nn.BCEWithLogitsLoss() and optimizer = optim.RMSprop(model.parameters(), lr=0.001) . 我正在使用criterion = nn.BCEWithLogitsLoss()optimizer = optim.RMSprop(model.parameters(), lr=0.001) Although it seems to have some effect on Training Loss, but Validation losses look like random numbers and not forming any pattern. 尽管它似乎对训练损失有一定影响,但是验证损失看起来像是随机数,没有形成任何模式。 What could be the possible reasons for loss not converging? 损失不收敛的可能原因是什么?

This is my CNN architecture: 这是我的CNN架构:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # convolutional layer (sees 298x298x3 image tensor)
        self.conv1 = nn.Conv2d(3, 16, 3)
        # convolutional layer (sees 147x147x16 tensor)
        self.conv2 = nn.Conv2d(16, 32, 3)
        # convolutional layer (sees 71x71x32 tensor)
        self.conv3 = nn.Conv2d(32, 64, 3)
        # convolutional layer (sees 33x33x64 tensor)
        self.conv4 = nn.Conv2d(64, 64, 3)
        # convolutional layer (sees 14x14x64 tensor)
        self.conv5 = nn.Conv2d(64, 64, 3)
        # max pooling layer
        self.pool = nn.MaxPool2d(2, 2)
        # linear layer (64 * 7 * 7 -> 500)
        self.fc1 = nn.Linear(3136, 512)
        # linear layer (512 -> 1)
        self.fc2 = nn.Linear(512, 1)
        # dropout layer (p=0.25)
        self.dropout = nn.Dropout(0.25)

    def forward(self, x):
        # add sequence of convolutional and max pooling layers
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = self.pool(F.relu(self.conv4(x)))
        x = self.pool(F.relu(self.conv5(x)))

        # flatten image input
        x = x.view(-1, 64 * 7 * 7)
        # add dropout layer
        x = self.dropout(x)
        # add 1st hidden layer, with relu activation function
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add 2nd hidden layer
        x = self.fc2(x)
        return x

This is the complete jupyter notebook . 这是完整的jupyter笔记本 Apologies for not being able to create a minimal reproduce-able example code. 抱歉无法创建最少的可复制示例代码。

I think the problem is in dataloaders , here I noticed, that you're not passing samplers to loaders here: 我认为问题出在数据dataloaders ,在这里我注意到,您没有在此处将samplers传递给loaders samplers

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

train_loader = torch.utils.data.DataLoader(

test_loader = torch.utils.data.DataLoader(

I have never used Samplers , so I don't now how to correctly use them, but I suppose you wanted to do smth like this: 我从未使用过Samplers ,所以现在我不知道如何正确使用它们,但是我想您想这样做:

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

train_loader = torch.utils.data.DataLoader(

test_loader = torch.utils.data.DataLoader(

And according to docs: 并根据文档:

sampler (Sampler, optional) – defines the strategy to draw samples from the dataset. 采样器(采样器,可选)–定义从数据集中抽取样本的策略。 If specified, shuffle must be False. 如果指定,则shuffle必须为False。

if you are using samplers you should turn off shuffle. 如果使用采样器,则应关闭随机播放。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM