简体   繁体   中英

Loss is not converging in Pytorch but does in Tensorflow

Epoch: 1    Training Loss: 0.816370     Validation Loss: 0.696534
Validation loss decreased (inf --> 0.696534).  Saving model ...
Epoch: 2    Training Loss: 0.507756     Validation Loss: 0.594713
Validation loss decreased (0.696534 --> 0.594713).  Saving model ...
Epoch: 3    Training Loss: 0.216438     Validation Loss: 1.119294
Epoch: 4    Training Loss: 0.191799     Validation Loss: 0.801231
Epoch: 5    Training Loss: 0.111334     Validation Loss: 1.753786
Epoch: 6    Training Loss: 0.064309     Validation Loss: 1.348847
Epoch: 7    Training Loss: 0.058158     Validation Loss: 1.839139
Epoch: 8    Training Loss: 0.015489     Validation Loss: 1.370469
Epoch: 9    Training Loss: 0.082856     Validation Loss: 1.701200
Epoch: 10   Training Loss: 0.003859     Validation Loss: 2.657933
Epoch: 11   Training Loss: 0.018133     Validation Loss: 0.593986
Validation loss decreased (0.594713 --> 0.593986).  Saving model ...
Epoch: 12   Training Loss: 0.160197     Validation Loss: 1.499911
Epoch: 13   Training Loss: 0.012942     Validation Loss: 1.879732
Epoch: 14   Training Loss: 0.002037     Validation Loss: 2.399405
Epoch: 15   Training Loss: 0.035908     Validation Loss: 1.960887
Epoch: 16   Training Loss: 0.051137     Validation Loss: 2.226335
Epoch: 17   Training Loss: 0.003953     Validation Loss: 2.619108
Epoch: 18   Training Loss: 0.000381     Validation Loss: 2.746541
Epoch: 19   Training Loss: 0.094646     Validation Loss: 3.555713
Epoch: 20   Training Loss: 0.022620     Validation Loss: 2.833098
Epoch: 21   Training Loss: 0.004800     Validation Loss: 4.181845
Epoch: 22   Training Loss: 0.014128     Validation Loss: 1.933705
Epoch: 23   Training Loss: 0.026109     Validation Loss: 2.888344
Epoch: 24   Training Loss: 0.000768     Validation Loss: 3.029443
Epoch: 25   Training Loss: 0.000327     Validation Loss: 3.079959
Epoch: 26   Training Loss: 0.000121     Validation Loss: 3.578420
Epoch: 27   Training Loss: 0.148478     Validation Loss: 3.297387
Epoch: 28   Training Loss: 0.030328     Validation Loss: 2.218535
Epoch: 29   Training Loss: 0.001673     Validation Loss: 2.934132
Epoch: 30   Training Loss: 0.000253     Validation Loss: 3.215722

My loss is not converging. I am working on Horses vs humans dataset. There is an official notebook in tensorflow for that and it worked like a charm. When I am trying to replicate the same with pytorch, loss is not converging. Can you please have a look?

I am using criterion = nn.BCEWithLogitsLoss() and optimizer = optim.RMSprop(model.parameters(), lr=0.001) . Although it seems to have some effect on Training Loss, but Validation losses look like random numbers and not forming any pattern. What could be the possible reasons for loss not converging?

This is my CNN architecture:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # convolutional layer (sees 298x298x3 image tensor)
        self.conv1 = nn.Conv2d(3, 16, 3)
        # convolutional layer (sees 147x147x16 tensor)
        self.conv2 = nn.Conv2d(16, 32, 3)
        # convolutional layer (sees 71x71x32 tensor)
        self.conv3 = nn.Conv2d(32, 64, 3)
        # convolutional layer (sees 33x33x64 tensor)
        self.conv4 = nn.Conv2d(64, 64, 3)
        # convolutional layer (sees 14x14x64 tensor)
        self.conv5 = nn.Conv2d(64, 64, 3)
        # max pooling layer
        self.pool = nn.MaxPool2d(2, 2)
        # linear layer (64 * 7 * 7 -> 500)
        self.fc1 = nn.Linear(3136, 512)
        # linear layer (512 -> 1)
        self.fc2 = nn.Linear(512, 1)
        # dropout layer (p=0.25)
        self.dropout = nn.Dropout(0.25)

    def forward(self, x):
        # add sequence of convolutional and max pooling layers
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = self.pool(F.relu(self.conv4(x)))
        x = self.pool(F.relu(self.conv5(x)))

        # flatten image input
        x = x.view(-1, 64 * 7 * 7)
        # add dropout layer
        x = self.dropout(x)
        # add 1st hidden layer, with relu activation function
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add 2nd hidden layer
        x = self.fc2(x)
        return x

This is the complete jupyter notebook . Apologies for not being able to create a minimal reproduce-able example code.

I think the problem is in dataloaders , here I noticed, that you're not passing samplers to loaders here:

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=16,
        num_workers=0,
        shuffle=True
    )

test_loader = torch.utils.data.DataLoader(
        test_dataset,
        batch_size=16,
        num_workers=0,
        shuffle=True
    )

I have never used Samplers , so I don't now how to correctly use them, but I suppose you wanted to do smth like this:

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

train_loader = torch.utils.data.DataLoader(
        train_dataset,
        sampler=train_sampler,
        batch_size=16,
        num_workers=0,
        shuffle=True
    )

test_loader = torch.utils.data.DataLoader(
        test_dataset,
        sampler=valid_sampler,
        batch_size=16,
        num_workers=0,
        shuffle=True
    )

And according to docs:

sampler (Sampler, optional) – defines the strategy to draw samples from the dataset. If specified, shuffle must be False.

if you are using samplers you should turn off shuffle.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM