Pytorch 二进制分类RNN Model 不学习

Question

I'm working on a binary classification task with Pytorch and my model is failing to learn, I can't figure out if it is a problem with the model or with the data.我正在使用 Pytorch 进行二进制分类任务，而我的 model 无法学习，我无法确定 model 或数据是否有问题。

Here is my model:这是我的 model：

from torch import nn

class RNN(nn.Module):
    def __init__(self, input_dim):
        super(RNN, self).__init__()
        
        self.rnn = nn.RNN(input_size=input_dim, hidden_size=64,
                          num_layers=2,
                          batch_first=True, bidirectional=True)
        
        self.norm = nn.BatchNorm1d(128)
        
        self.rnn2 = nn.RNN(input_size=128, hidden_size=64,
                          num_layers=2,
                          batch_first=True, bidirectional=False)
        
        self.drop = nn.Dropout(0.5)
        
        self.fc7 = nn.Linear(64, 2)  
        self.sigmoid2 = nn.Softmax(dim=2)

    def forward(self, x):       
        out, h_n = self.rnn(x)
        out = out.permute(0, 2, 1)
        out = self.norm(out)
        out = out.permute(0, 2, 1)
        
        out, h_n = self.rnn2(out)
        out = self.drop(out)
        
        out = self.fc7(out)
        out = self.sigmoid2(out)
        return out.squeeze()

The model consists in two RNN layers, with a BatchNorm in between, then a Dropout and the last layer, I use Softmax function with two classes instead of Sigmoid for evaluation purposes. model 包含两个 RNN 层，中间有一个 BatchNorm，然后是 Dropout 和最后一层，我使用 Softmax function 和两个类而不是 Sigmoid 用于评估目的。

Then I create and train the model:然后我创建并训练 model：

model = RNN(2476)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
loss_function = nn.CrossEntropyLoss()
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1) 


model.train()

EPOCHS = 25
BATCH_SIZE = 64

epoch_loss = []
for ii in range(EPOCHS):
    for i in range(1, X_train.size()[0]//BATCH_SIZE+1):
        x_train = X_train[(i-1)*BATCH_SIZE:i*BATCH_SIZE]
        labels = y_train[(i-1)*BATCH_SIZE:i*BATCH_SIZE]
        
        optimizer.zero_grad()

        y_pred = model(x_train)
        y_pred = y_pred.round()
            
        single_loss = loss_function(y_pred, labels.long().squeeze())

        single_loss.backward()
        optimizer.step()
        lr_scheduler.step()

        print(f"\rBatch {i}/{X_train.size()[0]//BATCH_SIZE+1} Trained: {i*BATCH_SIZE}/{X_train.size()[0]} Loss: {single_loss.item():10.8f} Step: {lr_scheduler.get_lr()}", end="")
    
    epoch_loss.append(single_loss.item())
    print(f'\nepoch: {ii:3} loss: {single_loss.item():10.8f}')

This is the output when training the model:这是训练 model 时的 output：

Batch 353/354 Trained: 22592/22644 Loss: 0.86013699 Step: [1.0000000000000007e-21]
epoch:   0 loss: 0.86013699
Batch 353/354 Trained: 22592/22644 Loss: 0.81326193 Step: [1.0000000000000014e-33]
epoch:   1 loss: 0.81326193
Batch 353/354 Trained: 22592/22644 Loss: 0.87576205 Step: [1.0000000000000022e-45]
epoch:   2 loss: 0.87576205
Batch 353/354 Trained: 22592/22644 Loss: 0.92263710 Step: [1.0000000000000026e-57]
epoch:   3 loss: 0.92263710
Batch 353/354 Trained: 22592/22644 Loss: 0.90701210 Step: [1.0000000000000034e-68]
epoch:   4 loss: 0.90701210
Batch 353/354 Trained: 22592/22644 Loss: 0.92263699 Step: [1.0000000000000039e-80]
epoch:   5 loss: 0.92263699
Batch 353/354 Trained: 22592/22644 Loss: 0.82888693 Step: [1.0000000000000044e-92]
epoch:   6 loss: 0.82888693
Batch 353/354 Trained: 22592/22644 Loss: 0.81326193 Step: [1.000000000000005e-104]]
epoch:   7 loss: 0.81326193
Batch 353/354 Trained: 22592/22644 Loss: 0.87576205 Step: [1.0000000000000055e-115]
epoch:   8 loss: 0.87576205
Batch 353/354 Trained: 22592/22644 Loss: 0.82888693 Step: [1.0000000000000062e-127]
epoch:   9 loss: 0.82888693
Batch 353/354 Trained: 22592/22644 Loss: 0.81326199 Step: [1.0000000000000067e-139]
epoch:  10 loss: 0.81326199
Batch 353/354 Trained: 22592/22644 Loss: 0.82888693 Step: [1.0000000000000072e-151]
epoch:  11 loss: 0.82888693
Batch 353/354 Trained: 22592/22644 Loss: 0.89138699 Step: [1.0000000000000076e-162]
epoch:  12 loss: 0.89138699
Batch 353/354 Trained: 22592/22644 Loss: 0.82888699 Step: [1.000000000000008e-174]]
epoch:  13 loss: 0.82888699
Batch 353/354 Trained: 22592/22644 Loss: 0.82888687 Step: [1.0000000000000089e-186]
epoch:  14 loss: 0.82888687
Batch 353/354 Trained: 22592/22644 Loss: 0.82888693 Step: [1.0000000000000096e-198]
epoch:  15 loss: 0.82888693
Batch 353/354 Trained: 22592/22644 Loss: 0.84451199 Step: [1.0000000000000103e-210]
epoch:  16 loss: 0.84451199
Batch 353/354 Trained: 22592/22644 Loss: 0.96951205 Step: [1.0000000000000111e-221]
epoch:  17 loss: 0.96951205
Batch 353/354 Trained: 22592/22644 Loss: 0.87576205 Step: [1.0000000000000117e-233]
epoch:  18 loss: 0.87576205
Batch 353/354 Trained: 22592/22644 Loss: 0.89138705 Step: [1.0000000000000125e-245]
epoch:  19 loss: 0.89138705
Batch 353/354 Trained: 22592/22644 Loss: 0.79763699 Step: [1.0000000000000133e-257]
epoch:  20 loss: 0.79763699
Batch 353/354 Trained: 22592/22644 Loss: 0.84451199 Step: [1.0000000000000138e-268]
epoch:  21 loss: 0.84451199
Batch 353/354 Trained: 22592/22644 Loss: 0.84451205 Step: [1.0000000000000146e-280]
epoch:  22 loss: 0.84451205
Batch 353/354 Trained: 22592/22644 Loss: 0.79763693 Step: [1.0000000000000153e-292]
epoch:  23 loss: 0.79763693
Batch 353/354 Trained: 22592/22644 Loss: 0.87576205 Step: [1.000000000000016e-304]]
epoch:  24 loss: 0.87576205

And this is the loss per epoch:这是每个时期的损失：

For the data, each of the features in the input data have a dimension of (2474,) , and the targets have 1 dimension (either [1] or [0] ), then I add the sequence length dimension ( 1 ) to the input data for the RNN layers:对于数据，输入数据中的每个特征都有一个维度(2474,) ，目标有 1 个维度（ [1]或[0] ），然后我将sequence length维度 ( 1 ) 添加到RNN层的输入数据：

X_train.size(), X_test.size(), y_train.size(), y_test.size()

(torch.Size([22644, 1, 2474]),
 torch.Size([5661, 1, 2474]),
 torch.Size([22644, 1]),
 torch.Size([5661, 1]))

Distribution of the target classes:目标类分布：

I can't figure out why my model is not learning, the classes are balanced and I haven't notified anything wrong with the data.我不明白为什么我的 model 没有学习，课程是平衡的，我没有通知数据有任何问题。 Any suggestions?有什么建议么？

Answer 1

Increase number of epochs (let it train for longer, NNs take time), lower batch size and explore with other hyperparameters.增加 epoch 的数量（让它训练更长时间，NNs 需要时间），降低批量大小并使用其他超参数进行探索。

I was wondering if you're doing 'optimizer.zero_grad()' too often?我想知道你是否经常做'optimizer.zero_grad()'？ More on that https://stackoverflow.com/a/67819799/7420967更多关于https://stackoverflow.com/a/67819799/7420967

Answer 2

This is not a direct solution to your problem, but what was the process that led to this architecture?这不是您问题的直接解决方案，但导致此架构的过程是什么？ I've found it helpful to build up complexity iteratively if only to make identifying issues more trivial (what did I add just before the issue arose?).我发现如果只是为了使识别问题变得更简单（我在问题出现之前添加了什么？），迭代地构建复杂性是有帮助的。

To save time on constructing your RNN iteratively, you can try single-batch training by which you construct a network that can overfit a single training batch.为了节省迭代构建 RNN 的时间，您可以尝试单批次训练，通过该训练构建一个可以过度拟合单个训练批次的网络。 If your network can overfit a single training batch, it should be complex enough to learn the features in the training data.如果您的网络可以过拟合单个训练批次，那么它应该足够复杂以学习训练数据中的特征。

Once you have an architecture that can easily overfit a single training batch, you can then train with the entire training set and explore additional strategies to account for overfitting through regularization.一旦您拥有可以轻松过拟合单个训练批次的架构，您就可以使用整个训练集进行训练，并探索其他策略以通过正则化来解决过拟合问题。

Your model doesn't seem overly complex but this may mean starting with a single rnn layer and a single linear layer to see if your loss will budge on a single batch.您的 model 似乎并不过分复杂，但这可能意味着从单个 rnn 层和单个线性层开始，看看您的损失是否会在单个批次中减少。

Pytorch 二进制分类RNN Model 不学习

问题描述

1 个解决方案

解决方案1
0 2021-12-18 20:13:39

解决方案2
0 2021-12-21 15:23:27

Pytorch 二进制分类RNN Model 不学习

问题描述

1 个解决方案

解决方案1 0 2021-12-18 20:13:39

解决方案2 0 2021-12-21 15:23:27

解决方案1
0 2021-12-18 20:13:39

解决方案2
0 2021-12-21 15:23:27