PyTorch：寻找已被就地操作修改的梯度计算所需的变量 - 多任务学习

Question

I recently did a massive refactor to my PyTorch LSTM code, in order to support multitask learning.我最近对我的 PyTorch LSTM 代码进行了大规模重构，以支持多任务学习。 I created an MTLWrapper , which holds a BaseModel (which can be one of several variations on a regular LSTM.network), which remained the same as it was before the refactor, minus a linear hidden2tag layer (takes hidden sequence and converts to tag space), which now sits in the wrapper.我创建了一个MTLWrapper ，它包含一个BaseModel （它可以是常规 LSTM.network 的几个变体之一），它与重构之前保持不变，减去线性 hidden2tag 层（采用隐藏序列并转换为标签空间)，现在位于包装器中。 The reason for this is that for multitask learning, all the parameters are shared, except for the final linear layer, which I have one of for each task.这样做的原因是对于多任务学习，所有参数都是共享的，除了最后的线性层，我为每个任务设置了一个。 These are stored in a nn.ModuleList , not just a regular python list.这些存储在nn.ModuleList中，而不仅仅是常规的 python 列表。

What happens now is that my forward pass returns a list of tag scores tensors (one for each task), rather than a single tensor of the tag scores for a single task.现在发生的是我的前向传递返回一个标签分数张量列表（每个任务一个），而不是单个任务的标签分数的单个张量。 I compute the losses for each of these tasks and then try to backpropagate with the average of these losses (technically also averaged over all the sentences of a batch, but this was true before the refactor too).我计算了这些任务中每一个的损失，然后尝试用这些损失的平均值进行反向传播（技术上也是对一个批次的所有句子求平均，但在重构之前也是如此）。 I call model.zero_grad() before running the forward pass on each sentence in a batch.在批量运行每个句子的正向传递之前，我调用model.zero_grad() 。

I don't know exactly where it happened, but after this refactor, I started getting this error (on the second batch):我不知道它到底发生在哪里，但在这次重构之后，我开始收到这个错误（在第二批）：

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. RuntimeError：尝试第二次向后遍历图形，但缓冲区已被释放。 Specify retain_graph=True when calling backward the first time.第一次向后调用时指定retain_graph=True。

Following the advice, I added the retain_graph=True flag, but now I get the following error instead (also on the second backward step):按照建议，我添加了 retain_graph=True 标志，但现在我得到以下错误（也是在第二个向后步骤中）：

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 400]], which is output 0 of TBackward, is at version 2; RuntimeError：梯度计算所需的变量之一已被就地操作修改：[torch.FloatTensor [100, 400]]，即 TBackward 的 output 0，版本为 2； expected version 1 instead.预期的版本 1。 Hint: the backtrace further above shows the operation that failed to compute its gradient.提示：上面的回溯显示了计算梯度失败的操作。 The variable in question was changed in there or anywhere later.有问题的变量在那里或以后的任何地方被改变了。 Good luck!祝你好运！

The hint in the backtrace is not actually helpful, because I have no idea where a tensor of the shape [100, 400] even came from - I don't have any parameters of size 400. I have a sneaky suspicion that the problem is actually that I shouldn't need the retain_graph=True , but I have no way to confirm that vs. finding the mystery variable that is being changed according to the second error.回溯中的提示实际上并没有帮助，因为我什至不知道 [100, 400] 形状的张量来自哪里 - 我没有任何大小为 400 的参数。我偷偷怀疑问题出在实际上我不需要retain_graph=True ，但我无法确认这一点与找到根据第二个错误更改的神秘变量。 Either way, I'm at a complete loss how to solve this issue.无论哪种方式，我都完全不知道如何解决这个问题。 Any help is appreciated!任何帮助表示赞赏！

Code snippets:代码片段：

import torch
import torch.nn as nn
import torch.nn.functional as F

class MTLWrapper(nn.Module):
    
    def __init__(self, embedding_dim, hidden_dim, dropout,..., directions=1, device='cpu', model_type):
        super(MTLWrapper, self).__init__()
        self.base_model = model_type(embedding_dim, hidden_dim, dropout, ..., directions, device)
        self.linear_taggers = []
        for tagset_size in tagset_sizes:
            self.linear_taggers.append(nn.Linear(hidden_dim*directions, tagset_size))
        self.linear_taggers = nn.ModuleList(self.linear_taggers)

    def init_hidden(self, hidden_dim):
        return self.base_model.init_hidden(hidden_dim)

    def forward(self, sentence):
        lstm_out = self.base_model.forward(sentence)
        tag_scores = []
        for linear_tagger in self.linear_taggers:
            tag_space = linear_tagger(lstm_out.view(len(sentence), -1))
            tag_scores.append(F.log_softmax(tag_space))
        tag_scores = torch.stack(tag_scores)
        return tag_scores

Inside the train function: function列车内：

for i in range(math.ceil(len(train_sents)/batch_size)):
    batch = r[i*batch_size:(i+1)*batch_size]
    losses = []
    for j in batch:

        sentence = train_sents[j]
        tags = train_tags[j]

        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Also, we need to clear out the hidden state of the LSTM,
        # detaching it from its history on the last instance.
        model.hidden = model.init_hidden(hidden_dim)

        sentence_in = sentence
        targets = tags

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        loss = [loss_function(tag_scores[i], targets[i]) for i in range(len(tag_scores))]
        loss = torch.stack(loss)
        avg_loss = sum(loss)/len(loss)
        losses.append(avg_loss)
losses = torch.stack(losses)
total_loss = sum(losses)/len(losses)  # average over all sentences in batch
total_loss.backward(retain_graph=True)
running_loss += total_loss.item() 
optimizer.step()
count += 1

And code for one possible BaseModel (the others are practically identical):以及一种可能的BaseModel的代码（其他的几乎相同）：

class LSTMTagger(nn.Module):

def __init__(self, embedding_dim, hidden_dim, dropout, vocab_size, alphabet_size,
             directions=1, device='cpu'):

    super(LSTMTagger, self).__init__()
    self.device = device

    self.hidden_dim = hidden_dim
    self.directions = directions
    self.dropout = nn.Dropout(dropout)

    self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

    # The LSTM takes word embeddings as inputs, and outputs hidden states
    # with dimensionality hidden_dim.
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, dropout=dropout, bidirectional=directions == 2)

    # The linear layer that maps from hidden state space to tag space
    self.hidden = self.init_hidden(hidden_dim)

def init_hidden(self, dim):
    # Before we've done anything, we don't have any hidden state.
    # Refer to the PyTorch documentation to see exactly
    # why they have this dimensionality.
    # The axes semantics are (num_layers, minibatch_size, hidden_dim)
    return (torch.zeros(self.directions, 1, dim).to(device=self.device),
            torch.zeros(self.directions, 1, dim).to(device=self.device))

def forward(self, sentence):
    word_idxs = []
    for word in sentence:
        word_idxs.append(word[0])

    embeds = self.word_embeddings(torch.LongTensor(word_idxs).to(device=self.device))
   
    lstm_out, self.hidden = self.lstm(
        embeds.view(len(sentence), 1, -1), self.hidden)
    lstm_out = self.dropout(lstm_out)
    return lstm_out

Answer 1

The problem is that when I was resetting the hidden states of the model ( model.hidden = model.init_hidden(hidden_dim) ) I didn't actually reassign the reinitialized weights to the BaseModel , but only in the MTLWrapper (which doesn't technically even use hidden layers).问题是，当我重置 model ( model.hidden = model.init_hidden(hidden_dim) ) 的隐藏状态时，我实际上并没有将重新初始化的权重重新分配给BaseModel ，而只是在MTLWrapper中（从技术上讲这不是甚至使用隐藏层）。 I amended my MTLWrapper 's init_hidden() function as follows:我修改了我的MTLWrapper的init_hidden() function 如下：

class MTLWrapper(nn.Module):

    def init_hidden(self, hidden_dim):
        self.base_model.hidden = self.base_model.init_hidden(hidden_dim)
        return self.base_model.init_hidden(hidden_dim)

This resolved the first error, and my code runs without the retain_graph=True flag.这解决了第一个错误，我的代码在没有retain_graph=True标志的情况下运行。

PyTorch：寻找已被就地操作修改的梯度计算所需的变量 - 多任务学习

问题描述

1 个解决方案

解决方案1
0 2020-10-27 06:55:19

PyTorch：寻找已被就地操作修改的梯度计算所需的变量 - 多任务学习

问题描述

1 个解决方案

解决方案1 0 2020-10-27 06:55:19

解决方案1
0 2020-10-27 06:55:19