為什么帶有一個 GRU 層的 model 返回零梯度？

Question

我正在嘗試比較 2 個模型以了解漸變的行為。

import torch
import torch.nn as nn
import torchinfo

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()        
        self.Identity =  nn.Identity ()
        self.GRU      = nn.GRU(input_size=3, hidden_size=32, num_layers=2, batch_first=True)
        self.fc       = nn.Linear(32, 5)
        
    def forward(self, input_series):
                
        self.Identity(input_series)
        
        output, h = self.GRU(input_series)                
        output    = output[:,  -1, :]       # get last state                        
        output    = self.fc(output) 
        output    = output.view(-1, 5, 1)   # reorginize output        
                        
        return output
    
    
class SecondModel(nn.Module):
    def __init__(self):
        super(SecondModel, self).__init__()        
        self.GRU      = nn.GRU(input_size=3, hidden_size=32, num_layers=2, batch_first=True)        
        
    def forward(self, input_series):
                
        output, h = self.GRU(input_series)                                        
        return output

檢查第一個 model 的梯度給出 True（零梯度）：

model = MyModel()
x     = torch.rand([2, 10, 3])
y     = model(x)
y.retain_grad()  
y[:, -1].sum().backward()
print(torch.allclose(y.grad[:, :-1], torch.tensor(0.)))  # gradients w.r.t previous outputs are zeroes

檢查第二個 model 的梯度也給出 True（零梯度）：

model = SecondModel()
x     = torch.rand([2, 10, 3])
y     = model(x)
y.retain_grad()  
y[:, -1].sum().backward()
print(torch.allclose(y.grad[:, :-1], torch.tensor(0.)))  # gradients w.r.t previous outputs are zeroes

根據這里的答案：

GRU 保存序列 output 順序后的線性層？

第二個 model（只有 GRU 層）需要提供非零梯度。

我錯過了什么？
我們什么時候會得到零或非零梯度？

Answer 1

y.grad[:, :-1]的值理論上不應該為零，但在這里它們是因為y[:, :-1]似乎並不指代用於計算y[:, -1]在 GRU 實現中。 例如，一個簡單的 1 層 GRU 實現看起來像

import torch
import torch.nn as nn

class GRU(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.lin_r = nn.Linear(input_size + hidden_size, hidden_size)
        self.lin_z = nn.Linear(input_size + hidden_size, hidden_size)
        self.lin_in = nn.Linear(input_size, hidden_size)
        self.lin_hn = nn.Linear(hidden_size, hidden_size)
        self.hidden_size = hidden_size

    def forward(self, x):
        bsz, len_, in_ = x.shape
        h = torch.zeros([bsz, self.hidden_size])
        hs = []
        for i in range(len_):
            r = self.lin_r(torch.cat([x[:, i], h], dim=-1)).sigmoid()
            z = self.lin_z(torch.cat([x[:, i], h], dim=-1)).sigmoid()
            n = (self.lin_in(x[:, i]) + r * self.lin_hn(h)).tanh()
            h = (1.-z)*n + z*h
            hs.append(h)

        # Return the output both as a single tensor and as a list of
        # tensors actually used in computing the hidden vectors
        return torch.stack(hs, dim=1), hs

然后，我們有

model = GRU(input_size=3, hidden_size=32)
x = torch.rand([2, 10, 3])
y, hs = model(x)
y.retain_grad()
for h in hs:
    h.retain_grad()
y[:, -1].sum().backward()
print(torch.allclose(y.grad[:, -1], torch.tensor(0.)))  # False, as expected (sanity check)
print(torch.allclose(y.grad[:, :-1], torch.tensor(0.)))  # True, unexpected
print(any(torch.allclose(h.grad, torch.tensor(0.)) for h in hs))  # False, as expected

看起來 PyTorch 計算了梯度 w.r.t hs中的所有張量，但不是那些 w.r.t y 。

所以，回答你的問題：

我不認為你錯過任何東西。 鏈接的答案不太正確，因為它錯誤地假設 PyTorch 會按預期計算y.grad 。
鏈接答案中作為評論給出的理論仍然是正確的，但並不十分完整：如果輸入無關緊要，則梯度始終為零。

為什么帶有一個 GRU 層的 model 返回零梯度？

問題描述

1 個解決方案

解決方案1
0 2023-01-01 10:29:44

為什么帶有一個 GRU 層的 model 返回零梯度？

問題描述

1 個解決方案

解決方案1 0 2023-01-01 10:29:44

解決方案1
0 2023-01-01 10:29:44