简体   繁体   中英

LSTM cell implementation in Pytorch design choices

I was looking for an implementation of an LSTM cell in Pytorch that I could extend, and I found an implementation of it in the accepted answer here . I will post it here because I'd like to refer to it. There are quite a few implementation details that I do not understand, and I was wondering if someone could clarify.

import math
import torch as th
import torch.nn as nn

class LSTM(nn.Module):

    def __init__(self, input_size, hidden_size, bias=True):
        super(LSTM, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias
        self.i2h = nn.Linear(input_size, 4 * hidden_size, bias=bias)
        self.h2h = nn.Linear(hidden_size, 4 * hidden_size, bias=bias)
        self.reset_parameters()

    def reset_parameters(self):
        std = 1.0 / math.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

    def forward(self, x, hidden):
        h, c = hidden
        h = h.view(h.size(1), -1)
        c = c.view(c.size(1), -1)
        x = x.view(x.size(1), -1)

        # Linear mappings
        preact = self.i2h(x) + self.h2h(h)

        # activations
        gates = preact[:, :3 * self.hidden_size].sigmoid()
        g_t = preact[:, 3 * self.hidden_size:].tanh()
        i_t = gates[:, :self.hidden_size]
        f_t = gates[:, self.hidden_size:2 * self.hidden_size]
        o_t = gates[:, -self.hidden_size:]

        c_t = th.mul(c, f_t) + th.mul(i_t, g_t)

        h_t = th.mul(o_t, c_t.tanh())

        h_t = h_t.view(1, h_t.size(0), -1)
        c_t = c_t.view(1, c_t.size(0), -1)
        return h_t, (h_t, c_t)

1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)

2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?

3- Why do we use view for h, c, and x in the forward method?

4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates ?

5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here: 在此处输入图像描述

1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)

In the equations you have included, the input x and the hidden state h are used for four calculations, where each of them is a matrix multiplication with a weight. Whether you do four matrix multiplications or concatenate the weights and do one bigger matrix multiplication and separate the results afterwards, has the same result.

input_size = 5
hidden_size = 10

input = torch.randn((2, input_size))

# Two different weights
w_c = torch.randn((hidden_size, input_size))
w_i = torch.randn((hidden_size, input_size))

# Concatenated weights into one tensor
# with size:[2 * hidden_size, input_size]
w_combined = torch.cat((w_c, w_i), dim=0)

# Output calculated by using separate matrix multiplications
out_c = torch.matmul(w_c, input.transpose(0, 1))
out_i = torch.matmul(w_i, input.transpose(0, 1))

# One bigger matrix multiplication with the combined weights
out_combined = torch.matmul(w_combined, input.transpose(0, 1))
# The first hidden_size number of rows belong to w_c
out_combined_c = out_combined[:hidden_size]
# The second hidden_size number of rows belong to w_i
out_combined_i = out_combined[hidden_size:]

# Using torch.allclose because they are equal besides floating point errors.
torch.allclose(out_c, out_combined_c) # => True
torch.allclose(out_i, out_combined_i) # => True

By setting the output size of the linear layer to 4 * hidden_size there are four weights with size hidden_size , so only one layer is needed instead of four. There is not really an advantage of doing this, except maybe a minor performance improvement, mostly for smaller inputs that don't fully exhaust the parallelisations capabilities if done individually.

4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates ?

That's where the outputs are separated to correspond to the output of the four individual calculations. The output is the concatenation of [i_t; f_t; o_t; g_t] [i_t; f_t; o_t; g_t] [i_t; f_t; o_t; g_t] (not including tanh and sigmoid respectively).

You can get the same separation by splitting the output into four chunks with torch.chunk :

i_t, f_t, o_t, g_t = torch.chunk(preact, 4, dim=1)

But after the separation you would have to apply torch.sigmoid to i_t , f_t and o_t , and torch.tanh to g_t .

5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:

The parameters W are the weights in the linear layer self.i2h and U in the linear layer self.h2h , but concatenated.

W_i, W_f, W_o, W_c = torch.chunk(self.i2h.weight, 4, dim=0)

U_i, U_f, U_o, U_c = torch.chunk(self.h2h.weight, 4, dim=0)

3- Why do we use view for h, c, and x in the forward method?

Based on h_t = h_t.view(1, h_t.size(0), -1) towards the end, the hidden states have the size [1, batch_size, hidden_size] . With h = h.view(h.size(1), -1) that gets rid of the first singular dimension to get size [batch_size, hidden_size] . The same could be achieved with h.squeeze(0) .

2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?

Parameter initialisation can have a big impact on the model's learning capability. The general rule for the initialisation is to have values close to zero without being too small. A common initialisation is to draw from a normal distribution with mean 0 and variance of 1 / n , where n is the number of neurons, which in turn means a standard deviation of 1 / sqrt(n) .

In this case it uses a uniform distribution instead of a normal distribution, but the general idea is similar. Determining the minimum/maximum value based on the number of neurons but avoiding to make them too small. If the minimum/maximum value would be 1 / n the values would get very small, so using 1 / sqrt(n) is more appropriate, eg 256 neurons: 1 / 256 = 0.0039 whereas 1 / sqrt(256) = 0.0625 .

Initializing neural networks provides some explanations of different initialisations with interactive visualisations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM