I was looking for an implementation of an LSTM cell in Pytorch that I could extend, and I found an implementation of it in the accepted answer here . I will post it here because I'd like to refer to it. There are quite a few implementation details that I do not understand, and I was wondering if someone could clarify.
import math
import torch as th
import torch.nn as nn
class LSTM(nn.Module):
def __init__(self, input_size, hidden_size, bias=True):
super(LSTM, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.bias = bias
self.i2h = nn.Linear(input_size, 4 * hidden_size, bias=bias)
self.h2h = nn.Linear(hidden_size, 4 * hidden_size, bias=bias)
self.reset_parameters()
def reset_parameters(self):
std = 1.0 / math.sqrt(self.hidden_size)
for w in self.parameters():
w.data.uniform_(-std, std)
def forward(self, x, hidden):
h, c = hidden
h = h.view(h.size(1), -1)
c = c.view(c.size(1), -1)
x = x.view(x.size(1), -1)
# Linear mappings
preact = self.i2h(x) + self.h2h(h)
# activations
gates = preact[:, :3 * self.hidden_size].sigmoid()
g_t = preact[:, 3 * self.hidden_size:].tanh()
i_t = gates[:, :self.hidden_size]
f_t = gates[:, self.hidden_size:2 * self.hidden_size]
o_t = gates[:, -self.hidden_size:]
c_t = th.mul(c, f_t) + th.mul(i_t, g_t)
h_t = th.mul(o_t, c_t.tanh())
h_t = h_t.view(1, h_t.size(0), -1)
c_t = c_t.view(1, c_t.size(0), -1)
return h_t, (h_t, c_t)
1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init
method)
2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?
3- Why do we use view
for h, c, and x in the forward method?
4- I'm also confused about the column bounds in the activations
part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates
?
5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:
1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the
init
method)
In the equations you have included, the input x and the hidden state h are used for four calculations, where each of them is a matrix multiplication with a weight. Whether you do four matrix multiplications or concatenate the weights and do one bigger matrix multiplication and separate the results afterwards, has the same result.
input_size = 5
hidden_size = 10
input = torch.randn((2, input_size))
# Two different weights
w_c = torch.randn((hidden_size, input_size))
w_i = torch.randn((hidden_size, input_size))
# Concatenated weights into one tensor
# with size:[2 * hidden_size, input_size]
w_combined = torch.cat((w_c, w_i), dim=0)
# Output calculated by using separate matrix multiplications
out_c = torch.matmul(w_c, input.transpose(0, 1))
out_i = torch.matmul(w_i, input.transpose(0, 1))
# One bigger matrix multiplication with the combined weights
out_combined = torch.matmul(w_combined, input.transpose(0, 1))
# The first hidden_size number of rows belong to w_c
out_combined_c = out_combined[:hidden_size]
# The second hidden_size number of rows belong to w_i
out_combined_i = out_combined[hidden_size:]
# Using torch.allclose because they are equal besides floating point errors.
torch.allclose(out_c, out_combined_c) # => True
torch.allclose(out_i, out_combined_i) # => True
By setting the output size of the linear layer to 4 * hidden_size there are four weights with size hidden_size , so only one layer is needed instead of four. There is not really an advantage of doing this, except maybe a minor performance improvement, mostly for smaller inputs that don't fully exhaust the parallelisations capabilities if done individually.
4- I'm also confused about the column bounds in the
activations
part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size forgates
?
That's where the outputs are separated to correspond to the output of the four individual calculations. The output is the concatenation of [i_t; f_t; o_t; g_t]
[i_t; f_t; o_t; g_t]
[i_t; f_t; o_t; g_t]
(not including tanh and sigmoid respectively).
You can get the same separation by splitting the output into four chunks with torch.chunk
:
i_t, f_t, o_t, g_t = torch.chunk(preact, 4, dim=1)
But after the separation you would have to apply torch.sigmoid
to i_t
, f_t
and o_t
, and torch.tanh
to g_t
.
5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:
The parameters W are the weights in the linear layer self.i2h
and U in the linear layer self.h2h
, but concatenated.
W_i, W_f, W_o, W_c = torch.chunk(self.i2h.weight, 4, dim=0)
U_i, U_f, U_o, U_c = torch.chunk(self.h2h.weight, 4, dim=0)
3- Why do we use
view
for h, c, and x in the forward method?
Based on h_t = h_t.view(1, h_t.size(0), -1)
towards the end, the hidden states have the size [1, batch_size, hidden_size] . With h = h.view(h.size(1), -1)
that gets rid of the first singular dimension to get size [batch_size, hidden_size] . The same could be achieved with h.squeeze(0)
.
2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?
Parameter initialisation can have a big impact on the model's learning capability. The general rule for the initialisation is to have values close to zero without being too small. A common initialisation is to draw from a normal distribution with mean 0 and variance of 1 / n , where n is the number of neurons, which in turn means a standard deviation of 1 / sqrt(n) .
In this case it uses a uniform distribution instead of a normal distribution, but the general idea is similar. Determining the minimum/maximum value based on the number of neurons but avoiding to make them too small. If the minimum/maximum value would be 1 / n the values would get very small, so using 1 / sqrt(n) is more appropriate, eg 256 neurons: 1 / 256 = 0.0039 whereas 1 / sqrt(256) = 0.0625 .
Initializing neural networks provides some explanations of different initialisations with interactive visualisations.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.