简体   繁体   English

PyTorch 中的 LSTM 分类名称

[英]LSTM in PyTorch Classifying Names

I am trying the example presented in https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html but I am using a LSTM model instead of a RNN.我正在尝试https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html 中提供的示例,但我使用的是 LSTM 模型而不是 RNN。 The dataset is composed by different names (of different sizes) and their corresponding language (total number of languages is 18), and the objective is to train a model that given a certain name outputs the language it belongs to.数据集由不同的名称(大小不同)及其对应的语言(语言总数为 18 种)组成,目标是训练一个模型,给定某个名称,输出其所属的语言。

My problems right now are:我现在的问题是:

  • How to deal with variable size names, ie Hector and Kim, in the LSTM如何在 LSTM 中处理可变大小的名称,即 Hector 和 Kim
  • A whole name (secuence of character) is processed every time in the LSTM so the output of the softmax function has shape (#characters of name, #target classes) but I would like just to obtain (1,#target of classes) in order to decide each name to which class does it correspond to.每次在 LSTM 中都会处理一个全名(字符的序列),因此 softmax 函数的输出具有形状(#characters of name, #target classes)但我只想获得(1,#target of classes)为了决定每个名称对应于哪个类。 I have tried to just get the last row but results are very bad.我试图只获得最后一行,但结果非常糟糕。
class LSTM(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTM, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.softmax = nn.LogSoftmax(dim = 1)


    def forward(self, word):
        embeds = self.word_embeddings(word)
        lstm_out, _ = self.lstm(embeds.view(len(word), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(word), -1))
        tag_scores = self.softmax(tag_space)
        return tag_scores

    def initHidden(self):
        return Variable(torch.zeros(1, self.hidden_dim))
    lstm = LSTM(n_embedding_dim,n_hidden,n_characters,n_categories)
    optimizer = torch.optim.SGD(lstm.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()
    def train(category_tensor, line_tensor):
        # i.e. line_tensor = tensor([37,  4, 14, 13, 19,  0, 17,  0, 10,  8, 18]) and category_tensor = tensor([11])
        optimizer.zero_grad()
        output = lstm(line_tensor)

        loss = criterion(output[-1:], category_tensor) # VERY BAD
        loss.backward()

        optimizer.step()

        return output, loss.data.item()

Where line_tensor is of variable size (depending the size of each name) and is a mapping between character and their index in the dictionary其中line_tensor的大小可变(取决于每个名称的大小),并且是字符与其在字典中的索引之间的映射

Lets dig into the solution step by step让我们一步一步深入研究解决方案

Frame the problem框定问题

Given your problem statement, you will have to use LSTM for making a classification rather then its typical use of tagging.鉴于您的问题陈述,您将不得不使用 LSTM 进行分类,而不是其典型的标记用途。 The LSTM is unrolled for certain timestep and this is the reason why input and ouput dimensions of a recurrent models are LSTM 在特定时间步长展开,这就是为什么循环模型的输入和输出维度是

  • Input: batch size X time steps X input size输入: batch size X time steps X input size
  • Output: batch size X time steps X hidden size输出: batch size X time steps X hidden size

Now since you want to use it for classification, you have two options:现在既然你想用它进行分类,你有两个选择:

  1. Put a dense layer over the output of all the timesteps/unrollings [My example below uses this]在所有时间步/展开的输出上放置一个密集层 [我下面的示例使用这个]
  2. Ignore the all timestep outputs except the last, and put a dense layer over the last timestep忽略除最后一个时间步之外的所有时间步输出,并在最后一个时间步上放置一个密集层

So the input to our LSTM model are the names fed as one character per LSTM timestep and output will be the class corresponding to its language.因此,我们 LSTM 模型的输入是每个 LSTM 时间步长作为一个字符输入的名称,输出将是与其语言对应的类。

How to handle variable length inputs/names如何处理可变长度的输入/名称

We have two options again here.我们在这里再次有两个选择。

  1. Batch same length names together.将相同长度的名称批处理在一起。 This is called bucketing这称为分桶
  2. Fix max length based on the average size of names you have.根据您拥有的名称的平均大小修复最大长度。 Pad the smaller names and chop off the longer names [My example below uses max length of 10]填充较小的名称并砍掉较长的名称 [我下面的示例使用最大长度为 10]

Do we need Embedding layer ?我们需要嵌入层吗?

No. Embedding layers are typically used to learn a good vector representations of words.不。嵌入层通常用于学习单词的良好向量表示。 But in the case of character model, the input is a character not a word so adding an embedding layers does not help.但是在字符模型的情况下,输入是字符而不是单词,因此添加嵌入层无济于事。 Character can be directly encoded to number and embedding layers does very little in capturing relationship between different characters.字符可以直接编码为数字,嵌入层在捕获不同字符之间的关系方面作用很小。 You can still use embedding layer, but I strongly believe it will not help.您仍然可以使用嵌入层,但我坚信它无济于事。

Toy character LSTM model code玩具角色 LSTM 模型代码

import numpy as np
import torch
import torch.nn as nn

# Model architecture 
class Recurrent_Model(nn.Module):
    def __init__(self, output_size, time_steps=10):
        super(Recurrent_Model, self).__init__()
        self.time_steps = time_steps
        self.lstm = nn.LSTM(1,32, bidirectional=True, num_layers=2)
        self.linear = nn.Linear(32*2*time_steps, output_size)

    def forward(self, x):        
        lstm_out, _ = self.lstm(x)
        return self.linear(lstm_out.view(-1,32*2*self.time_steps))

# Sample input and output
names = ['apple', 'dog', 'donkey', "elephant", "hippopotamus"]
lang = [0,1,2,1,0]

def pad_sequence(name, max_len=10):
    x = np.zeros((len(name), max_len))
    for i, name in enumerate(names):
        for j, c in enumerate(name):
            if j >= max_len:
                break
            x[i,j] = ord(c)
    return torch.FloatTensor(x)

x = pad_sequence(names)
x = torch.unsqueeze(x, dim=2)
y = torch.LongTensor(lang)

model = Recurrent_Model(3)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), 0.01)

for epoch in range(500):
    model.train()
    output = model(x)
    loss = criterion(output, y)
    print (f"Train Loss: {loss.item()}")
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Note笔记

  1. All the tensors are loaded into memory so if you have huge dataset, you will have to use a dataset and dataloader to avoid OOM error.所有张量都加载到内存中,因此如果您有庞大的数据集,则必须使用数据集和数据加载器来避免 OOM 错误。
  2. You will have to split data into train test and validate on the test dataset (the standard model building stuff)您必须将数据拆分为训练测试并在测试数据集(标准模型构建内容)上进行验证
  3. You will have to normalize the input tensors before passing it to the model (again the standard model building stuff)在将输入张量传递给模型之前,您必须对其进行归一化(再次是标准模型构建的东西)

Finally最后

so how do you make sure your model architecture does not have bugs or is learning.那么您如何确保您的模型架构没有错误或正在学习。 As Andrej karpathy says, overfit the model on a small dataset and if it is overfitting then we are fine.正如 Andrej karpathy 所说,在一个小数据集上过度拟合模型,如果它过度拟合,那么我们很好。

While there may be different approaches depending on the application domain, a common approach to variable sized input is to pad them to a MAX_SIZE.虽然可能有不同的方法取决于应用程序域,但可变大小输入的常用方法是将它们填充到 MAX_SIZE。 Either define a sufficiently large MAX_SIZE or pick the largest name in the dataset to define it.定义一个足够大的 MAX_SIZE 或选择数据集中最大的名称来定义它。

The padding should be zeros or some other null character that fits the tokenization scheme.填充应该是零或其他适合标记化方案的空字符。

Embedding is pretty important for NLP models and as you see, the LSTM layer expects you to give it the resulting embedding dimensions.嵌入对于 NLP 模型非常重要,如您所见,LSTM 层希望您为其提供结果嵌入维度。

    self.embedding = nn.Embedding(vocab_size, embedding_dim)

vocab_size = number of unique names in the dataset. vocab_size = 数据集中唯一名称的数量。

embedding_dim = appropriate number. embedding_dim = 适当的数字。 Experiment with different dimensions, what gets better results?尝试不同的维度,什么会得到更好的结果? 5? 5? 512? 512?

    self.lstm = nn.LSTM(embedding_dim, 
                       hidden_dim)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM