简体   繁体   中英

LSTM in PyTorch Classifying Names

I am trying the example presented in https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html but I am using a LSTM model instead of a RNN. The dataset is composed by different names (of different sizes) and their corresponding language (total number of languages is 18), and the objective is to train a model that given a certain name outputs the language it belongs to.

My problems right now are:

  • How to deal with variable size names, ie Hector and Kim, in the LSTM
  • A whole name (secuence of character) is processed every time in the LSTM so the output of the softmax function has shape (#characters of name, #target classes) but I would like just to obtain (1,#target of classes) in order to decide each name to which class does it correspond to. I have tried to just get the last row but results are very bad.
class LSTM(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTM, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.softmax = nn.LogSoftmax(dim = 1)


    def forward(self, word):
        embeds = self.word_embeddings(word)
        lstm_out, _ = self.lstm(embeds.view(len(word), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(word), -1))
        tag_scores = self.softmax(tag_space)
        return tag_scores

    def initHidden(self):
        return Variable(torch.zeros(1, self.hidden_dim))
    lstm = LSTM(n_embedding_dim,n_hidden,n_characters,n_categories)
    optimizer = torch.optim.SGD(lstm.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()
    def train(category_tensor, line_tensor):
        # i.e. line_tensor = tensor([37,  4, 14, 13, 19,  0, 17,  0, 10,  8, 18]) and category_tensor = tensor([11])
        optimizer.zero_grad()
        output = lstm(line_tensor)

        loss = criterion(output[-1:], category_tensor) # VERY BAD
        loss.backward()

        optimizer.step()

        return output, loss.data.item()

Where line_tensor is of variable size (depending the size of each name) and is a mapping between character and their index in the dictionary

Lets dig into the solution step by step

Frame the problem

Given your problem statement, you will have to use LSTM for making a classification rather then its typical use of tagging. The LSTM is unrolled for certain timestep and this is the reason why input and ouput dimensions of a recurrent models are

  • Input: batch size X time steps X input size
  • Output: batch size X time steps X hidden size

Now since you want to use it for classification, you have two options:

  1. Put a dense layer over the output of all the timesteps/unrollings [My example below uses this]
  2. Ignore the all timestep outputs except the last, and put a dense layer over the last timestep

So the input to our LSTM model are the names fed as one character per LSTM timestep and output will be the class corresponding to its language.

How to handle variable length inputs/names

We have two options again here.

  1. Batch same length names together. This is called bucketing
  2. Fix max length based on the average size of names you have. Pad the smaller names and chop off the longer names [My example below uses max length of 10]

Do we need Embedding layer ?

No. Embedding layers are typically used to learn a good vector representations of words. But in the case of character model, the input is a character not a word so adding an embedding layers does not help. Character can be directly encoded to number and embedding layers does very little in capturing relationship between different characters. You can still use embedding layer, but I strongly believe it will not help.

Toy character LSTM model code

import numpy as np
import torch
import torch.nn as nn

# Model architecture 
class Recurrent_Model(nn.Module):
    def __init__(self, output_size, time_steps=10):
        super(Recurrent_Model, self).__init__()
        self.time_steps = time_steps
        self.lstm = nn.LSTM(1,32, bidirectional=True, num_layers=2)
        self.linear = nn.Linear(32*2*time_steps, output_size)

    def forward(self, x):        
        lstm_out, _ = self.lstm(x)
        return self.linear(lstm_out.view(-1,32*2*self.time_steps))

# Sample input and output
names = ['apple', 'dog', 'donkey', "elephant", "hippopotamus"]
lang = [0,1,2,1,0]

def pad_sequence(name, max_len=10):
    x = np.zeros((len(name), max_len))
    for i, name in enumerate(names):
        for j, c in enumerate(name):
            if j >= max_len:
                break
            x[i,j] = ord(c)
    return torch.FloatTensor(x)

x = pad_sequence(names)
x = torch.unsqueeze(x, dim=2)
y = torch.LongTensor(lang)

model = Recurrent_Model(3)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), 0.01)

for epoch in range(500):
    model.train()
    output = model(x)
    loss = criterion(output, y)
    print (f"Train Loss: {loss.item()}")
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Note

  1. All the tensors are loaded into memory so if you have huge dataset, you will have to use a dataset and dataloader to avoid OOM error.
  2. You will have to split data into train test and validate on the test dataset (the standard model building stuff)
  3. You will have to normalize the input tensors before passing it to the model (again the standard model building stuff)

Finally

so how do you make sure your model architecture does not have bugs or is learning. As Andrej karpathy says, overfit the model on a small dataset and if it is overfitting then we are fine.

While there may be different approaches depending on the application domain, a common approach to variable sized input is to pad them to a MAX_SIZE. Either define a sufficiently large MAX_SIZE or pick the largest name in the dataset to define it.

The padding should be zeros or some other null character that fits the tokenization scheme.

Embedding is pretty important for NLP models and as you see, the LSTM layer expects you to give it the resulting embedding dimensions.

    self.embedding = nn.Embedding(vocab_size, embedding_dim)

vocab_size = number of unique names in the dataset.

embedding_dim = appropriate number. Experiment with different dimensions, what gets better results? 5? 512?

    self.lstm = nn.LSTM(embedding_dim, 
                       hidden_dim)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM