I am trying the example presented in https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html but I am using a LSTM model instead of a RNN. The dataset is composed by different names (of different sizes) and their corresponding language (total number of languages is 18), and the objective is to train a model that given a certain name outputs the language it belongs to.
My problems right now are:
(#characters of name, #target classes)
but I would like just to obtain (1,#target of classes)
in order to decide each name to which class does it correspond to. I have tried to just get the last row but results are very bad.class LSTM(nn.Module):
def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
super(LSTM, self).__init__()
self.hidden_dim = hidden_dim
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
# The LSTM takes word embeddings as inputs, and outputs hidden states
# with dimensionality hidden_dim.
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
# The linear layer that maps from hidden state space to tag space
self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
self.softmax = nn.LogSoftmax(dim = 1)
def forward(self, word):
embeds = self.word_embeddings(word)
lstm_out, _ = self.lstm(embeds.view(len(word), 1, -1))
tag_space = self.hidden2tag(lstm_out.view(len(word), -1))
tag_scores = self.softmax(tag_space)
return tag_scores
def initHidden(self):
return Variable(torch.zeros(1, self.hidden_dim))
lstm = LSTM(n_embedding_dim,n_hidden,n_characters,n_categories)
optimizer = torch.optim.SGD(lstm.parameters(), lr=learning_rate)
criterion = nn.NLLLoss()
def train(category_tensor, line_tensor):
# i.e. line_tensor = tensor([37, 4, 14, 13, 19, 0, 17, 0, 10, 8, 18]) and category_tensor = tensor([11])
optimizer.zero_grad()
output = lstm(line_tensor)
loss = criterion(output[-1:], category_tensor) # VERY BAD
loss.backward()
optimizer.step()
return output, loss.data.item()
Where line_tensor
is of variable size (depending the size of each name) and is a mapping between character and their index in the dictionary
Lets dig into the solution step by step
Given your problem statement, you will have to use LSTM for making a classification rather then its typical use of tagging. The LSTM is unrolled for certain timestep and this is the reason why input and ouput dimensions of a recurrent models are
batch size X time steps X input size
batch size X time steps X hidden size
Now since you want to use it for classification, you have two options:
So the input to our LSTM model are the names fed as one character per LSTM timestep and output will be the class corresponding to its language.
We have two options again here.
No. Embedding layers are typically used to learn a good vector representations of words. But in the case of character model, the input is a character not a word so adding an embedding layers does not help. Character can be directly encoded to number and embedding layers does very little in capturing relationship between different characters. You can still use embedding layer, but I strongly believe it will not help.
Toy character LSTM model code
import numpy as np
import torch
import torch.nn as nn
# Model architecture
class Recurrent_Model(nn.Module):
def __init__(self, output_size, time_steps=10):
super(Recurrent_Model, self).__init__()
self.time_steps = time_steps
self.lstm = nn.LSTM(1,32, bidirectional=True, num_layers=2)
self.linear = nn.Linear(32*2*time_steps, output_size)
def forward(self, x):
lstm_out, _ = self.lstm(x)
return self.linear(lstm_out.view(-1,32*2*self.time_steps))
# Sample input and output
names = ['apple', 'dog', 'donkey', "elephant", "hippopotamus"]
lang = [0,1,2,1,0]
def pad_sequence(name, max_len=10):
x = np.zeros((len(name), max_len))
for i, name in enumerate(names):
for j, c in enumerate(name):
if j >= max_len:
break
x[i,j] = ord(c)
return torch.FloatTensor(x)
x = pad_sequence(names)
x = torch.unsqueeze(x, dim=2)
y = torch.LongTensor(lang)
model = Recurrent_Model(3)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), 0.01)
for epoch in range(500):
model.train()
output = model(x)
loss = criterion(output, y)
print (f"Train Loss: {loss.item()}")
optimizer.zero_grad()
loss.backward()
optimizer.step()
so how do you make sure your model architecture does not have bugs or is learning. As Andrej karpathy says, overfit the model on a small dataset and if it is overfitting then we are fine.
While there may be different approaches depending on the application domain, a common approach to variable sized input is to pad them to a MAX_SIZE. Either define a sufficiently large MAX_SIZE or pick the largest name in the dataset to define it.
The padding should be zeros or some other null character that fits the tokenization scheme.
Embedding is pretty important for NLP models and as you see, the LSTM layer expects you to give it the resulting embedding dimensions.
self.embedding = nn.Embedding(vocab_size, embedding_dim)
vocab_size
= number of unique names in the dataset.
embedding_dim
= appropriate number. Experiment with different dimensions, what gets better results? 5? 512?
self.lstm = nn.LSTM(embedding_dim,
hidden_dim)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.