简体   繁体   English

批量大小的字符 RNN 分类

[英]Char RNN classification with batch size

I'm replicating this example for a classification with a Pytorch char-rnn .我正在复制此示例以使用Pytorch char-rnn进行分类

for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

I see that every epoch only 1 example is taken and random.我看到每个时代只有 1 个例子是随机的。 I would like that each epoch all the dataset is taken with a specific batch size of examples.我希望每个时期的所有数据集都采用特定批次大小的示例。 I can adjust the code to do this myself but I was wondering if some flags already exist.我可以调整代码来自己做这件事,但我想知道是否已经存在一些标志。

Thank you谢谢

If you construct a Dataset class by inheriting from the PyTorch Dataset class and then feed it into the PyTorch DataLoader class , then you can set a parameter batch_size to determine how many examples you will get out in each iteration of your training loop.如果您通过从PyTorch Dataset 类继承来构建一个 Dataset 类,然后将其提供给PyTorch DataLoader 类,那么您可以设置一个参数batch_size来确定您在训练循环的每次迭代中将得到多少个示例。

I have followed the same tutorial as you.我遵循了与您相同的教程。 I can show you how I have used the PyTorch classes above to get the data in batches.我可以向您展示我如何使用上面的 PyTorch 类来批量获取数据。

# load data into a DataFrame using the findFiles function as in the tutorial
files = findFiles('data/names') # load the files as in the tutorial into a dataframe 
df_names = pd.concat([
    pd.read_table(f, names = ["names"], header = None)\
      .assign(lang = f.stem)\
    for f in files]).reset_index(drop = True)
print(df_names.head())

# output: 
#      names      lang
# 0      Abe  Japanese
# 1  Abukara  Japanese
# 2   Adachi  Japanese
# 3     Aida  Japanese
# 4   Aihara  Japanese

# Make train and test data 
from sklearn.model_selection import train_test_split
X_train, X_dev, y_train, y_dev = train_test_split(df_names.names, df_names.lang,
                                                   train_size = 0.8)
df_train = pd.concat([X_train, y_train], axis=1)
df_val = pd.concat([X_dev, y_dev], axis=1)

Now I construct a modified Dataset class using the dataframe(s) above by inheriting from the PyTorch Dataset class.现在,我通过从 PyTorch 数据集类继承,使用上面的数据帧构造一个修改后的数据集类。

import torch
from torch.utils.data import Dataset, DataLoader
class NameDatasetReader(Dataset):
    def __init__(self, df: pd.DataFrame):
        self.df = df      
        
    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx: int):
        row = self.df.loc[idx] # gets a row from the df 
        input_name = list(row.names) # turns name into a list of chars 
        len_name = len(input_name) # length of name (used to pad packed sequence)
        labels = row.label # target 
        return input_name, len_name, labels
train_dat =  NameDatasetReader(df_train) # make dataset from dataframe with training data 

Now, the thing is that when you want to work with batches and sequences you need the sequences to be of equal length in each batch.现在,问题是当您想要处理批次和序列时,您需要每个批次中的序列长度相等。 That is why I also get the length of the extracted name from the dataframe in the __getitem__() function above.这就是为什么我还在上面的__getitem__()函数中从数据帧中获取提取名称的长度。 This is to be used in function that modifies the training examples used in each batch.这将用于修改每个批次中使用的训练示例的函数。

This is called a collate_batch function and in this example it modifies each batch of your training data such that the sequences in a given batch are of equal length.这称为 collat​​e_batch 函数,在本例中,它会修改每个批次的训练数据,以使给定批次中的序列长度相等。

# Dictionary of all letters (as in the original tutorial,
#  I have just inserted also an entry for the padding token)
all_letters_dict= dict(zip(all_letters, range(1, len(all_letters) +2)))
all_letters_dict['<PAD>'] = 0

# function to turn name into a tensor 
def line_to_tensor(line):
    """turns name into a tensor of one hot encoded vectors"""
    tensor = torch.zeros(len(line),
                         len(all_letters_dict.keys())) # (name_len x vocab_size) - <PAD> is part of vocab
    for li, letter in enumerate(line):
        tensor[li][all_letters_dict[letter]] = 1
    return tensor

def collate_batch_lstm(input_data: Tuple) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Combines multiple name samples into a single batch
    :param input_data: The combined input_ids, seq_lens, and labels for the batch
    :return: A tuple of tensors (input_ids, seq_lens, labels)
    """
    
    # loops over batch input and extracts vals 
    names = [i[0] for i in input_data] 
    seq_names_len = [i[1] for i in input_data] 
    labels = [i[2] for i in input_data] 

    max_length = max(seq_names_len) # longest sequence aka. name 

    # Pad all of the input samples to the max length 
    names = [(name + ["<PAD>"] * (max_length - len(name))) for name in names]  
    
    input_ids = [line_to_tensor(name) for name in names] # turn each list of chars into a tensor with one hot vecs
    
    # Make sure each sample is max_length long
    assert (all(len(i) == max_length for i in input_ids))
    return torch.stack(input_ids), torch.tensor(seq_names_len), torch.tensor(labels) 

Now, I can construct a dataloader by inserting the dataset object from above, the collate_batch_lstm() function above, and a given batch_size into the DataLoader class.现在,我可以通过将上面的数据集对象、上面的 collat​​e_batch_lstm() 函数和给定的 batch_size 插入到 DataLoader 类中来构建数据加载器。

train_dat_loader = DataLoader(train_dat, batch_size = 4, collate_fn = collate_batch_lstm)

You can now iterate over train_dat_loader which returns a training batch with 4 names in each iteration.您现在可以迭代train_dat_loader ,它会在每次迭代中返回一个带有 4 个名称的训练批次。

Consider a given batch from train_dat_loader:考虑来自 train_dat_loader 的给定批次:

seq_tensor, seq_lengths, labels = iter(train_dat_loader).next()
print(seq_tensor.shape, seq_lengths.shape, labels.shape)
print(seq_tensor)
print(seq_lengths)
print(labels)
# output: 
# torch.Size([4, 11, 59]) torch.Size([4]) torch.Size([4])
# tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          ...,
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.]],

#         [[0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          ...,
#          [1., 0., 0.,  ..., 0., 0., 0.],
#          [1., 0., 0.,  ..., 0., 0., 0.],
#          [1., 0., 0.,  ..., 0., 0., 0.]],

#         [[0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          ...,
#          [1., 0., 0.,  ..., 0., 0., 0.],
#          [1., 0., 0.,  ..., 0., 0., 0.],
#          [1., 0., 0.,  ..., 0., 0., 0.]],

#         [[0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          [0., 0., 0.,  ..., 0., 0., 0.],
#          ...,
#          [1., 0., 0.,  ..., 0., 0., 0.],
#          [1., 0., 0.,  ..., 0., 0., 0.],
#          [1., 0., 0.,  ..., 0., 0., 0.]]])
# tensor([11,  3,  8,  7])
# tensor([14,  1, 14,  2])

It gives us an tensor of size (4 x 11 x 59).它为我们提供了一个大小为 (4 x 11 x 59) 的张量。 4 because we have specified that we want a batch size of 4. 11 is the length of the longest name in the given batch (all other names have been padded with zeros such that they are equal length). 4 因为我们已经指定了我们想要的批次大小为 4。11 是给定批次中最长名称的长度(所有其他名称都用零填充,以便它们的长度相等)。 59 is the number of characters in our vocabulary. 59 是我们词汇表中的字符数。

The next thing is to incorporate this into your training routine and use a packing routine to avoid doing redundant calculations on the zeros that you have padded your data with :)下一件事是将其纳入您的训练例程并使用打包例程以避免对您填充数据的零进行冗余计算:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM