[英]Char RNN classification with batch size
我正在復制此示例以使用Pytorch char-rnn進行分類。
for iter in range(1, n_iters + 1):
category, line, category_tensor, line_tensor = randomTrainingExample()
output, loss = train(category_tensor, line_tensor)
current_loss += loss
我看到每個時代只有 1 個例子是隨機的。 我希望每個時期的所有數據集都采用特定批次大小的示例。 我可以調整代碼來自己做這件事,但我想知道是否已經存在一些標志。
謝謝
如果您通過從PyTorch Dataset 類繼承來構建一個 Dataset 類,然后將其提供給PyTorch DataLoader 類,那么您可以設置一個參數batch_size
來確定您在訓練循環的每次迭代中將得到多少個示例。
我遵循了與您相同的教程。 我可以向您展示我如何使用上面的 PyTorch 類來批量獲取數據。
# load data into a DataFrame using the findFiles function as in the tutorial
files = findFiles('data/names') # load the files as in the tutorial into a dataframe
df_names = pd.concat([
pd.read_table(f, names = ["names"], header = None)\
.assign(lang = f.stem)\
for f in files]).reset_index(drop = True)
print(df_names.head())
# output:
# names lang
# 0 Abe Japanese
# 1 Abukara Japanese
# 2 Adachi Japanese
# 3 Aida Japanese
# 4 Aihara Japanese
# Make train and test data
from sklearn.model_selection import train_test_split
X_train, X_dev, y_train, y_dev = train_test_split(df_names.names, df_names.lang,
train_size = 0.8)
df_train = pd.concat([X_train, y_train], axis=1)
df_val = pd.concat([X_dev, y_dev], axis=1)
現在,我通過從 PyTorch 數據集類繼承,使用上面的數據幀構造一個修改后的數據集類。
import torch
from torch.utils.data import Dataset, DataLoader
class NameDatasetReader(Dataset):
def __init__(self, df: pd.DataFrame):
self.df = df
def __len__(self):
return len(self.df)
def __getitem__(self, idx: int):
row = self.df.loc[idx] # gets a row from the df
input_name = list(row.names) # turns name into a list of chars
len_name = len(input_name) # length of name (used to pad packed sequence)
labels = row.label # target
return input_name, len_name, labels
train_dat = NameDatasetReader(df_train) # make dataset from dataframe with training data
現在,問題是當您想要處理批次和序列時,您需要每個批次中的序列長度相等。 這就是為什么我還在上面的__getitem__()
函數中從數據幀中獲取提取名稱的長度。 這將用於修改每個批次中使用的訓練示例的函數。
這稱為 collate_batch 函數,在本例中,它會修改每個批次的訓練數據,以使給定批次中的序列長度相等。
# Dictionary of all letters (as in the original tutorial,
# I have just inserted also an entry for the padding token)
all_letters_dict= dict(zip(all_letters, range(1, len(all_letters) +2)))
all_letters_dict['<PAD>'] = 0
# function to turn name into a tensor
def line_to_tensor(line):
"""turns name into a tensor of one hot encoded vectors"""
tensor = torch.zeros(len(line),
len(all_letters_dict.keys())) # (name_len x vocab_size) - <PAD> is part of vocab
for li, letter in enumerate(line):
tensor[li][all_letters_dict[letter]] = 1
return tensor
def collate_batch_lstm(input_data: Tuple) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
Combines multiple name samples into a single batch
:param input_data: The combined input_ids, seq_lens, and labels for the batch
:return: A tuple of tensors (input_ids, seq_lens, labels)
"""
# loops over batch input and extracts vals
names = [i[0] for i in input_data]
seq_names_len = [i[1] for i in input_data]
labels = [i[2] for i in input_data]
max_length = max(seq_names_len) # longest sequence aka. name
# Pad all of the input samples to the max length
names = [(name + ["<PAD>"] * (max_length - len(name))) for name in names]
input_ids = [line_to_tensor(name) for name in names] # turn each list of chars into a tensor with one hot vecs
# Make sure each sample is max_length long
assert (all(len(i) == max_length for i in input_ids))
return torch.stack(input_ids), torch.tensor(seq_names_len), torch.tensor(labels)
現在,我可以通過將上面的數據集對象、上面的 collate_batch_lstm() 函數和給定的 batch_size 插入到 DataLoader 類中來構建數據加載器。
train_dat_loader = DataLoader(train_dat, batch_size = 4, collate_fn = collate_batch_lstm)
您現在可以迭代train_dat_loader
,它會在每次迭代中返回一個帶有 4 個名稱的訓練批次。
考慮來自 train_dat_loader 的給定批次:
seq_tensor, seq_lengths, labels = iter(train_dat_loader).next()
print(seq_tensor.shape, seq_lengths.shape, labels.shape)
print(seq_tensor)
print(seq_lengths)
print(labels)
# output:
# torch.Size([4, 11, 59]) torch.Size([4]) torch.Size([4])
# tensor([[[0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# ...,
# [0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.]],
# [[0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# ...,
# [1., 0., 0., ..., 0., 0., 0.],
# [1., 0., 0., ..., 0., 0., 0.],
# [1., 0., 0., ..., 0., 0., 0.]],
# [[0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# ...,
# [1., 0., 0., ..., 0., 0., 0.],
# [1., 0., 0., ..., 0., 0., 0.],
# [1., 0., 0., ..., 0., 0., 0.]],
# [[0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# ...,
# [1., 0., 0., ..., 0., 0., 0.],
# [1., 0., 0., ..., 0., 0., 0.],
# [1., 0., 0., ..., 0., 0., 0.]]])
# tensor([11, 3, 8, 7])
# tensor([14, 1, 14, 2])
它為我們提供了一個大小為 (4 x 11 x 59) 的張量。 4 因為我們已經指定了我們想要的批次大小為 4。11 是給定批次中最長名稱的長度(所有其他名稱都用零填充,以便它們的長度相等)。 59 是我們詞匯表中的字符數。
下一件事是將其納入您的訓練例程並使用打包例程以避免對您填充數據的零進行冗余計算:)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.