無法為 torchtext 文本分類構建詞匯

Question

我正在嘗試准備從 csv 文件加載的自定義數據集，以便在 torchtext 文本二進制分類問題中使用。 這是一個基本數據集，包含新聞標題和市場情緒 label 被分配為“正面”或“負面”。 我一直在關注 PyTorch 上的一些在線教程以達到這一點，但他們在最新的 torchtext package 中進行了一些重大更改，因此大部分內容已過時。

Below I've successfully parsed my csv file into a pandas dataframe with two columns - text headline and a label which is either 0 or 1 for positive/negative, split into a training and test dataset then wrapped them as a PyTorch dataset class:

train, test = train_test_split(eurusd_df, test_size=0.2)
class CustomTextDataset(Dataset):
def __init__(self, text, labels):
    self.text = text
    self.labels = labels
    
def __getitem__(self, idx):
    label = self.labels.iloc[idx]
    text = self.text.iloc[idx]
    sample = {"Label": label, "Text": text}
    return sample

def __len__(self):
    return len(self.labels)
train_dataset = CustomTextDataset(train['Text'], train['Labels'])
test_dataset = CustomTextDataset(test['Text'], test['Labels'])

I'm now trying to build a vocabulary of tokens following this tutorial https://coderzcolumn.com/tutorials/artificial-intelligence/pytorch-simple-guide-to-text-classification and the official pytorch tutorial https://pytorch. org/tutorials/beginner/text_sentiment_ngrams_tutorial.html 。

但是使用下面的代碼

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')
train_iter = train_dataset

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)
        
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

產生一個非常小的詞匯量，並將示例vocab(['here', 'is', 'an', 'example'])應用於取自原始 dataframe 的文本字段會產生一個 0 列表，暗示 vocab正在從 label 字段構建，僅包含 0 和 1，不包含文本字段。 任何人都可以查看並向我展示如何構建針對文本字段的詞匯嗎？

Answer 1

非常小的詞匯量是因為在底層， build_vocab_from_iterator使用了來自 Collections 標准庫的計數器，更具體地說是它的更新function。 此 function 的使用方式假設您傳遞給build_vocab_from_iterator的內容是一個可迭代的包裝包含 words/tokens 的迭代。

這意味着在其當前的 state 中，因為可以迭代字符串，您的代碼將創建一個能夠編碼所有字母的詞匯，而不是單詞，包括您的數據集，因此詞匯量非常小。

我不知道這是否是 Python/Pytorch 開發人員的意圖，但正因為如此，您需要將簡單的迭代器包裝在一個列表中，例如：

vocab = build_vocab_from_iterator([yield_tokens(train_iter)], specials=["<unk>"])

注意：如果您的詞匯只給出零，這不是因為它取自 label 字段，它只是返回與未知標記對應的 integer，因為它不知道所有不僅僅是字符的單詞。

希望這可以幫助！

Answer 2

所以事實證明，問題出在我的 CustomTextDataset class 中的獲取項目 function 上，它返回了一個 dict，它首先創建了構建詞匯的問題，然后在列表中傳遞迭代器時，創建了一個 TypeError。 感謝 Callim Ethée 的回答，因為它確實為我指明了正確的方向！

無法為 torchtext 文本分類構建詞匯

問題描述

2 個解決方案

解決方案1
1 已采納 2022-07-30 18:39:38

解決方案2
1 2022-07-31 16:50:11

無法為 torchtext 文本分類構建詞匯

問題描述

2 個解決方案

解決方案1 1 已采納 2022-07-30 18:39:38

解決方案2 1 2022-07-31 16:50:11

解決方案1
1 已采納 2022-07-30 18:39:38

解決方案2
1 2022-07-31 16:50:11