使用 spacy 構建詞匯

Question

我正在使用 spacy 標記器來標記我的數據，然后構建 vocab。

這是我的代碼：

import spacy
nlp = spacy.load("en_core_web_sm")

def build_vocab(docs, max_vocab=10000, min_freq=3):
 stoi = {'<PAD>':0, '<UNK>':1}
 itos = {0:'<PAD>', 1:'<UNK>'}
 word_freq = {}
 idx = 2
 for sentence in docs:
  for word in [i.text.lower() for i in nlp(sentence)]:
   
   if word not in word_freq:
    word_freq[word] = 1
   else:
    word_freq[word] += 1

   if word_freq[word] == min_freq:
    if len(stoi) < max_vocab:
     stoi[word] = idx
     itos[idx] = word
     idx += 1
 return stoi, itos

但是因為我有超過 800000 個句子，所以需要幾個小時才能完成。

有沒有更快更好的方法來實現這一目標？ 謝謝。

更新：試圖刪除 min_freq：

def build_vocab(docs, max_vocab=10000):
  stoi = {'<PAD>':0, '<UNK>':1}
  itos = {0:'<PAD>', 1:'<UNK>'}
  idx = 2
  for sentence in docs:
    for word in [i.text.lower() for i in nlp(sentence)]:
      if word not in stoi:
        if len(stoi) < max_vocab:
          stoi[word] = idx
          itos[idx] = word
          idx += 1
  return stoi, itos

仍然需要很長時間，spacy 是否有一個 function 可以像在 torchtext (.build_vocab) 中一樣構建詞匯。

Answer 1

您可以做幾件事來加快速度。

import spacy
from collections import Counter

def build_vocab(texts, max_vocab=10000, min_freq=3):
    nlp = spacy.blank("en") # just the tokenizer
    wc = Counter()
    for doc in nlp.pipe(texts):
        for word in doc:
            wc[word.lower_] += 1

    word2id = {}
    id2word = {}
    for word, count in wc.most_common():
        if count < min_freq: break
        if len(word2id) >= max_vocab: break
        wid = len(word2id)
        word2id[word] = wid
        id2word[wid] = word
    return word2id, id2word

解釋：

如果你只使用分詞器，你可以使用spacy.blank
nlp.pipe對於大量文本來說很快（不太重要，可能與空白 model 無關）
Counter針對這種計數任務進行了優化

另一件事是，您在最初的示例中構建詞匯的方式，您將使用具有足夠標記的前 N 個單詞，而不是前 N 個單詞，這可能是錯誤的。

另一件事是，如果您使用 spaCy，您不應該以這種方式構建您的詞匯 - spaCy 有自己的內置詞匯 class 可以處理將令牌轉換為 ID。 我猜您可能需要將此映射用於下游任務或其他內容，但請查看vocab 文檔以查看是否可以使用它。

使用 spacy 構建詞匯

問題描述

1 個解決方案

解決方案1
0 2021-03-30 10:55:44

使用 spacy 構建詞匯

問題描述

1 個解決方案

解決方案1 0 2021-03-30 10:55:44

解決方案1
0 2021-03-30 10:55:44