简体   繁体   中英

Build vocab using spacy

I'm using spacy tokenizer to tokenize my data, and then build vocab.

This is my code:

import spacy
nlp = spacy.load("en_core_web_sm")

def build_vocab(docs, max_vocab=10000, min_freq=3):
 stoi = {'<PAD>':0, '<UNK>':1}
 itos = {0:'<PAD>', 1:'<UNK>'}
 word_freq = {}
 idx = 2
 for sentence in docs:
  for word in [i.text.lower() for i in nlp(sentence)]:
   
   if word not in word_freq:
    word_freq[word] = 1
   else:
    word_freq[word] += 1

   if word_freq[word] == min_freq:
    if len(stoi) < max_vocab:
     stoi[word] = idx
     itos[idx] = word
     idx += 1
 return stoi, itos

But it takes hours to complete since I have more than 800000 sentences.

Is there a faster and better way to achieve this? Thanks.

update: tried to remove min_freq:

def build_vocab(docs, max_vocab=10000):
  stoi = {'<PAD>':0, '<UNK>':1}
  itos = {0:'<PAD>', 1:'<UNK>'}
  idx = 2
  for sentence in docs:
    for word in [i.text.lower() for i in nlp(sentence)]:
      if word not in stoi:
        if len(stoi) < max_vocab:
          stoi[word] = idx
          itos[idx] = word
          idx += 1
  return stoi, itos

still takes a long time, does spacy have a function to build vocab like in torchtext (.build_vocab).

There are a couple of things you can do to make this faster.

import spacy
from collections import Counter

def build_vocab(texts, max_vocab=10000, min_freq=3):
    nlp = spacy.blank("en") # just the tokenizer
    wc = Counter()
    for doc in nlp.pipe(texts):
        for word in doc:
            wc[word.lower_] += 1

    word2id = {}
    id2word = {}
    for word, count in wc.most_common():
        if count < min_freq: break
        if len(word2id) >= max_vocab: break
        wid = len(word2id)
        word2id[word] = wid
        id2word[wid] = word
    return word2id, id2word

Explanation:

  1. If you only use the tokenizer you can use spacy.blank
  2. nlp.pipe is fast for lots of text (less important, maybe irrelevant with blank model though)
  3. Counter is optimized for this kind of counting task

Another thing is that the way you are building your vocab in your initial example, you will take the first N words that have enough tokens, not the top N words, which is probably wrong.

Another thing is that if you're using spaCy you shouldn't build your vocab this way - spaCy has its own built-in vocab class that handles converting tokens to IDs. I guess you might need this mapping for a downstream task or something but look at the vocab docs to see if you can use that instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM