简体   繁体   中英

How to delete words in self-trained word2vec model

I got a self-trained word2vec model (2G, end with ".model"). I convert the model into a text file (over 50G, end with ".txt") because I have to use the text file in my other python codes. I am trying to reduce the size of the text file by deleting words that I do not need. I have built up a vocabulary set with all the words I need. How can I filter unnecessary words in the model?

I have tried to build a dictionary for the text file, but I am out of RAM.

emb_dict = dict()
with open(emb_path, "r", encoding="utf-8") as f:
    lines = f.readlines()
    for l in lines:
        word, embedding = l.strip().split(' ',1)
        emb_dict[word] = embedding

I am thinking if I can delete words in the ".model" file. How can I do it? Any help would be appreciated!

It's hard to answer further without more precise code but you could batch your analysis of the text file

lines_to_keep = []
new_file = "some_path.txt"
words_to_keep = set(some_words)
with open(emb_path, "r", encoding="utf-8") as f:
    for l in f:
        word, embedding = l.strip().split(' ',1)
        if word in words_to_keep:
            lines_to_keep.append(l.strip())
        if lines_to_keep and len(lines_to_keep) % 1000 == 0:
            with open(new_file, "a") as f:
                f.write("\n".join(lines_to_keep)
            lines_to_keep = []

Usually the best way to keep a word2vec model size down is to discard more of the less-frequent words that appeared in the original training corpus.

Words with only a few mentions tend to not get very good word-vectors anyway, and throwing out lots of the few-occurrence words usually has the beneficial side-effect of making the remaining word-vectors better.

If you're using the gensim Word2Vec class, two alternate ways to do this, pre-training, are:

  • Use a larger min_count value.
  • Specify a max_final_vocab count - no more than than exactly that count of words will be kept by the model.

After training, with a set of vectors that were already saved with .save_word2vec_format() , you could re-load them using the limit parameter (to only load the leading, most-frequent words), then re-save. For example:

from gensim.models import KeyedVectors
w2v_model = KeyedVectors.load_word2vec_format(allvecs_filename, binary=False, limit=500000)
w2v_model.save_word2vec_format(somevecs_filename, binary=False)

Alternatively, if you had a list_of_words_to_keep , you could load the full-file (no limit , assuming you have enough RAM), but then thin-out the model's .vocab dictionary before re-saving. For example:

from gensim.models import KeyedVectors
w2v_model = KeyedVectors.load_word2vec_format(allvecs_filename, binary=False)
vocab_set = set(w2v_model.vocab.keys())
keep_set = set(list_of_words_to_keep)
drop_set = vocab_set - keep_set
for word in drop_set:
    del w2v_model.vocab[word]
w2v_model.save_word2vec_format(somevecs_filename, binary=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM