I got a self-trained word2vec model (2G, end with ".model"). I convert the model into a text file (over 50G, end with ".txt") because I have to use the text file in my other python codes. I am trying to reduce the size of the text file by deleting words that I do not need. I have built up a vocabulary set with all the words I need. How can I filter unnecessary words in the model?
I have tried to build a dictionary for the text file, but I am out of RAM.
emb_dict = dict()
with open(emb_path, "r", encoding="utf-8") as f:
lines = f.readlines()
for l in lines:
word, embedding = l.strip().split(' ',1)
emb_dict[word] = embedding
I am thinking if I can delete words in the ".model" file. How can I do it? Any help would be appreciated!
It's hard to answer further without more precise code but you could batch your analysis of the text file
lines_to_keep = []
new_file = "some_path.txt"
words_to_keep = set(some_words)
with open(emb_path, "r", encoding="utf-8") as f:
for l in f:
word, embedding = l.strip().split(' ',1)
if word in words_to_keep:
lines_to_keep.append(l.strip())
if lines_to_keep and len(lines_to_keep) % 1000 == 0:
with open(new_file, "a") as f:
f.write("\n".join(lines_to_keep)
lines_to_keep = []
Usually the best way to keep a word2vec model size down is to discard more of the less-frequent words that appeared in the original training corpus.
Words with only a few mentions tend to not get very good word-vectors anyway, and throwing out lots of the few-occurrence words usually has the beneficial side-effect of making the remaining word-vectors better.
If you're using the gensim
Word2Vec
class, two alternate ways to do this, pre-training, are:
min_count
value.max_final_vocab
count - no more than than exactly that count of words will be kept by the model. After training, with a set of vectors that were already saved with .save_word2vec_format()
, you could re-load them using the limit
parameter (to only load the leading, most-frequent words), then re-save. For example:
from gensim.models import KeyedVectors
w2v_model = KeyedVectors.load_word2vec_format(allvecs_filename, binary=False, limit=500000)
w2v_model.save_word2vec_format(somevecs_filename, binary=False)
Alternatively, if you had a list_of_words_to_keep
, you could load the full-file (no limit
, assuming you have enough RAM), but then thin-out the model's .vocab
dictionary before re-saving. For example:
from gensim.models import KeyedVectors
w2v_model = KeyedVectors.load_word2vec_format(allvecs_filename, binary=False)
vocab_set = set(w2v_model.vocab.keys())
keep_set = set(list_of_words_to_keep)
drop_set = vocab_set - keep_set
for word in drop_set:
del w2v_model.vocab[word]
w2v_model.save_word2vec_format(somevecs_filename, binary=False)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.