简体   繁体   中英

Is there an alternative to fully loading pre-trained word embeddings in memory?

I want to use pre-trained word embeddings in my machine learning model. The word embedings file I have is about 4GB. I currently read the entire file into memory in a dictionary and whenever I want to map a word to its vector representation I perform a lookup in that dictionary.

The memory usage is very high and I would like to know if there is another way of using word embeddings without loading the entire data into memory.

I have recently come across generators in Python. Could they help me reduce the memory usage?

Thank you!

What task do you have in mind? If this is a similarity based task, you could simply use the load_word2vec_format method in gensim, this allows you to pass in a limit to the number of vectors loaded. The vectors in something like the Googlenews set are ordered by frequency, this will give you the critical vectors. This also makes sense theoretically as the words with low frequency will usually have relatively bad representations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM