简体   繁体   中英

Adjust Google's Word2Vec loaded with Gensim to your vocabulary and then create the embedding vector

I was wondering how can I limited the Google's Word2Vec to my vocabulary. Google's Word2 vec link: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

This is what I have:

import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)

embedding_matrix = np.zeros((len(my_vocabulary), 300))

where my vocabulary is a list of unique words in my corpus. How can I feel the embedding matrix only for words in my_vocabulary? In addition, I would like to have the flexibility that if my word does not exist in the Google's word2vec to be filled with zeros.

Thanks

You can use gensim.models.Word2Vec to build your custom w2v model.

sentences = [['cats', 'can', 'not', 'fly'], ['dogs','cant' 'drive']]
model = gensim.models.Word2Vec(sentences, min_count=1)

Reference: https://rare-technologies.com/word2vec-tutorial/

You can fill your embedding matrix using the following code:

import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('path/to/bin', binary=True)

embedding_matrix = np.zeros((len(my_vocabulary), 300))

for index,word in enumerate(my_vocabulary):
    try:
        # update embedding matrix using Google's pretrained model
        embedding_matrix[index] = model.mv[word] 
    except:
        # when word isn't found in pretrained model, we keep the embedding matrix unchanged at that index (assigned to zero)
        pass

Further, you can explore ways to initialize your out of vocabulary words to some values other than zero.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM