Adjust Google's Word2Vec loaded with Gensim to your vocabulary and then create the embedding vector

Question

I was wondering how can I limited the Google's Word2Vec to my vocabulary. Google's Word2 vec link: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

This is what I have:

import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)

embedding_matrix = np.zeros((len(my_vocabulary), 300))

where my vocabulary is a list of unique words in my corpus. How can I feel the embedding matrix only for words in my_vocabulary? In addition, I would like to have the flexibility that if my word does not exist in the Google's word2vec to be filled with zeros.

Thanks

Answer 1

You can use gensim.models.Word2Vec to build your custom w2v model.

sentences = [['cats', 'can', 'not', 'fly'], ['dogs','cant' 'drive']]
model = gensim.models.Word2Vec(sentences, min_count=1)

Reference: https://rare-technologies.com/word2vec-tutorial/

Answer 2

You can fill your embedding matrix using the following code:

import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('path/to/bin', binary=True)

embedding_matrix = np.zeros((len(my_vocabulary), 300))

for index,word in enumerate(my_vocabulary):
    try:
        # update embedding matrix using Google's pretrained model
        embedding_matrix[index] = model.mv[word] 
    except:
        # when word isn't found in pretrained model, we keep the embedding matrix unchanged at that index (assigned to zero)
        pass

Further, you can explore ways to initialize your out of vocabulary words to some values other than zero.

Adjust Google's Word2Vec loaded with Gensim to your vocabulary and then create the embedding vector

Question

2 answers

solution1
1 2018-04-15 23:49:06

solution2
0 2018-04-16 09:24:37

Adjust Google's Word2Vec loaded with Gensim to your vocabulary and then create the embedding vector

Question

2 answers

solution1 1 2018-04-15 23:49:06

solution2 0 2018-04-16 09:24:37

solution1
1 2018-04-15 23:49:06

solution2
0 2018-04-16 09:24:37