I was wondering how can I limited the Google's Word2Vec to my vocabulary. Google's Word2 vec link: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
This is what I have:
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
embedding_matrix = np.zeros((len(my_vocabulary), 300))
where my vocabulary is a list of unique words in my corpus. How can I feel the embedding matrix only for words in my_vocabulary? In addition, I would like to have the flexibility that if my word does not exist in the Google's word2vec to be filled with zeros.
Thanks
You can use gensim.models.Word2Vec
to build your custom w2v model.
sentences = [['cats', 'can', 'not', 'fly'], ['dogs','cant' 'drive']]
model = gensim.models.Word2Vec(sentences, min_count=1)
You can fill your embedding matrix using the following code:
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('path/to/bin', binary=True)
embedding_matrix = np.zeros((len(my_vocabulary), 300))
for index,word in enumerate(my_vocabulary):
try:
# update embedding matrix using Google's pretrained model
embedding_matrix[index] = model.mv[word]
except:
# when word isn't found in pretrained model, we keep the embedding matrix unchanged at that index (assigned to zero)
pass
Further, you can explore ways to initialize your out of vocabulary words to some values other than zero.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.