简体   繁体   中英

Gensim word2vec augment or merge pre-trained vectors

I am loading pre-trained vectors from a binary file generated from the word2vec C code with something like:

model_1 = Word2Vec.load_word2vec_format('vectors.bin', binary=True)

I am using those vectors to generate vector representations of sentences that contain words that may not have already existing vectors in vectors.bin . For example, if vectors.bin has no associated vector for the word "yogurt", and I try

yogurt_vector = model_1['yogurt']

I get KeyError: 'yogurt' , which makes good sense. What I want is to be able to take the sentence words that do not have corresponding vectors and add representations for them to model_1 . I am aware from this post that you cannot continue to train the C vectors. Is there then a way to train a new model, say model_2 , for the words with no vectors and merge model_2 with model_1 ?

Alternatively, is there a way to test if the model contains a word before I actually try to retrieve it, so that I can at least avoid the KeyError?

Avoiding the key error is easy:

[x for x in 'this model hus everything'.split() if x in model_1.vocab]

The more difficult problem is merging a new word to an existing model. The problem is that word2vec calculates the likelihood of 2 words being next to each other, and if the word 'yogurt' wasn't in the first body that the model was trained on it's not next to any of those words, so the second model would not correlate to the first.

You can look at the internals when a model is saved (uses numpy.save) and I would be interested in working with you to come up with code to allow adding vocabulary.

This is a great question, and unfortunately there is no way to add to the vocabulary without changing the internals of the code. Check out this discussion: https://groups.google.com/forum/#!searchin/word2vec-toolkit/online $20word2vec/word2vec-toolkit/L9zoczopPUQ/_Zmy57TzxUQJ

My advice is to ignore words that are not in the vocabulary, and only use the ones that are in the vocabulary. If you are using python, you can do this by:

for word in wordlist:
    if word in model.vocab:
       present.append(word)
    else:
       # this is all the words that are absent for your model
       # might be useful for debugging. Ignore if you dont need this info
       absent.append(word)

<Do whatever you want with the words in the list 'present'>    

A possible alternative to handle absent/missing words is suggested by YoonKim in "Convolutional Neural Networks for Sentence Classification"

Its code: https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py#L88

def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
    """
    For words that occur in at least min_df documents, create a separate word vector.    
    0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
    """
    for word in vocab:
        if word not in word_vecs and vocab[word] >= min_df:
            word_vecs[word] = np.random.uniform(-0.25,0.25,k)  

But this works bcoz, you use model to find to find corresponding vectors. Functionality like similarity etc are lost

you can continue adding new words/sentences to a model vocabulary and train the augmented model , with gensim online training algorithm ( https://rutumulkar.com/blog/2015/word2vec/ ),

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html

model = gensim.models.Word2Vec.load(temporary_filepath)
more_sentences = [
    ['Advanced', 'users', 'can', 'load', 'a', 'model',
     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences'],
]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)

related:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM