简体   繁体   English

Gensim word2vec 扩充或合并预训练向量

[英]Gensim word2vec augment or merge pre-trained vectors

I am loading pre-trained vectors from a binary file generated from the word2vec C code with something like:我正在从 word2vec C 代码生成的二进制文件中加载预训练的向量,其内容如下:

model_1 = Word2Vec.load_word2vec_format('vectors.bin', binary=True)

I am using those vectors to generate vector representations of sentences that contain words that may not have already existing vectors in vectors.bin .我正在使用这些向量来生成句子的向量表示,这些句子包含的单词可能在vectors.bin中还没有存在的向量。 For example, if vectors.bin has no associated vector for the word "yogurt", and I try例如,如果vectors.bin没有与“yogurt”这个词相关联的向量,我尝试

yogurt_vector = model_1['yogurt']

I get KeyError: 'yogurt' , which makes good sense.我得到KeyError: 'yogurt' ,这很有意义。 What I want is to be able to take the sentence words that do not have corresponding vectors and add representations for them to model_1 .我想要的是能够采用没有相应向量的句子单词并将它们的表示添加到model_1 I am aware from this post that you cannot continue to train the C vectors.我从这篇文章中了解到您不能继续训练 C 向量。 Is there then a way to train a new model, say model_2 , for the words with no vectors and merge model_2 with model_1 ?那么有没有办法为没有向量的词训练一个新模型,比如说model_2并将model_2model_1合并?

Alternatively, is there a way to test if the model contains a word before I actually try to retrieve it, so that I can at least avoid the KeyError?或者,有没有办法在我实际尝试检索之前测试模型是否包含一个单词,以便我至少可以避免 KeyError?

Avoiding the key error is easy:避免关键错误很容易:

[x for x in 'this model hus everything'.split() if x in model_1.vocab]

The more difficult problem is merging a new word to an existing model.更困难的问题是将新词合并到现有模型中。 The problem is that word2vec calculates the likelihood of 2 words being next to each other, and if the word 'yogurt' wasn't in the first body that the model was trained on it's not next to any of those words, so the second model would not correlate to the first.问题是 word2vec 计算了两个单词彼此相邻的可能性,如果“yogurt”这个词不在模型训练的第一个正文中,那么它就不在任何这些词旁边,所以第二个模型不会与第一个相关。

You can look at the internals when a model is saved (uses numpy.save) and I would be interested in working with you to come up with code to allow adding vocabulary.您可以在保存模型时查看内部​​结构(使用 numpy.save),我有兴趣与您一起编写代码以允许添加词汇。

This is a great question, and unfortunately there is no way to add to the vocabulary without changing the internals of the code.这是一个很好的问题,不幸的是没有办法在不改变代码内部的情况下添加到词汇表中。 Check out this discussion: https://groups.google.com/forum/#!searchin/word2vec-toolkit/online $20word2vec/word2vec-toolkit/L9zoczopPUQ/_Zmy57TzxUQJ查看此讨论: https ://groups.google.com/forum/#!searchin/word2vec-toolkit/online $20word2vec/word2vec-toolkit/L9zoczopPUQ/_Zmy57TzxUQJ

My advice is to ignore words that are not in the vocabulary, and only use the ones that are in the vocabulary.我的建议是忽略词汇表中没有的词,只使用词汇表中的词。 If you are using python, you can do this by:如果您使用的是 python,您可以通过以下方式执行此操作:

for word in wordlist:
    if word in model.vocab:
       present.append(word)
    else:
       # this is all the words that are absent for your model
       # might be useful for debugging. Ignore if you dont need this info
       absent.append(word)

<Do whatever you want with the words in the list 'present'>    

A possible alternative to handle absent/missing words is suggested by YoonKim in "Convolutional Neural Networks for Sentence Classification" YoonKim 在“Convolutional Neural Networks for Sentence Classification”中提出了一种处理缺失/缺失词的可能替代方法

Its code: https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py#L88其代码: https : //github.com/yoonkim/CNN_sentence/blob/master/process_data.py#L88

def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
    """
    For words that occur in at least min_df documents, create a separate word vector.    
    0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
    """
    for word in vocab:
        if word not in word_vecs and vocab[word] >= min_df:
            word_vecs[word] = np.random.uniform(-0.25,0.25,k)  

But this works bcoz, you use model to find to find corresponding vectors.但是这行得通,你使用模型来查找以找到相应的向量。 Functionality like similarity etc are lost相似性等功能丢失

you can continue adding new words/sentences to a model vocabulary and train the augmented model , with gensim online training algorithm ( https://rutumulkar.com/blog/2015/word2vec/ ),您可以继续向模型词汇表中添加新单词/句子使用 gensim 在线训练算法( https://rutumulkar.com/blog/2015/word2vec/ )训练增强模型

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html

model = gensim.models.Word2Vec.load(temporary_filepath)
more_sentences = [
    ['Advanced', 'users', 'can', 'load', 'a', 'model',
     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences'],
]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)

related:有关的:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM