简体   繁体   English

将词嵌入添加到 word2vec gensim 模型

[英]Add word embedding to word2vec gensim model

I'm looking for a way to dinamically add pre-trained word vectors to a word2vec gensim model.我正在寻找一种将预训练词向量动态添加到 word2vec gensim 模型的方法。

I have a pre-trained word2vec model in a txt (words and their embedding) and I need to get Word Mover's Distance (for example via gensim.models.Word2Vec.wmdistance ) between documents in a specific corpus and a new document.我在 txt 中有一个预训练的 word2vec 模型(单词及其嵌入),我需要在特定语料库中的文档和新文档之间获取 Word Mover 的距离(例如通过gensim.models.Word2Vec.wmdistance )。

To prevent the need to load the whole vocabulary, I would want to load only the subset of the pre-trained model's words that are found in the corpus.为了避免加载整个词汇表的需要,我只想加载在语料库中找到的预训练模型单词的子集。 But if the new document has words that are not found in the corpus but they are in the original model vocabulary add them to the model so they are considered in the computation.但是,如果新文档中有在语料库中找不到但在原始模型词汇表中存在的词,则将它们添加到模型中,以便在计算中考虑它们。

What I want is to save RAM, so possible things that would help me:我想要的是节省 RAM,所以可能对我有帮助的事情:

  • Is there a way to add the word vectors directly to the model?有没有办法将词向量直接添加到模型中?
  • Is there a way to load to gensim from a matrix or another object?有没有办法从矩阵或其他对象加载到 gensim? I could have that object in RAM and append to it the new words before loading them in the model我可以在 RAM 中拥有该对象并在将新词加载到模型中之前将其附加到它
  • I don't need it to be on gensim, so if you know a different implementation for WMD that gets the vectors as input that would work (though I do need it in Python)我不需要它在 gensim 上,所以如果你知道 WMD 的不同实现,它可以将向量作为输入工作(尽管我在 Python 中确实需要它)

Thanks in advance.提前致谢。

METHOD 1:方法一:

You can just use keyedvectors from gensim.models.keyedvectors .您可以只使用gensim.models.keyedvectors中的gensim.models.keyedvectors They are very easy to use.它们非常易于使用。

from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)

METHOD 2:方法二:

AND if you already have built a model using gensim.models.Word2Vec you can just do this.并且如果您已经使用gensim.models.Word2Vec构建了一个模型,您可以这样做。 suppose I want to add the token <UKN> with a random vector.假设我想添加带有随机向量的标记<UKN>

model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length

The complete example would be like this:完整的例子是这样的:

import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec

dataset = api.load("text8")  # load dataset as iterable
model = Word2Vec(dataset)

model.wv["<UNK>"] = np.random.rand(100)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM