简体   繁体   English

从 gensim word2vec 模型中删除旧“单词”的最佳方法是什么?

[英]What is the best way to drop old "words" from gensim word2vec model?

I have a "corpus" built from an item-item graph, which means each sentence is a graph walk path and each word is an item.我有一个从项目-项目图构建的“语料库”,这意味着每个句子都是一个图步行路径,每个单词都是一个项目。 I want to train a word2vec model upon the corpus to obtain items' embedding vectors.我想在语料库上训练 word2vec 模型以获得项目的嵌入向量。 The graph is updated everyday so the word2vec model is trained in an increased way (using Word2Vec.save() and Word2Vec.load() ) to keep updating the items' vectors.该图每天更新,因此 word2vec 模型以增加的方式进行训练(使用Word2Vec.save()Word2Vec.load() )以不断更新项目的向量。

Unlike words, the items in my corpus have their lifetime and there will be new items added in everyday.与文字不同,我的语料库中的项目有其生命周期,并且每天都会添加新项目。 In order to prevent the constant growth of the model size, I need to drop items that reached their lifetime while keep the model trainable.为了防止模型大小不断增长,我需要丢弃已达到其生命周期的项目,同时保持模型可训练。 I've read the similar question here , but this question's answer doesn't related to increased-training and is based on KeyedVectors .我在这里读过类似的问题,但这个问题的答案与增加训练无关,而是基于KeyedVectors I come up with the below code, but I'm not sure if it is correct and proper:我想出了下面的代码,但我不确定它是否正确和正确:

from gensim.models import Word2Vec
import numpy as np

texts = [["a", "b", "c"], ["a", "h", "b"]]
m = Word2Vec(texts, size=5, window=5, min_count=1, workers=1)

print(m.wv.index2word)
print(m.wv.vectors)

# drop old words
wordsToDrop = ["b", "c"]
for w in wordsToDrop:
    i = m.wv.index2word.index(w)
    m.wv.index2word.pop(i)
    m.wv.vectors = np.delete(m.wv.vectors, i, axis=0)
    del m.wv.vocab[w]

print(m.wv.index2word)
print(m.wv.vectors)
m.save("m.model")
del m

# increased training
new = [["a", "e", "n"], ["r", "s"]]
m = Word2Vec.load("m.model")
m.build_vocab(new, update=True)
m.train(new, total_examples=m.corpus_count, epochs=2)
print(m.wv.index2word)
print(m.wv.vectors)

After deleting and increased training, is the m.wv.index2word and m.wv.vectors still element-wise corresponding?删除并增加训练后, m.wv.index2wordm.wv.vectors仍然按元素对应? Is there any side-effect of above code?上面的代码有副作用吗? If my way is not good, could someone give me an example to show how to drop the old "words" properly and keep the model trainable?如果我的方式不好,有人可以举个例子来展示如何正确删除旧的“单词”并保持模型可训练?

There's no official support for removing words from a Gensim Word2Vec model, once they've ever "made the cut" for inclusion.没有官方支持从 Gensim Word2Vec模型中删除单词,一旦它们曾经“切入”以包含在内。

Even the ability to add words isn't on a great footing, as the feature isn't based on any proven/published method of updating a Word2Vec model, and glosses over difficult tradeoffs in how update-batches affect the model, via choice of learning-rate or whether the batches fully represent the existing vocabulary.甚至添加单词的能力也不是很好,因为该功能不是基于任何经过验证/已发布的更新Word2Vec模型的方法,并且通过选择学习率或批次是否完全代表现有词汇。 The safest course is to regularly re-train the model from scratch, with a full corpus with sufficient examples of all relevant words.最安全的方法是定期从头开始重新训练模型,使用包含所有相关单词的足够示例的完整语料库。

So, my main suggestion would be to regularly replace your model with a new one trained with all still-relevant data.因此,我的主要建议是定期用一个新的模型替换您的模型,该模型使用所有仍然相关的数据进行训练。 That would ensure it's no longer wasting model state on obsolete terms, and that all still-live terms have received coequal, interleaved training.这将确保它不再在过时的术语上浪费模型状态,并且所有仍然存在的术语都接受了同等的交错训练。

After such a reset, word-vectors won't be comparable to word-vectors from a prior 'model era'.在这样的重置之后,词向量将无法与先前“模型时代”的词向量相提并论。 (The same word, even if its tangible meaning hasn't changed, could be an arbitrarily different place - but the relative relationships with other vectors should remain as good or better.) But, that same sort of drift-out-of-comparison is also happening with any set of small-batch updates that don't 'touch' every existing word equally, just at some unquantifiable rate. (同一个词,即使它的有形含义没有改变,也可能是一个任意不同的地方——但与其他向量的相对关系应该保持良好或更好。)但是,同样的偏离比较任何一组小批量更新也正在发生,这些更新不会平等地“触及”每个现有单词,只是以某种无法量化的速度发生。

OTOH, if you think you need to stay with such incremental updates, even knowing the caveats, it's plausible that you could patch-up the model structures to retain as much as is sensible from the old model & continue training. OTOH,如果你认为你需要保持这种增量更新,即使知道警告,你可以修补模型结构以尽可能多地保留旧模型并继续训练是合理的。

Your code so far is a reasonable start, missing a few important considerations for proper functionality:到目前为止,您的代码是一个合理的开始,缺少正确功能的一些重要注意事项:

  • because deleting earlier-words changes the index location of later-words, you'd need to update the vocab[word].index values for every surviving word, to match the new index2word ordering.因为删除较早的词会改变较晚词的索引位置,所以您需要更新每个幸存词的vocab[word].index值,以匹配新的index2word排序。 For example, after doing all deletions, you might do:例如,在完成所有删除操作后,您可能会执行以下操作:
for i, word in enumerate(m.wv.index2word):
    m.wv.vocab[word].index = i
  • because in your (default negative-sampling) Word2Vec model, there is also another array of per-word weights related to the model's output layer, that should also be updated in sync, so that the right output-values are being checked per word.因为在你的(默认负采样) Word2Vec模型,有每个字相关模型的输出层权的另一个数组,也应当同步更新,所以,正确的输出值被每个字检查。 Roughly, wheneever you delete a row from m.wv.vectors , you should delete the same row from m.traininables.syn1neg .粗略地说,wheneever你删除行m.wv.vectors ,您应该删除同一行m.traininables.syn1neg

  • because the surviving vocabulary has different relative word-frequencies, both the negative-sampling and downsampling (controlled by the sample parameter) functions should work off different pre-calculated structures to assist their choices.因为幸存的词汇具有不同的相对词频,负采样和下采样(由sample参数控制)函数都应该处理不同的预先计算的结构来帮助他们选择。 For the cumulative-distribution table used by negative-sampling, this is pretty easy:对于负采样使用的累积分布表,这很容易:

m.make_cum_table(m.wv)

For the downsampling, you'd want to update the .sample_int values similar to the logic you can view around the code at https://github.com/RaRe-Technologies/gensim/blob/3.8.3/gensim/models/word2vec.py#L1534 .对于下采样,您需要更新.sample_int值,类似于您可以在https://github.com/RaRe-Technologies/gensim/blob/3.8.3/gensim/models/word2vec查看代码周围的逻辑.py#L1534 (But, looking at that code now, I think it may be buggy in that it's updating all words with just the frequency info in the new dict, so probably fouling the usual downsampling of truly-frequent words, and possibly erroneously downsampling words that are only frequent in the new update.) (但是,这些代码现在看,我认为这可能是越野车中,它的更新所有的话中带刚的新字典的频率信息,所以可能污染的真正频繁的话平时下采样,并可能错误地采样是词仅在新更新中频繁出现。)

If those internal structures are updated properly in sync with your existing actions, the model is probably in a consistent state for further training.如果这些内部结构与您现有的操作同步正确更新,则模型可能处于一致状态以供进一步训练。 (But note: these structures change a lot in the forthcoming gensim-4.0.0 release, so any custom tampering will need to be updated when upgrading then.) (但请注意:这些结构在即将发布的gensim-4.0.0版本中发生了很大变化,因此升级时需要更新任何自定义篡改。)

One other efficiency note: the np.delete() operation will create a new array, the full size of the surviving array, and copy the old values over, each time it is called.另一个效率注意事项: np.delete()操作将创建一个新数组,即存活数组的完整大小,并在每次调用时复制旧值。 So using it to remove many rows, one at a time, from a very-large original array is likely to require a lot of redundant allocation/copying/garbage-collection.因此,使用它从一个非常大的原始数组中一次删除许多行可能需要大量的冗余分配/复制/垃圾收集。 You may be able to call it once, at the end, with a list of all indexes to remove.您可以在最后调用它一次,并列出要删除的所有索引。

But really: the simpler & better-grounded approach, which may also yield significantly better continually-comparable vectors, would be to retrain with all current data whenever possible or a large amount of change has occurred.但实际上:更简单和更基础的方法,也可能产生明显更好的连续可比向量,将尽可能使用所有当前数据或发生大量变化的情况重新训练。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM