简体   繁体   English

如何在Gensim Word2Vec中手动更改单词的矢量尺寸

[英]How to manually change the vector dimensions of a word in Gensim Word2Vec

I have a Word2Vec model with a lot of word vectors. 我有一个Word2Vec模型,有很多单词向量。 I can access a word vector as so. 我可以这样访问一个单词向量。

word_vectors = gensim.models.Word2Vec.load(wordspace_path)
print(word_vectors['boy'])

Output 产量

[ -5.48055351e-01   1.08748421e-01  -3.50534245e-02  -9.02988110e-03...]

Now I have a proper vector representation that I want to replace the word_vectors['boy'] with. 现在我有一个合适的矢量表示,我想用word替换word_vectors ['boy']。

word_vectors['boy'] = [ -7.48055351e-01   3.08748421e-01  -2.50534245e-02  -10.02988110e-03...]

But the following error is thrown 但是引发了以下错误

TypeError: 'Word2Vec' object does not support item assignment

Is there any fashion or workaround to do this? 有没有时尚或解决方法来做到这一点? That is manipulate word vectors manually once the model is trained? 一旦训练模型,那就是手动操纵单词向量? Is it possible in other platforms except Gensim? 除了Gensim之外的其他平台有可能吗?

Since word2vec vectors are typically only created by the iterative training process, then accessed, the gensim Word2Vec object does not support direct assignment of new values by its word indexes. 由于word2vec向量通常仅由迭代训练过程创建,然后被访问,因此gensim Word2Vec对象不支持通过其单词索引直接分配新值。

However, as it is in Python, all its internal structures are fully viewable/tamperable by you, and as it is open-source, you can view exactly how it does all of its existing functionality, and use that as a model for how to do new things. 但是,就像在Python中一样,它的所有内部结构都是完全可见/可篡改的,并且由于它是开源的,您可以准确地查看它如何完成所有现有功能,并将其用作如何使用的模型做新事物。

Specifically, the raw word-vectors are (in recent versions of gensim) stored in a property of the Word2Vec object called wv , and this wv property is an instance of KeyedVectors . 具体来说,原始的单词向量是(在最新版本的gensim中)存储在Word2Vec对象的一个​​名为wv的属性中,而这个wv属性是KeyedVectors一个实例。 If you examine its source code, you can see accesses of word-vectors by string key (eg 'boy' ), including those by [] -indexing implemented by the __getitem__() method, go through its method word_vec() . 如果你检查它的源代码,你可以看到通过字符串键(例如'boy' )访问字向量,包括__getitem__()方法实现的[] -indexing,通过它的方法word_vec() You can view the source of that method either in your local installation, or at Github: 您可以在本地安装或Github中查看该方法的来源:

https://github.com/RaRe-Technologies/gensim/blob/c2201664d5ae03af8d90fb5ff514ffa48a6f305a/gensim/models/keyedvectors.py#L265 https://github.com/RaRe-Technologies/gensim/blob/c2201664d5ae03af8d90fb5ff514ffa48a6f305a/gensim/models/keyedvectors.py#L265

There you'll see the word is actually converted to an integer-index (via self.vocab[word].index ) then used to access an internal syn0 or syn0norm array (depending on whether the user is accessing the raw or unit-normalized vector). 在那里你会看到单词实际上转换为整数索引(通过self.vocab[word].index )然后用于访问内部syn0syn0norm数组(取决于用户是访问原始或单位规范化)向量)。 If you look elsewhere where these are set up, or simply examine them in your own console/code (as if by word_vectors.wv.syn0 ), you'll see these are numpy arrays which do support direct assignment by index. 如果你看看其他地方,这些地方都设置了,或者干脆检查它们在自己的控制台/代码(仿佛word_vectors.wv.syn0 ),你会看到这些numpy数组里面支持指数直接分配。

So, you can directly tamper with their values by integer index, as if by: 因此,您可以通过整数索引直接篡改其值,如下所示:

word_vectors.wv.syn0[word_vectors.wv.vocab['boy'].index] = [ -7.48055351e-01   3.08748421e-01  -2.50534245e-02  -10.02988110e-03...]

And then, future accesses of word_vectors.wv['boy'] will return your updated values. 然后, word_vectors.wv['boy']未来访问将返回您更新的值。

Notes: 笔记:

• If you want syn0norm to be updated, to have the proper unit-normed vectors (as are used in most_similar() and other operations), it'd likely be best to modify syn0 first, then discard and recalculate syn0norm , via: •如果您希望更新syn0norm ,以获得正确的单位标准向量(如在most_similar()和其他操作中使用的most_similar() ),最好先修改syn0 ,然后通过以下方式丢弃并重新计算syn0norm

word_vectors.wv.syn0norm = None
word_vectors.wv.init_sims()

• Adding new words would require more involved object-tampering, because it will require growing the syn0 (replacing it with a larger array), and updating the vocab dict •添加新单词需要更多涉及对象篡改,因为它需要增加syn0 (用更大的数组替换它),并更新vocab词典

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM