简体   繁体   English

Gensim Word2Vec 模型:切割尺寸

[英]Gensim Word2Vec model: Cut dimensions

I have a trained word2vec models in geinsim with 300 dimensions and would like to cut the dimensions to 100 (simply drop the last 200 dimensions).我在 geinsim 中有一个训练有素的 word2vec 模型,有 300 个维度,并且想将维度减少到 100 个(只需删除最后 200 个维度)。 What is the easiest and most efficient way using python?使用 python 的最简单和最有效的方法是什么?

You could save the output model in the word2vec format .您可以将输出模型保存为word2vec 格式 Make sure to save it as a text file (.txt).确保将其保存为文本文件 (.txt)。 The word2vec format is as follows word2vec格式如下

First line is <vocabulary_size> <embedding_size> .第一行是<vocabulary_size> <embedding_size> In your case the <embedding_size> will be 300 .在您的情况下, <embedding_size>将为300 Rest of the lines will be <word><TAB><300 floating point numbers space separated> .其余的行将是<word><TAB><300 floating point numbers space separated> Now you can easily parse this file in python and discard the last 200 floating points from each of the lines.现在您可以轻松地在 python 中解析这个文件并丢弃每一行的最后 200 个浮点。 Make sure to update the <embedding_size> in your first line.确保更新第一行中的<embedding_size> Save this as a new file (optional).将其另存为新文件(可选)。 Now you can load this new file as a fresh word2vec model using load_word2vec_format() .现在,您可以使用load_word2vec_format()将此新文件作为新的 word2vec 模型加载

You should be able to trim the dimensions inside a KeyedVectors instance, then save it – so you don't have to do anything special with the format on disk.您应该能够修剪KeyedVectors实例内的尺寸,然后保存它 - 这样您就不必对磁盘上的格式进行任何特殊处理。 For example:例如:

kv = w2v_model.wv
kv.vectors = kv.vectors[:,0:100]  # keeps just 1st 100 dims
kv.vector_size = 100

Now kv can be saved (as either gensim 's native .save() or the interchange format .save_word2vec_format() ), or just operated on as a subset of the original dimensions.现在kv可以保存(作为gensim的原生.save()或交换格式.save_word2vec_format() ),或者只是作为原始维度的子集进行操作。

(While any 100 dimensions of a larger embedding are as likely to be as good as any other, you'll be losing some of the 300-dimensions' expressiveness, in arbitrary ways. Re-training with 100 dimensions to begin with might do better, or using some sort of dimensionality-reduction algorithm which might, in effect, be sure to leave you with the "most expressive" 100 dimensions.) (虽然较大嵌入的任何 100 维都可能与其他任何维度一样好,但您会以任意方式失去一些 300 维的表现力。从 100 维开始重新训练可能会更好,或使用某种降​​维算法,这实际上可能确保为您留下“最具表现力”的 100 维。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM