简体   繁体   English

从文本文件加载词向量 - GENSIM PYTHON

[英]Load word vectors from a text file - GENSIM PYTHON

Hello i have a txt file in this form, in the first column is the word and in the second its vector.您好,我有一个这种形式的 txt 文件,第一列是单词,第二列是向量。

word 0.256 0.2659 0.326595
word1 0.528 0.6589 0.62326 ...

i am trying to load this as keyedvectors because I want to compute after the cosine similarity between the words and find the most similar words but I always get an error.我正在尝试将其加载为键控向量,因为我想计算单词之间的余弦相似度并找到最相似的单词,但我总是会出错。

I'm guessing the actual format includes line breaks, like:我猜实际格式包括换行符,例如:

word 0.256 0.2659 0.326595
word1 0.528 0.6589 0.62326

That's more-or-less the format common for GLoVe-trained vectors, & very similar to the text format used by Google's original word2vec.c code - which adds a 1st line with a count of vectors & their dimensionality.这或多或少是 GLoVe 训练向量的常见格式,并且与 Google 的原始word2vec.c代码使用的文本格式非常相似 - 它添加了第一行,其中包含向量计数及其维度。

(If your vectors came from one of those tools, or a public place, & there are more hints as to their format from the filename or origin, that would have been helpful to note in your question.) (如果您的矢量来自其中一种工具或公共场所,并且文件名或来源中有关其格式的更多提示,那么在您的问题中说明这一点会很有帮助。)

If I'm guessing your true format correctly, then Gensim's KeyedVectors class can load the GLoVe format via the .load_word2vec_format() method, with the no_header=True optional parameter:如果我猜对了你的真实格式,那么 Gensim 的KeyedVectors class 可以通过.load_word2vec_format()方法加载 GLoVe 格式,使用no_header=True可选参数:

vecs = KeyedVectors.load_word2vec_format(filename, binary=False, no_header=True)

See the docs for more options:有关更多选项,请参阅文档:

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.load_word2vec_format https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.load_word2vec_format

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM