简体   繁体   English

Gensim:如何从文本文件加载预先计算的单词向量

[英]Gensim: how to load precomputed word vectors from text file

I have a text file with my precomputed word vectors in the following format (example): 我有一个文本文件,其中包含以下格式的预计算单词向量(示例):

word -0.0762464299711 0.0128308048976 ... 0.0712385589283\\n'

on each line for every word (with 297 extra floats in place of the ... ). 每个单词的每一行(用297个额外的浮点数代替... )。 I am trying to load these with Gensim as KeyedVectors, because I ultimately would like to compute the cosine similarity, find most similar words, etc. Unfortunately I have not worked with Gensim before and from the documentation it's not quite clear to me how to do this. 我试图用Gensim加载这些作为KeyedVectors,因为我最终想要计算余弦相似度,找到最相似的单词,等等。不幸的是我之前没有和Gensim一起工作,从文档中我不太清楚如何做这个。 I have tried the following which I found here : 我试过以下在这里找到的以下内容:

word_vectors = KeyedVectors.load_word2vec_format('/embeddings/word.vectors', binary=False)

However this gives the following error: 但是,这会产生以下错误:

ValueError: invalid literal for int() with base 10: 'the'

'the' is the first word in the text file, so I suspect that the loading function is expecting something to be there that is not. ''是文本文件中的第一个单词,所以我怀疑加载函数是否期望某些东西不存在。 But I can't find any information on what should be there. 但我找不到任何有关应该存在的信息。 I would highly appreciate a pointer to such information or any other solution to my problem. 我非常感谢指向这些信息的指针或我的问题的任何其他解决方案。 Thanks! 谢谢!

You can see here an example of Word2Vec format. 您可以在此处看到Word2Vec格式的示例。 The first line is supposed to contain the number of words you have in your file followed by the dimension of your vectors. 第一行应该包含文件中的单词数,后跟向量的维度。 This is probably why your script is returning you an error. 这可能是您的脚本返回错误的原因。

In your example : 在你的例子中:

1 300
word -0.0762464299711 0.0128308048976 ... 0.0712385589283

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM