简体   繁体   English

将 GloVe 向量导入 gensim。 UnicodeDecodeError:“utf-8”编解码器无法解码 position 中的字节 0xe6 0:无效的继续字节

[英]Importing GloVe vectors into gensim. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte

I've produced GloVe vectors using the code provided by https://github.com/stanfordnlp/GloVe/blob/master/demo.sh using my own corpus.我使用我自己的语料库使用https://github.com/stanfordnlp/GloVe/blob/master/demo.sh提供的代码生成了 GloVe 向量。 So, I have both the.bin file and.txt file vectors.所以,我有 .bin 文件和 .txt 文件向量。 I'm trying to import these files into gensim so I can work with them like I can word2vec vectors.我正在尝试将这些文件导入到 gensim 中,这样我就可以像处理 word2vec 向量一样使用它们。

I've tried changing to load using both the binary format and text file format but only ended up getting a pickling error:我尝试使用二进制格式和文本文件格式更改加载,但最终得到一个酸洗错误:

models = gensim.models.Word2Vec.load(file)

I've tried ignoring the unicode error, which didn't work.我尝试忽略 unicode 错误,但没有成功。 I still got the unicode error.我仍然收到 unicode 错误。

model = gensim.models.KeyedVectors.load_word2vec_format(file, binary=True, unicode_errors='ignore')

This is what I have for my code right now:这就是我现在的代码:

from gensim.models import KeyedVectors
import gensim
from gensim.models import word2vec

file = 'vectors.bin'
model = KeyedVectors.load_word2vec_format(file, binary=True, unicode_errors='ignore')  
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])

This is the error message I keep getting:这是我不断收到的错误消息:

Traceback (most recent call last):
  File "glove_to_word2vec.py", line 6, in <module>
    model = KeyedVectors.load_word2vec_format(file, binary=True)  # C  binary format
  File "/home/users/epair/.local/lib/python3.6/site- packages/gensim/models/keyedvectors.py", line 1498, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/home/users/epair/.local/lib/python3.6/site-packages/gensim/models/utils_any2vec.py", line 343, in _load_word2vec_format
    header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "/home/users/epair/.local/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0:  invalid continuation byte

The pickling error was something like this: Unpickling Error while using Word2Vec.load()酸洗错误是这样的: Unpickling Error while using Word2Vec.load()

Text file format文本文件格式

There's no expectation a plain .load() would work – that will only work with gensim 's own models, saved with the matching .save() method.没有期望一个普通的.load()会起作用——它只适用于gensim自己的模型,用匹配的.save()方法保存。

However, .load_word2vec_format() should work with files in the right format.但是, .load_word2vec_format()应该使用正确格式的文件。

Are you sure the file is in a compatible format?您确定文件格式兼容吗? (Does it load into the original Google word2vec.c sibling tools, like the distance or word-analogy executables?) (它是否加载到原始的 Google word2vec.c同级工具中,例如distanceword-analogy可执行文件?)

You mentioned having the .txt format as well – have you tried loading that file (with binary=False )?您还提到了.txt格式——您是否尝试过加载该文件(使用binary=False )?

Looking at line 343 of utils_any2vec.py (in a version of gensim you're likely using), that appears to be reading the very 1st line of the file, which should only have 2 plain space-separated numbers on it: the number of words, and the number of dimensions.查看utils_any2vec.py的第 343 行(在您可能使用的gensim版本中),这似乎正在读取文件的第一行,该文件上应该只有 2 个纯空格分隔的数字:字数和维数。 (That is, encoding issues with regard to your actual word-tokens shouldn't even be involved.) (也就是说,甚至不应该涉及与您的实际单词令牌有关的编码问题。)

If you look at your file with head -1 vectors.txt , is that all you see?如果你用head -1 vectors.txt查看你的文件,你看到的就是这些吗? (If not, your GLoVe code isn't writing the right compatible format.) (如果不是,您的GLoVe代码没有编写正确的兼容格式。)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeDecodeError: &#39;utf-8&#39; 编解码器无法解码位置 5 中的字节 0xe0:连续字节无效 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 5: invalid continuation byte UnicodeDecodeError: &#39;utf-8&#39; 编解码器无法解码位置 1 中的字节 0xe4:Django 中的连续字节无效 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 1: invalid continuation byte in Django UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置34的字节0xe3:无效的连续字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 34: invalid continuation byte pip install&UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置9的字节0xe0:无效的连续字节 - pip install & UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 9: invalid continuation byte UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置1中的字节0xe3:无效的连续字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 1: invalid continuation byte UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置434852中的字节0xe2:无效的连续字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置105中的字节0xe2:无效的连续字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 105: invalid continuation byte UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置33的字节0xe4:无效的连续字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 33: invalid continuation byte UnicodeDecodeError:“utf-8”编解码器无法解码 position 6 中的字节 0xe1:无效的继续字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 6: invalid continuation byte Python Nltk:UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置50的字节0xe9:无效的连续字节 - Python Nltk :UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 50: invalid continuation byte
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM