错误：'utf8'编解码器无法解码位置0中的字节0x80：无效的起始字节

Question

I am trying to do the following kaggle assignmnet . 我正在尝试执行以下kaggle assignmnet 。 I am using gensim package to use word2vec. 我使用gensim包来使用word2vec。 I am able to create the model and store it to disk. 我能够创建模型并将其存储到磁盘。 But when I am trying to load the file back I am getting the error below. 但是当我尝试加载文件时，我收到以下错误。

    -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py 
Traceback (most recent call last):
  File "prog_w2v.py", line 7, in <module>
    models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
    header = utils.to_unicode(fin.readline())
  File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
    return unicode(text, encoding, errors=errors)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I find similar question. 我发现了类似的问题。 But I was unable to solve the problem. 但我无法解决问题。 My prog_w2v.py is as below. 我的prog_w2v.py如下。

import gensim
import time
start = time.time()    
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) 
end = time.time()   
print end-start,"   seconds"

I am trying to generate the model using code here . 我试图在这里使用代码生成模型。 The program takes about half an hour to generate the model. 该程序生成模型大约需要半小时。 Hence I am unable to run it many times to debug it. 因此，我无法多次运行它来调试它。

Answer 1

You are not loading the file correctly. 您没有正确加载文件。 You should use load() instead of load_word2vec_format(). 您应该使用load（）而不是load_word2vec_format（）。 The latter is used when you train a model using the C code, and save the model in a binary format. 当您使用C代码训练模型时使用后者，并以二进制格式保存模型。 However you are not saving the model in a binary format, and are training it using python. 但是，您不是以二进制格式保存模型，而是使用python进行训练。 So you can simply use the following code and it should work: 所以你可以简单地使用以下代码，它应该工作：

models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')

Answer 2

If you save your model with: 如果您使用以下方式保存模型：

model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')

Then load word2vec with load_word2vec_format method would cause the issue. 然后使用load_word2vec_format方法加载word2vec会导致问题。 To make it work you should use: 为了使它工作，你应该使用：

wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')

The same thing also happen when you save model with: 使用以下方法保存模型时也会发生同样的事情：

 model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)

And then, want to load with KeyedVectors.load method. 然后，想要使用KeyedVectors.load方法加载。 In this situation, use: 在这种情况下，使用：

wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)

Answer 3

As per the other answers, knowing the way you save the file is important because there are specific ways to load it as well. 根据其他答案，了解保存文件的方式很重要，因为还有特定的方法来加载它。 But, you can simply use the flag unicode_errors='ignore' to skip this issue and load the model as you want. 但是，您只需使用标志unicode_errors='ignore'来跳过此问题并根据需要加载模型。

import gensim  

model = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True, unicode_errors='ignore')

By default, this flag is set to ' strict ': unicode_errors='strict' . 默认情况下，此标志设置为' strict '： unicode_errors='strict' 。

According to the documentation, the following is given as the reason as to why errors like this occur. 根据文档，下面给出了为什么会发生这样的错误的原因。

unicode_errors : str, optional default 'strict', is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. unicode_errors ：str，可选默认值'strict'，是一个适合作为errors参数传递给unicode（）（Python 2.x）或str（）（Python 3.x）函数的字符串。 If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), 'ignore' or 'replace' may help. 如果源文件可能包含在多字节unicode字符中间截断的单词标记（从原始word2vec.c工具中常见），则“忽略”或“替换”可能会有所帮助。

All of the above answers are helpful, if we really can keep track of how each model was saved. 如果我们真的可以跟踪每个模型的保存方式，那么上述所有答案都会有所帮助。 But what if we have a bunch of models, that we need to load, and create a general method for it? 但是，如果我们有一堆模型，我们需要加载，并为它创建一般方法呢？ We can use the above flag to do so. 我们可以使用上面的标志来做到这一点。

I myself have experienced instances where I train multiple models using the original word2vec.c file , but when I try to load it into gensim , some models will load successfully, and some would give the unicode errors, I have found the above flag to be helpful and convenient. 我自己经历过使用原始word2vec.c file训练多个模型的实例，但是当我尝试将其加载到gensim ，某些模型会成功加载，有些会给出unicode错误，我发现上面的标志是乐于助人，方便。

Answer 4

If you saved your model with save(), you must use load() 如果使用save（）保存模型，则必须使用load（）

load_word2vec_format is for the model generated by google, not for the model generated by gensim load_word2vec_format用于由谷歌生成的模型，而不是由gensim生成的模型

错误：'utf8'编解码器无法解码位置0中的字节0x80：无效的起始字节

问题描述

4 个解决方案

解决方案1
12 2015-05-12 21:52:06

解决方案2
5 2018-01-06 20:13:54

解决方案3
5 2018-12-03 06:21:07

解决方案4
3 2015-01-20 18:43:12

错误：'utf8'编解码器无法解码位置0中的字节0x80：无效的起始字节

问题描述

4 个解决方案

解决方案1 12 2015-05-12 21:52:06

解决方案2 5 2018-01-06 20:13:54

解决方案3 5 2018-12-03 06:21:07

解决方案4 3 2015-01-20 18:43:12

解决方案1
12 2015-05-12 21:52:06

解决方案2
5 2018-01-06 20:13:54

解决方案3
5 2018-12-03 06:21:07

解决方案4
3 2015-01-20 18:43:12