Python：UnicodeDecodeError：'utf-8'编解码器无法解码 position 中的字节 0x80 0：无效起始字节

Question

I am fetching data from a catalog and it's giving data in bytes format.我正在从目录中获取数据，它以字节格式提供数据。

Bytes data:字节数据：

b'\x80\x00\x00\x00\n\x00\x00%\x83\xa0\x08\x01\x00\xbb@\x00\x00\x05p 
\x02\x00>\xf3\x00\x00\x00}\x02\x00`\x03\xef0\x00\x00\r\xc0 
\x06\xf0>\xf3\x00\x00\x02\x88\x02\x03\xec\x03\xef0\x00\x00/.....'

While converting this data in string or any readable format I'am getting this error:在将此数据转换为字符串或任何可读格式时，我收到此错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Code which I used(Python 3.7.3):我使用的代码（Python 3.7.3）：

blobs = blob.decode('utf-8')

AND和

import json
json.dumps(blob.decode())

I've also used pickle , ast and pprint but they are not helpful here.我也使用过pickle 、 ast和pprint但它们在这里没有帮助。

What I tried:我尝试了什么：

Answer 1

You can try ignoring the non-readable blocks.您可以尝试忽略不可读的块。

blobs.decode('utf-8', 'ignore')

It's not a great solution but the way you're generating the byte object has some issues.这不是一个很好的解决方案，但是您生成字节 object 的方式存在一些问题。 Maybe, utf-8 is not the proper encoding for your data.也许， utf-8不是您数据的正确编码。

Answer 2

The UTF-8 encoding has some built-in redundancy that serves at least two purposes: UTF-8 编码具有一些内置冗余，至少有两个用途：

1) locating code points reading back and forth 1) 定位来回读取的代码点

Start bytes (in binary dots carrying actual data) match one of these 4 patterns起始字节（以携带实际数据的二进制点表示）匹配这 4 种模式之一

0.......
110.....
1110....
11110...

whereas continuation bytes (0 to 3) have always this form而连续字节（0到3）总是这种形式

10......

2) checking for validity 2) 检查有效性

If this encoding is not respected, it is safe to say that it is not UTF-8 data, eg because corruptions occurred during a transfer.如果不遵守此编码，则可以肯定地说它不是 UTF-8 数据，例如因为在传输期间发生了损坏。

Conclusion结论

Why is it possible to say that b'\x80\' cannot be UTF-8?为什么可以说b'\x80\'不能是UTF-8？ Already at the first two bytes the encoding is violated: because 80 must be a continuation byte.已经在前两个字节处违反了编码：因为 80 必须是连续字节。 This is exactly what your error message says:这正是您的错误消息所说的：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte UnicodeDecodeError：“utf-8”编解码器无法解码 position 中的字节 0x80 0：无效的起始字节

And even if you skip this one, you get another problem some bytes later at b'%\x83' , so it's most likely that either you are trying to decode the wrong data or assume the wrong encoding.即使你跳过这个，你也会在b'%\x83'一些字节后遇到另一个问题，所以很可能你要么试图解码错误的数据，要么假设错误的编码。

Answer 3

The data in your example is clearly not text in any common encoding.您示例中的数据显然不是任何常见编码的文本。 Neither Python nor we can figure out a way to turn data which is obviously not text into strings. Python 和我们都无法找到将显然不是文本的数据转换为字符串的方法。

If this is a well-defined binary file format, find a parser for this format (ideally a popular Python library, but for more obscure or proprietary formats you may not be able to find one) or write one yourself if you can figure out how the data is structured, either by clever experimentation and good guesswork, or by finding (if not authoritative then perhaps more or less speculative third-party) documentation.如果这是一种定义明确的二进制文件格式，请找到该格式的解析器（理想情况下是流行的 Python 库，但对于更晦涩或专有的格式，您可能无法找到）或自己编写一个数据是结构化的，要么通过巧妙的实验和良好的猜测，要么通过查找（如果不是权威的，那么可能或多或少具有推测性的第三方）文档。

If you simply want to turn the bytes into a string of code points with the same Unicode code points (so that for example the input byte \xff maps to the Unicode code point U+00FF ), the 'latin-1' encoding does this, obscurely but conveniently.如果您只是想将字节转换为具有相同 Unicode 代码点的代码点字符串（例如，输入字节\xff映射到 Unicode 代码点U+00FF ），则'latin-1'编码会执行此操作，晦涩但方便。 The result in this case will obviously not be useful human-readable text;这种情况下的结果显然不是有用的人类可读文本； in many ways, it would then be more natural and quite possibly less error-prone and more convenient to just keep the data as bytes instead.在许多方面，将数据保留为bytes会更自然，更不容易出错并且更方便。

Answer 4

For this encoding error对于这个编码错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

or other like that, you just have to open the database file with .json extension and change the encoding to UTF-8 (for exemple in VScode, you can change it in right-bottom nav-bar) and save the file...或其他类似的，您只需打开扩展名为.json的数据库文件并将编码更改为 UTF-8（例如在 VScode 中，您可以在右下角导航栏中进行更改）并保存文件...

Now run现在运行

 $ git status

you'll have something like this result你会有这样的结果

 On branch master
 Changes not staged for commit:
   (use "git add <file>..." to update what will be committed)
   (use "git restore <file>..." to discard changes in working directory)
        modified:   store/dumps/store.json
   (use "git add <file>..." to include in what will be committed)
        .gitignore

 no changes added to commit (use "git add" and/or "git commit -a")

or something like this one或类似的东西

On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   store/dumps/store.json
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        .gitignore

for the first case, you just have to do this one对于第一种情况，你只需要这样做

$ git add store/dumps/

the second case don't need this previous part...第二种情况不需要前面的部分......

Now, for the two cases, you have to commit the changes with现在，对于这两种情况，您必须提交更改

$ git commit -m "launching to production"

the console will return you a message informed you for the adds and changes...控制台将返回一条消息，通知您添加和更改...

You have to build log for the app again with您必须再次为应用程序构建日志

$ git push heroku master

(for heroku users) （适用于 heroku 用户）

after the build, you just have to load the database again with构建后，您只需要再次加载数据库

heroku run python manage.py loaddata store/dumps/store.json

it will install the objects./.它将安装对象。/。

excuses for my english level !!!为我的英语水平找借口！！！

Python：UnicodeDecodeError：'utf-8'编解码器无法解码 position 中的字节 0x80 0：无效起始字节

问题描述

4 个解决方案

解决方案1
2 2020-06-03 10:32:13

解决方案2
2 2020-06-03 17:43:02

1) locating code points reading back and forth 1) 定位来回读取的代码点

2) checking for validity 2) 检查有效性

Conclusion结论

解决方案3
1 2021-07-06 12:12:09

解决方案4
0 2020-10-09 19:15:32

Python：UnicodeDecodeError：'utf-8'编解码器无法解码 position 中的字节 0x80 0：无效起始字节

问题描述

4 个解决方案

解决方案1 2 2020-06-03 10:32:13

解决方案2 2 2020-06-03 17:43:02

1) locating code points reading back and forth 1) 定位来回读取的代码点

2) checking for validity 2) 检查有效性

Conclusion结论

解决方案3 1 2021-07-06 12:12:09

解决方案4 0 2020-10-09 19:15:32

解决方案1
2 2020-06-03 10:32:13

解决方案2
2 2020-06-03 17:43:02

解决方案3
1 2021-07-06 12:12:09

解决方案4
0 2020-10-09 19:15:32