在 Python 中解码后恢复带有无效字符（UTF-8）的字节字符串

Question

I've logged a lot of texts that were being decoded to unicode(UTF-8) from a byte string.我记录了很多从字节字符串解码为 unicode(UTF-8) 的文本。

Example:例子：

From upstream I received aa lot of byte strings, like:从上游我收到了很多字节字符串，例如：

b_st = b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xfe\x00'

I saved those in my computer after doing a decoding解码后我把它们保存在我的电脑里

b_un = b_st.decode("utf-8", "replace")

As you can see the initial byte string have a invalid characters to decode to UTF-8(eg \\xff ) so those will be replaced.正如您所看到的，初始字节字符串有一个无效字符要解码为 UTF-8（例如\\xff ），因此这些\\xff将被替换。

After that I tried to recover the byte string from that unicode text doing: b_un.encode("utf-8") but it returns to me another byte string , not the same as the original.之后，我尝试从该 unicode 文本中恢复字节字符串： b_un.encode("utf-8")但它返回给我另一个字节字符串，与原始字符串不同。

Is it possible to recover the original byte string?是否可以恢复原始字节字符串？

PS.附注。 I didn't decode those texts intentionally, I didnt read the default behavior of an a Class that automatically converts any text to unicode if necessary.我没有故意解码这些文本，我没有阅读类的默认行为，该类在必要时自动将任何文本转换为 unicode。

Answer 1

replace is a lossy codec error handler , replacing any un-decodable bytes with \� (the unicode replacement character) replace是一个有损编解码器错误处理程序，用\� （unicode 替换字符）替换任何不可解码的字节

as such it is impossible to recover your original image因此无法恢复您的原始图像

If you're handed a byte string, you can write it to a file by using a binary io object:如果您收到一个字节字符串，则可以使用二进制 io 对象将其写入文件：

with open(filename, 'wb') as f:
    f.write(byte_string)

在 Python 中解码后恢复带有无效字符（UTF-8）的字节字符串

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-09-08 23:13:54

在 Python 中解码后恢复带有无效字符（UTF-8）的字节字符串

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-09-08 23:13:54

解决方案1
2 已采纳 2020-09-08 23:13:54