Recover byte string with invalid characters(UTF-8) after having decoded it in Python

Question

I've logged a lot of texts that were being decoded to unicode(UTF-8) from a byte string.

Example:

From upstream I received aa lot of byte strings, like:

b_st = b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xfe\x00'

I saved those in my computer after doing a decoding

b_un = b_st.decode("utf-8", "replace")

As you can see the initial byte string have a invalid characters to decode to UTF-8(eg \\xff ) so those will be replaced.

After that I tried to recover the byte string from that unicode text doing: b_un.encode("utf-8") but it returns to me another byte string , not the same as the original.

Is it possible to recover the original byte string?

PS. I didn't decode those texts intentionally, I didnt read the default behavior of an a Class that automatically converts any text to unicode if necessary.

Answer 1

replace is a lossy codec error handler , replacing any un-decodable bytes with \� (the unicode replacement character)

as such it is impossible to recover your original image

If you're handed a byte string, you can write it to a file by using a binary io object:

with open(filename, 'wb') as f:
    f.write(byte_string)

Recover byte string with invalid characters(UTF-8) after having decoded it in Python

Question

1 answers

solution1
2 ACCPTED 2020-09-08 23:13:54

Recover byte string with invalid characters(UTF-8) after having decoded it in Python

Question

1 answers

solution1 2 ACCPTED 2020-09-08 23:13:54

solution1
2 ACCPTED 2020-09-08 23:13:54