Binary data gets written as string literal - how to convert it back to bytes?

Question

I am writing compressed data as a bytes type to a black-box API (ie I cannot change what happens under the hood). When I get that data back, it is returned as a string type which I cannot decompress using the generic python modules (zlib, bz2, etc)

In more detail, part of the problem is that this string includes the leading 'b' , eg
b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'
(this is a string type).

When I compare this to the original binary representation, outside of the quotes and leading B it is identical.

If I try to simply convert back to bytes (eg using the bytes function) it wraps the whole thing and escapes the slashes and I get something like the following:

b"b'x\\\\x9c\\\\xabV*HL\\\\xd1\\\\xcd\\\\xccK\\\\xcbW\\\\xb2RPJ\\\\xcb\\\\xcfOJ,R\\\\xaa\\\\x05\\\\x00T\\\\x83\\\\x07b'"

Questions is, is it possible to convert this back to a bytes type so I can decompress it? If so, how?

I've seen a few different examples (eg How to cast a string to bytes without encoding ) that don't quite work out for what I'm trying.

UPDATE:

Lots of good answers, thanks folks! I wish I could click accept on multiple of them. And yes, as many of you noted, it is zlib compressed. This is by design as we have extremely limited space to work with and would like to stay with JSON if possible (zlib was chosen arbitrarily to just get the quirks of binary data out, and may not be the final choice).

Answer 1

Assuming type str for your original string, you have the following raw string (literal length 4 escape codes not an actual escape code representing 1 byte):

s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"

If you remove the leading b' and ' , you can use the latin1 encoding to convert to bytes. latin1 is a 1:1 mapping of Unicode code points to byte values, because the first 256 Unicode code points represent the latin1 character set:

>>> s[2:-1].encode('latin1')
b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'

This is now a byte string, but contains literal escape codes. Now apply the unicode_escape encoding to translate back to a str of the actual code points:

>>> s2 = b.decode('unicode_escape')
>>> s2
'x\x9c«V*HLÑÍÌKËW²RPJËÏOJ,Rª\x05\x00T\x83\x07b'

This is now a Unicode string, with code points, but we still need a byte string. Encode with latin1 again:

>>> b2 = s2.encode('latin1')
>>> b2
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'

In one step:

>>> s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"
>>> b = s[2:-1].encode('latin1').decode('unicode_escape').encode('latin1')
>>> b
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'

It appears this sample data is a zlib-compressed JSON string:

>>> import zlib,json
>>> json.loads(zlib.decompress(b))
{'pad-info': 'foobar'}

Answer 2

You can take the bytes from your string by selecting the whole string except first two b' and last one ' characther. Then convert it first to bytes and then decode back to a string.

Here an example:

str(bytes(bytes_string[2:-1], encoding), encoding)

Where:

bytes_string = "b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"

and encoding is the encoded used in the bytes string (eg 'UTF-8')

Answer 3

What is happening is this:

The black-box server is stringifying bytes before the send them. You need to take the string which represents bytes and turn it back into bytes. The easiest way to do this is for Abstract Syntax Tree lib (ast).

import ast
import zlib

stringified_bytes = "b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'"
print(f"{type(stringified_bytes)}: {stringified_bytes}")

actual_bytes = ast.literal_eval(stringified_bytes)
print(f"{type(actual_bytes)}: {actual_bytes}")

answer = zlib.decompress(actual_bytes)
print(f"Answer: {answer}")

Here is a run of the script:

(venv) [ttucker@zim stackoverflow]$ python bin.py 
<class 'str'>: b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
<class 'bytes'>: b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
Answer: b'{"pad-info": "foobar"}'

... this is pretty interesting stuff ... it looks like they have another byte-string with JSON in it. Is this like one of those hacker encoding challenges?

You have a zlib file, by the way

I know this because the beginning two bytes of the data are 78 9c ( x = 78 in hex) ... and if you look that up here: https://en.wikipedia.org/wiki/List_of_file_signatures , you can see it is a zlip

So, I used the zlib library to decode it ... Neat stuff.

Binary data gets written as string literal - how to convert it back to bytes?

Question

3 answers

solution1
2 ACCPTED 2020-11-21 00:09:34

solution2
1 2020-11-21 00:02:55

solution3
1 2020-11-21 02:45:55

What is happening is this:

You have a zlib file, by the way

Binary data gets written as string literal - how to convert it back to bytes?

Question

3 answers

solution1 2 ACCPTED 2020-11-21 00:09:34

solution2 1 2020-11-21 00:02:55

solution3 1 2020-11-21 02:45:55

What is happening is this:

You have a zlib file, by the way

solution1
2 ACCPTED 2020-11-21 00:09:34

solution2
1 2020-11-21 00:02:55

solution3
1 2020-11-21 02:45:55