简体   繁体   中英

<bytes> to escaped <str> Python 3

Currently, I have Python 2.7 code that receives <str> objects over a socket connection. All across the code we use <str> objects, comparisons, etc. In an effort to convert to Python 3 , I've found that socket connections now return <bytes> objects which requires us to change all literals to be like b'abc' to make literal comparisons, etc. This is a lot of work, and although it is apparent why this change was made in Python 3 I am curious if there are any simpler workarounds.

Say I receive <bytes> b'\\xf2a27' over a socket connection. Is there a simple way to convert these <bytes> into a <str> object with the same escapes in Python 3.6 ? I have looked into some solutions myself to no avail.

a = b'\xf2a27'.decode('utf-8', errors='backslashescape')

Above yields '\\\\xf2a27' with len(a) = 7 instead of the original len(b'\\xf2a27') = 3 . Indexing is wrong too, this just won't work but it seems like it is headed down the right path.

a = b'\xf2a27'.decode('latin1')

Above yields 'òa27' which contains Unicode characters that I would like to avoid. Although in this case len(a) = 5 and comparisons like a[0] == '\\xf2' work, but I'd like to keep the information escaped in representation if possible.

Is there perhaps a more elegant solution that I am missing?

You really have to think about what the data you receive represents and Python 3 makes a strong point in that direction. There's an important difference between a string of bytes that actually represent a collection of bytes and a string of (abstract, unicode) characters.

You may have to think about each piece of data individually if they can have different representations.

Let's take your example of b'\\xf2a27' which in its raw form you receive from the socket is just a string of 4 bytes: 0xf2 , 0x61 , 0x32 , 0x37 in hex or 242 , 97 , 50 , 55 in decimal.

  1. Let's say you actually want 4 bytes out of that. You could either keep it as a byte string or convert it into a list or tuple of bytes if that serves you better:

     raw_bytes = b'\\xf2a27' list_of_bytes = list(raw_bytes) tuple_of_bytes = tuple(raw_bytes) if raw_bytes == b'\\xf2a27': pass if list_of_bytes == [0xf2, 0x61, 0x32, 0x37]: pass if tuple_of_bytes == (0xf2, 0x61, 0x32, 0x37): pass 
  2. Let's say this actually represents a 32-bit integer in which case you should convert it into a Python int . Choose whether it is encoded in little or big endian byte order and make sure you pick the correct one of signed vs. unsigned.

     raw_bytes = b'\\xf2a27' signed_little_endian, = struct.unpack('<i', raw_bytes) signed_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=True) unsigned_little_endian, = struct.unpack('<I', raw_bytes) unsigned_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=False) signed_big_endian, = struct.unpack('>i', raw_bytes) signed_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=True) unsigned_big_endian, = struct.unpack('>I', raw_bytes) unsigned_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=False) if signed_litte_endian == 926048754: pass 
  3. Let's say it's actually text. Think about what encoding it comes in. In your case it cannot be UTF-8 as b'\\xf2' would be a byte string that cannot be correctly decoded as UTF-8. If it's latin1 aka iso8859-1 and you're sure about it, that's fine.

     raw_bytes = b'\\xf2a27' character_string = raw_bytes.decode('iso8859-1') if character_string == '\\xf2a27': pass 

    If your choice of encoding was correct, having a '\\xf2' or 'ò' character inside the string will also be correct. It's still a single character. 'ò' , '\\xf2' , '\ò' and '\\U000000f2' are just 4 different ways to represent the same single character in a (unicode) string literal. Also, the len will be 4, not 5.

     print(ord(character_string[0])) # will be 242 print(hex(ord(character_string[0]))) # will be 0xf2 print(len(character_string)) # will be 4 

    If you actually observed a length of 5, you may have observed it at the wrong point. Perhaps after encoding the character string to UTF-8 or having it implicitly encoded to UTF-8 by printing to a UTF-8 Terminal.

    Note the difference of the number of bytes output to the shell when changing the default I/O encoding:

     PYTHONIOENCODING=UTF-8 python3 -c 'print(b"\\xf2a27".decode("latin1"), end="")' | wc -c # will output 5 PYTHONIOENCODING=latin1 python3 -c 'print(b"\\xf2a27".decode("latin1"), end="")' | wc -c # will output 4 

Ideally, you should perform your comparisons after converting the raw bytes to the correct data type they represent. That makes your code more readable and easier to maintain.

As a general rule of thumb, you should always convert raw bytes to their actual (abstract) data type as soon as you receive them. Then keep it in that abstract data type for processing as long as possible. If necessary, convert it back to some raw data on output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM