简体   繁体   English

<bytes> 逃脱 <str> Python 3

[英]<bytes> to escaped <str> Python 3

Currently, I have Python 2.7 code that receives <str> objects over a socket connection. 目前,我有Python 2.7代码,可通过套接字连接接收<str>对象。 All across the code we use <str> objects, comparisons, etc. In an effort to convert to Python 3 , I've found that socket connections now return <bytes> objects which requires us to change all literals to be like b'abc' to make literal comparisons, etc. This is a lot of work, and although it is apparent why this change was made in Python 3 I am curious if there are any simpler workarounds. 在所有代码中,我们都使用<str>对象,比较等。为了转换为Python 3 ,我发现套接字连接现在返回<bytes>对象,这要求我们将所有文字更改为b'abc'做字面比较,等等,这是一个大量的工作,虽然很明显为什么这个变化是在Python 3搞得我很好奇,如果有任何简单的解决方法。

Say I receive <bytes> b'\\xf2a27' over a socket connection. 假设我通过套接字连接收到<bytes> b'\\xf2a27' Is there a simple way to convert these <bytes> into a <str> object with the same escapes in Python 3.6 ? 有没有简单的方法可以将这些<bytes>转换为<str>对象,而在Python 3.6中具有相同的转义 I have looked into some solutions myself to no avail. 我自己研究了一些解决方案,但无济于事。

a = b'\xf2a27'.decode('utf-8', errors='backslashescape')

Above yields '\\\\xf2a27' with len(a) = 7 instead of the original len(b'\\xf2a27') = 3 . 以上的产率'\\\\xf2a27'len(a) = 7 ,而不是原来len(b'\\xf2a27') = 3 Indexing is wrong too, this just won't work but it seems like it is headed down the right path. 索引编制也是错误的,但这根本行不通,但似乎是正确的方法。

a = b'\xf2a27'.decode('latin1')

Above yields 'òa27' which contains Unicode characters that I would like to avoid. 上面产生了'òa27' ,其中包含我要避免的Unicode字符。 Although in this case len(a) = 5 and comparisons like a[0] == '\\xf2' work, but I'd like to keep the information escaped in representation if possible. 尽管在这种情况下, len(a) = 5a[0] == '\\xf2'有效,但我希望尽可能地保留表示形式中的信息。

Is there perhaps a more elegant solution that I am missing? 也许我缺少一个更优雅的解决方案?

You really have to think about what the data you receive represents and Python 3 makes a strong point in that direction. 您确实必须考虑收到的数据代表什么,Python 3在该方向上很重要。 There's an important difference between a string of bytes that actually represent a collection of bytes and a string of (abstract, unicode) characters. 实际上代表字节集合的字节字符串和(抽象,Unicode)字符字符串之间存在重要区别。

You may have to think about each piece of data individually if they can have different representations. 如果每个数据都有不同的表示形式,则可能需要分别考虑它们。

Let's take your example of b'\\xf2a27' which in its raw form you receive from the socket is just a string of 4 bytes: 0xf2 , 0x61 , 0x32 , 0x37 in hex or 242 , 97 , 50 , 55 in decimal. 让我们把你的例子b'\\xf2a27'这在其原始形式从套接字接收仅仅是一个4个字节的字符串: 0xf20x610x320x37十六进制或242975055十进制。

  1. Let's say you actually want 4 bytes out of that. 假设您实际上想要4个字节。 You could either keep it as a byte string or convert it into a list or tuple of bytes if that serves you better: 您可以将其保留为字节字符串,也可以将其转换为字节list或字节tuple (如果这样做更好):

     raw_bytes = b'\\xf2a27' list_of_bytes = list(raw_bytes) tuple_of_bytes = tuple(raw_bytes) if raw_bytes == b'\\xf2a27': pass if list_of_bytes == [0xf2, 0x61, 0x32, 0x37]: pass if tuple_of_bytes == (0xf2, 0x61, 0x32, 0x37): pass 
  2. Let's say this actually represents a 32-bit integer in which case you should convert it into a Python int . 假设这实际上代表一个32位整数,在这种情况下,您应该将其转换为Python int Choose whether it is encoded in little or big endian byte order and make sure you pick the correct one of signed vs. unsigned. 选择是以小端字节序还是大端字节序编码,并确保选择正确的带符号和无符号。

     raw_bytes = b'\\xf2a27' signed_little_endian, = struct.unpack('<i', raw_bytes) signed_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=True) unsigned_little_endian, = struct.unpack('<I', raw_bytes) unsigned_little_endian = int.from_bytes(raw_bytes, byteorder='little', signed=False) signed_big_endian, = struct.unpack('>i', raw_bytes) signed_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=True) unsigned_big_endian, = struct.unpack('>I', raw_bytes) unsigned_big_endian = int.from_bytes(raw_bytes, byteorder='big', signed=False) if signed_litte_endian == 926048754: pass 
  3. Let's say it's actually text. 假设它实际上是文本。 Think about what encoding it comes in. In your case it cannot be UTF-8 as b'\\xf2' would be a byte string that cannot be correctly decoded as UTF-8. 考虑一下它的编码方式。在您的情况下,它不能为UTF-8,因为b'\\xf2'将是无法正确解码为UTF-8的字节字符串。 If it's latin1 aka iso8859-1 and you're sure about it, that's fine. 如果它是latin1 aka iso8859-1,而且您确定可以,那就很好。

     raw_bytes = b'\\xf2a27' character_string = raw_bytes.decode('iso8859-1') if character_string == '\\xf2a27': pass 

    If your choice of encoding was correct, having a '\\xf2' or 'ò' character inside the string will also be correct. 如果您选择的编码正确,则在字符串中包含'\\xf2''ò'字符也将是正确的。 It's still a single character. 它仍然是一个字符。 'ò' , '\\xf2' , '\ò' and '\\U000000f2' are just 4 different ways to represent the same single character in a (unicode) string literal. 'ò''\\xf2''\ò''\\U000000f2'只是在(unicode)字符串文字中表示相同单个字符的4种不同方式。 Also, the len will be 4, not 5. 此外,len将为4,而不是5。

     print(ord(character_string[0])) # will be 242 print(hex(ord(character_string[0]))) # will be 0xf2 print(len(character_string)) # will be 4 

    If you actually observed a length of 5, you may have observed it at the wrong point. 如果您实际观察到长度为5,则可能是在错误的位置观察到的。 Perhaps after encoding the character string to UTF-8 or having it implicitly encoded to UTF-8 by printing to a UTF-8 Terminal. 可能是在将字符串编码为UTF-8或通过打印到UTF-8终端隐式编码为UTF-8之后。

    Note the difference of the number of bytes output to the shell when changing the default I/O encoding: 请注意,更改默认I / O编码时,输出到外壳的字节数有所不同:

     PYTHONIOENCODING=UTF-8 python3 -c 'print(b"\\xf2a27".decode("latin1"), end="")' | wc -c # will output 5 PYTHONIOENCODING=latin1 python3 -c 'print(b"\\xf2a27".decode("latin1"), end="")' | wc -c # will output 4 

Ideally, you should perform your comparisons after converting the raw bytes to the correct data type they represent. 理想情况下,应该将原始字节转换为它们代表的正确数据类型之后执行比较。 That makes your code more readable and easier to maintain. 这使您的代码更具可读性,更易于维护。

As a general rule of thumb, you should always convert raw bytes to their actual (abstract) data type as soon as you receive them. 根据一般经验,应该始终在收到原始字节后将其转换为实际的(抽象的)数据类型。 Then keep it in that abstract data type for processing as long as possible. 然后将其保留在该抽象数据类型中,以便进行尽可能长的处理。 If necessary, convert it back to some raw data on output. 如有必要,将其转换回输出的一些原始数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM