简体   繁体   中英

How to convert byte string to character with correct escaping?

I can not figure out, why the decoding fails, if the byte string starts with hex a, b, c, d, e or f, instead of a number, there are always two backslashs instead of one.

>>> bstr = b'\xfb'
>>> bstr.decode('utf8', 'backslashreplace')
'\\xfb'

What I want is '\\xfb' instead.

but,

>>> bstr = b'\x1f'
>>> bstr.decode('utf8', 'backslashreplace')
'\x1f'

works as expected. Do you know what is wrong?

b'\\xfb' is a bytestring containing a single byte. That byte has hex value FB, or 251 in decimal.

'\\xfb' is a string containing a single Unicode code point. That code point is U+00FB LATIN SMALL LETTER U WITH CIRCUMFLEX, or û .

b'\\xfb' is not the UTF-8 encoding of '\\xfb' . The UTF-8 encoding of '\\xfb' is b'\\xc3\\xbb' :

>>> '\xfb'.encode('utf-8')
b'\xc3\xbb'

In fact, b'\\xfb' is not the UTF-8 encoding of anything at all, and trying to decode it as UTF-8 is an error. 'backslashreplace' specifies a way of handling that error, where the FB byte is replaced with the character sequence backslash-xfb.

While it is possible to do a thing that will convert b'\\xfb' to '\\xfb' , that conversion has nothing to do with UTF-8, and applying that conversion without getting your requirements straight will only cause more problems. You need to figure out what your program actually needs to be doing. Most likely, the right path forward doesn't involve any b'\\xfb' to '\\xfb' conversion. We can't tell what you need to do, since we're missing so much context.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM