简体   繁体   中英

Why does Python automatically encode hex in strings as UTF-8?

I have been using python to do ascii-to-binary translations and kept running into issues with parsing the result. Eventually I thought to look at what the Python commands were generating.

There seems to be a rouge 0xc2 inserted in the output (for example):

$ python -c 'print("\x80")' | xxd
00000000: c280 0a                                  ...

Indeed this happens regardless of where such bytes are used:

$ python -c 'print("Test\x80Test2\x81")' | xxd
00000000: 5465 7374 c280 5465 7374 32c2 810a       Test..Test2...

On a hunch, I poked around at UTF-8 and sure enough, U+0080 is encoded as 0xc2 0x80 . Apparently, Python takes the liberty of assuming by \\x80 I actually meant the encoding for U+0080 . Is there a way to change this default behavior or otherwise explicitly dictate my intention of including the singlar byte 0x80 and not the UTF encoding?

Python 3.6.2

Python 3 does the right thing of inserting a character into a str which is string of characters, not a byte sequence.

UTF8 is the default encoding. If you need to insert a byte, a different encoding where that character is represented as a byte is needed.

$ PYTHONIOENCODING=iso-8859-1 python3 -c 'print("\x80")' | xxd
00000000: 800a

PYTHONIOENCODING

If this is set before running the interpreter, it overrides the encoding used for stdin/stdout/stderr, in the syntax encodingname:errorhandler. Both the encodingname and the :errorhandler parts are optional and have the same meaning as in str.encode().

If you want to output raw bytes in Python 3 you shouldn't be using the print function, since it's for outputting text in your default encoding. Instead, you can use sys.stdout.buffer.write .

ASCII is a 7 bit encoding, so if your so-called ASCII contains characters like b'\\x80' it's not legal ASCII. Perhaps your data is actually encoded with iso-8859-1, aka latin-1, or it could be the closely-related Windows variant cp1252 . To do this kind of thing correctly you need to determine the actual encoding that was used to create the data.

If you want to output "Test\\x80Test2\\x81" and have the hex dump look like this:

00000000  54 65 73 74 80 54 65 73  74 32 81                 |Test.Test2.|

You can do

import sys
s = "Test\x80Test2\x81"
sys.stdout.buffer.write(s.encode('latin1'))

This works because Latin-1 is a subset of Unicode. Here's a quick demo:

import binascii

a = ''.join([chr(i) for i in range(256)])
b = a.encode('latin1')
print(binascii.hexlify(b))

output

b'000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff'

However, if you're actually working with binary data then you shouldn't be storing it in text strings in the first place, you should be using bytes , or possibly bytearray . The sane way to produce the b bytes string from my previous example is to do

b = bytes(range(256))

And if you have a bytes object like b"Test\\x80Test2\\x81" you can dump those bytes to stdout with

sys.stdout.buffer.write(b"Test\x80Test2\x81")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM