简体   繁体   English

为什么Python会自动将字符串中的十六进制编码为UTF-8?

[英]Why does Python automatically encode hex in strings as UTF-8?

I have been using python to do ascii-to-binary translations and kept running into issues with parsing the result. 我一直在使用python进行ASCII到二进制的翻译,并且在解析结果时一直遇到问题。 Eventually I thought to look at what the Python commands were generating. 最终,我想看看Python命令正在生成什么。

There seems to be a rouge 0xc2 inserted in the output (for example): 在输出中似乎插入了胭脂0xc2 (例如):

$ python -c 'print("\x80")' | xxd
00000000: c280 0a                                  ...

Indeed this happens regardless of where such bytes are used: 实际上,无论在何处使用此类字节,都会发生这种情况:

$ python -c 'print("Test\x80Test2\x81")' | xxd
00000000: 5465 7374 c280 5465 7374 32c2 810a       Test..Test2...

On a hunch, I poked around at UTF-8 and sure enough, U+0080 is encoded as 0xc2 0x80 . 凭直觉,我在UTF-8旁打了一下,果然, U+0080编码为0xc2 0x80 Apparently, Python takes the liberty of assuming by \\x80 I actually meant the encoding for U+0080 . 显然,Python冒昧地假设\\x80实际上是U+0080的编码。 Is there a way to change this default behavior or otherwise explicitly dictate my intention of including the singlar byte 0x80 and not the UTF encoding? 有没有一种方法可以更改此默认行为,或者以其他方式明确指示我打算包含单字节0x80而不是UTF编码?

Python 3.6.2

Python 3 does the right thing of inserting a character into a str which is string of characters, not a byte sequence. Python 3做正确的事情是在字符串str中插入一个字符,该str是字符串而不是字节序列。

UTF8 is the default encoding. UTF8是默认编码。 If you need to insert a byte, a different encoding where that character is represented as a byte is needed. 如果需要插入一个字节,则需要以该字符表示为字节的其他编码。

$ PYTHONIOENCODING=iso-8859-1 python3 -c 'print("\x80")' | xxd
00000000: 800a

PYTHONIOENCODING 致病性编码

If this is set before running the interpreter, it overrides the encoding used for stdin/stdout/stderr, in the syntax encodingname:errorhandler. 如果在运行解释器之前设置了此设置,则它将使用语法encodingname:errorhandler覆盖用于stdin / stdout / stderr的编码。 Both the encodingname and the :errorhandler parts are optional and have the same meaning as in str.encode(). encodingname和:errorhandler部分都是可选的,并且与str.encode()中的含义相同。

If you want to output raw bytes in Python 3 you shouldn't be using the print function, since it's for outputting text in your default encoding. 如果要在Python 3中输出原始字节,则不应使用print函数,因为它用于以默认编码输出文本。 Instead, you can use sys.stdout.buffer.write . 相反,您可以使用sys.stdout.buffer.write

ASCII is a 7 bit encoding, so if your so-called ASCII contains characters like b'\\x80' it's not legal ASCII. ASCII是7位编码,因此,如果您的所谓ASCII包含b'\\x80'之类的字符,则不是合法的ASCII。 Perhaps your data is actually encoded with iso-8859-1, aka latin-1, or it could be the closely-related Windows variant cp1252 . 也许您的数据实际上是用iso-8859-1(又名latin-1)编码的,或者可能是与Windows密切相关的变体cp1252 To do this kind of thing correctly you need to determine the actual encoding that was used to create the data. 为了正确地执行这种操作,您需要确定用于创建数据的实际编码。

If you want to output "Test\\x80Test2\\x81" and have the hex dump look like this: 如果要输出"Test\\x80Test2\\x81"并使十六进制转储如下所示:

00000000  54 65 73 74 80 54 65 73  74 32 81                 |Test.Test2.|

You can do 你可以做

import sys
s = "Test\x80Test2\x81"
sys.stdout.buffer.write(s.encode('latin1'))

This works because Latin-1 is a subset of Unicode. 这是有效的,因为Latin-1是Unicode的子集。 Here's a quick demo: 这是一个快速演示:

import binascii

a = ''.join([chr(i) for i in range(256)])
b = a.encode('latin1')
print(binascii.hexlify(b))

output 输出

b'000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff'

However, if you're actually working with binary data then you shouldn't be storing it in text strings in the first place, you should be using bytes , or possibly bytearray . 但是,如果您实际上是在使用二进制数据,则不应首先将其存储在文本字符串中,而应使用bytes或可能使用bytearray The sane way to produce the b bytes string from my previous example is to do 从我之前的示例中生成b字节字符串的理智方法是

b = bytes(range(256))

And if you have a bytes object like b"Test\\x80Test2\\x81" you can dump those bytes to stdout with 如果您有一个像b"Test\\x80Test2\\x81"这样的bytes对象,则可以使用以下命令将这些字节转储到stdout

sys.stdout.buffer.write(b"Test\x80Test2\x81")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python是否会自动解码ASCII和UTF-8字节字符串? - Does python automatically decode ASCII and UTF-8 byte strings? python 将十六进制字符串编码/解码为 utf-8 字符串 - python encode/decode hex string to utf-8 string Python将Unicode-Hex utf-8字符串转换为Unicode字符串 - Python Convert Unicode-Hex utf-8 strings to Unicode strings 使用encode('utf-8')在python中从Excel读取字符串的缺点 - Downsides to reading strings from Excel in python using encode('utf-8') 为什么有些字符串在utf-16中编码,而其他字符串只用utf-8编码? - Why do some strings encode in utf-16, while others only encode in utf-8? 为什么str.encode('utf-8')在我的python脚本中产生UnicodeDecodeError? - Why does str.encode('utf-8') produce UnicodeDecodeError in my python script? 为什么 'encode("utf-8", 'ignore').decode("utf-8")' 在 Python 3 中不去除非 UTF8 字符? - Why doesn't 'encode("utf-8", 'ignore').decode("utf-8")' strip non-UTF8 chars in Python 3? 将Python列表编码为UTF-8 - Encode Python list to UTF-8 Python 使用 utf-8 解码和编码 - Python decode and encode with utf-8 为什么在对包含无效UTF-8数据的字符串进行操作时,Ruby会出错,而Python却不会呢? - Why does Ruby error when operating on strings containing invalid UTF-8 data, but Python does not?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM