简体   繁体   English

Python utf-8 编码不遵循 unicode 规则

[英]Python utf-8 encoding not following unicode rules

Background: I've got a byte file that is encoded using unicode.背景:我有一个使用 unicode 编码的字节文件。 However, I can't figure out the right method to get Python to decode it to a string.但是,我想不出让 Python 将其解码为字符串的正确方法。 Sometimes is uses 1-byte ASCII text.有时是使用 1 字节的 ASCII 文本。 The majority of the time it uses 2-byte "plain latin" text, but it can possibly contain any unicode character.大多数时候它使用 2 字节的“纯拉丁文”文本,但它可能包含任何 unicode 字符。 So my python program needs to be able to decode that and handle it.所以我的 python 程序需要能够解码并处理它。 Unfortunately byte_string.decode('unicode') isn't a thing, so I need to specify another encoding scheme.不幸的是byte_string.decode('unicode')不是问题,所以我需要指定另一种编码方案。 Using Python 3.9使用 Python 3.9

I've read through the Python doc on unicode and utf-8 Python doc .我已经阅读了 unicode 和 utf-8 Python doc上的 Python 文档。 If Python uses unicode for it's strings, and utf-8 as default, this should be pretty straightforward, yet I keep getting incorrect decodes.如果 Python 对它的字符串使用 unicode,默认使用 utf-8,这应该非常简单,但我总是得到不正确的解码。

If I understand how unicode works, the most significant byte is the character code, and the least significant byte is the lookup value in the decode table.如果我了解 unicode 的工作原理,最高有效字节是字符代码,最低有效字节是解码表中的查找值。 So I would expect 0x00_41 to decode to "A",所以我希望 0x00_41 解码为“A”,
0x00_F2 => 0x00_F2 => 在此处输入图像描述
x65_03_01 => é (e with combining acute accent). x65_03_01 => é(e 与重音组合)。

I wrote a short test file to experiment with these byte combinations, and I'm running into a few situations that I don't understand (despite extensive reading).我写了一个简短的测试文件来试验这些字节组合,但我遇到了一些我不明白的情况(尽管进行了大量阅读)。

Example code:示例代码:

def main():
    print("Starting MAIN...")

    vrsn_bytes = b'\x76\x72\x73\x6E'
    serato_bytes = b'\x00\x53\x00\x65\x00\x72\x00\x61\x00\x74\x00\x6F'
    special_bytes = b'\xB2\xF2'  
    combining_bytes = b'\x41\x75\x64\x65\x03\x01'  

    print(f"vrsn_bytes: {vrsn_bytes}")
    print(f"serato_bytes: {serato_bytes}")
    print(f"special_bytes: {special_bytes}")
    print(f"combining_bytes: {combining_bytes}")
    
    encoding_method = 'utf-8'  # also tried latin-1 and cp1252
    vrsn_str = vrsn_bytes.decode(encoding_method)
    serato_str = serato_bytes.decode(encoding_method)
    special_str = special_bytes.decode(encoding_method)
    combining_str = combining_bytes.decode(encoding_method)
    print(f"vrsn_str: {vrsn_str}")
    print(f"serato_str: {serato_str}")
    print(f"special_str: {special_str}")
    print(f"combining_str: {combining_str}")

    return True

if __name__ == '__main__':

    print("Starting Command Line Experiment!")
    
    if not main():
        print("\n Command Line Test FAILED!!")
    else:
        print("\n Command Line Test PASSED!!")

Issue 1: utf-8 encoding.问题 1:utf-8 编码。 As the experiment is written, I get the following error:在编写实验时,出现以下错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 0: invalid start byte

I don't understand why this fails to decode, according to the unicode decode table , 0x00B2 should be "SUPERSCRIPT TWO".我不明白为什么解码失败,根据unicode解码表,0x00B2应该是“SUPERSCRIPT TWO”。 In fact, it seems like anything above 0x7F returns the same UnicodeDecodeError.事实上,似乎 0x7F 以上的任何内容都会返回相同的 UnicodeDecodeError。

I know that some encoding schemes only support 7 bits, which is what seems like is happening, but utf-8 should support not only 8 bits, but multiple bytes.我知道有些编码方案只支持 7 位,这似乎正在发生,但 utf-8 不仅应该支持 8 位,还应该支持多字节。

If I changed encoding_method to encoding_method = 'latin-1' which extends the original ascii 128 characters to 256 characters (up to 0xFF), then I get a better output:如果我将encoding_method更改为encoding_method = 'latin-1'将原始 ascii 128 个字符扩展到 256 个字符(最多 0xFF),那么我会得到更好的输出:

vrsn_str: vrsn
serato_str: Serato
special_str: ²ò
combining_str: Aude

However, this encoding is not handling the 2-byte codes properly.但是,这种编码没有正确处理 2 字节代码。 \x00_53 should be S , not �S , and none of the encoding methods I'll mention in this post handle the combining acute accent after Aude properly. \x00_53 应该是S ,而不是 �S ,我将在这篇文章中提到的编码方法都没有正确处理Aude之后的组合尖音符。

So far I've tried many different encoding methods, but the ones that are closest are: unicode_escape, latin-1, and cp1252.到目前为止,我已经尝试了许多不同的编码方法,但最接近的是:unicode_escape、latin-1 和 cp1252。 while I expect utf-8 to be what I'm supposed to use, it does not behave like it's described in the Python doc linked above.虽然我希望 utf-8 是我应该使用的,但它的行为并不像上面链接的 Python 文档中描述的那样。

Any help is appreciated.任何帮助表示赞赏。 Besides trying more methods, I don't understand why this isn't decoding according to the table in link 3.除了尝试更多方法,我不明白为什么这不是根据链接 3 中的表进行解码。

This isn't actually a python issue, it's how you're encoding the character.这实际上不是 python 问题,而是您对字符进行编码的方式。 To convert a unicode codepoint to utf-8, you do not simply get the bytes from the codepoint position.要将 unicode 代码点转换为 utf-8,您不能简单地从代码点位置获取字节。

For example, the code point U+2192 is →.例如,代码点 U+2192 是 →。 The actual binary representation in utf-8 is: 0xE28692, or 11100010 10000110 10010010 utf-8 中的实际二进制表示是:0xE28692,或 11100010 10000110 10010010

As we can see, this is 3 bytes, not 2 as we'd expect if we only used the position.如我们所见,这是 3 个字节,而不是我们仅使用位置时所期望的 2 个字节。 To get correct behavior, you can either do the encoding by hand, or use a converter such as this one:要获得正确的行为,您可以手动进行编码,也可以使用如下转换器:

https://onlineunicodetools.com/convert-unicode-to-binary https://onlineunicodetools.com/convert-unicode-to-binary

This will let you input a unicode character and get the utf-8 binary representation.这将使您输入一个 unicode 字符并获得 utf-8 二进制表示。

To get correct output for ò, we need to use 0xC3B2.要获得 ò 的正确输出,我们需要使用 0xC3B2。

>>> s = b'\xC3\xB2'
>>> print(s.decode('utf-8'))
ò

The reason why you can't use the direct binary representation is because of the header for the bytes.不能使用直接二进制表示的原因是字节的标头。 In utf-8, we can have 1-byte, 2-byte, and 4-byte codepoints.在 utf-8 中,我们可以有 1 字节、2 字节和 4 字节代码点。 For example, to signify a 1 byte codepoint, the first bit is encoded as a 0. This means that we can only store 2^7 1-byte code points.例如,为了表示一个 1 字节的代码点,第一位编码为 0。这意味着我们只能存储 2^7 个 1 字节的代码点。 So, the codepoint U+0080, which is a control character, must be encoded as a 2-byte character such as 11000010 10000000.因此,控制字符代码点 U+0080 必须编码为 2 字节字符,例如 11000010 10000000。

For this character, the first byte begins with the header 110, while the second byte begins with the header 10. This means that the data for the codepoint is stored in the last 5 bits of the first byte and the last 6 bits of the second byte.对于这个字符,第一个字节以 header 110 开头,而第二个字节以 header 10 开头。这意味着代码点的数据存储在第一个字节的最后 5 位和第二个字节的最后 6 位字节。 If we combine those, we get 00010 000000, which is equivalent to 0x80.如果我们将它们组合起来,我们会得到 00010 000000,相当于 0x80。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM