简体   繁体   English

如何将表示UTF-8字符的int转换为Unicode代码点?

[英]How do I convert an int representing a UTF-8 character into a Unicode code point?

Let us use the character Latin Capital Letter a with Ogonek (U+0104) as an example. 让我们以带有Ogonek(U + 0104)的拉丁大写字母a为例。

I have an int that represents its UTF-8 encoded form: 我有一个表示其UTF-8编码形式的整数:

my_int = 0xC484
# Decimal: `50308`
# Binary: `0b1100010010000100`

If use the unichr function i get: \쒄 or (U+C484) 如果使用unichr函数,我得到: \쒄 (U + C484)

But, I need it to output: Ą 但是,我需要它输出: Ą

How do I convert my_int to a Unicode code point? 如何将my_int转换为Unicode代码点?

To convert the integer 0xC484 to the bytestring '\\xc4\\x84' (the UTF-8 representation of the Unicode character Ą ), you can use struct.pack() : 要将整数0xC484转换为字节字符串'\\xc4\\x84' (Unicode字符Ą的UTF-8表示形式),可以使用struct.pack()

>>> import struct
>>> struct.pack(">H", 0xC484)
'\xc4\x84'

... where > in the format string represents big-endian , and H represents unsigned short int . ...,其中格式字符串中的>表示big-endianH表示无符号short int

Once you have your UTF-8 bytestring, you can decode it to Unicode as usual: 拥有UTF-8字节串后,您可以照常将其解码为Unicode:

>>> struct.pack(">H", 0xC484).decode("utf8")
u'\u0104'

>>> print struct.pack(">H", 0xC484).decode("utf8")
Ą

Encode the number to a hex string, using hex() or %x . 使用hex()%x将数字编码为十六进制字符串。 Then you can interpret that as a series of hex bytes using the hex decoder. 然后,您可以使用hex解码器将其解释为一系列十六进制字节。 Finally use the utf-8 decoder to get a unicode string: 最后使用utf-8解码器获取unicode字符串:

def weird_utf8_integer_to_unicode(n):
    s= '%x' % n
    if len(s) % 2:
        s= '0'+s
    return s.decode('hex').decode('utf-8')

The len check is in case the first byte is in the range 0x1–0xF, which would leave it missing a leading zero. len检查是为了防止第一个字节在0x1-0xF范围内,这将使其丢失前导零。 This should be able to cope with any length string and any character (however encoding a byte sequence in an integer like this would be unable to preseve leading zero bytes). 这应该能够处理任何长度的字符串和任何字符(但是,用这样的整数编码字节序列将无法假装前导零字节)。

>>> int2bytes(0xC484).decode('utf-8')
u'\u0104'
>>> print(_)
Ą

where int2bytes() is defined here . 这里定义了int2bytes()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM