简体   繁体   中英

Turning a unicode code point into a unicode character in Python

I'm parsing hex/unicode escapes from text.

So I'll have an input string like

\x{abcd}

which is easy enough - I wind up with an array ["ab", "cd"] which I call digits and do this to it:

return bytes(int(d, 16) for d in digits).decode("utf-8")

So I basically accept everything between the {} as a UTF-8-encoded character and turn it into a character. Simple.

>>> bytes(int(d, 16) for d in ["e1", "88", "92"]).decode("utf-8")
'ሒ'

But I want to go the other way: \\u{1212} should result in the same character. The problem is, I don't know how to treat the resulting ["12", "12"] as a unicode code point instead of UTF-8 bytes to get the ሒ character again.

How can I do this in python 3?

You can use chr after parsing the number as base-16:

>>> chr(int('1212', 16))
'ሒ'
>>> '\u1212'
'ሒ'

If you're replacing this globally in some string, using re.sub with a substitution function could make this simple:

import re

def replacer(match):
    if match.group(2) == 'u':
        return chr(int(match.group(3), 16))
    elif match.group(2) == 'x':
        return  # ...

re.sub(r'(\\(x|u)\{(.*?)\})', replacer, r'\x{abcd} foo \u{1212}')

do you mean to encode the string like this?

>>> print u"\u1212"
ሒ
>>> print u"\u00A9"
©

edit:

if you start with a string, it's just

>>> chr(int("1212", 16))
'ሒ'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM