I'm parsing hex/unicode escapes from text.
So I'll have an input string like
\x{abcd}
which is easy enough - I wind up with an array ["ab", "cd"]
which I call digits
and do this to it:
return bytes(int(d, 16) for d in digits).decode("utf-8")
So I basically accept everything between the {}
as a UTF-8-encoded character and turn it into a character. Simple.
>>> bytes(int(d, 16) for d in ["e1", "88", "92"]).decode("utf-8")
'ሒ'
But I want to go the other way: \\u{1212}
should result in the same character. The problem is, I don't know how to treat the resulting ["12", "12"]
as a unicode code point instead of UTF-8 bytes to get the ሒ character again.
How can I do this in python 3?
You can use chr
after parsing the number as base-16:
>>> chr(int('1212', 16))
'ሒ'
>>> '\u1212'
'ሒ'
If you're replacing this globally in some string, using re.sub
with a substitution function could make this simple:
import re
def replacer(match):
if match.group(2) == 'u':
return chr(int(match.group(3), 16))
elif match.group(2) == 'x':
return # ...
re.sub(r'(\\(x|u)\{(.*?)\})', replacer, r'\x{abcd} foo \u{1212}')
do you mean to encode the string like this?
>>> print u"\u1212"
ሒ
>>> print u"\u00A9"
©
edit:
if you start with a string, it's just
>>> chr(int("1212", 16))
'ሒ'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.