[英]How to decode a unicode character in string?
我有以下字符串:
Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.
此字符串包含 '\t'。 我无法解码,因为它已经是一个字符串。 如果我先编码,然后解码,它仍然显示'\t'。 我如何让它显示一个 ' ?
一种选择是对它进行literal_eval:
import ast
s = r"Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction. \u2661"
r = ast.literal_eval(f'"{s}"')
print(r)
输出:
Conversely, companies that arent sharp-eyed enoughto see that their real Dumbwaiter Pitches are lame, tired, or just plain evil well, they usually endup facing extinction. ♡
不知何故,Unicode 转义字符串超出了 2000 十六进制。 Unicode 破折号和撇号是:
和
所以无论如何让我们修复它,即使错误是在源(THEM)而不是目的地:
import re
text = r'Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.'
pattern = r'\\u([0-9a-fA-F]{4})'
# used to indicate the end of the previous match
# to save the string parts that don't need character encoding
off = 0
# start with an empty string
s = r''
# find and iterate over all matches of \uHHHH where H is a hex digit
for u in re.finditer(pattern, text):
# append anything up to the unicode escape
s += text[off:u.start()]
# fix encoding mistake, unicode escapes are 2000 hex off the mark
# then append it
s += chr(int(u.group(1), 16) + 0x2000)
# set off to the end of the match
off = u.end()
# append everything from the last match to the end of the line
s += text[off:len(text)]
print(s)
打印出来
Conversely, companies that aren’t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil — well, they usually end up facing extinction.
请注意,尽管我很高兴地忽略了文本中任何可能存在的\\\\u00xx
(反斜杠本身被转义),但这是我留给您解决的问题。 当然,文本中任何正确的Unicode 转义也会被更改。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.