繁体   English   中英

如何解码字符串中的unicode字符?

[英]How to decode a unicode character in string?

我有以下字符串:

Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.

此字符串包含 '\t'。 我无法解码,因为它已经是一个字符串。 如果我先编码,然后解码,它仍然显示'\t'。 我如何让它显示一个 ' ?

一种选择是对它进行literal_eval:

import ast
s = r"Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction. \u2661"
r = ast.literal_eval(f'"{s}"')
print(r)

输出:

Conversely, companies that arent sharp-eyed enoughto see that their real Dumbwaiter Pitches are lame, tired, or just plain evil  well, they usually endup facing extinction. ♡

不知何故,Unicode 转义字符串超出了 2000 十六进制。 Unicode 破折号和撇号是:

Unicode 字符“EM DASH”(U+2014)

Unicode 字符“右单引号”(U+2019)

所以无论如何让我们修复它,即使错误是在源(THEM)而不是目的地:

import re
text = r'Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.'
pattern = r'\\u([0-9a-fA-F]{4})'

# used to indicate the end of the previous match
# to save the string parts that don't need character encoding
off = 0
# start with an empty string
s = r''
# find and iterate over all matches of \uHHHH where H is a hex digit
for u in re.finditer(pattern, text):
    # append anything up to the unicode escape
    s += text[off:u.start()]
    # fix encoding mistake, unicode escapes are 2000 hex off the mark
    # then append it
    s += chr(int(u.group(1), 16) + 0x2000)
    # set off to the end of the match
    off = u.end()
# append everything from the last match to the end of the line
s += text[off:len(text)]
print(s)

打印出来

Conversely, companies that aren’t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil — well, they usually end up facing extinction.

请注意,尽管我很高兴地忽略了文本中任何可能存在的\\\\u00xx (反斜杠本身被转义),但这是我留给您解决的问题。 当然,文本中任何正确的Unicode 转义也会被更改。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM