简体   繁体   English

如何解码转义的 Unicode 个字符?

[英]How to decode escaped Unicode characters?

I'm trying to replace escaped Unicode characters with the actual characters:我正在尝试用实际字符替换转义的 Unicode 字符:

string = "\\u00c3\\u00a4"
print(string.encode().decode("unicode-escape"))

The expected output is ä , the actual output is ä .预期的 output 是ä ,实际的ä是 ä 。

["

The following solution seems to work in similar situations ( see for example this case about decoding broken Hebrew text<\/a> ):<\/i>以下解决方案似乎在类似情况下也有效( 例如,参见有关解码损坏的希伯来语文本的案例<\/a>):<\/b><\/p>

("\\u00c3\\u00a4"
  .encode('latin-1')
  .decode('unicode_escape')
  .encode('latin-1')
  .decode('utf-8')
)

The codecs doc page states :编解码器文档页面指出

在此处输入图像描述

That means that output of the "unicode-escape" will be latin1, even if the default for python is utf-8.这意味着“unicode-escape”的 output 将是 latin1,即使 python 的默认值是 utf-8。
So, you just need to encode back to latin1 and decode back to utf-8所以,你只需要编码回 latin1 并解码回 utf-8

mixed_string_to_be_unescaped =  '\u002Fq:85\\u002FczM"},{\"name\":\"Santé\",\"parent_name\":\"Santé'

val = codecs.decode(mixed_string_to_be_unescaped, 'unicode-escape')
val = val.encode('latin1').decode('utf-8')
print(val)

/q:85/czM"},{"name":"Santé","parent_name":"Santé

The above solution works, but to me was not clear because I didn't get why I should convert to latin-1 before the unicode_escape (discovered that was doing this automatically), neither why it was using unicode_escape in an unescaped string.上面的解决方案有效,但对我来说并不清楚,因为我不明白为什么我应该在 unicode_escape 之前转换为 latin-1(发现它是自动执行此操作),也不明白为什么它在未转义的字符串中使用 unicode_escape。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM