如何解码转义的 Unicode 个字符？

Question

I'm trying to replace escaped Unicode characters with the actual characters:我正在尝试用实际字符替换转义的 Unicode 字符：

string = "\\u00c3\\u00a4"
print(string.encode().decode("unicode-escape"))

The expected output is ä , the actual output is Ã¤ .预期的 output 是ä ，实际的Ã¤是 ä 。

Answer 1

["

The following solution seems to work in similar situations ( see for example this case about decoding broken Hebrew text<\/a> ):<\/i>以下解决方案似乎在类似情况下也有效（ 例如，参见有关解码损坏的希伯来语文本的案例<\/a>）：<\/b><\/p>

("\\u00c3\\u00a4"
  .encode('latin-1')
  .decode('unicode_escape')
  .encode('latin-1')
  .decode('utf-8')
)

Answer 2

The codecs doc page states :编解码器文档页面指出：

That means that output of the "unicode-escape" will be latin1, even if the default for python is utf-8.这意味着“unicode-escape”的 output 将是 latin1，即使 python 的默认值是 utf-8。
So, you just need to encode back to latin1 and decode back to utf-8所以，你只需要编码回 latin1 并解码回 utf-8

mixed_string_to_be_unescaped =  '\u002Fq:85\\u002FczM"},{\"name\":\"Santé\",\"parent_name\":\"Santé'

val = codecs.decode(mixed_string_to_be_unescaped, 'unicode-escape')
val = val.encode('latin1').decode('utf-8')
print(val)

/q:85/czM"},{"name":"Santé","parent_name":"Santé

The above solution works, but to me was not clear because I didn't get why I should convert to latin-1 before the unicode_escape (discovered that was doing this automatically), neither why it was using unicode_escape in an unescaped string.上面的解决方案有效，但对我来说并不清楚，因为我不明白为什么我应该在 unicode_escape 之前转换为 latin-1（发现它是自动执行此操作），也不明白为什么它在未转义的字符串中使用 unicode_escape。

如何解码转义的 Unicode 个字符？

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-09-22 21:45:25

解决方案2
1 2022-12-06 05:58:13

如何解码转义的 Unicode 个字符？

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-09-22 21:45:25

解决方案2 1 2022-12-06 05:58:13

解决方案1
3 已采纳 2018-09-22 21:45:25

解决方案2
1 2022-12-06 05:58:13