Python-将unicode和hex转换为unicode

Question

I have a supposedly unicode string like this: 我有一个所谓的unicode字符串，像这样：

u'\\xc3\\xa3\\xc6\’\\xc2\\xa9\\xc3\\xa3\\xc6\’\\xe2\€\“\\xc3\\xa3\\xc6\’\\xc2\\xa9\\xc3\\xa3\\xe2\€\š\\xc2\\xa4\\xc3\\xa3\\xc6\’\\xe2\€\“\\xc3\\xaf\\xc2\\xbc\\xc2\\x81\\xc3\\xa3\\xe2\€\š\\xc2\\xb9\\xc3\\xa3\\xe2\€\š\\xc2\\xaf\\xc3\\xa3\\xc6\’\\xc2\\xbc\\xc3\\xa3\\xc6\’\\xc2\\xab\\xc3\\xa3\\xe2\€\š\\xc2\\xa2\\xc3\\xa3\\xe2\€\š\\xc2\\xa4\\xc3\\xa3\\xc6\’\\xe2\€\\xb0\\xc3\\xa3\\xc6\’\\xc2\\xab\\xc3\\xa3\\xc6\’\\xe2\€\\xa2\\xc3\\xa3\\xe2\€\š\\xc2\\xa7\\xc3\\xa3\\xe2\€\š\\xc2\\xb9\\xc3\\xa3\\xc6\’\\xe2\€\\xa0\\xc3\\xa3\\xe2\€\š\\xc2\\xa3\\xc3\\xa3\\xc6\’\\xc2\\x90\\xc3\\xa3\\xc6\’\\xc2\\xab\\xc3\\xaf\\xc2\\xbc\\xcb\†\\xc3\\xa3\\xe2\€\š\\xc2\\xb9\\xc3\\xa3\\xe2\€\š\\xc2\\xaf\\xc3\\xa3\\xc6\’\\xe2\€\\xa2\\xc3\\xa3\\xe2\€\š\\xc2\\xa7\\xc3\\xa3\\xe2\€\š\\xc2\\xb9\\xc3\\xaf\\xc2\\xbc\\xe2\€\\xb0' u'\\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xc2 \\ xa9 \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xe2 \\ u20ac \\ u201c \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xc2 \\ xa9 \\ xc3 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xa4 \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xe2 \\ u20ac \\ u201c \\ xc3 \\ xaf \\ xc2 \\ xbc \\ xc2 \\ x81 \\ xc3 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xb9 \\ xc3 \\ xa3 \\ xe \\ u20ac \\ u0161 \\ xc2 \\ xaf \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xc2 \\ xbc \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xc2 \\ xab \\ xc3 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xa2 \\ xc3 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xa4 \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xe2 \\ u20ac \\ xb0 \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xc2 \\ xab \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xe2 \\ u20ac \\ xa2 \\ xc3 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xa7 \\ xc3 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xb9 \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xe2 \\ u20ac \\ xa0 \\ xc3 \\ xa3 \\ xe2 \\ \\ u0161 \\ xc2 \\ xa3 \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xc2 \\ x90 \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xc2 \\ xab \\ xc3 \\ xaf \\ xc2 \\ xbc \\ xcb \\ u2020 \\ xc3 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xb9 \\ xc3 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xaf \\ xc3 \\ xa3 \\ xc6 \\ u2019 \\ xe2 \\ u20ac \\ xa2 \\ xc3 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xa7 \\ xa7 \\ xa3 \\ xe2 \\ u20ac \\ u0161 \\ xc2 \\ xb9 \\ xc3 \\ xaf \\ xc2 \\ xbc \\ xe2 \\ u20ac \\ xb0'

How do I get the correct unicode string out of this? 我如何从中得到正确的unicode字符串？ I think, the actual unicode value is ラブライブ！スクールアイドルフェスティバル（スクフェス） 我认为，实际的unicode值是ラブライブ！スクールアイドルフェスティバル（スクフェス）

Answer 1

You have a Mojibake , an incorrectly decoded piece text. 您有一个Mojibake ，一个错误地解码的文本。

You can use the ftfy library to un-do the damage: 您可以使用ftfy库撤消损坏：

>>> from ftfy import fix_text
>>> fix_text(s)
u'\u30e9\u30d6\u30e9\u30a4\u30d6!\u30b9\u30af\u30fc\u30eb\u30a2\u30a4\u30c9\u30eb\u30d5\u30a7\u30b9\u30c6\u30a3\u30d0\u30eb(\u30b9\u30af\u30d5\u30a7\u30b9)'
>>> print fix_text(s)
ラブライブ!スクールアイドルフェスティバル(スクフェス)

According to ftfy , your data was encoded as UTF-8, then decoded as Windows codepage 1252; 根据ftfy ，您的数据编码为UTF-8，然后解码为Windows代码页1252； the ftfy.fixes.fix_one_step_and_explain() function shows the repair steps needed: ftfy.fixes.fix_one_step_and_explain()函数显示所需的修复步骤：

>>> ftfy.fixes.fix_one_step_and_explain(s)[-1]
[(u'encode', u'sloppy-windows-1252', 0), (u'decode', u'utf-8', 0)]

(the 'sloppy' encoding is needed because not all UTF-8 bytes can be decoded as cp1252 , but some bad decoders then just copy the original byte; the special codec reverses that process). （需要“草率”编码，因为并非所有UTF-8字节都可以解码为cp1252 ，但是某些错误的解码器仅复制原始字节即可；特殊编解码器则可逆转该过程）。

In fact, in your case this was done twice , not a feat I had seen before: 实际上，在您的情况下，此操作执行了两次，而不是我以前见过的壮举：

>>> print s.encode('sloppy-cp1252').decode('utf8').encode('sloppy-cp1252').decode('utf8')
ラブライブ！スクールアイドルフェスティバル（スクフェス）

Python-将unicode和hex转换为unicode

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-02-07 21:20:38

Python-将unicode和hex转换为unicode

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-02-07 21:20:38

解决方案1
5 已采纳 2017-02-07 21:20:38