Python 3：将 UTF-8 unicode 印地语文字转换为 Unicode

Question

I have a string of UTF-8 literals我有一串 UTF-8 文字

'\\xe0\\xa4\\xb9\\xe0\\xa5\\x80 \\xe0\\xa4\\xac\\xe0\\xa5\\x8b\\xe0\\xa4\\xb2' which covnverts to '\\xe0\\xa4\\xb9\\xe0\\xa5\\x80 \\xe0\\xa4\\xac\\xe0\\xa5\\x8b\\xe0\\xa4\\xb2' 转换为

ही बोल in Hindi. ही बोल 印地语。 I am unable convert string a to bytes我无法将string a转换为字节

a = '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2'
#convert a to bytes
#also tried a = bytes(a,'utf-8')
a = a.encode('utf-8')
s = str(a,'utf-8')

The string is converted to bytes but contains wrong unicode literals字符串被转换为字节，但包含错误的 unicode 文字

RESULT : b'\\xc3\\xa0\\xc2\\xa4\\xc2\\xb9\\xc3\\xa0\\xc2\\xa5\\xc2\\x80 \\xc3\\xa0\\xc2\\xa4\\xc2\\xac\\xc3\\xa0\\xc2\\xa5\\xc2\\x8b\\xc3\\xa0\\xc2\\xa4\\xc2\\xb2' which prints - à¤¹à¥ à¤¬à¥à¤²结果： b'\\xc3\\xa0\\xc2\\xa4\\xc2\\xb9\\xc3\\xa0\\xc2\\xa5\\xc2\\x80 \\xc3\\xa0\\xc2\\xa4\\xc2\\xac\\xc3\\xa0\\xc2\\xa5\\xc2\\x8b\\xc3\\xa0\\xc2\\xa4\\xc2\\xb2'打印 - à¤¹à¥ à¤¬à¥à¤²

EXPECTED : It should be b'\\xe0\\xa4\\xb9\\xe0\\xa5\\x80\\xe0\\xa4\\xac\\xe0\\xa5\\x8b\\xe0\\xa4\\xb2 which will be ही बोल预期：它应该是b'\\xe0\\xa4\\xb9\\xe0\\xa5\\x80\\xe0\\xa4\\xac\\xe0\\xa5\\x8b\\xe0\\xa4\\xb2这将是 ही बोल

Answer 1

Use the raw-unicode-escape codec to encode the string as bytes, then you can decode as UTF-8.使用raw-unicode-escape编解码器将字符串编码为字节，然后您可以解码为 UTF-8。

>>> s = '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2'
>>> s.encode('raw-unicode-escape').decode('utf-8')
'ही बोल'

This is something of a workaround;这是一种解决方法； the ideal solution would be to prevent the source of the data stringifying the original bytes.理想的解决方案是防止数据源字符串化原始字节。

Answer 2

Your original string was likely decoded as latin1 .您的原始字符串可能被解码为latin1 。 Decode it as UTF-8 instead if possible, but if received messed up you can reverse it by encoding as latin1 again and decoding correctly as UTF-8:如果可能，将其解码为 UTF-8，但如果收到乱码，您可以通过再次编码为latin1并正确解码为 UTF-8 来反转它：

>>> s = '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2'
>>> s.encode('latin1').decode('utf8')
'ही बोल'

Note that latin1 encoding matches the first 256 Unicode code points, so U+00E0 ( '\\xe0' in a Python 3 str object) becomes byte E0h ( b'\\xe0' in a Python 3 bytes object).请注意， latin1编码匹配前 256 个 Unicode 代码点，因此U+00E0 （Python 3 str对象中的'\\xe0' ）变为字节 E0h（Python 3 bytes对象中的b'\\xe0' ）。 It's a 1:1 mapping between U+0000-U+00FF and bytes 00h-FFh.它是 U+0000-U+00FF 和字节 00h-FFh 之间的 1:1 映射。

Python 3：将 UTF-8 unicode 印地语文字转换为 Unicode

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-12-14 13:47:28

解决方案2
1 2019-12-15 02:17:02

Python 3：将 UTF-8 unicode 印地语文字转换为 Unicode

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-12-14 13:47:28

解决方案2 1 2019-12-15 02:17:02

解决方案1
1 已采纳 2019-12-14 13:47:28

解决方案2
1 2019-12-15 02:17:02