简体   繁体   English

如何将带有转义字符的ascii字符串转换为其等效的unicode

[英]How to convert an ascii string with escape characters to its unicode equivalent

# coding=ascii
bad_string = '\x9a'
expected = u'š'
good_string = bad_string.decode('unicode-escape').encode('utf-8')
if good_string != expected:
    raise AssertionError()

I would expect the above test to pass, but I'm getting the following error: 我希望以上测试能够通过,但出现以下错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

What am I missing here? 我在这里想念什么?

(I can't simply change bad_string to be unicode. These are strings arriving from an outside source) (我不能简单地将bad_string更改为unicode。这些是从外部来源bad_string的字符串)

'\\x9a' doesn't have any escape characters in it. '\\x9a'中没有任何转义字符。 The escape is part of the string literal and the bytes represented are just one: [0x9a] . 转义是字符串文字的一部分,表示的字节只是一个: [0x9a] The encoding might be Windows-1252, because that's common and has š at 0x9a, but you really have to know what it is. 编码可能是Windows-1252,因为这很常见,并且在0x9a处带有š,但您实际上必须知道它是什么。 To decode as Windows-1252: 解码为Windows-1252:

good_string = bad_string.decode('cp2512')

If what you actually have is '\\\\x9a' (one backslash, three other characters), then you'll need to convert it to the above form first. 如果您实际拥有的是'\\\\x9a' (一个反斜杠,其他三个字符),则需要首先将其转换为上述形式。 The right way to do this depends on how the escapes managed to get there in the first place. 正确的方法取决于逃生者如何首先到达那里。 If it's from a Python string literal, use string-escape first: 如果来自Python字符串文字,请首先使用string-escape

good_string = bad_string.decode('string-escape').decode('cp2512')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM