如何确定字符串是否转义为unicode

Question

How do you determine if a string contains escaped unicode so you know whether or not to run .decode("unicode-escape") ? 如何确定字符串是否包含转义的unicode，以便您知道是否运行.decode("unicode-escape") ？

For example: 例如：

test.py test.py

# -*- coding: utf-8 -*-
str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_escaped_unicode(str):
    #how do I determine if this is escaped unicode?
    pass

for str in arr_all_strings:
    if is_escaped_unicode(str):
        str = str.decode("unicode-escape")
    print str

Current output: 当前输出：

"A\u0026B"
"Война́ и миръ"

Expected output: 预期产量：

"A&B"
"Война́ и миръ"

How do I define is_escaped_unicode(str) to determine if the string that's passed is actually escaped unicode? 如何定义is_escaped_unicode(str)以确定传递的字符串是否实际是转义为unicode？

Answer 1

You can not. 你不能。

There is no way to tell if '"A\&B"' originally came from some text that was encoded, or if the data are just the bytes '"A\&B"', or if we arrived there from some other encoding. 无法判断“A \\ u0026B”最初是来自某些已编码的文本，还是数据只是字节“A \\ u0026B”，或者我们是否从其他编码到达那里。

How do ... you know whether or not to run .decode("unicode-escape") 怎么做...你知道是否要运行.decode("unicode-escape")

You have to know if someone earlier has called text.encode('unicode-escape') . 您必须知道之前是否有人调用了text.encode('unicode-escape') 。 The bytes themselves can not tell you. 字节本身无法告诉你。

You can certainly guess , by looking for \\u or \\U escape sequences, or by just try/except the decoding and see what happens, but I don't recommend to go down this route. 你可以猜测，通过寻找\\ u或\\ U转义序列，或者只是尝试/除了解码，看看会发生什么，但我不建议沿着这条路走下去。

If you encounter a bytestring in your application, and you don't already know what the encoding is, then your problem lies elsewhere and should be fixed elsewhere. 如果您在应用程序中遇到字节字符串，并且您还不知道编码是什么，那么您的问题就在其他地方，应该在其他地方修复。

Answer 2

str_escaped = u'"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

def is_escaped_unicode(str):
    #how do I determine if this is escaped unicode?
    if is_ascii(str): # escaped unicode is ascii
        return True
    return False

for str in arr_all_strings:
    if is_escaped_unicode(str):
        str = str.decode("unicode-escape")
    print str

The following code will work for your case. 以下代码适用于您的案例。

Explain: 说明：

All string in str_escaped is in Ascii range. str_escaped中的所有字符串都在Ascii范围内。
Char in str_unicode do not contain in Ascii range. str_unicode中的char不包含在Ascii范围内。

Answer 3

Here's a crude way to do it. 这是一种粗暴的方式。 Try decoding as unicode-escape, and if that succeeds the resulting string will be shorter than the original string. 尝试解码为unicode-escape，如果成功，结果字符串将短于原始字符串。

str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]

def decoder(s):
    y = s.decode('unicode-escape')
    return y if len(y) < len(s) else s.decode('utf8')

for s in arr_all_strings:
    print s, decoder(s)

output 产量

"A\u0026B" "A&B"
"Война и миръ" "Война и миръ"

But seriously, you'll save yourself a lot of pain if you can migrate to Python 3. And if you can't immediately migrate to Python 3, you may find this article helpful: Pragmatic Unicode , which was written by SO veteran Ned Batchelder. 但严重的是，如果你可以迁移到Python 3，你将为自己省去很多痛苦。如果你不能立即迁移到Python 3，你会发现这篇文章很有帮助：实用的Unicode ，由SO老手Ned Batchelder编写。

如何确定字符串是否转义为unicode

问题描述

3 个解决方案

解决方案1
7 2017-08-12 15:25:08

解决方案2
3 已采纳 2017-08-12 15:19:39

解决方案3
1 2017-08-12 15:32:29

如何确定字符串是否转义为unicode

问题描述

3 个解决方案

解决方案1 7 2017-08-12 15:25:08

解决方案2 3 已采纳 2017-08-12 15:19:39

解决方案3 1 2017-08-12 15:32:29

解决方案1
7 2017-08-12 15:25:08

解决方案2
3 已采纳 2017-08-12 15:19:39

解决方案3
1 2017-08-12 15:32:29