简体   繁体   English

如何确定字符串是否转义为unicode

[英]How to determine if a string is escaped unicode

How do you determine if a string contains escaped unicode so you know whether or not to run .decode("unicode-escape") ? 如何确定字符串是否包含转义的unicode,以便您知道是否运行.decode("unicode-escape")

For example: 例如:

test.py test.py

# -*- coding: utf-8 -*-
str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_escaped_unicode(str):
    #how do I determine if this is escaped unicode?
    pass

for str in arr_all_strings:
    if is_escaped_unicode(str):
        str = str.decode("unicode-escape")
    print str

Current output: 当前输出:

"A\u0026B"
"Война́ и миръ"

Expected output: 预期产量:

"A&B"
"Война́ и миръ"

How do I define is_escaped_unicode(str) to determine if the string that's passed is actually escaped unicode? 如何定义is_escaped_unicode(str)以确定传递的字符串是否实际是转义为unicode?

You can not. 你不能。

There is no way to tell if '"A\&B"' originally came from some text that was encoded, or if the data are just the bytes '"A\&B"', or if we arrived there from some other encoding. 无法判断“A \\ u0026B”最初是来自某些已编码的文本,还是数据只是字节“A \\ u0026B”,或者我们是否从其他编码到达那里。

How do ... you know whether or not to run .decode("unicode-escape") 怎么做...你知道是否要运行.decode("unicode-escape")

You have to know if someone earlier has called text.encode('unicode-escape') . 您必须知道之前是否有人调用了text.encode('unicode-escape') The bytes themselves can not tell you. 字节本身无法告诉你。

You can certainly guess , by looking for \\u or \\U escape sequences, or by just try/except the decoding and see what happens, but I don't recommend to go down this route. 你可以猜测 ,通过寻找\\ u或\\ U转义序列,或者只是尝试/除了解码,看看会发生什么,但我不建议沿着这条路走下去。

If you encounter a bytestring in your application, and you don't already know what the encoding is, then your problem lies elsewhere and should be fixed elsewhere. 如果您在应用程序中遇到字节字符串,并且您还不知道编码是什么,那么您的问题就在其他地方,应该在其他地方修复。

str_escaped = u'"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

def is_escaped_unicode(str):
    #how do I determine if this is escaped unicode?
    if is_ascii(str): # escaped unicode is ascii
        return True
    return False

for str in arr_all_strings:
    if is_escaped_unicode(str):
        str = str.decode("unicode-escape")
    print str

The following code will work for your case. 以下代码适用于您的案例。

Explain: 说明:

  • All string in str_escaped is in Ascii range. str_escaped中的所有字符串都在Ascii范围内。

  • Char in str_unicode do not contain in Ascii range. str_unicode中的char不包含在Ascii范围内。

Here's a crude way to do it. 这是一种粗暴的方式。 Try decoding as unicode-escape, and if that succeeds the resulting string will be shorter than the original string. 尝试解码为unicode-escape,如果成功,结果字符串将短于原始字符串。

str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]

def decoder(s):
    y = s.decode('unicode-escape')
    return y if len(y) < len(s) else s.decode('utf8')

for s in arr_all_strings:
    print s, decoder(s)

output 产量

"A\u0026B" "A&B"
"Война и миръ" "Война и миръ"

But seriously, you'll save yourself a lot of pain if you can migrate to Python 3. And if you can't immediately migrate to Python 3, you may find this article helpful: Pragmatic Unicode , which was written by SO veteran Ned Batchelder. 但严重的是,如果你可以迁移到Python 3,你将为自己省去很多痛苦。如果你不能立即迁移到Python 3,你会发现这篇文章很有帮助: 实用的Unicode ,由SO老手Ned Batchelder编写。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM