[英]How to determine if a string is escaped unicode
How do you determine if a string contains escaped unicode so you know whether or not to run .decode("unicode-escape")
? 如何确定字符串是否包含转义的unicode,以便您知道是否运行.decode("unicode-escape")
?
For example: 例如:
test.py test.py
# -*- coding: utf-8 -*-
str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]
def is_escaped_unicode(str):
#how do I determine if this is escaped unicode?
pass
for str in arr_all_strings:
if is_escaped_unicode(str):
str = str.decode("unicode-escape")
print str
Current output: 当前输出:
"A\u0026B"
"Война́ и миръ"
Expected output: 预期产量:
"A&B"
"Война́ и миръ"
How do I define is_escaped_unicode(str)
to determine if the string that's passed is actually escaped unicode? 如何定义is_escaped_unicode(str)
以确定传递的字符串是否实际是转义为unicode?
You can not. 你不能。
There is no way to tell if '"A\&B"' originally came from some text that was encoded, or if the data are just the bytes '"A\&B"', or if we arrived there from some other encoding. 无法判断“A \\ u0026B”最初是来自某些已编码的文本,还是数据只是字节“A \\ u0026B”,或者我们是否从其他编码到达那里。
How do ... you know whether or not to run
.decode("unicode-escape")
怎么做...你知道是否要运行.decode("unicode-escape")
You have to know if someone earlier has called text.encode('unicode-escape')
. 您必须知道之前是否有人调用了text.encode('unicode-escape')
。 The bytes themselves can not tell you. 字节本身无法告诉你。
You can certainly guess , by looking for \\u or \\U escape sequences, or by just try/except the decoding and see what happens, but I don't recommend to go down this route. 你可以猜测 ,通过寻找\\ u或\\ U转义序列,或者只是尝试/除了解码,看看会发生什么,但我不建议沿着这条路走下去。
If you encounter a bytestring in your application, and you don't already know what the encoding is, then your problem lies elsewhere and should be fixed elsewhere. 如果您在应用程序中遇到字节字符串,并且您还不知道编码是什么,那么您的问题就在其他地方,应该在其他地方修复。
str_escaped = u'"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]
def is_ascii(s):
return all(ord(c) < 128 for c in s)
def is_escaped_unicode(str):
#how do I determine if this is escaped unicode?
if is_ascii(str): # escaped unicode is ascii
return True
return False
for str in arr_all_strings:
if is_escaped_unicode(str):
str = str.decode("unicode-escape")
print str
The following code will work for your case. 以下代码适用于您的案例。
Explain: 说明:
All string in str_escaped is in Ascii range. str_escaped中的所有字符串都在Ascii范围内。
Char in str_unicode do not contain in Ascii range. str_unicode中的char不包含在Ascii范围内。
Here's a crude way to do it. 这是一种粗暴的方式。 Try decoding as unicode-escape, and if that succeeds the resulting string will be shorter than the original string. 尝试解码为unicode-escape,如果成功,结果字符串将短于原始字符串。
str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]
def decoder(s):
y = s.decode('unicode-escape')
return y if len(y) < len(s) else s.decode('utf8')
for s in arr_all_strings:
print s, decoder(s)
output 产量
"A\u0026B" "A&B"
"Война и миръ" "Война и миръ"
But seriously, you'll save yourself a lot of pain if you can migrate to Python 3. And if you can't immediately migrate to Python 3, you may find this article helpful: Pragmatic Unicode , which was written by SO veteran Ned Batchelder. 但严重的是,如果你可以迁移到Python 3,你将为自己省去很多痛苦。如果你不能立即迁移到Python 3,你会发现这篇文章很有帮助: 实用的Unicode ,由SO老手Ned Batchelder编写。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.