[英]How can I compare a unicode type to a string in python?
I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. 我正在尝试使用列表理解来比较字符串对象,但是字符串之一是utf-8,即json.loads的副产品。 Scenario: 场景:
us = u'MyString' # is the utf-8 string
Part one of my question, is why does this return False? 我的问题的第一部分,为什么这会返回False? : :
us.encode('utf-8') == "MyString" ## False
Part two - how can I compare within a list comprehension? 第二部分-如何在列表理解中进行比较?
myComp = [utfString for utfString in jsonLoadsObj
if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.
EDIT: I'm using Google App Engine, which uses Python 2.7 编辑:我正在使用Google Python 2.7的Google App Engine
Here's a more complete example of the problem: 这是问题的更完整示例:
#json coming from remote server:
#response object looks like: {"number1":"first", "number2":"second"}
data = json.loads(response)
k = data.keys()
I need something like:
myList = [item for item in k if item=="number1"]
#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]
You must be looping over the wrong data set; 您必须遍历错误的数据集。 just loop directly over the JSON-loaded dictionary, there is no need to call .keys()
first: 只需直接在JSON加载的字典上循环即可,无需先调用.keys()
:
data = json.loads(response)
myList = [item for item in data if item == "number1"]
You may want to use u"number1"
to avoid implicit conversions between Unicode and byte strings: 您可能要使用u"number1"
以避免Unicode和字节字符串之间的隐式转换:
data = json.loads(response)
myList = [item for item in data if item == u"number1"]
Both versions work fine : 两种版本都可以正常工作 :
>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']
Note that in your first example, us
is not a UTF-8 string; 请注意,在您的第一个示例中, us
不是 UTF-8字符串。 it is unicode data, the json
library has already decoded it for you. 它是unicode数据, json
库已经为您解码了。 A UTF-8 string on the other hand, is a sequence encoded bytes . 另一方面,UTF-8字符串是序列编码的bytes 。 You may want to read up on Unicode and Python to understand the difference: 您可能需要阅读Unicode和Python来了解不同之处:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky 每个软件开发人员绝对,肯定必须绝对了解Unicode和字符集(无借口!)作者:Joel Spolsky
Pragmatic Unicode by Ned Batchelder Ned Batchelder的实用Unicode
On Python 2, your expectation that your test returns True
would be correct, you are doing something else wrong: 在Python 2上,您期望测试返回True
期望是正确的,但您做错了其他事情:
>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>
There is no need to encode the strings to UTF-8 to make comparisons; 没有必要将字符串编码成UTF-8进行比较; use unicode literals instead: 改用unicode文字:
myComp = [elem for elem in json_data if elem == u"MyString"]
You are trying to compare a string of bytes ( 'MyString'
) with a string of Unicode code points ( u'MyString'
). 您正在尝试将字节字符串( 'MyString'
)与Unicode代码点字符串( u'MyString'
)比较。 This is an "apples and oranges" comparison. 这是“苹果和橘子”的比较。 Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False
: 不幸的是,Python 2在某些情况下假装此比较有效,而不是始终返回False
:
>>> u'MyString' == 'MyString' # in my opinion should be False
True
It's up to you as the designer/developer to decide what the correct comparison should be. 作为设计者/开发人员,由您决定应该进行正确的比较。 Here is one possible way: 这是一种可能的方法:
a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b # True
I recommend the above instead of a == b.decode('UTF-8')
because all u''
style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way. 我建议使用上面的而不是a == b.decode('UTF-8')
因为所有u''
样式字符串都可以使用UTF-8编码为字节,除非在某些奇怪的情况下,但是并非所有字节字符串都可以以这种方式被解码为Unicode。
But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashes\—are cool'.encode('UTF-8') == 'Em dashes\\x97are cool'
. 但是,如果您选择在比较之前对Unicode字符串进行UTF-8编码,则在Windows系统上将无法执行以下操作: u'Em dashes\—are cool'.encode('UTF-8') == 'Em dashes\\x97are cool'
。 But if you .encode('Windows-1252')
instead it would succeed. 但是,如果您使用.encode('Windows-1252')
,它将成功。 That's why it's an apples and oranges comparison. 这就是为什么这是苹果和橘子的比较。
I'm assuming you're using Python 3. us.encode('utf-8') == "MyString"
returns False
because the str.encode()
function is returning a bytes object : 我假设您正在使用us.encode('utf-8') == "MyString"
返回False
因为str.encode()
函数返回的是字节对象 :
In [2]: us.encode('utf-8')
Out[2]: b'MyString'
In Python 3, strings are already Unicode , so the u'MyString'
is superfluous. 在Python 3中,字符串已经是Unicode了 ,所以u'MyString'
是多余的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.