简体   繁体   English

如何比较unicode类型和python中的字符串?

[英]How can I compare a unicode type to a string in python?

I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. 我正在尝试使用列表理解来比较字符串对象,但是字符串之一是utf-8,即json.loads的副产品。 Scenario: 场景:

us = u'MyString' # is the utf-8 string

Part one of my question, is why does this return False? 我的问题的第一部分,为什么这会返回False? :

us.encode('utf-8') == "MyString" ## False

Part two - how can I compare within a list comprehension? 第二部分-如何在列表理解中进行比较?

myComp = [utfString for utfString in jsonLoadsObj
           if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.

EDIT: I'm using Google App Engine, which uses Python 2.7 编辑:我正在使用Google Python 2.7的Google App Engine

Here's a more complete example of the problem: 这是问题的更完整示例:

#json coming from remote server:
#response object looks like:  {"number1":"first", "number2":"second"}

data = json.loads(response)
k = data.keys()

I need something like:
myList = [item for item in k if item=="number1"]  

#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]

You must be looping over the wrong data set; 您必须遍历错误的数据集。 just loop directly over the JSON-loaded dictionary, there is no need to call .keys() first: 只需直接在JSON加载的字典上循环即可,无需先调用.keys()

data = json.loads(response)
myList = [item for item in data if item == "number1"]  

You may want to use u"number1" to avoid implicit conversions between Unicode and byte strings: 您可能要使用u"number1"以避免Unicode和字节字符串之间的隐式转换:

data = json.loads(response)
myList = [item for item in data if item == u"number1"]  

Both versions work fine : 两种版本都可以正常工作

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']

Note that in your first example, us is not a UTF-8 string; 请注意,在您的第一个示例中, us 不是 UTF-8字符串。 it is unicode data, the json library has already decoded it for you. 它是unicode数据, json库已经为您解码了。 A UTF-8 string on the other hand, is a sequence encoded bytes . 另一方面,UTF-8字符串是序列编码的bytes You may want to read up on Unicode and Python to understand the difference: 您可能需要阅读Unicode和Python来了解不同之处:

On Python 2, your expectation that your test returns True would be correct, you are doing something else wrong: 在Python 2上,您期望测试返回True期望是正确的,但您做错了其他事情:

>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>

There is no need to encode the strings to UTF-8 to make comparisons; 没有必要将字符串编码成UTF-8进行比较; use unicode literals instead: 改用unicode文字:

myComp = [elem for elem in json_data if elem == u"MyString"]

You are trying to compare a string of bytes ( 'MyString' ) with a string of Unicode code points ( u'MyString' ). 您正在尝试将字节字符串( 'MyString' )与Unicode代码点字符串( u'MyString' )比较。 This is an "apples and oranges" comparison. 这是“苹果和橘子”的比较。 Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False : 不幸的是,Python 2在某些情况下假装此比较有效,而不是始终返回False

>>> u'MyString' == 'MyString'  # in my opinion should be False
True

It's up to you as the designer/developer to decide what the correct comparison should be. 作为设计者/开发人员,由您决定应该进行正确的比较。 Here is one possible way: 这是一种可能的方法:

a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b  # True

I recommend the above instead of a == b.decode('UTF-8') because all u'' style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way. 我建议使用上面的而不是a == b.decode('UTF-8')因为所有u''样式字符串都可以使用UTF-8编码为字节,除非在某些奇怪的情况下,但是并非所有字节字符串都可以以这种方式被解码为Unicode。

But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashes\—are cool'.encode('UTF-8') == 'Em dashes\\x97are cool' . 但是,如果您选择在比较之前对Unicode字符串进行UTF-8编码,则在Windows系统上将无法执行以下操作: u'Em dashes\—are cool'.encode('UTF-8') == 'Em dashes\\x97are cool' But if you .encode('Windows-1252') instead it would succeed. 但是,如果您使用.encode('Windows-1252') ,它将成功。 That's why it's an apples and oranges comparison. 这就是为什么这是苹果和橘子的比较。

I'm assuming you're using Python 3. us.encode('utf-8') == "MyString" returns False because the str.encode() function is returning a bytes object : 我假设您正在使用us.encode('utf-8') == "MyString"返回False因为str.encode()函数返回的是字节对象

In [2]: us.encode('utf-8')
Out[2]: b'MyString'

In Python 3, strings are already Unicode , so the u'MyString' is superfluous. 在Python 3中,字符串已经是Unicode了 ,所以u'MyString'是多余的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM