如何比较unicode类型和python中的字符串？

Question

I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. 我正在尝试使用列表理解来比较字符串对象，但是字符串之一是utf-8，即json.loads的副产品。 Scenario: 场景：

us = u'MyString' # is the utf-8 string

Part one of my question, is why does this return False? 我的问题的第一部分，为什么这会返回False？ : ：

us.encode('utf-8') == "MyString" ## False

Part two - how can I compare within a list comprehension? 第二部分-如何在列表理解中进行比较？

myComp = [utfString for utfString in jsonLoadsObj
           if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.

EDIT: I'm using Google App Engine, which uses Python 2.7 编辑：我正在使用Google Python 2.7的Google App Engine

Here's a more complete example of the problem: 这是问题的更完整示例：

#json coming from remote server:
#response object looks like:  {"number1":"first", "number2":"second"}

data = json.loads(response)
k = data.keys()

I need something like:
myList = [item for item in k if item=="number1"]  

#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]

Answer 1

You must be looping over the wrong data set; 您必须遍历错误的数据集。 just loop directly over the JSON-loaded dictionary, there is no need to call .keys() first: 只需直接在JSON加载的字典上循环即可，无需先调用.keys() ：

data = json.loads(response)
myList = [item for item in data if item == "number1"]

You may want to use u"number1" to avoid implicit conversions between Unicode and byte strings: 您可能要使用u"number1"以避免Unicode和字节字符串之间的隐式转换：

data = json.loads(response)
myList = [item for item in data if item == u"number1"]

Both versions work fine : 两种版本都可以正常工作 ：

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']

Note that in your first example, us is not a UTF-8 string; 请注意，在您的第一个示例中， us 不是 UTF-8字符串。 it is unicode data, the json library has already decoded it for you. 它是unicode数据， json库已经为您解码了。 A UTF-8 string on the other hand, is a sequence encoded bytes . 另一方面，UTF-8字符串是序列编码的bytes 。 You may want to read up on Unicode and Python to understand the difference: 您可能需要阅读Unicode和Python来了解不同之处：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky 每个软件开发人员绝对，肯定必须绝对了解Unicode和字符集（无借口！）作者：Joel Spolsky
The Python Unicode HOWTO Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder Ned Batchelder的实用Unicode

On Python 2, your expectation that your test returns True would be correct, you are doing something else wrong: 在Python 2上，您期望测试返回True期望是正确的，但您做错了其他事情：

>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>

There is no need to encode the strings to UTF-8 to make comparisons; 没有必要将字符串编码成UTF-8进行比较; use unicode literals instead: 改用unicode文字：

myComp = [elem for elem in json_data if elem == u"MyString"]

Answer 2

You are trying to compare a string of bytes ( 'MyString' ) with a string of Unicode code points ( u'MyString' ). 您正在尝试将字节字符串（ 'MyString' ）与Unicode代码点字符串（ u'MyString' ）比较。 This is an "apples and oranges" comparison. 这是“苹果和橘子”的比较。 Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False : 不幸的是，Python 2在某些情况下假装此比较有效，而不是始终返回False ：

>>> u'MyString' == 'MyString'  # in my opinion should be False
True

It's up to you as the designer/developer to decide what the correct comparison should be. 作为设计者/开发人员，由您决定应该进行正确的比较。 Here is one possible way: 这是一种可能的方法：

a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b  # True

I recommend the above instead of a == b.decode('UTF-8') because all u'' style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way. 我建议使用上面的而不是a == b.decode('UTF-8')因为所有u''样式字符串都可以使用UTF-8编码为字节，除非在某些奇怪的情况下，但是并非所有字节字符串都可以以这种方式被解码为Unicode。

But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashes\—are cool'.encode('UTF-8') == 'Em dashes\\x97are cool' . 但是，如果您选择在比较之前对Unicode字符串进行UTF-8编码，则在Windows系统上将无法执行以下操作： u'Em dashes\—are cool'.encode('UTF-8') == 'Em dashes\\x97are cool' 。 But if you .encode('Windows-1252') instead it would succeed. 但是，如果您使用.encode('Windows-1252') ，它将成功。 That's why it's an apples and oranges comparison. 这就是为什么这是苹果和橘子的比较。

Answer 3

I'm assuming you're using Python 3. us.encode('utf-8') == "MyString" returns False because the str.encode() function is returning a bytes object : 我假设您正在使用us.encode('utf-8') == "MyString"返回False因为str.encode()函数返回的是字节对象：

In [2]: us.encode('utf-8')
Out[2]: b'MyString'

In Python 3, strings are already Unicode , so the u'MyString' is superfluous. 在Python 3中，字符串已经是Unicode了，所以u'MyString'是多余的。

如何比较unicode类型和python中的字符串？

问题描述

3 个解决方案

解决方案1
21 已采纳 2013-05-09 21:29:53

解决方案2
12 2013-05-09 21:34:28

解决方案3
3 2013-05-09 21:27:46

如何比较unicode类型和python中的字符串？

问题描述

3 个解决方案

解决方案1 21 已采纳 2013-05-09 21:29:53

解决方案2 12 2013-05-09 21:34:28

解决方案3 3 2013-05-09 21:27:46

解决方案1
21 已采纳 2013-05-09 21:29:53

解决方案2
12 2013-05-09 21:34:28

解决方案3
3 2013-05-09 21:27:46