简体   繁体   中英

python intersection of utf-8 list and utf-8 string

I made this code work when I use list with ASCII letters and ASCII strings, but I couldn't make this work.

# -*- coding: utf-8 -*-
asa = ["ā","ē","ī","ō","ū","ǖ","Ā","Ē","Ī","Ō","Ū","Ǖ",
"á","é","í","ó","ú","ǘ","Á","É","Í","Ó","Ú","Ǘ",
"ǎ","ě","ǐ","ǒ","ǔ","ǚ","Ǎ","Ě","Ǐ","Ǒ","Ǔ","Ǚ",
"à","è","ì","ò","ù","ǜ","À","È","Ì","Ò","Ù","Ǜ"]
[x.decode('utf-8') for x in asa]
print list(set(asa) & set("ō"))

You need to put your character within a list, because strings are iterable objects and your unicode character is contain 2 byte string thus python assumes "ō" as \\xc5 and \\x8d . :

>>> list("ō")
['\xc5', '\x8d']
>>> print list(set(asa) & set(["ō"]))
['\xc5\x8d']
>>> print list(set(asa) & set(["ō"]))[0]
ō

Your first set contains elements of the form "ō".decode('utf-8') (type unicode ), equivalent to u"ō" .

The second set contains byte strings like "ō" (type str ), so they don't compare equal and you get no intersections.

Medidate:

>>> 'a' == u'a'
True
>>> 'ō' == u'ō'
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
>>> list('ō')
['\xc5', '\x8d']
>>> list(u'ō')
[u'\u014d']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM