简体   繁体   中英

Comparing user input unicode strings in Python 2.7

What is the best way to compare a string entered by the user with another string?

For example:

# -*- coding: utf-8 -*-

from __future__ import unicode_literals

user_input = raw_input("Please, write árido: ").decode("utf8")
if u"árido" == user_input:
    print "OK"
else:
    print "FALSE"

EDIT:

This

# -*- coding: utf-8 -*-

from __future__ import unicode_literals
from unicodedata import normalize
import sys

uinput2 = "árbol"
uinput = raw_input("type árbol: ")

print "Encoding %s" % sys.stdout.encoding
print "User Input \t\tProgram Input"
print "-"*50
print "%s \t\t\t%s \t(raw value)" % (uinput, uinput2)
print "%s \t\t\t%s \t(unicode(value))" % (unicode(uinput), unicode(uinput2))
print "%s \t\t\t%s \t(value.decode('utf8'))" % (uinput.decode("utf-8"), uinput2.decode("utf-8"))
print "%s \t\t\t%s \t(normalize('NFC',value))" % (normalize("NFC",uinput.decode("utf-8")), normalize("NFC",uinput2.decode("utf-8")));
print "\n\nUser Input \t\tProgram Input (Repr)"
print "-"*50
print "%s \t%s" % (repr(uinput),repr(uinput2))
print "%s \t%s \t(unicode(value))" % (repr(unicode(uinput)), repr(uinput2))
print "%s \t%s \t(value.decode('utf8'))" % (repr(uinput.decode("utf-8")), repr(uinput2.decode("utf-8")))
print "%s \t%s \t(normalize('NFC',value)))" % (repr(normalize("NFC",uinput.decode("utf-8"))), repr(normalize("NFC",uinput2.decode("utf-8"))));

prints:

type árbol: árbol
Encoding utf-8
User Input      Program Input
--------------------------------------------------
árbol          árbol   (raw value)
árbol          árbol   (unicode(value))
árbol          árbol   (value.decode('utf8'))
árbol          árbol   (normalize('NFC',value))


User Input              Program Input (Repr)
--------------------------------------------------
'\xc3\x83\xc2\xa1rbol'  u'\xe1rbol'
u'\xc3\xa1rbol'         u'\xe1rbol'     (unicode(value))
u'\xc3\xa1rbol'         u'\xe1rbol'     (value.decode('utf8'))
u'\xc3\xa1rbol'         u'\xe1rbol'     (normalize('NFC',value)))

Any idea? I haven't problems when I work with other languages ​​like Java. This only happens to me with python. I'm using Eclipse.

Thanks in advance :)

Can you check the character encoding of your terminal,

import sys

sys.stdin.encoding

If it is UTF-8, then decode should be fine. Otherwise, you have to decode the raw_input with right encoding.

like, raw_input().decode(sys.stdin.encoding) to check whether it is proper along with Unicode Normalization, if needed.

Your current approach isn't bad, but you should probably use unicodedata.normalize() for the comparison. The docs linked above explain why this is a good idea. For example, try evaluating the following:

u'Ç' == u'Ç'

Spoiler alert, this will give you False because the left side is the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA), and the right side is the single character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA).

You can use unicodedata.normalize() to handle this properly by first converting the strings to a normalized form. For example:

# -*- coding: utf-8 -*-
from unicodedata import normalize

from __future__ import unicode_literals

user_input = normalize('NFC', raw_input("Please, write árido: ").decode("utf8"))
if normalize('NFC', u"árido") == user_input:
    print "OK"
else:
    print "FALSE"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM