[英]Comparing user input unicode strings in Python 2.7
What is the best way to compare a string entered by the user with another string? 将用户输入的字符串与另一个字符串进行比较的最佳方法是什么?
For example: 例如:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
user_input = raw_input("Please, write árido: ").decode("utf8")
if u"árido" == user_input:
print "OK"
else:
print "FALSE"
EDIT: 编辑:
This 这个
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
from unicodedata import normalize
import sys
uinput2 = "árbol"
uinput = raw_input("type árbol: ")
print "Encoding %s" % sys.stdout.encoding
print "User Input \t\tProgram Input"
print "-"*50
print "%s \t\t\t%s \t(raw value)" % (uinput, uinput2)
print "%s \t\t\t%s \t(unicode(value))" % (unicode(uinput), unicode(uinput2))
print "%s \t\t\t%s \t(value.decode('utf8'))" % (uinput.decode("utf-8"), uinput2.decode("utf-8"))
print "%s \t\t\t%s \t(normalize('NFC',value))" % (normalize("NFC",uinput.decode("utf-8")), normalize("NFC",uinput2.decode("utf-8")));
print "\n\nUser Input \t\tProgram Input (Repr)"
print "-"*50
print "%s \t%s" % (repr(uinput),repr(uinput2))
print "%s \t%s \t(unicode(value))" % (repr(unicode(uinput)), repr(uinput2))
print "%s \t%s \t(value.decode('utf8'))" % (repr(uinput.decode("utf-8")), repr(uinput2.decode("utf-8")))
print "%s \t%s \t(normalize('NFC',value)))" % (repr(normalize("NFC",uinput.decode("utf-8"))), repr(normalize("NFC",uinput2.decode("utf-8"))));
prints: 打印:
type árbol: árbol
Encoding utf-8
User Input Program Input
--------------------------------------------------
árbol árbol (raw value)
árbol árbol (unicode(value))
árbol árbol (value.decode('utf8'))
árbol árbol (normalize('NFC',value))
User Input Program Input (Repr)
--------------------------------------------------
'\xc3\x83\xc2\xa1rbol' u'\xe1rbol'
u'\xc3\xa1rbol' u'\xe1rbol' (unicode(value))
u'\xc3\xa1rbol' u'\xe1rbol' (value.decode('utf8'))
u'\xc3\xa1rbol' u'\xe1rbol' (normalize('NFC',value)))
Any idea? 任何想法? I haven't problems when I work with other languages like Java.
当我使用Java之类的其他语言时,我没有问题。 This only happens to me with python.
这仅发生在python中。 I'm using Eclipse.
我正在使用Eclipse。
Thanks in advance :) 提前致谢 :)
Can you check the character encoding of your terminal, 您能检查一下终端的字符编码吗?
import sys
导入系统
sys.stdin.encoding
sys.stdin.encoding
If it is UTF-8, then decode should be fine. 如果它是UTF-8,则解码应该可以。 Otherwise, you have to decode the raw_input with right encoding.
否则,您必须使用正确的编码对raw_input进行解码。
like, raw_input().decode(sys.stdin.encoding) to check whether it is proper along with Unicode Normalization, if needed. 像raw_input()。decode(sys.stdin.encoding)一样检查它是否正确(如果需要)和Unicode Normalization。
Your current approach isn't bad, but you should probably use unicodedata.normalize()
for the comparison. 您当前的方法还不错,但是您可能应该使用
unicodedata.normalize()
进行比较。 The docs linked above explain why this is a good idea. 上面链接的文档解释了为什么这是一个好主意。 For example, try evaluating the following:
例如,尝试评估以下内容:
u'Ç' == u'Ç'
Spoiler alert, this will give you False
because the left side is the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA), and the right side is the single character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA). 剧透警报,这将为您提供
False
因为左侧是序列U + 0043(拉丁文大写字母C)U + 0327(合并CEDILLA),右侧是单个字符U + 00C7(拉丁文大写字母C和CEDILLA) )。
You can use unicodedata.normalize()
to handle this properly by first converting the strings to a normalized form. 您可以使用
unicodedata.normalize()
正确处理此问题,方法是先将字符串转换为规范化形式。 For example: 例如:
# -*- coding: utf-8 -*-
from unicodedata import normalize
from __future__ import unicode_literals
user_input = normalize('NFC', raw_input("Please, write árido: ").decode("utf8"))
if normalize('NFC', u"árido") == user_input:
print "OK"
else:
print "FALSE"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.