简体   繁体   English

如何在Python中读取Unicode输入并比较Unicode字符串?

[英]How to read Unicode input and compare Unicode strings in Python?

I work in Python and would like to read user input (from command line) in Unicode format, ie a Unicode equivalent of raw_input ? 我使用Python工作,并希望以Unicode格式读取用户输入(来自命令行),即与raw_input相当的Unicode?

Also, I would like to test Unicode strings for equality and it looks like a standard == does not work. 另外,我想测试Unicode字符串是否相等,看起来像标准==不起作用。

raw_input() returns strings as encoded by the OS or UI facilities. raw_input()返回由OS或UI工具编码的字符串。 The difficulty is knowing which is that decoding. 困难在于知道哪个是解码。 You might attempt the following: 您可以尝试以下操作:

import sys, locale
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))

which should work correctly in most of the cases. 哪些应该在大多数情况下正常工作。

We need more data about not working Unicode comparisons in order to help you. 我们需要更多关于不使用Unicode比较的数据来帮助您。 However, it might be a matter of normalization. 但是,这可能是一个正常化的问题。 Consider the following: 考虑以下:

>>> a1= u'\xeatre'
>>> a2= u'e\u0302tre'

a1 and a2 are equivalent but not equal: a1a2相等但不相等:

>>> print a1, a2
être être
>>> print a1 == a2
False

So you might want to use the unicodedata.normalize() method: 所以你可能想使用unicodedata.normalize()方法:

>>> import unicodedata as ud
>>> ud.normalize('NFC', a1)
u'\xeatre'
>>> ud.normalize('NFC', a2)
u'\xeatre'
>>> ud.normalize('NFC', a1) == ud.normalize('NFC', a2)
True

If you give us more information, we might be able to help you more, though. 如果您向我们提供更多信息,我们可能会为您提供更多帮助。

It should work. 它应该工作。 raw_input returns a byte string which you must decode using the correct encoding to get your unicode object. raw_input返回一个字节字符串,您必须使用正确的编码对其进行解码以获取您的unicode对象。 For example, the following works for me under Python 2.5 / Terminal.app / OSX: 例如,以下适用于Python 2.5 / Terminal.app / OSX下的我:

>>> bytes = raw_input()
日本語 Ελληνικά
>>> bytes
'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e \xce\x95\xce\xbb\xce\xbb\xce\xb7\xce\xbd\xce\xb9\xce\xba\xce\xac'

>>> uni = bytes.decode('utf-8') # substitute the encoding of your terminal if it's not utf-8
>>> uni
u'\u65e5\u672c\u8a9e \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac'

>>> print uni
日本語 Ελληνικά

As for comparing unicode strings: can you post an example where the comparison doesn't work? 至于比较unicode字符串:你能发布一个比较不起作用的例子吗?

I'm not really sure, which format you mean by "Unicode format", there are several. 我不太确定,你用“Unicode格式”表示哪种格式,有几种。 UTF-8? UTF-8? UTF-16? UTF-16? In any case you should be able to read a normal string with raw_input and then decode it using the strings decode method: 在任何情况下,您都应该能够使用raw_input读取普通字符串,然后使用字符串decode方法对其进行decode

raw = raw_input("Please input some funny characters: ")
decoded = raw.decode("utf-8")

If you have a different input encoding just use "utf-16" or whatever instead of "utf-8". 如果您有不同的输入编码,只需使用“utf-16”或其他代替“utf-8”。 Also see the codecs modules docs for different kinds of encodings. 另请参阅编解码器模块文档以了解不同类型的编码。

Comparing then should work just fine with == . 然后比较应该与==一起正常工作。 If you have string literals containing special characters you should prefix them with "u" to mark them as unicode: 如果您有包含特殊字符的字符串文字,则应在其前面添加“u”以将其标记为unicode:

if decoded == u"äöü":
  print "Do you speak German?"

And if you want to output these strings again, you probably want to encode them again in the desired encoding: 如果您想再次输出这些字符串,您可能希望以所需的编码再次对它们进行编码:

print decoded.encode("utf-8")

In the general case, it's probably not possible to compare unicode strings. 在一般情况下,可能无法比较unicode字符串。 The problem is that there are several ways to compose the same characters. 问题是有几种方法可以组成相同的字符。 A simple example is accented roman characters. 一个简单的例子是重音罗马字符。 Although there are codepoints for basically all of the commonly used accented characters, it is also correct to compose them from unaccented base letters and a non-spacing accent. 尽管基本上所有常用的重音字符都有代码点,但从非重音基本字母和非间距重音组成它们也是正确的。 This issue is more significant in many non-roman alphabets. 这个问题在许多非罗马字母表中更为重要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM