简体   繁体   English

python中将unicode与unicode比较

[英]Comparing unicode with unicode in python

I am trying to count the number of same words in an Urdu document which is saved in UTF-8. 我试图计算保存在UTF-8中的Urdu文档中相同单词的数量。

so for example I have document containing 3 exactly same words separated by space 例如,我的文档包含3个完全相同的单词,并用空格分隔

خُداوند خُداوند خُداوند

I tried to count the words by reading the file using the following code: 我尝试通过使用以下代码读取文件来计算单词数:

        file_obj = codecs.open(path,encoding="utf-8")
        lst = repr(file_obj.readline()).split(" ")
        word = lst[0]
        count =0
        for w in lst:
            if word == w:
                count += 1
        print count

but the value of count I am getting is 1 while I should get 3. 但是我得到的count的值是1而我应该得到3。

How does one compare Unicode strings? 如何比较Unicode字符串?

Remove the repr() from your code. 从您的代码中删除repr() Use repr() only to create debug output; 仅使用repr()创建调试输出; you are turning a unicode value into a string that can be pasted back into the interpreter. 您正在将unicode值转换为可以粘贴回解释器的字符串。

This means your line from the file is now stored as: 这意味着文件中的行现在存储为:

>>> repr(u'خُداوند خُداوند خُداوند\n').split(" ")
["u'\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f", '\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f', "\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f\\n'"]

Note the double backslashes (escaped unicode escapes) and the first string starts with u' and the last string ends with \\\\n' . 请注意, 反斜杠(转义的Unicode转义符)和第一个字符串以u'开头,最后一个字符串以\\\\n'结尾。 These values are obviously never equal. 这些值显然永远不相等。

Remove the repr() , and use .split() without arguments to remove the trailing whitespace too: 删除repr() ,并使用不带参数的 .split()来删除结尾的空格:

lst = file_obj.readline().split()

and your code will work: 并且您的代码将正常工作:

>>> res = u'خُداوند خُداوند خُداوند\n'.split()
>>> res[0] == res[1] == res[2]
True

You may need to normalize the input first; 您可能需要先将输入标准化 some characters can be expressed either as one unicode codepoint or as two combining codepoints. 有些字符可以表示为一个unicode代码点或两个组合代码点。 Normalizing moves all such characters to a composed or decomposed state. 规范化将所有此类字符移动到合成或分解状态。 See Normalizing Unicode . 请参阅规范化Unicode

Try removing the repr ? 尝试删除repr

lst = file_obj.readline().split(" ")

The point is that you should at least print variables like lst and w to see what they are. 关键是,您至少应print lstw类的变量以了解它们的含义。

Comparing unicode strings in Python: 比较Python中的unicode字符串:

a = u'Artur'
print(a)
b = u'\u0041rtur'
print(b)

if a == b:
    print('the same')

result: 结果:

Artur
Artur
the same

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM