python中将unicode与unicode比较

Question

I am trying to count the number of same words in an Urdu document which is saved in UTF-8. 我试图计算保存在UTF-8中的Urdu文档中相同单词的数量。

so for example I have document containing 3 exactly same words separated by space 例如，我的文档包含3个完全相同的单词，并用空格分隔

خُداوند خُداوند خُداوند

I tried to count the words by reading the file using the following code: 我尝试通过使用以下代码读取文件来计算单词数：

        file_obj = codecs.open(path,encoding="utf-8")
        lst = repr(file_obj.readline()).split(" ")
        word = lst[0]
        count =0
        for w in lst:
            if word == w:
                count += 1
        print count

but the value of count I am getting is 1 while I should get 3. 但是我得到的count的值是1而我应该得到3。

How does one compare Unicode strings? 如何比较Unicode字符串？

Answer 1

Remove the repr() from your code. 从您的代码中删除repr() 。 Use repr() only to create debug output; 仅使用repr()创建调试输出； you are turning a unicode value into a string that can be pasted back into the interpreter. 您正在将unicode值转换为可以粘贴回解释器的字符串。

This means your line from the file is now stored as: 这意味着文件中的行现在存储为：

>>> repr(u'خُداوند خُداوند خُداوند\n').split(" ")
["u'\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f", '\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f', "\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f\\n'"]

Note the double backslashes (escaped unicode escapes) and the first string starts with u' and the last string ends with \\\\n' . 请注意，双反斜杠（转义的Unicode转义符）和第一个字符串以u'开头，最后一个字符串以\\\\n'结尾。 These values are obviously never equal. 这些值显然永远不相等。

Remove the repr() , and use .split() without arguments to remove the trailing whitespace too: 删除repr() ，并使用不带参数的 .split()来删除结尾的空格：

lst = file_obj.readline().split()

and your code will work: 并且您的代码将正常工作：

>>> res = u'خُداوند خُداوند خُداوند\n'.split()
>>> res[0] == res[1] == res[2]
True

You may need to normalize the input first; 您可能需要先将输入标准化 。 some characters can be expressed either as one unicode codepoint or as two combining codepoints. 有些字符可以表示为一个unicode代码点或两个组合代码点。 Normalizing moves all such characters to a composed or decomposed state. 规范化将所有此类字符移动到合成或分解状态。 See Normalizing Unicode . 请参阅规范化Unicode 。

Answer 2

Try removing the repr ? 尝试删除repr ？

lst = file_obj.readline().split(" ")

The point is that you should at least print variables like lst and w to see what they are. 关键是，您至少应print lst和w类的变量以了解它们的含义。

Answer 3

Comparing unicode strings in Python: 比较Python中的unicode字符串：

a = u'Artur'
print(a)
b = u'\u0041rtur'
print(b)

if a == b:
    print('the same')

result: 结果：

Artur
Artur
the same

python中将unicode与unicode比较

问题描述

3 个解决方案

解决方案1
3 已采纳 2013-11-03 10:28:56

解决方案2
1 2013-11-03 10:16:22

解决方案3
0 2013-11-03 10:21:03

python中将unicode与unicode比较

问题描述

3 个解决方案

解决方案1 3 已采纳 2013-11-03 10:28:56

解决方案2 1 2013-11-03 10:16:22

解决方案3 0 2013-11-03 10:21:03

解决方案1
3 已采纳 2013-11-03 10:28:56

解决方案2
1 2013-11-03 10:16:22

解决方案3
0 2013-11-03 10:21:03