[英]Comparing unicode with unicode in python
I am trying to count the number of same words in an Urdu document which is saved in UTF-8. 我试图计算保存在UTF-8中的Urdu文档中相同单词的数量。
so for example I have document containing 3 exactly same words separated by space 例如,我的文档包含3个完全相同的单词,并用空格分隔
خُداوند خُداوند خُداوند
I tried to count the words by reading the file using the following code: 我尝试通过使用以下代码读取文件来计算单词数:
file_obj = codecs.open(path,encoding="utf-8")
lst = repr(file_obj.readline()).split(" ")
word = lst[0]
count =0
for w in lst:
if word == w:
count += 1
print count
but the value of count I am getting is 1 while I should get 3. 但是我得到的count的值是1而我应该得到3。
How does one compare Unicode strings? 如何比较Unicode字符串?
Remove the repr()
from your code. 从您的代码中删除
repr()
。 Use repr()
only to create debug output; 仅使用
repr()
创建调试输出; you are turning a unicode value into a string that can be pasted back into the interpreter. 您正在将unicode值转换为可以粘贴回解释器的字符串。
This means your line from the file is now stored as: 这意味着文件中的行现在存储为:
>>> repr(u'خُداوند خُداوند خُداوند\n').split(" ")
["u'\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f", '\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f', "\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f\\n'"]
Note the double backslashes (escaped unicode escapes) and the first string starts with u'
and the last string ends with \\\\n'
. 请注意, 双反斜杠(转义的Unicode转义符)和第一个字符串以
u'
开头,最后一个字符串以\\\\n'
结尾。 These values are obviously never equal. 这些值显然永远不相等。
Remove the repr()
, and use .split()
without arguments to remove the trailing whitespace too: 删除
repr()
,并使用不带参数的 .split()
来删除结尾的空格:
lst = file_obj.readline().split()
and your code will work: 并且您的代码将正常工作:
>>> res = u'خُداوند خُداوند خُداوند\n'.split()
>>> res[0] == res[1] == res[2]
True
You may need to normalize the input first; 您可能需要先将输入标准化 。 some characters can be expressed either as one unicode codepoint or as two combining codepoints.
有些字符可以表示为一个unicode代码点或两个组合代码点。 Normalizing moves all such characters to a composed or decomposed state.
规范化将所有此类字符移动到合成或分解状态。 See Normalizing Unicode .
请参阅规范化Unicode 。
Try removing the repr
? 尝试删除
repr
?
lst = file_obj.readline().split(" ")
The point is that you should at least print
variables like lst
and w
to see what they are. 关键是,您至少应
print
lst
和w
类的变量以了解它们的含义。
Comparing unicode strings in Python: 比较Python中的unicode字符串:
a = u'Artur'
print(a)
b = u'\u0041rtur'
print(b)
if a == b:
print('the same')
result: 结果:
Artur
Artur
the same
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.