简体   繁体   English

utf-8在列表中搜索单词

[英]utf-8 search for word in list

I have a lookup list generated from utf-8 file by 我有一个从utf-8文件生成的查找列表

with open('stop_word_Tiba.txt') as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords)) # convert 2d list to 1d list

when I open the file I see that the word 'الو' is in there. 当我打开文件时,我看到其中有单词“الو”。 so it is in the list, but the list now looks like ['\\xd8\\xa7\\xd9\\x84\\xd9\\x88', '\\xd8\\xa3\\xd9\\x84\\xd9\\x88', '\\xd8\\xa7\\xd9\\x88\\xd9\\x83\\xd9\\x8a', '\\xd8\\xa7\\xd9\\x84', '\\xd8\\xa7\\xd9\\x87', '\\xd8\\xa3\\xd9\\x87', '\\xd9\\x87\\xd9\\x84\\xd9\\x88', '\\xd8\\xa3\\xd9\\x88\\xd9\\x83\\xd9\\x8a', '\\xd9\\x88'] 因此它在列表中,但列表现在看起来像['\\ xd8 \\ xa7 \\ xd9 \\ x84 \\ xd9 \\ x88','\\ xd8 \\ xa3 \\ xd9 \\ x84 \\ xd9 \\ x88','\\ xd8 \\ xa7 \\ xd9 \\ x88 \\ xd9 \\ x83 \\ xd9 \\ x8a','\\ xd8 \\ xa7 \\ xd9 \\ x84','\\ xd8 \\ xa7 \\ xd9 \\ x87','\\ xd8 \\ xa3 \\ xd9 \\ x87','\\ xd9 \\ x87 \\ xd9 \\ x84 \\ xd9 \\ x88','\\ xd8 \\ xa3 \\ xd9 \\ x88 \\ xd9 \\ x83 \\ xd9 \\ x8a','\\ xd9 \\ x88']

Then I would like to search if a specific word is in newStopWords1d the word 'الو' is '\\xd8\\xa7\\xd9\\x84\\xd9\\x88' 然后我想搜索newStopWords1d中是否有特定单词,单词'الو'是'\\ xd8 \\ xa7 \\ xd9 \\ x84 \\ xd9 \\ x88'

word='الو'
for w in newStopWords1d:
    if word == w.encode("utf-8"):
        print 'found'

The word is not found, I tried 找不到单词,我试过了

    if word in newStopWords1d:
        print 'found'

but again the word is not seen. 但同样没有看到这个词。 It seems like the problem with encoding but I couldn't solve it. 似乎是编码问题,但我无法解决。 can you please help me. 你能帮我么。

Worth mentioning was that you use Python 2.7. 值得一提的是您使用的是Python 2.7。

word='الو'
for w in newStopWords1d:
    if word == w.decode("utf-8"):
        print 'found'

Even better solution is to use either the open function from io 更好的解决方案是使用io的open函数

import io

with io.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

or codecs module codecs模块

import codecs

with codecs.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

as the built-in open function in Python 2.7 doesn't support specifying the encoding. 因为Python 2.7中的内置open函数不支持指定编码。

The problem solved by editing the open file statement as 通过将打开文件语句编辑为

with codecs.open("stop_word_Tiba.txt", "r", "utf-8") as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords))
    for w in newStopWords1d:
            if word.encode("utf-8") == w.encode("utf-8") :  
                      return 'found'

Thanks for you.. 谢谢你..

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM