utf-8在列表中搜索单词

Question

I have a lookup list generated from utf-8 file by 我有一个从utf-8文件生成的查找列表

with open('stop_word_Tiba.txt') as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords)) # convert 2d list to 1d list

when I open the file I see that the word 'الو' is in there. 当我打开文件时，我看到其中有单词“الو”。 so it is in the list, but the list now looks like ['\\xd8\\xa7\\xd9\\x84\\xd9\\x88', '\\xd8\\xa3\\xd9\\x84\\xd9\\x88', '\\xd8\\xa7\\xd9\\x88\\xd9\\x83\\xd9\\x8a', '\\xd8\\xa7\\xd9\\x84', '\\xd8\\xa7\\xd9\\x87', '\\xd8\\xa3\\xd9\\x87', '\\xd9\\x87\\xd9\\x84\\xd9\\x88', '\\xd8\\xa3\\xd9\\x88\\xd9\\x83\\xd9\\x8a', '\\xd9\\x88'] 因此它在列表中，但列表现在看起来像['\\ xd8 \\ xa7 \\ xd9 \\ x84 \\ xd9 \\ x88'，'\\ xd8 \\ xa3 \\ xd9 \\ x84 \\ xd9 \\ x88'，'\\ xd8 \\ xa7 \\ xd9 \\ x88 \\ xd9 \\ x83 \\ xd9 \\ x8a'，'\\ xd8 \\ xa7 \\ xd9 \\ x84'，'\\ xd8 \\ xa7 \\ xd9 \\ x87'，'\\ xd8 \\ xa3 \\ xd9 \\ x87'，'\\ xd9 \\ x87 \\ xd9 \\ x84 \\ xd9 \\ x88'，'\\ xd8 \\ xa3 \\ xd9 \\ x88 \\ xd9 \\ x83 \\ xd9 \\ x8a'，'\\ xd9 \\ x88']

Then I would like to search if a specific word is in newStopWords1d the word 'الو' is '\\xd8\\xa7\\xd9\\x84\\xd9\\x88' 然后我想搜索newStopWords1d中是否有特定单词，单词'الو'是'\\ xd8 \\ xa7 \\ xd9 \\ x84 \\ xd9 \\ x88'

word='الو'
for w in newStopWords1d:
    if word == w.encode("utf-8"):
        print 'found'

The word is not found, I tried 找不到单词，我试过了

    if word in newStopWords1d:
        print 'found'

but again the word is not seen. 但同样没有看到这个词。 It seems like the problem with encoding but I couldn't solve it. 似乎是编码问题，但我无法解决。 can you please help me. 你能帮我么。

Answer 1

Worth mentioning was that you use Python 2.7. 值得一提的是您使用的是Python 2.7。

word='الو'
for w in newStopWords1d:
    if word == w.decode("utf-8"):
        print 'found'

Even better solution is to use either the open function from io 更好的解决方案是使用io的open函数

import io

with io.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

or codecs module 或codecs模块

import codecs

with codecs.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

as the built-in open function in Python 2.7 doesn't support specifying the encoding. 因为Python 2.7中的内置open函数不支持指定编码。

Answer 2

The problem solved by editing the open file statement as 通过将打开文件语句编辑为

with codecs.open("stop_word_Tiba.txt", "r", "utf-8") as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords))
    for w in newStopWords1d:
            if word.encode("utf-8") == w.encode("utf-8") :  
                      return 'found'

Thanks for you.. 谢谢你..

utf-8在列表中搜索单词

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-04-06 00:23:05

解决方案2
0 2018-04-06 01:35:38

utf-8在列表中搜索单词

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-04-06 00:23:05

解决方案2 0 2018-04-06 01:35:38

解决方案1
0 已采纳 2018-04-06 00:23:05

解决方案2
0 2018-04-06 01:35:38