utf-8在列表中搜索單詞

Question

我有一個從utf-8文件生成的查找列表

with open('stop_word_Tiba.txt') as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords)) # convert 2d list to 1d list

當我打開文件時，我看到其中有單詞“الو”。 因此它在列表中，但列表現在看起來像['\\ xd8 \\ xa7 \\ xd9 \\ x84 \\ xd9 \\ x88'，'\\ xd8 \\ xa3 \\ xd9 \\ x84 \\ xd9 \\ x88'，'\\ xd8 \\ xa7 \\ xd9 \\ x88 \\ xd9 \\ x83 \\ xd9 \\ x8a'，'\\ xd8 \\ xa7 \\ xd9 \\ x84'，'\\ xd8 \\ xa7 \\ xd9 \\ x87'，'\\ xd8 \\ xa3 \\ xd9 \\ x87'，'\\ xd9 \\ x87 \\ xd9 \\ x84 \\ xd9 \\ x88'，'\\ xd8 \\ xa3 \\ xd9 \\ x88 \\ xd9 \\ x83 \\ xd9 \\ x8a'，'\\ xd9 \\ x88']

然后我想搜索newStopWords1d中是否有特定單詞，單詞'الو'是'\\ xd8 \\ xa7 \\ xd9 \\ x84 \\ xd9 \\ x88'

word='الو'
for w in newStopWords1d:
    if word == w.encode("utf-8"):
        print 'found'

找不到單詞，我試過了

    if word in newStopWords1d:
        print 'found'

但同樣沒有看到這個詞。 似乎是編碼問題，但我無法解決。 你能幫我么。

Answer 1

值得一提的是您使用的是Python 2.7。

word='الو'
for w in newStopWords1d:
    if word == w.decode("utf-8"):
        print 'found'

更好的解決方案是使用io的open函數

import io

with io.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

或codecs模塊

import codecs

with codecs.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

因為Python 2.7中的內置open函數不支持指定編碼。

Answer 2

通過將打開文件語句編輯為

with codecs.open("stop_word_Tiba.txt", "r", "utf-8") as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords))
    for w in newStopWords1d:
            if word.encode("utf-8") == w.encode("utf-8") :  
                      return 'found'

謝謝你..

utf-8在列表中搜索單詞

問題描述

2 個解決方案

解決方案1
0 已采納 2018-04-06 00:23:05

解決方案2
0 2018-04-06 01:35:38

utf-8在列表中搜索單詞

問題描述

2 個解決方案

解決方案1 0 已采納 2018-04-06 00:23:05

解決方案2 0 2018-04-06 01:35:38

解決方案1
0 已采納 2018-04-06 00:23:05

解決方案2
0 2018-04-06 01:35:38