使用python3從列表中刪除停用詞

Question

我一直在嘗試從使用python代碼讀取的csv文件中刪除停用詞，但是我的代碼似乎無法正常工作。 我嘗試在代碼中使用示例文本來驗證我的代碼，但仍然相同。 以下是我的代碼，如果有人可以幫助我糾正此問題，我將不勝感激。這是下面的代碼

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv

article = ['The computer code has a little bug' ,
      'im learning python' ,
           'thanks for helping me' ,
            'this is trouble' ,
          'this is a sample sentence'
            'cat in the hat']

tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))

Answer 1

您的tokenized_models是一個帶標記的句子列表，因此是一個列表列表。 如此，以下行嘗試將單詞列表與停用詞進行匹配：

stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]

而是通過單詞再次進行迭代。 就像是：

clean_models = []
for m in tokenized_models:
    stop_m = [i for i in m if str(i).lower() not in stopset]
    clean_models.append(stop_m)

print(clean_models)

離題有用的提示：
要定義多行字符串，請使用方括號，並且不要使用逗號：

article = ('The computer code has a little bug'
           'im learning python'
           'thanks for helping me'
           'this is trouble'
           'this is a sample sentence'
           'cat in the hat')

此版本可與您的原始代碼一起使用

Answer 2

word_tokenize(str(i))返回單詞列表，因此tokenized_models是列表列表。 您需要弄平該列表，或者更好的方法是使article成為單個字符串，因為我目前不知道為什么它是一個列表。

這是因為in運算符不會先搜索列表，然后再搜索該列表中的字符串，例如：

>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False

使用python3從列表中刪除停用詞

問題描述

2 個解決方案

解決方案1
3 已采納 2016-05-26 21:08:55

解決方案2
0 2016-05-26 21:09:26

使用python3從列表中刪除停用詞

問題描述

2 個解決方案

解決方案1 3 已采納 2016-05-26 21:08:55

解決方案2 0 2016-05-26 21:09:26

解決方案1
3 已采納 2016-05-26 21:08:55

解決方案2
0 2016-05-26 21:09:26