使用python3从列表中删除停用词

Question

I have been trying to remove stopwords from a csv file that im reading using python code but my code does not seem to work. 我一直在尝试从使用python代码读取的csv文件中删除停用词，但是我的代码似乎无法正常工作。 I have tried using a sample text in the code to validate my code but it is still the same . 我尝试在代码中使用示例文本来验证我的代码，但仍然相同。 Below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below 以下是我的代码，如果有人可以帮助我纠正此问题，我将不胜感激。这是下面的代码

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv

article = ['The computer code has a little bug' ,
      'im learning python' ,
           'thanks for helping me' ,
            'this is trouble' ,
          'this is a sample sentence'
            'cat in the hat']

tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))

Answer 1

Your tokenized_models is a list of tokenized sentences, so a list of lists. 您的tokenized_models是一个带标记的句子列表，因此是一个列表列表。 Ergo, the following line tries to match a list of words to a stopword: 如此，以下行尝试将单词列表与停用词进行匹配：

stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]

Instead, iterate again through words. 而是通过单词再次进行迭代。 Something like: 就像是：

clean_models = []
for m in tokenized_models:
    stop_m = [i for i in m if str(i).lower() not in stopset]
    clean_models.append(stop_m)

print(clean_models)

Off-topic useful hint: 离题有用的提示：
To define a multi-line string, use brackets and no comma: 要定义多行字符串，请使用方括号，并且不要使用逗号：

article = ('The computer code has a little bug'
           'im learning python'
           'thanks for helping me'
           'this is trouble'
           'this is a sample sentence'
           'cat in the hat')

This version would work with your original code 此版本可与您的原始代码一起使用

Answer 2

word_tokenize(str(i)) returns a list of words, so tokenized_models is a list of lists. word_tokenize(str(i))返回单词列表，因此tokenized_models是列表列表。 You need to flatten that list, or better yet just make article a single string, since I don't see why it's a list at the moment. 您需要弄平该列表，或者更好的方法是使article成为单个字符串，因为我目前不知道为什么它是一个列表。

This is because the in operator won't search through a list and then through strings in that list at the same time, eg: 这是因为in运算符不会先搜索列表，然后再搜索该列表中的字符串，例如：

>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False

使用python3从列表中删除停用词

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-05-26 21:08:55

解决方案2
0 2016-05-26 21:09:26

使用python3从列表中删除停用词

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-05-26 21:08:55

解决方案2 0 2016-05-26 21:09:26

解决方案1
3 已采纳 2016-05-26 21:08:55

解决方案2
0 2016-05-26 21:09:26