从文本文件列表中删除停用词

Question

我有一个已处理文本文件的列表，看起来像这样：

text =“这是第一个文本文档”这是第二个文本文档“这是第三个文档”

我已经能够成功地标记这些句子：

sentences = sent_tokenize(text)
    for ii, sentence in enumerate(sentences):
        sentences[ii] = remove_punctuation(sentence)
sentence_tokens = [word_tokenize(sentence) for sentence in sentences]

现在，我想从此令牌列表中删除停用词。
但是，由于它是文本文档列表中的句子列表，因此我似乎无法弄清楚该如何做。

到目前为止，这是我尝试过的方法，但未返回任何结果：

sentence_tokens_no_stopwords = [w for w in sentence_tokens if w not in stopwords]

我假设要实现这一点将需要某种for循环，但是我现在无法使用。 任何帮助，将不胜感激！

Answer 1

您可以创建两个这样的列表生成器：

sentence_tokens_no_stopwords = [[w for w in s if w not in stopwords] for s in sentence_tokens ]

从文本文件列表中删除停用词

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-02-04 15:48:11

从文本文件列表中删除停用词

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-02-04 15:48:11

解决方案1
2 已采纳 2017-02-04 15:48:11