Python NLTK-防止移除停用词来删除每个词

Question

I'm working with very short strings of words, and a few of them are stupid. 我正在使用很短的单词串，其中有些很愚蠢。 Hypothetically, I could have a string of "you an a" and if I remove stopwords, that string would be blank. 假设地，我可以有一个字符串“ you a a”，如果我删除停用词，那么该字符串将为空白。 Since I'm classifying in a loop, if it comes to a blank string it just stops with an error. 由于我是在循环中进行分类，因此如果涉及到空白字符串，它只会因错误而停止。 I've created the following code to fix this: 我创建了以下代码来解决此问题：

def title_features(words):
filter_words = [word for word in words.split() if word not in stopwords.words('english')]
features={}
if len(filter_words) >= 1:
    features['First word'] = ''.join(filter_words[0])
else:
    features['First word'] = ''.join(words.split()[0])
return features

This ensures that I don't have the error, but I'm wondering if there is a more efficient way to do it. 这样可以确保没有错误，但是我想知道是否有更有效的方法来解决。 Or a way to do it where it won't get rid of all the words, if they are all stopwords. 或者采取一种方法来解决所有单词（如果它们都是停用词）不会消失的情况。

Answer 1

The simplest solution is to check the result of filtering, and restore the full word list if necessary. 最简单的解决方案是检查过滤结果，并在必要时还原完整的单词列表。 Then the rest of your code can use a single variable without checks. 然后，其余代码可以使用单个变量而不进行检查。

def title_features(words):
    filter_words = [word for word in words.split() if word not in stopwords.words('english')]
    if not filter_words:       # Use full list if necessary
        filter_words = words

    features={}
    features['First word'] = filter_words[0]
    features[...] = ...

    return features

Answer 2

You could re-write as: 您可以将其重写为：

def title_features(words):
    filtered = [word for word in words.split() if word not in stopwords.words('english')]
    return {'First word': (filtered or words.split(None, 1) or [''])[0]}

Which will take filtered if it's not empty (eg - has a length or one or more), or in the case it is empty, then proceeds to split the original, and in the case that's empty defaults to a one element list with an empty string. 如果不为空（例如，具有一个或多个长度或一个或多个），或者为空，则将对其进行filtered ，然后继续拆分原始文件，如果为空，则默认为一个带空的元素列表串。 You than take the first element using [0] of whichever of those was chosen (the first non-stop word, the first word of the string or an empty string). 然后，您将使用选择的任何一个中的[0]作为第一个元素（第一个不间断字，字符串的第一个字或空字符串）。

Python NLTK-防止移除停用词来删除每个词

问题描述

2 个解决方案

解决方案1
2 2016-11-19 07:42:50

解决方案2
1 已采纳 2016-11-18 18:27:13

Python NLTK-防止移除停用词来删除每个词

问题描述

2 个解决方案

解决方案1 2 2016-11-19 07:42:50

解决方案2 1 已采纳 2016-11-18 18:27:13

解决方案1
2 2016-11-19 07:42:50

解决方案2
1 已采纳 2016-11-18 18:27:13