简体   繁体   English

Python NLTK-防止移除停用词来删除每个词

[英]Python NLTK - Preventing stop word removal from removing EVERY word

I'm working with very short strings of words, and a few of them are stupid. 我正在使用很短的单词串,其中有些很愚蠢。 Hypothetically, I could have a string of "you an a" and if I remove stopwords, that string would be blank. 假设地,我可以有一个字符串“ you a a”,如果我删除停用词,那么该字符串将为空白。 Since I'm classifying in a loop, if it comes to a blank string it just stops with an error. 由于我是在循环中进行分类,因此如果涉及到空白字符串,它只会因错误而停止。 I've created the following code to fix this: 我创建了以下代码来解决此问题:

def title_features(words):
filter_words = [word for word in words.split() if word not in stopwords.words('english')]
features={}
if len(filter_words) >= 1:
    features['First word'] = ''.join(filter_words[0])
else:
    features['First word'] = ''.join(words.split()[0])
return features

This ensures that I don't have the error, but I'm wondering if there is a more efficient way to do it. 这样可以确保没有错误,但是我想知道是否有更有效的方法来解决。 Or a way to do it where it won't get rid of all the words, if they are all stopwords. 或者采取一种方法来解决所有单词(如果它们都是停用词)不会消失的情况。

The simplest solution is to check the result of filtering, and restore the full word list if necessary. 最简单的解决方案是检查过滤结果,并在必要时还原完整的单词列表。 Then the rest of your code can use a single variable without checks. 然后,其余代码可以使用单个变量而不进行检查。

def title_features(words):
    filter_words = [word for word in words.split() if word not in stopwords.words('english')]
    if not filter_words:       # Use full list if necessary
        filter_words = words

    features={}
    features['First word'] = filter_words[0]
    features[...] = ...

    return features

You could re-write as: 您可以将其重写为:

def title_features(words):
    filtered = [word for word in words.split() if word not in stopwords.words('english')]
    return {'First word': (filtered or words.split(None, 1) or [''])[0]}

Which will take filtered if it's not empty (eg - has a length or one or more), or in the case it is empty, then proceeds to split the original, and in the case that's empty defaults to a one element list with an empty string. 如果不为空(例如,具有一个或多个长度或一个或多个),或者为空,则将对其进行filtered ,然后继续拆分原始文件,如果为空,则默认为一个带空的元素列表串。 You than take the first element using [0] of whichever of those was chosen (the first non-stop word, the first word of the string or an empty string). 然后,您将使用选择的任何一个中的[0]作为第一个元素(第一个不间断字,字符串的第一个字或空字符串)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM