简体   繁体   中英

Python NLTK - Preventing stop word removal from removing EVERY word

I'm working with very short strings of words, and a few of them are stupid. Hypothetically, I could have a string of "you an a" and if I remove stopwords, that string would be blank. Since I'm classifying in a loop, if it comes to a blank string it just stops with an error. I've created the following code to fix this:

def title_features(words):
filter_words = [word for word in words.split() if word not in stopwords.words('english')]
features={}
if len(filter_words) >= 1:
    features['First word'] = ''.join(filter_words[0])
else:
    features['First word'] = ''.join(words.split()[0])
return features

This ensures that I don't have the error, but I'm wondering if there is a more efficient way to do it. Or a way to do it where it won't get rid of all the words, if they are all stopwords.

The simplest solution is to check the result of filtering, and restore the full word list if necessary. Then the rest of your code can use a single variable without checks.

def title_features(words):
    filter_words = [word for word in words.split() if word not in stopwords.words('english')]
    if not filter_words:       # Use full list if necessary
        filter_words = words

    features={}
    features['First word'] = filter_words[0]
    features[...] = ...

    return features

You could re-write as:

def title_features(words):
    filtered = [word for word in words.split() if word not in stopwords.words('english')]
    return {'First word': (filtered or words.split(None, 1) or [''])[0]}

Which will take filtered if it's not empty (eg - has a length or one or more), or in the case it is empty, then proceeds to split the original, and in the case that's empty defaults to a one element list with an empty string. You than take the first element using [0] of whichever of those was chosen (the first non-stop word, the first word of the string or an empty string).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM