简体   繁体   中英

Python - remove stopwords from a string

I am having trouble creating code which removes stop words from a string input. Currently, here is my code:

stopWords = [ "a", "i", "it", "am", "at", "on", "in", "to", "too", "very", \
                 "of", "from", "here", "even", "the", "but", "and", "is", "my", \
                 "them", "then", "this", "that", "than", "though", "so", "are" ]
stemEndings = [ "-s", "-es", "-ed", "-er", "-ly" "-ing", "-'s", "-s'" ]
punctuation = [ ".", ",", ":", ";", "!", "?" ]
line = raw_input ("Type in lines, finish with a . at start of line only:")
while line != ".":
    def remove_punctuation(input): #removes punctuation from input
        output = ""
        text= 0
        while text<=(len(input)-1) :
            if input[text] not in punctuation:
               output=output + input[text]
            text+=1
        return output
    newline= remove_punctuation(line)
    newline= newline.lower()

What code could be added to remove stopWords from a string based on the stopWords list above? Thank you in advance.

As I undestand your problem, you whant to remove punctuation from an input string. My variant remove_punctuation function:

def remove_punctuation(input_string):
    for item in punctuation:
        input_string = input_string.replace(item, '')
    return input_string

As greg suggested, you should use a for loop instead of a while because it is more pythonic & easy to understand the code. Also, you should make your function declaration before the while loop for input, so that the python interpreter does not re-define the function everytime!

Also, if you want, you can set punctuation to a string rather than a list (for readability & ease)

stopWords = [ "a", "i", "it", "am", "at", "on", "in", "to", "too", "very", \
              "of", "from", "here", "even", "the", "but", "and", "is", "my", \
              "them", "then", "this", "that", "than", "though", "so", "are" ]
stemEndings = [ "-s", "-es", "-ed", "-er", "-ly" "-ing", "-'s", "-s'" ]
punctuation = ".,:;!?"

def remove_punctuation(input_string):
    for item in punctuation:
        input_string = input_string.replace(item, '')
    return input_string

line = raw_input ("Type in lines, finish with a . at start of line only:")

while not line == ".":
    newline = remove_punctuation(line)
    newline = newline.lower()

I find something interesting in another post that boost your code performance a lot. Try use set like it mentioned in below link. Faster way to remove stop words in Python

Credit goes to alko

您可以使用NTLK库而不是定义停止词。

pip install nltk

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM