简体   繁体   中英

Removing stopwords using NLTK in python

i am using NLTK to remove stopwords from a list element. Here is my code snippet

dict1 = {}
    for ctr,row in enumerate(cur.fetchall()):
            list1 = [row[0],row[1],row[2],row[3],row[4]]
            dict1[row[0]] = list1
            print ctr+1,"\n",dict1[row[0]][2]
            list2 = [w for w in dict1[row[0]][3] if not w in stopwords.words('english')]
            print list2

the problem is, this not only removing the stopwords but also it is removing characters from other words eg from the word 'orientation' 'i' and more stopwords will be removed and further it is storing characters instead of words in the list2. ie ['O', 'r', 'e', 'n', 'n', ' ', 'f', ' ', '3', ' ', 'r', 'e', 'r', 'e', ' ', 'p', 'n', '\\n', '\\n', '\\n', 'O', 'r', 'e', 'n', 'n', ' ', 'f', ' ', 'n', ' ', 'r', 'e', 'r', 'e', ' ', 'r', 'p', 'l'....................... while i want to store it as ['Orientation','....................

First, make sure that list1 is a list of words, not an array of characters. Here I can give you a code snippet that you can leverage it maybe.

from nltk import word_tokenize
from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')    # get english stop words

# test document
document = '''A moody child and wildly wise
Pursued the game with joyful eyes
'''

# first tokenize your document to a list of words
words = word_tokenize(document)
print(words)

# the remove all stop words
content = [w for w in words if w.lower() not in english_stopwords]
print(content)

The output will be:

['A', 'moody', 'child', 'and', 'wildly', 'wise', 'Pursued', 'the', 'game', 'with', 'joyful', 'eyes']
['moody', 'child', 'wildly', 'wise', 'Pursued', 'game', 'joyful', 'eyes']

First, your construction of list1 is a little peculiar to me. I think that there's a more pythonic solution:

list1 = row[:5]

Then, is there a reason you're accessing row[3] with dict1[row[0]][3], rather than row[3] directly?

Finally, assuming that row was a list of strings, constructing list2 from row[3] iterates over every character, rather than every word. That might be why you're parsing out 'i' and 'a' (and a few other characters).

The correct comprehension would be:

list2 = [w for w in row[3].split(' ') if w not in stopwords]

You have to split your strings apart somehow, probably around spaces. That takes something from:

'Hello, this is row3'

To

['Hello,', 'this', 'is', 'row3']

Iterating over that gives you full words, rather than individual characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM