简体   繁体   中英

what is the error in following python code

I want to remove stop words. Here is my code

import nltk
from nltk.corpus import stopwords
import string

u="The apple is the pomaceous fruit of the apple tree, species Malus domestica in the rose family (Rosaceae). It is one of the most widely cultivated tree fruits, and the most widely known of the many members of genus Malus that are used by humans."

v="An orange is a fruit of the orangle tree. it is the most cultivated tree fruits"

u=u.lower()
v=v.lower()

u_list=nltk.word_tokenize(u)
v_list=nltk.word_tokenize(v)

for word in u_list:
    if word in stopwords.words('english'):
        u_list.remove(word)
for word in v_list:
    if word in stopwords.words('english'):
        v_list.remove(word)

print u_list
print "\n\n\n\n"
print v_list

But only some stop words are removed. Please help me with this

The problem with what you are doing is list.remove(x) only removes the first occurrence of x , not every x. To remove every instance, you could use filter , but I would opt for something like this:

u_list = [word for word in u_list if word not in stopwords.words('english')] 

I would remove the words by converting the list of splitted words and the list of stopwords to a set and compute the difference :

u_list = list(set(u_list).difference(set(stopwords.words('english'))))

This should properly remove the occurences of the stopwords.

I struggled a while with a similar piece of code using the remove(x) function. I noticed that only about 50% of the stop words were removed. I knew it was not coming from the case (I lowered my words) nor from added puntuation or other character around the words (strip()). My theory (I am a beginner) is that when you remove a token the list shrunk, the indexes and the list item slide, but the loop continues from the same index. It therefore does not loop on every words. The solution is to rather increment a new list with the words that are not stop words and that you want to keep.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM