简体   繁体   中英

spam filtering: remove stopwords

I have created two lists: l1 is my major list and l2 is the list containing certain stopwords. I intend to remove the stopwords in l2 from the second nested list in l1. However, it seems that the code is not efficient and only one stopword is removed while the rest of them remain in l1. This is what l1 looks like:

[['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection.....]],...]

This is what l2 looks like:

['a', ' able', ' about', ' across', ' after', ' all', ' almost', ' also', ' am', ' among', ' an', ' and', ' any',....]

This is what I have tried:

for i in l1:
   i[1] = i[1].lower()
   i[1] = i[1].split()
   for j in i[1]:
      if j in l2:
         i[1].remove(j)

If you don't want to re-invent the wheel, you can use nltk to tokenise your text and remove stopwords:

import nltk
data = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection']]

for text in (label_text[1] for label_text in data): 
    filtered_tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in nltk.corpus.stopwords.words('english')]
    print(filtered_tokens)

And the output should be:

>>> [',', 'princess', '?', ':', ')']
>>> ['Urgent', '!', 'Please', 'call', '09061213237', 'landline', '.', '£5000', 'cash', 'luxury', '4*', 'Canary', 'Islands', 'Holiday', 'await', 'collection']

If you still want to use your own list of stopwords the following should do the trick for you:

import nltk

data = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection']]
stopwords = ['a', 'able', 'about', 'across', 'after', 'all', 'almost', 'also', 'am', 'among', 'an', 'and', 'any' ]

for text in (label_text[1] for label_text in data): 
    filtered_tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stopwords]
    print(filtered_tokens)

>>> ['how', 'you', 'will', 'do', 'that', ',', 'princess', '?', ':', ')']
>>> ['Urgent', '!', 'Please', 'call', '09061213237', 'from', 'landline', '.', '£5000', 'cash', 'or', 'luxury', '4*', 'Canary', 'Islands', 'Holiday', 'await', 'collection']

You should probably convert l2 to a regex and re.sub each string in l1 using it. Something like:

import re

l1 = [['ham', 'And how you will do that, princess? :)'],
      ['spam',
       'Urgent! Please call 09061213237 from landline. \xc2\xa35000 cash or a luxury 4* Canary Islands Holiday await collection.....']]

l2 = ['a', ' able', ' about', ' across', ' after', ' all', ' almost', ' also', ' am', ' among', ' an', ' and', ' any']

stop_re = re.compile(
    r'(\s+|\b)({})\b'.format(r'|'.join(word.strip() for word in l2)),
    re.IGNORECASE)
cleaned = [[stop_re.sub('', part).strip() for part in sublist] for sublist in l1]

# cleaned ==>
#     [['ham', 'how you will do that, princess? :)'],
#      ['spam',
#       'Urgent! Please call 09061213237 from landline. \xc2\xa35000 cash or luxury 4* Canary Islands Holiday await collection.....']]

One of the problems here is that you iterate over l2 for every word in l1 when you're doing if j in l2 (time complexity if O(n) ), which makes it quite slow. Since you are only interested in which words are in l2 you could convert it to a set, which has a time complexity if O(1) for accessing items in it. It also appears that l2 has spaces in each word that will make it harder to track.

One bug that also appears (that is quite common when deleting things from lists while iterating) is that when you delete an item from a list when iterating forward, it will actually offset the list and you'll skip the check on the next item in the list. This can easily be fixed by reversing the iteration of the list you're deleting from.

# Strip the spaces in l2 by using strip() on each element, and convert it to a set
l2 = set(map(lambda x: x.strip(), l2))

for i in l1:
    i[1] = i[1].lower()
    i[1] = i[1].split()
    # Reverse so it won't skip words on iteration
    for j in reversed(i[1]):
        if j in l2:
            i[1].remove(j)
    # Put back the strings again
    i[1] = ' '.join(i[1])

The previous solution had the time complexity of O(m*n) where m was the total amount of words to be checked, and n was the number of stopwords. This solution should have the time complexity of O(m) only.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM