Efficient method to exclude items in one list from another list in Python

Question

I have a list of 8000 strings (stop_words) and a list of 100,000 strings of various lengths running to millions of individual words. I am using the function to tokenize the 100,000 string and to exclude non alphanumeric tokens and tokens from the list stop_words.

    def tokenizer(text):

       return [stemmer.stem(tok.lower()) for tok in nltk.word_tokenize(text)/ 
       if tok.isalpha() and tok.lower() not in stop_words]

I have tested this code using 600 strings and it takes 60 seconds. If I remove the condition to exclude stopwords it takes 1 second on the same 600 strings

    def tokenizer(text):

       return [stemmer.stem(tok.lower()) for tok in nltk.word_tokenize(text)/ 
       if tok.isalpha()]

I am hoping there is a more efficient way to exclude items found in one list from another list.

I am grateful for any help or suggestions

Thanks

Answer 1

使stop_words成为一组，以便查找为O（1）。

stop_words = set(('word1', 'word2', 'word3'))

Answer 2

Make stop_words a set, since checking membership in a set is O(1), while checking membership in a list is O(N).
Call lower() on text (once) instead of lower() twice for each token.

stop_words = set(stop_words)
def tokenizer(text):
   return [stemmer.stem(tok) for tok in nltk.word_tokenize(text.lower())
           if tok.isalpha() and tok not in stop_words]

Since accessing local variables is quicker than looking up qualified names, you may also gain a bit of speed by making nltk.word_tokenize and stemmer.stem local:

stop_words = set(stop_words)
def tokenizer(text, stem = stemmer.stem, tokenize = nltk.word_tokenize):
   return [stem(tok) for tok in tokenize(text.lower())
           if tok.isalpha() and tok not in stop_words]

The default values for stem and tokenize are set once at the time the tokenizer function is defined . Inside tokenizer , stem and tokenize are local variables. Usually this kind of micro-optimization is not important, but since you are calling tokenizer 100K times, it may help you a little bit.

Answer 3

Use sets:

{x for x in one_list} - other_list

However it removes duplicates and ordering, so if it matters you need something else.

Efficient method to exclude items in one list from another list in Python

Question

3 answers

solution1
5 2013-01-12 13:10:19

solution2
3 ACCPTED 2013-01-12 13:12:50

solution3
0 2013-01-12 13:12:19

Efficient method to exclude items in one list from another list in Python

Question

3 answers

solution1 5 2013-01-12 13:10:19

solution2 3 ACCPTED 2013-01-12 13:12:50

solution3 0 2013-01-12 13:12:19

solution1
5 2013-01-12 13:10:19

solution2
3 ACCPTED 2013-01-12 13:12:50

solution3
0 2013-01-12 13:12:19