简体   繁体   中英

Efficient method to exclude items in one list from another list in Python

I have a list of 8000 strings (stop_words) and a list of 100,000 strings of various lengths running to millions of individual words. I am using the function to tokenize the 100,000 string and to exclude non alphanumeric tokens and tokens from the list stop_words.

    def tokenizer(text):

       return [stemmer.stem(tok.lower()) for tok in nltk.word_tokenize(text)/ 
       if tok.isalpha() and tok.lower() not in stop_words]

I have tested this code using 600 strings and it takes 60 seconds. If I remove the condition to exclude stopwords it takes 1 second on the same 600 strings

    def tokenizer(text):

       return [stemmer.stem(tok.lower()) for tok in nltk.word_tokenize(text)/ 
       if tok.isalpha()]

I am hoping there is a more efficient way to exclude items found in one list from another list.

I am grateful for any help or suggestions

Thanks

使stop_words成为一组,以便查找为O(1)。

stop_words = set(('word1', 'word2', 'word3'))
  • Make stop_words a set, since checking membership in a set is O(1), while checking membership in a list is O(N).
  • Call lower() on text (once) instead of lower() twice for each token.

stop_words = set(stop_words)
def tokenizer(text):
   return [stemmer.stem(tok) for tok in nltk.word_tokenize(text.lower())
           if tok.isalpha() and tok not in stop_words]

Since accessing local variables is quicker than looking up qualified names, you may also gain a bit of speed by making nltk.word_tokenize and stemmer.stem local:

stop_words = set(stop_words)
def tokenizer(text, stem = stemmer.stem, tokenize = nltk.word_tokenize):
   return [stem(tok) for tok in tokenize(text.lower())
           if tok.isalpha() and tok not in stop_words]

The default values for stem and tokenize are set once at the time the tokenizer function is defined . Inside tokenizer , stem and tokenize are local variables. Usually this kind of micro-optimization is not important, but since you are calling tokenizer 100K times, it may help you a little bit.

Use sets:

{x for x in one_list} - other_list

However it removes duplicates and ordering, so if it matters you need something else.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM