在Python中从另一个列表中排除一个列表中的项目的有效方法

Question

我有一个包含8000个字符串（stop_words）的列表，以及一个包含各种长度的100,000个字符串的列表，这些字符串运行到数百万个单词。 我正在使用该函数来标记100,000个字符串，并从列表stop_words中排除非字母数字标记和标记。

    def tokenizer(text):

       return [stemmer.stem(tok.lower()) for tok in nltk.word_tokenize(text)/ 
       if tok.isalpha() and tok.lower() not in stop_words]

我用600个字符串测试了这段代码，需要60秒。 如果我删除条件以排除停用词，则在相同的600个字符串上需要1秒钟

    def tokenizer(text):

       return [stemmer.stem(tok.lower()) for tok in nltk.word_tokenize(text)/ 
       if tok.isalpha()]

我希望有一种更有效的方法可以从另一个列表中排除一个列表中的项目。

我很感激任何帮助或建议

谢谢

Answer 1

使stop_words成为一组，以便查找为O（1）。

stop_words = set(('word1', 'word2', 'word3'))

Answer 2

使stop_words成为一组，因为检查集合中的成员资格是O（1），而检查列表中的成员资格是O（N）。
呼叫lower()的text （一次），而不是lower()两次，每次令牌。

stop_words = set(stop_words)
def tokenizer(text):
   return [stemmer.stem(tok) for tok in nltk.word_tokenize(text.lower())
           if tok.isalpha() and tok not in stop_words]

由于访问局部变量比查找限定名称更快，因此通过使nltk.word_tokenize和stemmer.stem local更快，您也可以获得一些速度：

stop_words = set(stop_words)
def tokenizer(text, stem = stemmer.stem, tokenize = nltk.word_tokenize):
   return [stem(tok) for tok in tokenize(text.lower())
           if tok.isalpha() and tok not in stop_words]

stem和tokenize的默认值在定义 tokenizer函数时设置一次。 在tokenizer内部， stem和tokenize是局部变量。 通常这种微优化并不重要，但由于你将tokenizer称为100K次，它可能对你有所帮助。

Answer 3

使用集：

{x for x in one_list} - other_list

但是它会删除重复和排序，所以如果重要，你需要别的东西。

在Python中从另一个列表中排除一个列表中的项目的有效方法

问题描述

3 个解决方案

解决方案1
5 2013-01-12 13:10:19

解决方案2
3 已采纳 2013-01-12 13:12:50

解决方案3
0 2013-01-12 13:12:19

在Python中从另一个列表中排除一个列表中的项目的有效方法

问题描述

3 个解决方案

解决方案1 5 2013-01-12 13:10:19

解决方案2 3 已采纳 2013-01-12 13:12:50

解决方案3 0 2013-01-12 13:12:19

解决方案1
5 2013-01-12 13:10:19

解决方案2
3 已采纳 2013-01-12 13:12:50

解决方案3
0 2013-01-12 13:12:19