I have a list of 8000 strings (stop_words) and a list of 100,000 strings of various lengths running to millions of individual words. I am using the function to tokenize the 100,000 string and to exclude non alphanumeric tokens and tokens from the list stop_words.
def tokenizer(text):
return [stemmer.stem(tok.lower()) for tok in nltk.word_tokenize(text)/
if tok.isalpha() and tok.lower() not in stop_words]
I have tested this code using 600 strings and it takes 60 seconds. If I remove the condition to exclude stopwords it takes 1 second on the same 600 strings
def tokenizer(text):
return [stemmer.stem(tok.lower()) for tok in nltk.word_tokenize(text)/
if tok.isalpha()]
I am hoping there is a more efficient way to exclude items found in one list from another list.
I am grateful for any help or suggestions
Thanks
使stop_words
成为一组,以便查找为O(1)。
stop_words = set(('word1', 'word2', 'word3'))
stop_words
a set, since checking membership in a set is O(1), while checking membership in a list is O(N). lower()
on text
(once) instead of lower()
twice for each token. stop_words = set(stop_words)
def tokenizer(text):
return [stemmer.stem(tok) for tok in nltk.word_tokenize(text.lower())
if tok.isalpha() and tok not in stop_words]
Since accessing local variables is quicker than looking up qualified names, you may also gain a bit of speed by making nltk.word_tokenize
and stemmer.stem
local:
stop_words = set(stop_words)
def tokenizer(text, stem = stemmer.stem, tokenize = nltk.word_tokenize):
return [stem(tok) for tok in tokenize(text.lower())
if tok.isalpha() and tok not in stop_words]
The default values for stem
and tokenize
are set once at the time the tokenizer
function is defined . Inside tokenizer
, stem
and tokenize
are local variables. Usually this kind of micro-optimization is not important, but since you are calling tokenizer
100K times, it may help you a little bit.
Use sets:
{x for x in one_list} - other_list
However it removes duplicates and ordering, so if it matters you need something else.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.