简体   繁体   中英

Syntax is right but runs terribly slow. How could I improve this piece of code?

def vocab(text):
   vocab = [w for w in text if w not in nltk.corpus.stopwords.words('english') 
           and w.isalpha()]
   fd = nltk.FreqDist(vocab)
   print([w for w, n in fd.most_common(50)])

# Define a function that returns the 50 most frequent words in a text
# (filtering out stopwords and punctuation).

Code works fine but is terribly slow. It is a simple function and should not take so long to respond. I wonder if there is a way speed it up.

A couple things:

import collections   # We'll use `collections.Counter`; it could be optimized
# Make a set of the stopwords, and don't recompute it for
# each invocation of `vocab`
stopword_set = set(nltk.corpus.stopwords.words('english'))

def vocab2(text):
    # Flip the order of stopword testing and isalpha;
    # we assume isalpha is faster, and since `and` is short-circuited,
    # if it returns False, the stopword testing is not done.
    text = [w for w in text if w.isalpha() and w not in stopword_set]
    return [w for w, n in collections.Counter(text).most_common(50)]

Timeit says the new version is about 140 times faster:

original 1.2306433910052874
fixed 0.008700065001903567

You don't say which part of your code is slow, but here is a possibility.

nltk.corpus.stopwords.words('english') returns a list. You can speed you your code by putting its contents in a set before you start iterating through your text.

stopwords = set(nltk.corpus.stopwords.words('english'))
vocab = [w for w in text if w not in stopwords and w.isalpha()]

Looking something up in a set is usually really fast.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM