Syntax is right but runs terribly slow. How could I improve this piece of code?

Question

def vocab(text):
   vocab = [w for w in text if w not in nltk.corpus.stopwords.words('english') 
           and w.isalpha()]
   fd = nltk.FreqDist(vocab)
   print([w for w, n in fd.most_common(50)])

# Define a function that returns the 50 most frequent words in a text
# (filtering out stopwords and punctuation).

Code works fine but is terribly slow. It is a simple function and should not take so long to respond. I wonder if there is a way speed it up.

Answer 1

A couple things:

import collections   # We'll use `collections.Counter`; it could be optimized
# Make a set of the stopwords, and don't recompute it for
# each invocation of `vocab`
stopword_set = set(nltk.corpus.stopwords.words('english'))

def vocab2(text):
    # Flip the order of stopword testing and isalpha;
    # we assume isalpha is faster, and since `and` is short-circuited,
    # if it returns False, the stopword testing is not done.
    text = [w for w in text if w.isalpha() and w not in stopword_set]
    return [w for w, n in collections.Counter(text).most_common(50)]

Timeit says the new version is about 140 times faster:

original 1.2306433910052874
fixed 0.008700065001903567

Answer 2

You don't say which part of your code is slow, but here is a possibility.

nltk.corpus.stopwords.words('english') returns a list. You can speed you your code by putting its contents in a set before you start iterating through your text.

stopwords = set(nltk.corpus.stopwords.words('english'))
vocab = [w for w in text if w not in stopwords and w.isalpha()]

Looking something up in a set is usually really fast.

Syntax is right but runs terribly slow. How could I improve this piece of code?

Question

2 answers

solution1
1 ACCPTED 2018-03-23 11:30:00

solution2
0 2018-03-23 11:26:37

Syntax is right but runs terribly slow. How could I improve this piece of code?

Question

2 answers

solution1 1 ACCPTED 2018-03-23 11:30:00

solution2 0 2018-03-23 11:26:37

solution1
1 ACCPTED 2018-03-23 11:30:00

solution2
0 2018-03-23 11:26:37