def vocab(text):
vocab = [w for w in text if w not in nltk.corpus.stopwords.words('english')
and w.isalpha()]
fd = nltk.FreqDist(vocab)
print([w for w, n in fd.most_common(50)])
# Define a function that returns the 50 most frequent words in a text
# (filtering out stopwords and punctuation).
Code works fine but is terribly slow. It is a simple function and should not take so long to respond. I wonder if there is a way speed it up.
A couple things:
import collections # We'll use `collections.Counter`; it could be optimized
# Make a set of the stopwords, and don't recompute it for
# each invocation of `vocab`
stopword_set = set(nltk.corpus.stopwords.words('english'))
def vocab2(text):
# Flip the order of stopword testing and isalpha;
# we assume isalpha is faster, and since `and` is short-circuited,
# if it returns False, the stopword testing is not done.
text = [w for w in text if w.isalpha() and w not in stopword_set]
return [w for w, n in collections.Counter(text).most_common(50)]
Timeit says the new version is about 140 times faster:
original 1.2306433910052874
fixed 0.008700065001903567
You don't say which part of your code is slow, but here is a possibility.
nltk.corpus.stopwords.words('english')
returns a list. You can speed you your code by putting its contents in a set before you start iterating through your text.
stopwords = set(nltk.corpus.stopwords.words('english'))
vocab = [w for w in text if w not in stopwords and w.isalpha()]
Looking something up in a set is usually really fast.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.