简体   繁体   中英

Fastest way to count a list of words in an article using python

I am looking for how many times all words in a bag of words are found in an article. I am not interested in the frequency of each word but the total amount of times all of them are found in the article. I have to analyse hundreds of articles, as I retrieve them from the internet. My algorithm takes long since each article is about 800 words.

Here is what I do (where amount is the number of times the words were found in a single article, article contains a string with all the words forming the article content, and I use NLTK to tokenize.)

bag_of_words = tokenize(bag_of_words)
tokenized_article = tokenize(article)

occurrences = [word for word in tokenized_article
                    if word in bag_of_words]

amount = len(occurrences)

Where the tokenized_article looks like:

[u'sarajevo', u'bosnia', u'herzegovi', u'war', ...]

And so does the bag_of_words .

I was wondering if there's any more efficient/faster way of doing it using NLTK or lambda functions, for instance.

I suggest using a set for the words you are counting - a set has constant-time membership test and so, is faster than using a list (which has a linear-time membership test).

For example:

occurrences = [word for word in tokenized_article
                    if word in set(bag_of_words)]

amount = len(occurrences)

Some timing tests (with an artificially created list, repeated ten times):

In [4]: words = s.split(' ') * 10

In [5]: len(words)
Out[5]: 1060

In [6]: to_match = ['NTLK', 'all', 'long', 'I']

In [9]: def f():
   ...:     return len([word for word in words if word in to_match])

In [13]: timeit(f, number = 10000)
Out[13]: 1.0613768100738525

In [14]: set_match = set(to_match)

In [15]: def g():
    ...:     return len([word for word in words if word in set_match])

In [18]: timeit(g, number = 10000)
Out[18]: 0.6921310424804688

Some other tests:

In [22]: p = re.compile('|'.join(set_match))

In [23]: p
Out[23]: re.compile(r'I|all|NTLK|long')

In [24]: p = re.compile('|'.join(set_match))

In [28]: def h():
    ...:     return len(filter(p.match, words))

In [29]: timeit(h, number = 10000)
Out[29]: 2.2606470584869385

Use sets for membership testing.

Another method of checking could be to count the occurrences of each word, and add the occurrence if the word exist, assuming articles contain some frequency of repeating words and if the article are not very short. Let's say an article contain 10 "the", now we only check for membership one time instead of 10 times.

from collections import Counter
def f():
    return sum(c for word, c in Counter(check).items() if word in words)

If you don't want the count, it's not "bag of words" anymore, but set of words. So convert your document to a set if that is really the case.

Avoid for loops and lambda functions , in particular nested ones. This requires a lot of interpreter work, and is slow. Instead, try to use optimized calls such as intersection (for performance, libraries such as numpy are also very good because they do the work in low-level C/Fortran/Cython code)

ie

count = len(bag_of_words_set.intersection( set(tokenized_article) ))

where word_set is the words you are interested in, as a set .

If you want a classic word count instead, use a collections.Counter :

from collections import Counter
counter = Counter()
...
counter.update(tokenized_article)

This will count all words though, including those not in your list. You can try this, but it may turn out to be slower because of the loop:

bag_of_words_set = set(bag_of_words)
...
for w in tokenized_article:
   if w in bag_of_words_set: # use a set, not a list!
      counter[w] += 1

A bit more complex, but potentially faster, is the use of two Counter s. One total, and one for the documents.

doc_counter.clear()
doc_counter.update( tokenized_article )
for w in doc_counter.keys():
  if not w in bag_of_words_set: del doc_counter[w]
counter.update(doc_counter) # untested.

The use of a counter for the document is beneficial if you have many duplicate unwanted words, where you can save a lookup. It's also better for multithreaded operation (easier synchronization)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM