Fastest way of checking for word in huge list of keywords - Python performance

Question

Thanks for amazingly quick response. Stackoverflow is awesome!

I need to check if a word (or rather thousands of them) is matching a dict containing keywords.

For example, say I have a string: "The fluffy fox jumped the friggin fence." I need to check each word of the string against a dict of keywords, and if there's a match, return all values.

I've created a dict filters : ( uniqueid means ie. "lk2m3lk4m2", rest is 'static'.)

filters:
        { "fox" : [
                    { 'subscription' : 'uniqueid', 'link' : 'uniqueid' },
                    { 'subscription' : 'uniqueid', 'link' : 'uniqueid' }
                  ]},

        { "fence" : [
                      { 'subscription' : 'uniqueid', 'link' : 'uniqueid' }
                    ]}

...and plan to iterate over filters for each word in string (and I have to do this with perhaps 5000 words / second. In other words, performance is the issue ABOVE ALL.

The number of filter-keywords may grow to thousands, while the strings will never be more than a normal sentence long (ie. 5-20 words). I'll therefore iterate over each word in the string and check if it's contained in the filter-list. However, at 500 sentences / second, I'm still looking at a lot of computation.

Is it possible to sort the list (ie. the key of dict's in list) and thus drastically improve performance, for example? And are there C-implementations I should use (like I'm using cjson with great performance gain)?

Sorry for the somewhat fluid question -- but how should I go about this task?

Edit:

Expected input:
"The fluffy fox jumped the friggin fence."
Expected output:
{ 'subscription' : 'flskdmfslk32232', 'link' : 'sfdksmfls22323' }, { 'subscription' : '3023940fsdf', 'link' : 'sdflsfm223' }
(ie. the subscriptions listed under each matching keyword.)

Answer 1

You can determine if a word is a key in filters by simply by doing filters.has_key(word) or by doing:

subscriptions = filters.get(word)
if subscriptions is not None:
    pass # TODO do something with subscriptions

or:

try:
    subscriptions = filters[word]
    # TODO do something with subscriptions
except:
    pass # probably don't need to do anything if not present

It isn't necessary to iterate over each entry in filters. Rather you will want to split your input string, add each word to a Set (to eliminate duplicates), and then iterate over your set to look up each word in your filters dictionary.

Answer 2

The fastest way to do it in Python would be to use a dictionary look each word of the sentence up in it, and accumulate and associated values. The main data structure would probably look something like this:

filters = {
    "fox" : (
              ('uniqueid1', 'uniqueid2'),
              ('uniqueid3', 'uniqueid4'),
            ),
    "fence" : (
                ('uniqueid5', 'uniqueid6'),
              ),
          }

Using this way (on 8-bit chars):

from string import punctuation

sentence = 'The fluffy fox jumped the friggin fence.'
sentence = sentence.translate(None, punctuation)  # remove punctuation chars

print [filters.get(word) for word in sentence.split() if word in filters]

Or it might be faster (time it to find out) like this which avoids a double dictionary look-up:

from string import punctuation

def map_words(sentence):
    for word in sentence.translate(None, punctuation).split():
        try:
            yield filters[word]
        except KeyError:
            pass

sentence = 'The fluffy fox jumped the friggin fence.'
print [v for v in map_words(sentence)]

Either way this is the output:

[(('uniqueid1', 'uniqueid2'), ('uniqueid3', 'uniqueid4')), (('uniqueid5', 'uniqueid6'),)]

Fastest way of checking for word in huge list of keywords - Python performance

Question

2 answers

solution1
4 ACCPTED 2012-11-12 20:46:40

solution2
2 2012-11-12 20:44:00

Fastest way of checking for word in huge list of keywords - Python performance

Question

2 answers

solution1 4 ACCPTED 2012-11-12 20:46:40

solution2 2 2012-11-12 20:44:00

solution1
4 ACCPTED 2012-11-12 20:46:40

solution2
2 2012-11-12 20:44:00