简体   繁体   中英

Efficiently String searching in Python

Let's say I had a database of like 2,000 keywords, which each map to a few common variations

For example:

 "Node" : ["node.js", "nodejs", "node js", "node"] 

 "Ruby on Rails" : ["RoR", "Rails", "Ruby on Rails"]

and I want to search a string (ok, a document) and return a list of all the contained keywords.

I know I could loop through a ton of regex searches, but is there a more efficient way of doing this? something approximating "real time" or near-real time for a web app?

I am currently looking at Elastic Search document, but I want to know is there a Pythonic way to achieve my result.

I am pretty familiar with regex but I don't want to write so many regular expression now. I will appreciate your answers or if you can point me to right direction.

You can use a data-structure which inverts this dictionary of keywords - so that each of ["node.js", "nodejs", "node js", "node", "Node"] is a key with the value "Node" - each other of the 10 or so variants for the other 2000 keywords points to one of the keywords - so a 20000 sized dictionary, which is not much.

With taht dict, you can retokenize your text to be composed only by the normalized form of the keywords, and them proceed to count then.

 primary_dict = {
     "Node" : ["node.js", "nodejs", "node js", "node", "Node"] 

      "Ruby_on_Rails" : ["RoR", "Rails", "Ruby on Rails"]
 }

def invert_dict(src):
    dst = {}
    for key, values in src.items():
        for value in values:
            dst[value] = key
    return dst

words = invert_dict(primary_dict)
from collections import Counter

def count_keywords(text):
    counted = Counter()
    for word in text.split(): # or use a regex to split on punctuation signs as well
        counted[words.get(word, None)] += 1
    return counted

As for the efficiency, this approach is rather nice, since each word on the text will be looked-up on the dictionary only once, and Python's dict search is O(log(n)) - that gives you an O(n log(n)) approach. Trying a single-mega-regexp as you've thought of will be O(n²), regardless of how fast a regexp match is (and it is not that fast comparing to a dict lookup).

If it is too long a text, maybe pre-tokenizing it with a simple split (or regexp) is not feasible - in that case, you can just read a chunk of text each time and divide small chunks of it in words.

Other approach

Since you don't need the count for each word, an alternative is to create Python sets with the words in your document and all the keywords in your list, and then take the intersection of both sets. You can count only the keywords of the this intersection set against the words inverted dict above.

Catch None of this take in account terms that contain whitespace - I am always considering words can be tokenized to be individually matched, but str.split, and simple punctuation-removing regexps can't account for composed terms like 'ruby on rails' and 'node js'. If there is no other workaround for you, instead of a 'split' you will have to write a custon tokenizer that can try to match sets of one, two and three words throughout the text against the inverted dict.

An alternative approach useful for tokenizing long strings is to construct a single omnibus regular expression, then use named groups to identify tokens. It takes a little setup, but the recognition phase is pushed into C/native code, and takes just a single pass, so it can be quite efficient. For example:

import re

tokens = {
    'a': ['andy', 'alpha', 'apple'],
    'b': ['baby']
}

def create_macro_re(tokens, flags=0):
    """
    Given a dict in which keys are token names and values are lists
    of strings that signify the token, return a macro re that encodes
    the entire set of tokens.
    """
    d = {}
    for token, vals in tokens.items():
        d[token] = '(?P<{}>{})'.format(token, '|'.join(vals))
    combined = '|'.join(d.values())
    return re.compile(combined, flags)

def find_tokens(macro_re, s):
    """
    Given a macro re constructed by `create_macro_re()` and a string,
    return a list of tuples giving the token name and actual string matched
    against the token.
    """
    found = []
    for match in re.finditer(macro_re, s):
        found.append([(t, v) for t, v in match.groupdict().items() if v is not None][0])
    return found    

Final step, running it:

macro_pat = create_macro_re(tokens, re.I)
print find_tokens(macro_pat, 'this is a string of baby apple Andy')

macro_pat ends up corresponding to:

re.compile(r'(?P<a>andy|alpha|apple)|(?P<b>baby)', re.IGNORECASE)

And the second line prints a list of tuples, each giving the token and the actual string matched against the token:

[('b', 'baby'), ('a', 'apple'), ('a', 'Andy')]

This example shows how a list of tokens can be compiled into a single regular expression, and that can be efficiently run against a string in a single pass.

Left unshown is one of its great strengths: the ability to define tokens not just through strings, but through regular expressions. So if we want alternate spellings of the b token, eg, we don't have to list them exhaustively. Normal regex patterns suffice. Say we wanted to also recognize 'babby' as a b token. We could do 'b': ['baby', 'babby'] as before, or we could use regex to do the same thing: 'b': ['babb?y'] . Or 'bab+y' if you want to also include arbitrary internal 'b' characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM