简体   繁体   中英

Python Regex - Fast replace of multiple keywords with punctuation and starting with

This is an extension of this previous question .

I have a python dictionary, made like this

a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}

I want to find a solution to replace, as fast as possible, all the words in the dictionary values, with their keys. Solution should be scalable for large text. If words end with asterisk, it means that all words in the text that start with that prexif should be replaced.

So the following sentence " I've been bad but I aspire to be a better person, and behave like my dog and cat:) " should be transformed into " XXX bad but I XXX to be a better person, and behave like my animal XXX ".

I am trying to use trrex for this, thinking it should be the fastest option. Is it? However I cannot succeed. Moreover I find problems:

  • in handling words which include punctuation (such as ":)" and "I've been");
  • when some string is repeated like "dog" and "dog and cat".

Can you help me achieve my goal with a scalable solution?

You can tweak this solution to suit your needs:

  • Create another dictionary from a that will contain the same keys and the regex created from the values
  • If a * char is found, replace it with \w* if you mean any zero or more word chars, or use \S* if you mean any zero or more non-whitespace chars (please adjust the def quote(self, char) method), else, quote the char
  • Use unambiguous word boundaries, (?<!\w) and (?!\w) , or remove them altogether if they interfere with matching non-word entries
  • The first regex here will look like (?<?\w)(:?cat|dog(:?\ and\ cat)?)(?!\w) ( demo ) and the second will look like (?<?\w)(::?\)|I've\ been|asp\w*)(?!\w) ( demo )
  • Replace in a loop.

See the Python demo :

import re

# Input
text = "I've been bad but I aspire to be a better person, and behave like my dog and cat :)"
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}

class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""
    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        if char == '*':
            return r'\w*'
        else:
            return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

# Creating patterns
a2 = {}
for k,v in a.items():
    trie = Trie()
    for w in v:
        trie.add(w)
    a2[k] = re.compile(fr"(?<!\w){trie.pattern()}(?!\w)", re.I)

for k,r in a2.items():
    text = r.sub(k, text)
    
print(text)
# => XXX bad but I XXX to be a better person, and behave like my animal XXX

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM