简体   繁体   English

Python 正则表达式 - 用标点符号快速替换多个关键字并以

[英]Python Regex - Fast replace of multiple keywords with punctuation and starting with

This is an extension of this previous question .这是上一个问题的扩展。

I have a python dictionary, made like this我有一个 python 字典,是这样制作的

a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}

I want to find a solution to replace, as fast as possible, all the words in the dictionary values, with their keys.我想找到一种解决方案,尽可能快地用它们的键替换字典值中的所有单词。 Solution should be scalable for large text.对于大文本,解决方案应该是可扩展的。 If words end with asterisk, it means that all words in the text that start with that prexif should be replaced.如果单词以星号结尾,则意味着文本中以该前缀开头的所有单词都应被替换。

So the following sentence " I've been bad but I aspire to be a better person, and behave like my dog and cat:) " should be transformed into " XXX bad but I XXX to be a better person, and behave like my animal XXX ".所以下面的句子“我一直很糟糕,但我渴望成为一个更好的人,表现得像我的狗和猫:) ”应该转化为“ XXX很糟糕,但我XXX会成为一个更好的人,表现得像我的动物一样” XXX ”。

I am trying to use trrex for this, thinking it should be the fastest option.我正在尝试为此使用trrex ,认为它应该是最快的选择。 Is it?是吗? However I cannot succeed.但是我不能成功。 Moreover I find problems:此外,我发现问题:

  • in handling words which include punctuation (such as ":)" and "I've been");处理包含标点符号的单词(例如“:)”和“I've been”);
  • when some string is repeated like "dog" and "dog and cat".当某些字符串重复时,例如“dog”和“dog and cat”。

Can you help me achieve my goal with a scalable solution?您能否通过可扩展的解决方案帮助我实现目标?

You can tweak this solution to suit your needs:您可以调整此解决方案以满足您的需求:

  • Create another dictionary from a that will contain the same keys and the regex created from the valuesa创建另一个字典,该字典将包含相同的键和从值创建的正则表达式
  • If a * char is found, replace it with \w* if you mean any zero or more word chars, or use \S* if you mean any zero or more non-whitespace chars (please adjust the def quote(self, char) method), else, quote the char如果找到*字符,如果您的意思是任何零个或多个单词字符,则用\w*替换它,或者如果您的意思是任何零个或多个非空白字符,请使用\S* (请调整def quote(self, char)方法),否则,引用字符
  • Use unambiguous word boundaries, (?<!\w) and (?!\w) , or remove them altogether if they interfere with matching non-word entries使用明确的单词边界(?<!\w)(?!\w) ,如果它们干扰匹配的非单词条目,则将它们完全删除
  • The first regex here will look like (?<?\w)(:?cat|dog(:?\ and\ cat)?)(?!\w) ( demo ) and the second will look like (?<?\w)(::?\)|I've\ been|asp\w*)(?!\w) ( demo )这里的第一个正则表达式看起来像(?<?\w)(:?cat|dog(:?\ and\ cat)?)(?!\w) ( demo ),第二个看起来像(?<?\w)(::?\)|I've\ been|asp\w*)(?!\w) (演示)
  • Replace in a loop.循环替换。

See the Python demo :请参阅Python 演示

import re

# Input
text = "I've been bad but I aspire to be a better person, and behave like my dog and cat :)"
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}

class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""
    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        if char == '*':
            return r'\w*'
        else:
            return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

# Creating patterns
a2 = {}
for k,v in a.items():
    trie = Trie()
    for w in v:
        trie.add(w)
    a2[k] = re.compile(fr"(?<!\w){trie.pattern()}(?!\w)", re.I)

for k,r in a2.items():
    text = r.sub(k, text)
    
print(text)
# => XXX bad but I XXX to be a better person, and behave like my animal XXX

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM