[英]Python Regex - Fast replace of multiple keywords with punctuation and starting with
This is an extension of this previous question .这是上一个问题的扩展。
I have a python dictionary, made like this我有一个 python 字典,是这样制作的
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}
I want to find a solution to replace, as fast as possible, all the words in the dictionary values, with their keys.我想找到一种解决方案,尽可能快地用它们的键替换字典值中的所有单词。 Solution should be scalable for large text.
对于大文本,解决方案应该是可扩展的。 If words end with asterisk, it means that all words in the text that start with that prexif should be replaced.
如果单词以星号结尾,则意味着文本中以该前缀开头的所有单词都应被替换。
So the following sentence " I've been bad but I aspire to be a better person, and behave like my dog and cat:) " should be transformed into " XXX bad but I XXX to be a better person, and behave like my animal XXX ".所以下面的句子“我一直很糟糕,但我渴望成为一个更好的人,表现得像我的狗和猫:) ”应该转化为“ XXX很糟糕,但我XXX会成为一个更好的人,表现得像我的动物一样” XXX ”。
I am trying to use trrex for this, thinking it should be the fastest option.我正在尝试为此使用trrex ,认为它应该是最快的选择。 Is it?
是吗? However I cannot succeed.
但是我不能成功。 Moreover I find problems:
此外,我发现问题:
Can you help me achieve my goal with a scalable solution?您能否通过可扩展的解决方案帮助我实现目标?
You can tweak this solution to suit your needs:您可以调整此解决方案以满足您的需求:
a
that will contain the same keys and the regex created from the valuesa
创建另一个字典,该字典将包含相同的键和从值创建的正则表达式*
char is found, replace it with \w*
if you mean any zero or more word chars, or use \S*
if you mean any zero or more non-whitespace chars (please adjust the def quote(self, char)
method), else, quote the char*
字符,如果您的意思是任何零个或多个单词字符,则用\w*
替换它,或者如果您的意思是任何零个或多个非空白字符,请使用\S*
(请调整def quote(self, char)
方法),否则,引用字符(?<!\w)
and (?!\w)
, or remove them altogether if they interfere with matching non-word entries(?<!\w)
和(?!\w)
,如果它们干扰匹配的非单词条目,则将它们完全删除(?<?\w)(:?cat|dog(:?\ and\ cat)?)(?!\w)
( demo ) and the second will look like (?<?\w)(::?\)|I've\ been|asp\w*)(?!\w)
( demo )(?<?\w)(:?cat|dog(:?\ and\ cat)?)(?!\w)
( demo ),第二个看起来像(?<?\w)(::?\)|I've\ been|asp\w*)(?!\w)
(演示) See the Python demo :请参阅Python 演示:
import re
# Input
text = "I've been bad but I aspire to be a better person, and behave like my dog and cat :)"
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
if char == '*':
return r'\w*'
else:
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
# Creating patterns
a2 = {}
for k,v in a.items():
trie = Trie()
for w in v:
trie.add(w)
a2[k] = re.compile(fr"(?<!\w){trie.pattern()}(?!\w)", re.I)
for k,r in a2.items():
text = r.sub(k, text)
print(text)
# => XXX bad but I XXX to be a better person, and behave like my animal XXX
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.