在python 3中匹配和替换多个字符串的有效方法？

Question

I have multiple (>30) compiled regex's 我有多个（> 30）编译的正则表达式

regex_1 = re.compile(...)
regex_2 = re.compile(...)
#... define multiple regex's
regex_n = re.compile(...)

I then have a function which takes a text and replaces some of its words using every one of the regex's above and the re.sub method as follows 然后我有一个函数，它接受一个text并使用上面的每个正则表达式和re.sub方法替换它的一些单词，如下所示

def sub_func(text):
    text = re.sub(regex_1, "string_1", text)
    # multiple subsitutions using all regex's ...
    text = re.sub(regex_n, "string_n", text)

    return text

Question: Is there a more efficient way to make these replacements? 问题：是否有更有效的方法来进行这些替换？

The regex's cannot be generalized or simplified from their current form. 正则表达式不能从它们当前的形式推广或简化。

I feel like reassigning the value of text each time for every regex is quite slow, given that the function only replaces a word or two from the entirety of text for each reassignment. 我觉得每次正则表达式每次重新分配时，每次正则表达式都会重新分配text的值，因为该函数只替换了整个text一两个单词。 Also, given that I have to do this for multiple documents, that slows things down even more. 此外，鉴于我必须为多个文档执行此操作，这会使事情进一步减慢。

Thanks in advance! 提前致谢！

Answer 1

Reassigning a value takes constant time in Python. 重新分配值需要在Python中保持不变的时间。 Unlike in languages like C, variables are more of a "name tag". 与C语言不同，变量更像是“名称标签”。 So, changing what the name tag points to takes very little time. 因此，更改名称标记指向的内容只需要很少的时间。

If they are constant strings, I would collect them into a tuple: 如果它们是常量字符串，我会将它们收集到一个元组中：

regexes = (
    (regex_1, 'string_1'),
    (regex_2, 'string_2'),
    (regex_3, 'string_3'),
    ...
)

And then in your function, just iterate over the list: 然后在你的函数中，只需迭代列表：

def sub_func_2(text):
    for regex, sub in regexes:
        text = re.sub(regex, sub, text)
    return text

But if your regexes are actually named regex_1 , regex_2 , etc., they probably should be directly defined in a list of some sort. 但是如果您的正则表达式实际上名为regex_1 ， regex_2等，它们可能应该直接在某种列表中定义。

Also note, if you are doing replacements like 'cat' -> 'dog' , the str.replace() method might be easier ( text = text.replace('cat', 'dog') ), and it will probably be faster. 还要注意，如果你正在做'cat' - > 'dog'这样的替换， str.replace()方法可能会更容易（ text = text.replace('cat', 'dog') ），它可能会是快点。

If your strings are very long, and re-making it from scratch with the regexes might take very long. 如果你的字符串非常长，并且使用正则表达式重新制作它可能需要很长时间。 An implementation of @Oliver Charlesworth's method that was mentioned in the comments could be: 评论中提到的@Oliver Charlesworth方法的实现可能是：

# Instead of this:
regexes = (
   ('1(1)', '$1i'),
   ('2(2)(2)', '$1a$2'),
   ('(3)(3)3', '$1a$2')
)


# Merge the regexes:
regex = re.compile('(1(1))|(2(2)(2))|((3)(3)3)')
substitutions = (
    '{1}i', '{1}a{2}', '{1}a{2}'
)

# Keep track of how many groups are in each alternative
group_nos = (1, 2, 2)

cumulative = [1]
for i in group_nos:
    cumulative.append(cumulative[-1] + i + 1)
del i
cumulative = tuple(zip(substitutions, cumulative))

def _sub_func(match):
    iter_ = iter(cumulative)
    for sub, x in iter_:
        if match.group(x) is not None:
            return sub.format(*map(match.group, range(x, next(iter_)[1])))

def sub_func(text):
    return re.sub(regex, _sub_func, text)

But this breaks down if you have overlapping text that you need to substitute. 但如果您有需要替换的重叠文本，则会出现故障。

Answer 2

we can pass a function to re.sub repl argument 我们可以将函数传递给re.sub repl参数

simplify to 3 regex for easier understanding 简化为3个正则表达式以便于理解

assuming regex_1, regex_2, and regex_3 will be 111,222 and 333 respectively. 假设regex_1，regex_2和regex_3分别为111,222和333。 Then, regex_replace will be the list holding string that will be use for replace follow the order of regex_1, regex_2 and regex_3. 然后，regex_replace将是按照regex_1，regex_2和regex_3的顺序用于替换的列表保存字符串。

regex_1 will be replace will 'one' regex_1将取代'one'
regex_2 replace with 'two' and so on regex_2替换为'two'，依此类推

Not sure how much this will improve the runtime though, give it a try 不知道这会改善运行时间，尝试一下

import re
regex_x = re.compile('(111)|(222)|(333)')
regex_replace = ['one', 'two', 'three']

def sub_func(text):
    return re.sub(regex_x, lambda x:regex_replace[x.lastindex-1], text)

>>> sub_func('testing 111 222 333')
>>> 'testing one two three'

在python 3中匹配和替换多个字符串的有效方法？

问题描述

2 个解决方案

解决方案1
3 2017-06-13 17:48:48

解决方案2
0 2017-06-13 21:02:05

在python 3中匹配和替换多个字符串的有效方法？

问题描述

2 个解决方案

解决方案1 3 2017-06-13 17:48:48

解决方案2 0 2017-06-13 21:02:05

解决方案1
3 2017-06-13 17:48:48

解决方案2
0 2017-06-13 21:02:05