繁体   English   中英

如何优化 python 中的字符串替换/替换?

[英]How to optimize string substitution / replacement in python?

请让我先说这不是替换字符串中的字母的重复。 这是整个substring 的替代品

我有一组文档,需要将几个不同的子字符串替换为空字符串或其他值。 最快的方法是什么? 有没有比使用正则表达式更快的方法?

当您在逐字/逐字符中执行字符串替换时存在显着差异,但是在这种情况下这将不起作用,除非正在使用某种形式的字符串匹配。

这是我之前的尝试。

import re 

def standard_for_loop(string, replacements):
    # case sensitive for loop
    for key, value in replacements.items():
        # would not work unless case specific
        string = string.replace(key, value)
        
    return string 


def regex_loop(string, replacements):
    #case insensitive regex substitution in for
    for key, value in replacements.items():
        string = re.sub(key, value, string, re.IGNORECASE)
        
    return string
    

def regex_multiple(string, replacements):
    # case insensitive regex substitution using lambda 
    pattern = re.compile("({})".format("|".join(replacements.keys())), re.IGNORECASE)
    return pattern.sub(lambda m: replacements[m.string.lower()[m.start():m.end()]], string)
    

    
def case_insensitive_for_loop(string, replacements):
    def find_next(string, pattern, sub):
        if pattern.lower() in string.lower():
            
            match = string.lower().index(pattern.lower())
            end = int(match + len(pattern))
            
            new_string = string[end:]
            
            # yield a replaced substring of original string
            yield string[:match] + sub
            yield from find_next(new_string, pattern, sub)
            
    '''
    # this is what I'm unsure about. How to negate need for 
    # for loop here and how to fix the append issue.
    # Currently the functionality works but it appends output 
    # replacement to the result. I know the "+=" is the 
    # cause of the problem, but I'm not sure how to fix this. 
    '''
    result = ''
    for k, v in replacements.items():
        for output in find_next(string, k, v):
            result += output
    return result

有两个问题,根据我的经验, regex_multiple是最准确的,但需要很长时间才能完成。 下一个最准确的是case_sensitive_for_loop但我不知道如何克服替换与附加问题。

例如,它将替换文档:

# for a sample document 

doc = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, ipsum nec volutpat dapibus, felis magna scelerisque enim, et rutrum nunc ligula eget augue. Phasellus aliquam feugiat venenatis. Sed lobortis pharetra ipsum ut venenatis.

Nullam ut accumsan orci. Vivamus faucibus augue in facilisis facilisis. Donec ut scelerisque ipsum. Ut mollis elit nibh, ut vulputate eros ultrices ac. Nunc ac urna sed libero imperdiet maximus non sed dui. Morbi ornare eu eros eget pharetra. Vivamus vestibulum nisi eu eros pulvinar aliquet.

Maecenas at justo bibendum, viverra urna nec, pellentesque orci. Cras ut molestie sem. Proin in tincidunt ex. Aliquam euismod id ligula a bibendum. Morbi at diam euismod, auctor ex non, venenatis ante. Proin convallis ex eu semper posuere. Etiam sed tincidunt massa. Vivamus aliquam mollis massa, nec lacinia est dictum vitae. In varius convallis pulvinar. Pellentesque aliquet pulvinar nibh vel dictum"""

#replacement strings where k is the substring to be searched and v is the value to be replaced with
repl = { 'venenatis ante':'', 
'ipsum nec volutpat dapibus' : '',
'ipsum vulputate accumsan' : '',
'dolor sit amet':'', 
'vivamus aliquam mollis massa':''
}

和:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, ipsum nec volutpat dapibus, felis magna scelerisque enim, et rutrum nunc ligula eget augue. Phasellus aliquam feugiat venenatis. Sed lobortis pharetra ipsum ut venenatis.

Nullam ut accumsan orci. Vivamus faucibus augue in facilisis facilisis. Donec ut scelerisque ipsum. Ut mollis elit nibh, ut vulputate eros ultrices ac. Nunc ac urna sed libero imperdiet maximus non sed dui. Morbi ornare eu eros eget pharetra. Vivamus vestibulum nisi eu eros pulvinar aliquet.

Maecenas at justo bibendum, viverra urna nec, pellentesque orci. Cras ut molestie sem. Proin in tincidunt ex. Aliquam euismod id ligula a bibendum. Morbi at diam euismod, auctor ex non, Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, ipsum nec volutpat dapibus, felis magna scelerisque enim, et rutrum nunc ligula eget augue. Phasellus aliquam feugiat venenatis. Sed lobortis pharetra ipsum ut venenatis.

Nullam ut accumsan orci. Vivamus faucibus augue in facilisis facilisis. Donec ut scelerisque ipsum. Ut mollis elit nibh, ut vulputate eros ultrices ac. Nunc ac urna sed libero imperdiet maximus non sed dui. Morbi ornare eu eros eget pharetra. Vivamus vestibulum nisi eu eros pulvinar aliquet.

Maecenas at justo bibendum, viverra urna nec, pellentesque orci. Cras ut molestie sem. Proin in tincidunt ex. Aliquam euismod id ligula a bibendum. Morbi at diam euismod, auctor ex non, venenatis ante. Proin convallis ex eu semper posuere. Etiam sed tincidunt massa.

在对它们进行比较之后, standard_for_loop在 50k 循环中以每个循环 4u 秒的速度最快。 第二快的是 20k 循环中每个循环 14 u 秒的regex_loop 然后是case_sensitive_for_loop ,在 10k 循环中每个循环需要 28.3u 秒。 regex_multiple 中的regex_multiple表达式令人惊讶地在 2k 循环中以 103u 秒的时间完成最长的时间。

这是 python timeit 输出python timeit 输出 对于每个 function。

想知道是否有任何我否定的字符串匹配算法来解决这个问题。 欢迎任何建议

regex_multiple效率低下,因为如果每次匹配时都重新计算整个字符串。 您可以降低匹配的字符串。 方法如下:

def regex_multiple(string, replacements):
    # case insensitive regex substitution using lambda 
    pattern = re.compile("({})".format("|".join(replacements.keys())), re.IGNORECASE)
    return pattern.sub(lambda m: replacements[m[0].lower()], string)

与其他不区分大小写的实现相比,此解决方案应该更快,并且比大型文档上的原始解决方案要快得多。

但是请注意,您正在比较区分大小写和不区分大小写的方法。 使用不区分大小写替换的计算量更大,因此速度更慢。 公平地说,您应该比较做同样事情的方法。

最后,如果您处理ASCII文档。 您可以将标志re.ASCII添加到正则表达式。 这使得解析更快一些。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM