提高python中正则表达式操作的速度

Question

我有一个python脚本，它运行在1M行不同长度的行上。 该脚本运行非常慢。 在过去的12个小时中，它仅运行了30000多个。 分割文件是没有问题的，因为文件已经分割。 我的代码如下所示：

regex1 = re.compile(r"(\{\{.*?\}\})", flags=re.IGNORECASE)
regex2 = re.compile(r"(<ref.*?</ref>)", flags=re.IGNORECASE)
regex3 = re.compile(r"(<ref.*?\/>)", flags=re.IGNORECASE)
regex4 = re.compile(r"(==External links==.*?)", flags=re.IGNORECASE)
regex5 = re.compile(r"(<!--.*?-->)", flags=re.IGNORECASE)
regex6 = re.compile(r"(File:[^ ]*? )", flags=re.IGNORECASE)
regex7 = re.compile(r" [0-9]+ ", flags=re.IGNORECASE)
regex8 = re.compile(r"(\[\[File:.*?\]\])", flags=re.IGNORECASE)
regex9 = re.compile(r"(\[\[.*?\.JPG.*?\]\])", flags=re.IGNORECASE)
regex10 = re.compile(r"(\[\[Image:.*?\]\])", flags=re.IGNORECASE)
regex11 = re.compile(r"^[^_].*(\) )", flags=re.IGNORECASE)

fout = open(sys.argv[2],'a+')

with open(sys.argv[1]) as f:
    for line in f:
        parts=line.split("\t")
        label=parts[0].replace(" ","_").lower()
        line=parts[1].lower()
        try:
            line = regex1.sub("",line )
        except:
            pass
        try:
            line = regex2.sub("",line )
        except:
            pass
        try:
            line = regex3.sub("",line )
        except:
            pass
        try:
            line = regex4.sub("",line )
        except:
            pass
        try:
            line = regex5.sub("",line )
        except:
            pass
        try:
            line = regex6.sub("",line )
        except:
            pass
        try:
            line = regex8.sub("",line )
        except:
            pass
        try:
            line = regex9.sub("",line )
        except:
            pass
        try:
            line = regex10.sub("",line )
        except:
            pass

        try:     
            for match in re.finditer(r"(\[\[.*?\]\])", line):
                replacement_list=match.group(0).replace("[","").replace("]","").split("|")
                replacement_list = [w.replace(" ","_") for w in replacement_list]
                replacement_for_links=' '.join(replacement_list)
                line = line.replace(match.group(0),replacement_for_links)
        except:
            pass
        try:
            line = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', line, flags=re.MULTILINE)  
        except:
            pass    
        try:
            line = line.translate(None, '!"#$%&\'*+,./:;<=>?@[\\]^`{|}~')
        except:
            pass        
        try:
            line = line.replace(' (',' ')   
            line=' '.join([word.rstrip(")") if not '(' in word else word for word in line.split(" ")])
            line=re.sub(' isbn [\w-]+ ',' ' ,line)
            line=re.sub(' [p]+ [\w-]+ ',' ' ,line)
            line = re.sub( ' \d+ ', ' ', line)
            line= re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", line)
            line = re.sub( '\s+', ' ', line).strip()
            line=re.sub(' isbn [\w-]+ ',' ' ,line)
        except:
            pass    
        out_string=label+"\t"+line
        fout.write(out_string)
        fout.write("\n")

fout.close()

我可以对当前版本进行任何更改吗？

更新1：使用@fearless_fool的建议进行性能分析后，我意识到regex3和regex9以及http删除是效率最低的。

更新2：有趣的是发现使用.*为正则表达式模式的步骤增加了更多。 我试图用[^X]*代替它，其中X是我知道它永远不会在字符串中发生的东西。 对于1000条长线，它可以提高约20倍。 例如现在regex1是regex1 = re.compile(r"(\\{\\{[^\\}]*?\\}\\})", flags=re.IGNORECASE) ...。如果我想在其中使用两个字符否定匹配，我不知道该怎么做。 例如，如果我想将(\\{\\{[^\\}]*?\\}\\})更改为(\\{\\{[^\\}\\}]*?\\}\\}) ，我知道这是错误的因为[]任何单词都被视为独立字符。

Answer 1

（将评论添加到答案中）：我建议您使用简洁实用的Regex 101工具来分析您的个人regexen，并查看其中是否花费了过多时间。

在进行此操作时，您可以在网站上发布完整的示例，以便其他人可以看到您用于典型输入的内容。 （我知道您已经做到了-太好了！）

Answer 2

在使用@fearless_fool推荐的有用的Regex工具之后，我通过将.*替换为代表.*的更严格版本的regex来显着提高了速度，例如： [^\\]]* 。 整个脚本中的这些更改大大提高了性能。

提高python中正则表达式操作的速度

问题描述

2 个解决方案

解决方案1
1 2015-12-31 19:44:52

解决方案2
0 已采纳 2016-01-04 21:59:20

提高python中正则表达式操作的速度

问题描述

2 个解决方案

解决方案1 1 2015-12-31 19:44:52

解决方案2 0 已采纳 2016-01-04 21:59:20

解决方案1
1 2015-12-31 19:44:52

解决方案2
0 已采纳 2016-01-04 21:59:20