加速Python中的正则表达式

Question

I need to quickly extract text from HTML files. 我需要从HTML文件中快速提取文本。 I am using the following regular expressions instead of a full-fledged parser since I need to be fast rather than accurate (I have more than a terabyte of text). 我使用以下正则表达式而不是完整的解析器，因为我需要快速而不是准确（我有超过1 TB的文本）。 The profiler shows that most of the time in my script is spent in the re.sub procedure. 分析器显示我的脚本中的大部分时间都花在re.sub过程中。 What are good ways of speeding up my process? 什么是加快我的过程的好方法？ I can implement some portions in C, but I wonder whether that will help given that the time is spent inside re.sub, which I think would be efficiently implemented. 我可以在C中实现一些部分，但我想知道这是否有用，因为在 re.sub中花费的时间，我认为这将有效实现。

# Remove scripts, styles, tags, entities, and extraneous spaces:
scriptRx    = re.compile("<script.*?/script>", re.I)
styleRx     = re.compile("<style.*?/style>", re.I)
tagsRx      = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>")
entitiesRx  = re.compile("&[0-9a-zA-Z]+;")
spacesRx    = re.compile("\s{2,}")
....
text = scriptRx.sub(" ", text)
text = styleRx.sub(" ", text)
....

Thanks! 谢谢！

Answer 1

First, use an HTML parser built for this, like BeautifulSoup: 首先，使用为此构建的HTML解析器，如BeautifulSoup：

http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/

Then, you can identify remaining particular slow spots with the profiler: 然后，您可以使用分析器识别剩余的特定慢点：

http://docs.python.org/library/profile.html http://docs.python.org/library/profile.html

And for learning about regular expressions, I've found Mastering Regular Expressions very valuable, no matter what the programming language: 为了学习正则表达式，我发现掌握正则表达式非常有价值，无论编程语言是什么：

http://oreilly.com/catalog/9781565922570 http://oreilly.com/catalog/9781565922570

Also: 也：

How can I debug a regular expression in python? 如何在python中调试正则表达式？

Due to the reclarification of the use-case, then for this request, I would say the above is not what you want. 由于用例的重新声明，那么对于这个请求，我会说上面的不是你想要的。 My alternate recommendation would be: Speeding up regular expressions in Python 我的另一个建议是：加速Python中的正则表达式

Answer 2

You're processing each file five times, so the first thing you should do (as Paul Sanwald said) is try to reduce that number by combining your regexes together. 你正在处理每个文件五次，所以你应该做的第一件事（如Paul Sanwald所说）是试图通过将你的正则表达式组合在一起来减少这个数字。 I would also avoid using reluctant quantifiers, which are designed for convenience at the expense of efficiency. 我也会避免使用不情愿的量词，这些量词是为了方便而牺牲效率而设计的。 Consider this regex: 考虑这个正则表达式：

<script.*?</script>

Each time the . 每一次. goes to consume another character, it first has to make sure </script> won't match at that spot. 去消费另一个角色，它首先必须确保</script>在那个位置不匹配。 It's almost like doing a negative lookahead at every position: 这几乎就像在每个位置做一个负向前瞻：

<script(?:(?!</script>).)*</script>

But we know there's no point doing the lookahead if the next character is anything but < , and we can tailor the regex accordingly: 但是我们知道如果下一个字符除了<没有任何意义，我们可以相应地调整正则表达式：

<script[^<]*(?:<(?!/script>)[^<]*)*</script>

When I test them in RegexBuddy with this target string: 当我使用此目标字符串在RegexBuddy中测试它们时：

<script type="text/javascript">var imagePath='http://sstatic.net/stackoverflow/img/';</script>

...the reluctant regex takes 173 steps to make the match, while the tailored regex takes only 28. ...不情愿的正则表达式需要173步才能完成比赛，而量身定制的正则表达式仅需28步。

Combining your first three regexes into one yields this beast: 将前三个正则表达式合并为一个产生这个野兽：

<(?:(script|style)[^<]*(?:<(?!/\1)[^<]*)*</\1>|[!/]?[a-zA-Z-]+[^<>]*>)

You might want to zap the <HEAD> element while you're at it (ie, (script|style|head) ). 你可能想要在它的同时删除<HEAD>元素（即(script|style|head) ）。

I don't know what you're doing with the fourth regex, for character entities--are you just deleting those, too? 对于角色实体，我不知道你对第四个正则表达式做了什么 - 你也只是删除它们吗？ I'm guessing the fifth regex has to be run separately, since some of the whitespace it's cleaning up is generated by the earlier steps. 我猜第五个正则表达式必须单独运行，因为它清理的一些空格是由前面的步骤生成的。 But try it with the first three regexes combined and see how much difference it makes. 但尝试将前三个正则表达式结合起来，看看它有多大差异。 That should tell you if it's worth going forward with this approach. 这应该告诉你这种方法是否值得推进。

Answer 3

one thing you can do is combine the script/style regexes using backreferences. 您可以做的一件事是使用反向引用组合脚本/样式正则表达式。 here's some sample data: 这是一些示例数据：

$ cat sample 
<script>some stuff</script>
<html>whatever </html>
<style>some other stuff</style>

using perl: 使用perl：

perl -ne "if (/<(script|style)>.*?<\/\1>/) { print $1; } " sample

it will match either script or style. 它将匹配脚本或样式。 I second the recommendation for "mastering regular expressions", it's an excellent book. 我推荐“掌握正则表达式”，这是一本很好的书。

Answer 4

The suggestion to use an HTML parser is a good one, since it'll quite possibly be faster than regular expressions. 使用HTML解析器的建议很好，因为它很可能比正则表达式更快。 But I'm not sure BeautifulSoup is the right tool for the job, since it constructs a parse tree from the entire file and stores the whole thing in memory. 但我不确定BeautifulSoup是否适合这项工作，因为它从整个文件构造一个解析树并将整个内容存储在内存中。 For a terabyte of HTML, you'd need an obscene amount of RAM to do that ;-) I'd suggest you look at HTMLParser , which is written at a lower level than BeautifulSoup, but I believe it's a stream parser, so it will only load a bit of the text at a time. 对于1TB的HTML，你需要一个淫秽的RAM才能做到这一点;-)我建议你看一下HTMLParser ，它写的级别比BeautifulSoup低，但我相信它是一个流解析器，所以它只会一次加载一些文本。

Answer 5

If your use-case is indeed to parse a few things for each of millions of documents, then my above answer won't help. 如果您的用例确实要为每一百万个文档解析一些内容，那么我的上述答案将无济于事。 I recommend some heuristics, like doing a couple "straight text" regexes on them to begin with - like just plain /script/ and /style/ to throw things out quickly if you can. 我推荐一些启发式方法，比如在它们上面开始使用几个“直接文本”正则表达式 - 就像普通/script/和/style/ ，如果可以的话，快速抛出一些东西。 In fact, do you really need to do the end-tag check at all? 事实上，你真的需要进行终端标签检查吗？ Isn't <style good enough? <style不够好吗？ Leave validation for someone else. 为其他人留下验证。 If the quick ones succeed, then put the rest into a single regex, like /<script|<style|\\s{2,}|etc.../ so that it doesn't have to go through so much text once for each regex. 如果快速的成功，那么将其余部分放入单个正则表达式，例如/<script|<style|\\s{2,}|etc.../这样它就不必经历过如此多的文本每个正则表达式。

Answer 6

I would use simple program with regular Python partition something like, this, but it is tested only with one style example file: 我会使用简单的程序与常规的Python分区，比如，这个，但它仅使用一个样式示例文件进行测试：

## simple filtering when not hierarchical tags inside other discarded tags

start_tags=('<style','<script')
end_tags=('</style>','</script>')

##print("input:\n %s" % open('giant.html').read())
out=open('cleaned.html','w')
end_tag=''

for line in open('giant.html'):
    line=' '.join(line.split())
    if end_tag:
        if end_tag in line:
            _,tag,end = line.partition(end_tags[index])
            if end.strip():
                out.write(end)
            end_tag=''
        continue ## discard rest of line if no end tag found in line

    found=( index for index in (start_tags.index(start_tag)
                                if start_tag in line else ''
                                for start_tag in start_tags)
            if index is not '')
    for index in  found:
        start,tag,end = line.partition(start_tags[index])
        # drop until closing angle bracket of start tag
        tag,_ ,end = end.partition('>')
        # check if closing tag already in same line
        if end_tags[index] in end:
            _,tag,end = end.partition(end_tags[index])
            if end.strip():
                out.write(end)
            end_tag = '' # end tag reset after found
        else:
            end_tag=end_tags[index]
            out.write(end) # no end tag at same line
    if not end_tag: out.write(line+'\n')

out.close()
##    print 'result:\n%s' % open('cleaned.html').read()

加速Python中的正则表达式

问题描述

6 个解决方案

解决方案1
9 2010-07-18 21:44:07

解决方案2
5 2010-07-19 01:38:08

解决方案3
1 2010-07-18 21:49:50

解决方案4
1 2010-07-18 21:50:47

解决方案5
1 2010-07-18 23:48:31

解决方案6
0 2010-07-19 16:35:19

加速Python中的正则表达式

问题描述

6 个解决方案

解决方案1 9 2010-07-18 21:44:07

解决方案2 5 2010-07-19 01:38:08

解决方案3 1 2010-07-18 21:49:50

解决方案4 1 2010-07-18 21:50:47

解决方案5 1 2010-07-18 23:48:31

解决方案6 0 2010-07-19 16:35:19

解决方案1
9 2010-07-18 21:44:07

解决方案2
5 2010-07-19 01:38:08

解决方案3
1 2010-07-18 21:49:50

解决方案4
1 2010-07-18 21:50:47

解决方案5
1 2010-07-18 23:48:31

解决方案6
0 2010-07-19 16:35:19