简体   繁体   English

在Python中使用多个正则表达式或更大的正则表达式替换

[英]Replacement using multiple regexes or a bigger one in Python

I've switched to Python pretty recently and I'm interested to clean up a very big number of web pages (around 12k) (but can be considered just as easily text files) by removing some particular tags or some other string patterns. 我最近才切换到Python,我有兴趣通过删除一些特定的标签或其他一些字符串模式来清理大量的网页(大约12k)(但可以被视为同样容易的文本文件)。 For this I'm using the re.sub(..) function in Python. 为此,我在Python中使用re.sub(..)函数。

My question is if it's better (from the efficiency point of view) to create one big regular expression that matches more of my patterns or call the function several times with smaller and simpler regular expressions. 我的问题是,如果更好(从效率的角度来看)创建一个匹配更多模式的大型正则表达式,或者使用更小更简单的正则表达式多次调用该函数。

To exemplify, is it better to use something like 举例来说,使用类似的东西更好

 re.sub(r"<[^<>]*>", content)
 re.sub(r"some_other_pattern", content)

or 要么

 re.sub(r"<[^<>]*>|some_other_pattern",content)

Of course, for the sake of exemplifying the previous patterns are really simple and I haven't compiled them here, but in my real-life scenario I will. 当然,为了举例说明以前的模式非常简单,我没有在这里编译它们,但在我的现实场景中,我会。

LE: The question is not related to the HTML nature of the files, but to the behavior of Python when dealing with multiple regex patterns. LE:问题与文件的HTML性质无关,而是与处理多个正则表达式模式时Python的行为有关。

Thanks! 谢谢!

Keep it simple. 把事情简单化。

I would say that you are safer using smaller Regexes to parse through this stuff. 我会说使用较小的Regexes来解析这些东西更安全。 At least that way if it behaves abnormally, you don't have to go digging to find which particular section of the massive Regex is behaving strangely. 至少就是这样,如果它表现异常,你不必去挖掘大量正则表达式的哪个特定部分表现得很奇怪。 Providing you have good logging of the replacements you do, it would be trivial to determine the source of the problem, should one arise. 如果您对所做的替换有良好的记录,那么如果出现问题,确定问题的根源将是微不足道的。

You don't want to run into this 你不想碰到这个

Speaking generally, "sequential" and "parallel" application is not the same and might produce different results, because sequential replacements can affect each other. 一般而言,“顺序”和“并行”应用程序不尽相同,可能产生不同的结果,因为顺序替换可能会相互影响。

As to performance I guess one expression will perform better, but that's just a guess. 至于性能,我猜一个表达式会表现得更好,但这只是猜测。 I personally prefer to keep then complex and use "verbose" mode for readability sake. 为了便于阅读,我个人更喜欢保持复杂并使用“详细”模式。

I understand your additional comment regarding "its the non-HTML parts I'm cleaning up". 我理解您对“我正在清理的非HTML部分”的补充评论。 Because of the possibility of a latter RE finding and replacing content that a earlier RE replaced, you'd be better off using the "alternative" operator and using a single RE. 由于后者RE可能会找到并替换先前RE替换的内容,因此最好使用“替代”运算符并使用单个RE。

Also, consider using BeautifulSoup to load and examine your HTML files. 另外,请考虑使用BeautifulSoup加载和检查HTML文件。 This will assist in finding the appropriate parts of your text with far less risk of capturing some HTML construct when you were just intending on on replacing some text. 这将有助于找到文本的适当部分,而当您打算更换某些文本时,捕获某些HTML构造的风险要小得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM