简体   繁体   English

Python html2text正则表达式性能

[英]Python html2text regex performance

i have build a html to plain text regex sequence. 我已经建立了一个HTML到纯文本正则表达式序列。 I use this in up to 100 threads to clean up html files. 我在多达100个线程中使用它来清理html文件。 I want get all visible text information of an given html file. 我想获取给定html文件的所有可见文本信息。

    self.content = re.sub(r'<!--(.|\n)*?-->', '', self.content)
    self.content = re.sub(r'<script (.|\n)*?>(.|\n)*?</script>', '', self.content)
    self.content = re.sub(r'<style (.|\n)*?>(.|\n)*?</style>', '', self.content)
    self.content = re.sub(r'(<[^>]*?>+)', ' ', self.content)

I am not realy a regex pro. 我不是真正的正则表达式专业人士。 Maybe i could improve the performance of this regex? 也许我可以改善此正则表达式的性能?

I dont want use beautifulsoap or django or html2text c++ distribution. 我不想使用beautifulsoap或django或html2text c ++发行版。 they are after tests slower then my regex. 他们经过测试比我的正则表达式慢。 I need just a space separeted string, not a tree or links ect. 我只需要一个用空格分隔的字符串,而不需要一棵树或链接等。

Thanks for helping. 感谢您的帮助。 I know on stackoverflow are some really smart people 我知道在stackoverflow上有一些非常聪明的人

Use a tool like BeautifulSoup or htmllib and don't try being smarter than the rest of the world. 使用类似BeautifulSoup或htmllib之类的工具,不要试图比世界其他地方更聪明。 Parsing HTML using regular expressions is the worst thing you can do! 使用正则表达式解析HTML是您最糟糕的事情! There will always be one Html file more where your regexes will fail. 总是会有一个HTML文件,您的正则表达式将失败。

There is a common credo according which HTML and XML texts must ne-e-ever be treated with regex tools. 有一个共同的信条,即必须使用正则表达式工具来处理HTML和XML文本。 You must take into account that the risks of such treatments are real and impossible to manage if it is practiced for too much ambitious aims. 您必须考虑到,如果为实现宏伟目标而实施此类治疗的风险是真实存在的,并且无法管理。 HTML and XML are too much complicated markup language to be analysed by regexes. HTML和XML是太多复杂的标记语言,因此正则表达式无法分析。

However I don't totally share this common credo. 但是,我并不完全认同这一共同信条。 In my opinion, it isn't a so much absurd method if it is lucidly used with the preoccupation of using regex in conditions that may be reasonbly considered as legitimating this use because the risks seem at the minimum. 在我看来,如果在考虑到将风险降到最低的情况下可以合理地认为正则表达式被合理使用的前提下,谨慎地使用正则表达式并不是一种荒谬的方法。

  • I believe that regexes can be used for limited and simple treatments of HTML or XML texts. 我相信正则表达式可用于HTML或XML文本的有限和简单处理。 I really understood here on stacoverflof.com that it is impracticable to parse HTML/XML with regexes. 我在stacoverflof.com上确实了解到,用正则表达式解析HTML / XML是不切实际的。 But when a parsing (extracting all or part of a markup tree) isn't implied in a treatment, why to so religiously reject the regexes (I allude to the cited link) 但是,当处理中没有暗示解析 (提取标记树的全部或一部分)时,为什么要如此虔诚地拒绝正则表达式(我暗示所引用的链接)

  • It seems to me that a good security step is to limit the use of a code using regex tools only on texts from a constant origin, and not trying to make it analysing various HTM or XML texts. 在我看来,一个很好的安全步骤是使用正则表达式工具限制仅对源于常量的文本使用代码,而不是试图使其分析各种HTM或XML文本。

After these warnings, I dare to propose to you the following improvements to your REs: 在收到这些警告后,我敢向您提出以下对您的RE的改进建议:

re.sub('<!--.*?-->', '', self.content, flags=re.DOTALL)

and

re.sub('<(script|style) .*?\\1>', '', self.content, flags=re.DOTALL)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM