[英]How to replace every single HTML section in a large file in Python?
I have several hundreds of long files with repeated blocks of HTML in each that I won't need for my further text analysis, therefore I would like to get rid of them as they occupy quite a lot of valuable memory when analyzing these files.我有数百个长文件,每个文件都包含重复的 HTML 块,我不需要进行进一步的文本分析,因此我想摆脱它们,因为它们在分析这些文件时占用了很多有价值的 memory。
These HTML blocks are occasionally broken by a newline character.这些 HTML 块偶尔会被换行符打破。 Just like regular HTML, the removable blocks always begin with
<!DOCTYPE
and end with </html>
.就像常规的 HTML 一样,可移动块总是以
<!DOCTYPE
开头并以</html>
结尾。
My approach was the following:我的方法如下:
content = inputfile.read()
pattern = re.compile('<!DOCTYPE.*[\s\S]*<\/html>')
match = pattern.findall(content)
However, this always returns only one single match.但是,这始终只返回一个匹配项。 The regex correctly identifies the very first instance of
<!DOCTYPE
and the very last instance of </html>
.正则表达式正确识别
<!DOCTYPE
的第一个实例和</html>
的最后一个实例。 Thus, even if I have 10,000 HTML blocks across the document that I want to remove using因此,即使我想要删除的文档中有 10,000 个 HTML 块
content = re.sub(pattern, '', content)
only one match has been found and thus, almost my whole file gets removed.只找到了一个匹配项,因此几乎我的整个文件都被删除了。
How could I find all the HTML blocks separately throughout the document?如何在整个文档中分别找到所有 HTML 块?
PS: I use Python3.x and my OS is Windows 10. PS:我使用 Python3.x,我的操作系统是 Windows 10。
Regular expressions are greedy by default.正则表达式默认是贪婪的。 That means it searches until it finds the last
<\HTML>
instance.这意味着它会搜索直到找到最后一个
<\HTML>
实例。 Change your expression as follows:更改您的表达式如下:
pattern = re.compile('<!DOCTYPE.*?<\/html>', flags=re.DOTALL)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.