如何替换 Python 中大文件中的每个 HTML 部分？

Question

I have several hundreds of long files with repeated blocks of HTML in each that I won't need for my further text analysis, therefore I would like to get rid of them as they occupy quite a lot of valuable memory when analyzing these files.我有数百个长文件，每个文件都包含重复的 HTML 块，我不需要进行进一步的文本分析，因此我想摆脱它们，因为它们在分析这些文件时占用了很多有价值的 memory。

These HTML blocks are occasionally broken by a newline character.这些 HTML 块偶尔会被换行符打破。 Just like regular HTML, the removable blocks always begin with <!DOCTYPE and end with </html> .就像常规的 HTML 一样，可移动块总是以<!DOCTYPE开头并以</html>结尾。

My approach was the following:我的方法如下：

content = inputfile.read()
pattern = re.compile('<!DOCTYPE.*[\s\S]*<\/html>')
match = pattern.findall(content)

However, this always returns only one single match.但是，这始终只返回一个匹配项。 The regex correctly identifies the very first instance of <!DOCTYPE and the very last instance of </html> .正则表达式正确识别<!DOCTYPE的第一个实例和</html>的最后一个实例。 Thus, even if I have 10,000 HTML blocks across the document that I want to remove using因此，即使我想要删除的文档中有 10,000 个 HTML 块

content = re.sub(pattern, '', content)

only one match has been found and thus, almost my whole file gets removed.只找到了一个匹配项，因此几乎我的整个文件都被删除了。

How could I find all the HTML blocks separately throughout the document?如何在整个文档中分别找到所有 HTML 块？

PS: I use Python3.x and my OS is Windows 10. PS：我使用 Python3.x，我的操作系统是 Windows 10。

Answer 1

Regular expressions are greedy by default.正则表达式默认是贪婪的。 That means it searches until it finds the last <\HTML> instance.这意味着它会搜索直到找到最后一个<\HTML>实例。 Change your expression as follows:更改您的表达式如下：

pattern = re.compile('<!DOCTYPE.*?<\/html>', flags=re.DOTALL)

如何替换 Python 中大文件中的每个 HTML 部分？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-30 14:57:30

如何替换 Python 中大文件中的每个 HTML 部分？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-30 14:57:30

解决方案1
1 已采纳 2020-06-30 14:57:30