简体   繁体   English

如何替换 Python 中大文件中的每个 HTML 部分?

[英]How to replace every single HTML section in a large file in Python?

I have several hundreds of long files with repeated blocks of HTML in each that I won't need for my further text analysis, therefore I would like to get rid of them as they occupy quite a lot of valuable memory when analyzing these files.我有数百个长文件,每个文件都包含重复的 HTML 块,我不需要进行进一步的文本分析,因此我想摆脱它们,因为它们在分析这些文件时占用了很多有价值的 memory。

These HTML blocks are occasionally broken by a newline character.这些 HTML 块偶尔会被换行符打破。 Just like regular HTML, the removable blocks always begin with <!DOCTYPE and end with </html> .就像常规的 HTML 一样,可移动块总是以<!DOCTYPE开头并以</html>结尾。

My approach was the following:我的方法如下:

content = inputfile.read()
pattern = re.compile('<!DOCTYPE.*[\s\S]*<\/html>')
match = pattern.findall(content)

However, this always returns only one single match.但是,这始终只返回一个匹配项。 The regex correctly identifies the very first instance of <!DOCTYPE and the very last instance of </html> .正则表达式正确识别<!DOCTYPE的第一个实例和</html>的最后一个实例。 Thus, even if I have 10,000 HTML blocks across the document that I want to remove using因此,即使我想要删除的文档中有 10,000 个 HTML 块

content = re.sub(pattern, '', content)

only one match has been found and thus, almost my whole file gets removed.只找到了一个匹配项,因此几乎我的整个文件都被删除了。

How could I find all the HTML blocks separately throughout the document?如何在整个文档中分别找到所有 HTML 块?

PS: I use Python3.x and my OS is Windows 10. PS:我使用 Python3.x,我的操作系统是 Windows 10。

Regular expressions are greedy by default.正则表达式默认是贪婪的。 That means it searches until it finds the last <\HTML> instance.这意味着它会搜索直到找到最后一个<\HTML>实例。 Change your expression as follows:更改您的表达式如下:

pattern = re.compile('<!DOCTYPE.*?<\/html>', flags=re.DOTALL)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何访问URL中的每个HTML文件 - How To Get to Every single HTML file in a URL 在 python 中的大文本文件(单行文件或单字符串文件)中查找和替换的最快方法 - fastest way to find and replace in a large text file (single line file or single string file) in python 如何将来自 python 的值插入到 html 文件的正文部分? - How to insert values coming from python into the body section of an html file? 如何使用Python替换文件中的大部分文本? - How can I replace a large portion of a text in a file using Python? Python中如何将文本文件的每两行合并为一个字符串? - How to merge every two lines of a text file into a single string in Python? 如何在 python/Selenium 中替换 URL 的“部分” - How to replace a "section" of the URL in python/Selenium 如何使用 python 替换 HTML 文件中的 HTML 代码? - how to replace HTML codes in HTML file using python? 如何使用Python re模块将\\ n替换为单个文件中的任何内容 - How to use Python re module to replace \n with nothing in a single file Python搜索并替换大文件的正则表达式 - Python Search and Replace regex for large file 在Python中替换大型文本文件中的多个字符串 - Replace Multiple Strings in a Large Text File in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM