简体   繁体   English

使用 Python 从页面中删除 HTML 标记内容

[英]Remove HTML tag contents from page using Python

I have an HTML file like the one below:我有一个 HTML 文件,如下所示:

<!DOCTYPE HTML>
<html>

<head>

<title>Sezione microbiologia</title>
<link rel="stylesheet" src="./style.css">

</head>

<body>

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Seconda diluizione</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Terza diluizione</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

    <section id="second">
        <!-- SOME CONTENT... -->
    </section>

    <section id="third">
        <!-- SOME CONTENT... -->
    </section>

    <section id="footer">
        <!-- SOME CONTENT... -->
    </section>
</div>
</body>

</html>

Problem description:问题描述:

I am trying to modify the headings <h1> that contain the the word diluizione to replace this word and its prefix with "Diluizione seriale".我正在尝试修改包含单词diluizione的标题<h1>以替换该单词及其前缀为“Diluizione seriale”。 I tried to do this using Python replace() , the problem is that even lines in the <p> paragraphs are cut off, whilst I would only like lines in the h1 tags to be modified.我尝试使用 Python replace()来做到这一点,问题是即使<p>段落中的行也被截断,而我只希望修改 h1 标记中的行。 On top of that, I still have not managed to find a way to automated taking out the prefix, ie "Prima", "Seconda", "Terza", etc.最重要的是,我仍然没有设法找到自动取出前缀的方法,即“Prima”、“Seconda”、“Terza”等。

The code I tried with我试过的代码

I currently came up with this:我目前想出了这个:

with open('./home.html') as file:
    text = file.read()


if "diluizione" in text:
    text = text.replace("diluizione", "diluizione seriale")

But this outputs:但这输出:

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione seriale</h1>
        <p>Some content including "prima diluizione seriale"...</p>
        <h1>Seconda diluizione seriale</h1>
        <p>Some content including "seconda diluizione seriale"...</p>
        <h1>Terza diluizione seriale</h1>
        <p>Some content including "terza diluizione seriale"...</p>
    </section>

So as you can see, even text in the <p> tags is affected and the headings the prefix is still there.如您所见,即使是<p>标签中的文本也会受到影响,并且前缀的标题仍然存在。

My desired output would be:想要的 output将是:

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Diluizione seriale</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

Any help or suggestion is very appreciated, thanks very much in advance.任何帮助或建议都非常感谢,在此先感谢。

You could use the regex through Pythons re module to achieve this.您可以通过 Pythons re模块使用正则表达式来实现这一点。 In order to only filter text within the h1 tags, you may use a positive lookbehind and a positive lookahead strategy.为了只过滤h1标记中的文本,您可以使用positive lookbehindpositive lookahead策略。

Code:代码:

import re

with open("path/to/home.html") as file:
    text = file.read()

text = re.sub("(?<=<h1>)\w+ \w+(?=</h1>)", "Diluizione seriale", text)

print(text)

Explanation :说明

The regular expression (?<=<h1>)\w+ \w+(?=</h1>) matches two consecutive word characters contained between <h1> and </h1> .正则表达式(?<=<h1>)\w+ \w+(?=</h1>)匹配<h1></h1>之间包含的两个连续单词字符。

Output : Output

<!-- SOME CONTENT... -->
<h1>Diluizione seriale</h1>
<p>Some content including "prima diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "seconda diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "terza diluizione"...</p>

Have a look at html.parser .看看html.parser Instead of trying to do sting interpolation, rather parse the HTML into a structure and then traverse it from there与其尝试进行刺痛插值,不如将 HTML 解析为一个结构,然后从那里遍历它

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM