使用 Python 从页面中删除 HTML 标记内容

Question

I have an HTML file like the one below:我有一个 HTML 文件，如下所示：

<!DOCTYPE HTML>
<html>

<head>

<title>Sezione microbiologia</title>
<link rel="stylesheet" src="./style.css">

</head>

<body>

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Seconda diluizione</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Terza diluizione</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

    <section id="second">
        <!-- SOME CONTENT... -->
    </section>

    <section id="third">
        <!-- SOME CONTENT... -->
    </section>

    <section id="footer">
        <!-- SOME CONTENT... -->
    </section>
</div>
</body>

</html>

Problem description:问题描述：

I am trying to modify the headings <h1> that contain the the word diluizione to replace this word and its prefix with "Diluizione seriale".我正在尝试修改包含单词diluizione的标题<h1>以替换该单词及其前缀为“Diluizione seriale”。 I tried to do this using Python replace() , the problem is that even lines in the <p> paragraphs are cut off, whilst I would only like lines in the h1 tags to be modified.我尝试使用 Python replace()来做到这一点，问题是即使<p>段落中的行也被截断，而我只希望修改 h1 标记中的行。 On top of that, I still have not managed to find a way to automated taking out the prefix, ie "Prima", "Seconda", "Terza", etc.最重要的是，我仍然没有设法找到自动取出前缀的方法，即“Prima”、“Seconda”、“Terza”等。

The code I tried with我试过的代码

I currently came up with this:我目前想出了这个：

with open('./home.html') as file:
    text = file.read()


if "diluizione" in text:
    text = text.replace("diluizione", "diluizione seriale")

But this outputs:但这输出：

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione seriale</h1>
        <p>Some content including "prima diluizione seriale"...</p>
        <h1>Seconda diluizione seriale</h1>
        <p>Some content including "seconda diluizione seriale"...</p>
        <h1>Terza diluizione seriale</h1>
        <p>Some content including "terza diluizione seriale"...</p>
    </section>

So as you can see, even text in the <p> tags is affected and the headings the prefix is still there.如您所见，即使是<p>标签中的文本也会受到影响，并且前缀的标题仍然存在。

My desired output would be:我想要的 output将是：

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Diluizione seriale</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

Any help or suggestion is very appreciated, thanks very much in advance.任何帮助或建议都非常感谢，在此先感谢。

Answer 1

You could use the regex through Pythons re module to achieve this.您可以通过 Pythons re模块使用正则表达式来实现这一点。 In order to only filter text within the h1 tags, you may use a positive lookbehind and a positive lookahead strategy.为了只过滤h1标记中的文本，您可以使用positive lookbehind和positive lookahead策略。

Code:代码：

import re

with open("path/to/home.html") as file:
    text = file.read()

text = re.sub("(?<=<h1>)\w+ \w+(?=</h1>)", "Diluizione seriale", text)

print(text)

Explanation :说明：

The regular expression (?<=<h1>)\w+ \w+(?=</h1>) matches two consecutive word characters contained between <h1> and </h1> .正则表达式(?<=<h1>)\w+ \w+(?=</h1>)匹配<h1>和</h1>之间包含的两个连续单词字符。

Output : Output ：

<!-- SOME CONTENT... -->
<h1>Diluizione seriale</h1>
<p>Some content including "prima diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "seconda diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "terza diluizione"...</p>

Answer 2

Have a look at html.parser .看看html.parser 。 Instead of trying to do sting interpolation, rather parse the HTML into a structure and then traverse it from there与其尝试进行刺痛插值，不如将 HTML 解析为一个结构，然后从那里遍历它

使用 Python 从页面中删除 HTML 标记内容

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-03-24 19:39:36

解决方案2
1 2021-03-24 19:37:10

使用 Python 从页面中删除 HTML 标记内容

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-03-24 19:39:36

解决方案2 1 2021-03-24 19:37:10

解决方案1
2 已采纳 2021-03-24 19:39:36

解决方案2
1 2021-03-24 19:37:10