Remove HTML tag contents from page using Python

Question

I have an HTML file like the one below:

<!DOCTYPE HTML>
<html>

<head>

<title>Sezione microbiologia</title>
<link rel="stylesheet" src="./style.css">

</head>

<body>

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Seconda diluizione</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Terza diluizione</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

    <section id="second">
        <!-- SOME CONTENT... -->
    </section>

    <section id="third">
        <!-- SOME CONTENT... -->
    </section>

    <section id="footer">
        <!-- SOME CONTENT... -->
    </section>
</div>
</body>

</html>

Problem description:

I am trying to modify the headings <h1> that contain the the word diluizione to replace this word and its prefix with "Diluizione seriale". I tried to do this using Python replace() , the problem is that even lines in the <p> paragraphs are cut off, whilst I would only like lines in the h1 tags to be modified. On top of that, I still have not managed to find a way to automated taking out the prefix, ie "Prima", "Seconda", "Terza", etc.

The code I tried with

I currently came up with this:

with open('./home.html') as file:
    text = file.read()


if "diluizione" in text:
    text = text.replace("diluizione", "diluizione seriale")

But this outputs:

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione seriale</h1>
        <p>Some content including "prima diluizione seriale"...</p>
        <h1>Seconda diluizione seriale</h1>
        <p>Some content including "seconda diluizione seriale"...</p>
        <h1>Terza diluizione seriale</h1>
        <p>Some content including "terza diluizione seriale"...</p>
    </section>

So as you can see, even text in the <p> tags is affected and the headings the prefix is still there.

My desired output would be:

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Diluizione seriale</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

Any help or suggestion is very appreciated, thanks very much in advance.

Answer 1

You could use the regex through Pythons re module to achieve this. In order to only filter text within the h1 tags, you may use a positive lookbehind and a positive lookahead strategy.

Code:

import re

with open("path/to/home.html") as file:
    text = file.read()

text = re.sub("(?<=<h1>)\w+ \w+(?=</h1>)", "Diluizione seriale", text)

print(text)

Explanation :

The regular expression (?<=<h1>)\w+ \w+(?=</h1>) matches two consecutive word characters contained between <h1> and </h1> .

Output :

<!-- SOME CONTENT... -->
<h1>Diluizione seriale</h1>
<p>Some content including "prima diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "seconda diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "terza diluizione"...</p>

Answer 2

Have a look at html.parser . Instead of trying to do sting interpolation, rather parse the HTML into a structure and then traverse it from there

Remove HTML tag contents from page using Python

Question

2 answers

solution1
2 ACCPTED 2021-03-24 19:39:36

solution2
1 2021-03-24 19:37:10

Remove HTML tag contents from page using Python

Question

2 answers

solution1 2 ACCPTED 2021-03-24 19:39:36

solution2 1 2021-03-24 19:37:10

solution1
2 ACCPTED 2021-03-24 19:39:36

solution2
1 2021-03-24 19:37:10