简体   繁体   中英

Remove HTML tag contents from page using Python

I have an HTML file like the one below:

<!DOCTYPE HTML>
<html>

<head>

<title>Sezione microbiologia</title>
<link rel="stylesheet" src="./style.css">

</head>

<body>

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Seconda diluizione</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Terza diluizione</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

    <section id="second">
        <!-- SOME CONTENT... -->
    </section>

    <section id="third">
        <!-- SOME CONTENT... -->
    </section>

    <section id="footer">
        <!-- SOME CONTENT... -->
    </section>
</div>
</body>

</html>

Problem description:

I am trying to modify the headings <h1> that contain the the word diluizione to replace this word and its prefix with "Diluizione seriale". I tried to do this using Python replace() , the problem is that even lines in the <p> paragraphs are cut off, whilst I would only like lines in the h1 tags to be modified. On top of that, I still have not managed to find a way to automated taking out the prefix, ie "Prima", "Seconda", "Terza", etc.

The code I tried with

I currently came up with this:

with open('./home.html') as file:
    text = file.read()


if "diluizione" in text:
    text = text.replace("diluizione", "diluizione seriale")

But this outputs:

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione seriale</h1>
        <p>Some content including "prima diluizione seriale"...</p>
        <h1>Seconda diluizione seriale</h1>
        <p>Some content including "seconda diluizione seriale"...</p>
        <h1>Terza diluizione seriale</h1>
        <p>Some content including "terza diluizione seriale"...</p>
    </section>

So as you can see, even text in the <p> tags is affected and the headings the prefix is still there.

My desired output would be:

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Diluizione seriale</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

Any help or suggestion is very appreciated, thanks very much in advance.

You could use the regex through Pythons re module to achieve this. In order to only filter text within the h1 tags, you may use a positive lookbehind and a positive lookahead strategy.

Code:

import re

with open("path/to/home.html") as file:
    text = file.read()

text = re.sub("(?<=<h1>)\w+ \w+(?=</h1>)", "Diluizione seriale", text)

print(text)

Explanation :

The regular expression (?<=<h1>)\w+ \w+(?=</h1>) matches two consecutive word characters contained between <h1> and </h1> .

Output :

<!-- SOME CONTENT... -->
<h1>Diluizione seriale</h1>
<p>Some content including "prima diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "seconda diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "terza diluizione"...</p>

Have a look at html.parser . Instead of trying to do sting interpolation, rather parse the HTML into a structure and then traverse it from there

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM