简体   繁体   中英

How to replace HTML comments with custom <comment> elements

I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python.

A sample HTML file looks something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

I figured out how to find the doctype and replace it with the tag <doctype>...</doctype> , but the commenting is giving me a lot of frustration. I want to replace the HTML comments with <comment>...</comment> . In this example HTML, I was able to replace the first two HTML comments, but anything inside the html tag and the last comment after the closing html tag I was not.

Here is my code:

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")

for child in soup.children:

    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")

    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)

# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

This is my first time using BeautifulSoup. How do I use BeautifulSoup to find and replace all HTML comments with the <comment> tag?

Could I convert it to a byte stream, via pickle , serializing it, applying regex, and then deseralize it back to a BeautifulSoup object? Would this work or just cause more problems?

I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name' .

Then I tried pickling just the text of the tag, via child.text , but deserialization failed due to AttributeError: can't set attribute . Basically, child.text is read-only, which explains why the regex doesn't work. So, I have no idea how to modify the text.

You have a couple of problems:

  1. You can't modify child.text . it's a read-only property that just calls get_text() behind the scenes, and its result is a brand new string unconnected to your document.

  2. re.sub() doesn't modify anything in-place. Your line

     re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE) 

    would have had to be

     child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE) 

    ... but that wouldn't work anyway, because of point 1.

  3. Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. Instead, you need to find nodes and replace them with other nodes.

Here's a solution that works:

import bs4

with open("example.html") as f:
    soup = bs4.BeautifulSoup(f)

for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
    tag = bs4.Tag(name="comment")
    tag.string = comment.strip()
    comment.replace_with(tag)

This code starts by iterating over the result of a call to find_all() , taking advantage of the fact that we can pass a function as the text argument. In BeautifulSoup, Comment is a subclass of NavigableString , so we search for it as though it were a string, and the lambda ... is just a shorthand for eg

def is_comment(e):
    return isinstance(e, bs4.Comment)

soup.find_all(text=is_comment)

Then, we create a new Tag with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created.

Here's the result:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM