如何用自定义替换HTML注释 <comment> 分子

Question

I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python. 我正在使用Python中的BeautifulSoup将大量HTML文件批量转换为XML。

A sample HTML file looks something like this: 示例HTML文件如下所示：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

I figured out how to find the doctype and replace it with the tag <doctype>...</doctype> , but the commenting is giving me a lot of frustration. 我想出了如何找到doctype并用标签<doctype>...</doctype>替换它，但评论给了我很多挫折。 I want to replace the HTML comments with <comment>...</comment> . 我想用<comment>...</comment>替换HTML注释。 In this example HTML, I was able to replace the first two HTML comments, but anything inside the html tag and the last comment after the closing html tag I was not. 在这个示例HTML中，我能够替换前两个HTML注释，但是html标记内的任何内容和关闭html标记之后的最后一个注释我都没有。

Here is my code: 这是我的代码：

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")

for child in soup.children:

    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")

    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)

# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

This is my first time using BeautifulSoup. 这是我第一次使用BeautifulSoup。 How do I use BeautifulSoup to find and replace all HTML comments with the <comment> tag? 如何使用BeautifulSoup查找并用<comment>标签替换所有HTML注释？

Could I convert it to a byte stream, via pickle , serializing it, applying regex, and then deseralize it back to a BeautifulSoup object? 我可以将它转换为字节流，通过pickle ，序列化，应用正则表达式，然后将其解除回一个BeautifulSoup对象？ Would this work or just cause more problems? 这会起作用还是只会导致更多问题？

I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name' . 我尝试在子标记对象上使用pickle但反序列化因TypeError: __new__() missing 1 required positional argument: 'name'而失败TypeError: __new__() missing 1 required positional argument: 'name' 。

Then I tried pickling just the text of the tag, via child.text , but deserialization failed due to AttributeError: can't set attribute . 然后我尝试通过child.text对标签的文本进行child.text ，但由于AttributeError: can't set attribute ，反序列化失败AttributeError: can't set attribute 。 Basically, child.text is read-only, which explains why the regex doesn't work. 基本上， child.text是只读的，这解释了为什么正则表达式不起作用。 So, I have no idea how to modify the text. 所以，我不知道如何修改文本。

Answer 1

You have a couple of problems: 你有几个问题：

You can't modify child.text . 你不能修改child.text 。 it's a read-only property that just calls get_text() behind the scenes, and its result is a brand new string unconnected to your document. 它是一个只读属性，只是在幕后调用get_text() ，其结果是一个全新的字符串未连接到您的文档。
re.sub() doesn't modify anything in-place. re.sub()不会就地修改任何内容。 Your line 你的路线
```
 re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE) 
```
would have had to be 必须是
```
 child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE) 
```
... but that wouldn't work anyway, because of point 1. ......但是由于第1点，这无论如何都行不通。
Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. 尝试通过用正则表达式替换其中的文本块来修改文档是使用BeautifulSoup的错误方法。 Instead, you need to find nodes and replace them with other nodes. 相反，您需要查找节点并将其替换为其他节点。

Here's a solution that works: 这是一个有效的解决方案：

import bs4

with open("example.html") as f:
    soup = bs4.BeautifulSoup(f)

for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
    tag = bs4.Tag(name="comment")
    tag.string = comment.strip()
    comment.replace_with(tag)

This code starts by iterating over the result of a call to find_all() , taking advantage of the fact that we can pass a function as the text argument. 这段代码首先迭代调用find_all() ，利用我们可以将函数作为text参数传递的事实。 In BeautifulSoup, Comment is a subclass of NavigableString , so we search for it as though it were a string, and the lambda ... is just a shorthand for eg 在BeautifulSoup中， Comment是NavigableString的子类，所以我们搜索它就好像它是一个字符串，而lambda ...只是一个简写例如

def is_comment(e):
    return isinstance(e, bs4.Comment)

soup.find_all(text=is_comment)

Then, we create a new Tag with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created. 然后，我们使用适当的名称创建一个新Tag ，将其内容设置为原始注释的剥离内容，并将注释替换为我们刚刚创建的标记。

Here's the result: 这是结果：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>

如何用自定义替换HTML注释 <comment> 分子

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-02-18 18:51:38

如何用自定义替换HTML注释 <comment> 分子

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-02-18 18:51:38

解决方案1
4 已采纳 2015-02-18 18:51:38