简体   繁体   English

如何用自定义替换HTML注释 <comment> 分子

[英]How to replace HTML comments with custom <comment> elements

I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python. 我正在使用Python中的BeautifulSoup将大量HTML文件批量转换为XML。

A sample HTML file looks something like this: 示例HTML文件如下所示:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

I figured out how to find the doctype and replace it with the tag <doctype>...</doctype> , but the commenting is giving me a lot of frustration. 我想出了如何找到doctype并用标签<doctype>...</doctype>替换它,但评论给了我很多挫折。 I want to replace the HTML comments with <comment>...</comment> . 我想用<comment>...</comment>替换HTML注释。 In this example HTML, I was able to replace the first two HTML comments, but anything inside the html tag and the last comment after the closing html tag I was not. 在这个示例HTML中,我能够替换前两个HTML注释,但是html标记内的任何内容和关闭html标记之后的最后一个注释我都没有。

Here is my code: 这是我的代码:

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")

for child in soup.children:

    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")

    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)

# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

This is my first time using BeautifulSoup. 这是我第一次使用BeautifulSoup。 How do I use BeautifulSoup to find and replace all HTML comments with the <comment> tag? 如何使用BeautifulSoup查找并用<comment>标签替换所有HTML注释?

Could I convert it to a byte stream, via pickle , serializing it, applying regex, and then deseralize it back to a BeautifulSoup object? 我可以将它转换为字节流,通过pickle ,序列化,应用正则表达式,然后将其解除回一个BeautifulSoup对象? Would this work or just cause more problems? 这会起作用还是只会导致更多问题?

I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name' . 我尝试在子标记对象上使用pickle但反序列化因TypeError: __new__() missing 1 required positional argument: 'name'而失败TypeError: __new__() missing 1 required positional argument: 'name'

Then I tried pickling just the text of the tag, via child.text , but deserialization failed due to AttributeError: can't set attribute . 然后我尝试通过child.text对标签的文本进行child.text ,但由于AttributeError: can't set attribute ,反序列化失败AttributeError: can't set attribute Basically, child.text is read-only, which explains why the regex doesn't work. 基本上, child.text是只读的,这解释了为什么正则表达式不起作用。 So, I have no idea how to modify the text. 所以,我不知道如何修改文本。

You have a couple of problems: 你有几个问题:

  1. You can't modify child.text . 你不能修改child.text it's a read-only property that just calls get_text() behind the scenes, and its result is a brand new string unconnected to your document. 它是一个只读属性,只是在幕后调用get_text() ,其结果是一个全新的字符串未连接到您的文档。

  2. re.sub() doesn't modify anything in-place. re.sub()不会就地修改任何内容。 Your line 你的路线

     re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE) 

    would have had to be 必须是

     child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE) 

    ... but that wouldn't work anyway, because of point 1. ......但是由于第1点,这无论如何都行不通。

  3. Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. 尝试通过用正则表达式替换其中的文本块来修改文档是使用BeautifulSoup的错误方法。 Instead, you need to find nodes and replace them with other nodes. 相反,您需要查找节点并将其替换为其他节点。

Here's a solution that works: 这是一个有效的解决方案:

import bs4

with open("example.html") as f:
    soup = bs4.BeautifulSoup(f)

for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
    tag = bs4.Tag(name="comment")
    tag.string = comment.strip()
    comment.replace_with(tag)

This code starts by iterating over the result of a call to find_all() , taking advantage of the fact that we can pass a function as the text argument. 这段代码首先迭代调用find_all() ,利用我们可以将函数作为text参数传递的事实。 In BeautifulSoup, Comment is a subclass of NavigableString , so we search for it as though it were a string, and the lambda ... is just a shorthand for eg 在BeautifulSoup中, CommentNavigableString的子类,所以我们搜索它就好像它是一个字符串,而lambda ...只是一个简写例如

def is_comment(e):
    return isinstance(e, bs4.Comment)

soup.find_all(text=is_comment)

Then, we create a new Tag with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created. 然后,我们使用适当的名称创建一个新Tag ,将其内容设置为原始注释的剥离内容,并将注释替换为我们刚刚创建的标记。

Here's the result: 这是结果:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何用python替换字符串中的html元素? - How to replace html elements in a string by python? 在Django中评论其他评论 - Comment other comments in Django BeautifulSoup:如何获取带有注释的课程的非注释内容? - BeautifulSoup: How to get the non-comment-content of a class with comments? Python 2.7 lxml:如何用注释替换标签 - Python 2.7 lxml: How to replace a tag with a comment 如何为博客详细信息视图上的评论制作 api。 我不需要回复评论(即没有孩子评论) - how to make api for comments on a blog detail view. I dont need replies to comments ( ie no children comment) 如何将 flask 预览中的元素集成到评论中 - How to integrate the elements from a flask preview into a comment 注释在网页上可见,但是BeautifulSoup返回的html对象不包含注释部分 - Comments are visible on the webpage, but the html object returned by BeautifulSoup did not contain the comment parts Python中的正则表达式,用于删除XML注释和HTML元素 - Regular Expression in Python for Removing XML Comments and HTML elements 如何将自定义HTML元素和图像添加到Django中的博客? - How to add custom HTML elements and images to a blog in Django? 用html文档中的元素替换多个字符串 - Replace multiple strings with elements in html document
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM