[英]How to replace HTML comments with custom <comment> elements
I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python. 我正在使用Python中的BeautifulSoup将大量HTML文件批量转换为XML。
A sample HTML file looks something like this: 示例HTML文件如下所示:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<!-- here is a comment inside the head tag -->
</head>
<body>
...
<!-- Comment inside body tag -->
<!-- Another comment inside body tag -->
<!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
</body>
</html>
<!-- This comment is the last line of the file -->
I figured out how to find the doctype and replace it with the tag <doctype>...</doctype>
, but the commenting is giving me a lot of frustration. 我想出了如何找到doctype并用标签<doctype>...</doctype>
替换它,但评论给了我很多挫折。 I want to replace the HTML comments with <comment>...</comment>
. 我想用<comment>...</comment>
替换HTML注释。 In this example HTML, I was able to replace the first two HTML comments, but anything inside the html
tag and the last comment after the closing html tag I was not. 在这个示例HTML中,我能够替换前两个HTML注释,但是html
标记内的任何内容和关闭html标记之后的最后一个注释我都没有。
Here is my code: 这是我的代码:
file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")
for child in soup.children:
# This takes care of the first two HTML comments
if isinstance(child, bs4.Comment):
child.replace_with("<comment>" + child.strip() + "</comment>")
# This should find all nested HTML comments and replace.
# It looks like it works but the changes are not finalized
if isinstance(child, bs4.Tag):
re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)
# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))
This is my first time using BeautifulSoup. 这是我第一次使用BeautifulSoup。 How do I use BeautifulSoup to find and replace all HTML comments with the <comment>
tag? 如何使用BeautifulSoup查找并用<comment>
标签替换所有HTML注释?
Could I convert it to a byte stream, via pickle
, serializing it, applying regex, and then deseralize it back to a BeautifulSoup
object? 我可以将它转换为字节流,通过pickle
,序列化,应用正则表达式,然后将其解除回一个BeautifulSoup
对象? Would this work or just cause more problems? 这会起作用还是只会导致更多问题?
I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name'
. 我尝试在子标记对象上使用pickle但反序列化因TypeError: __new__() missing 1 required positional argument: 'name'
而失败TypeError: __new__() missing 1 required positional argument: 'name'
。
Then I tried pickling just the text of the tag, via child.text
, but deserialization failed due to AttributeError: can't set attribute
. 然后我尝试通过child.text
对标签的文本进行child.text
,但由于AttributeError: can't set attribute
,反序列化失败AttributeError: can't set attribute
。 Basically, child.text
is read-only, which explains why the regex doesn't work. 基本上, child.text
是只读的,这解释了为什么正则表达式不起作用。 So, I have no idea how to modify the text. 所以,我不知道如何修改文本。
You have a couple of problems: 你有几个问题:
You can't modify child.text
. 你不能修改child.text
。 it's a read-only property that just calls get_text()
behind the scenes, and its result is a brand new string unconnected to your document. 它是一个只读属性,只是在幕后调用get_text()
,其结果是一个全新的字符串未连接到您的文档。
re.sub()
doesn't modify anything in-place. re.sub()
不会就地修改任何内容。 Your line 你的路线
re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
would have had to be 必须是
child.text = re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
... but that wouldn't work anyway, because of point 1. ......但是由于第1点,这无论如何都行不通。
Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. 尝试通过用正则表达式替换其中的文本块来修改文档是使用BeautifulSoup的错误方法。 Instead, you need to find nodes and replace them with other nodes. 相反,您需要查找节点并将其替换为其他节点。
Here's a solution that works: 这是一个有效的解决方案:
import bs4
with open("example.html") as f:
soup = bs4.BeautifulSoup(f)
for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
tag = bs4.Tag(name="comment")
tag.string = comment.strip()
comment.replace_with(tag)
This code starts by iterating over the result of a call to find_all()
, taking advantage of the fact that we can pass a function as the text
argument. 这段代码首先迭代调用find_all()
,利用我们可以将函数作为text
参数传递的事实。 In BeautifulSoup, Comment
is a subclass of NavigableString
, so we search for it as though it were a string, and the lambda ...
is just a shorthand for eg 在BeautifulSoup中, Comment
是NavigableString
的子类,所以我们搜索它就好像它是一个字符串,而lambda ...
只是一个简写例如
def is_comment(e):
return isinstance(e, bs4.Comment)
soup.find_all(text=is_comment)
Then, we create a new Tag
with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created. 然后,我们使用适当的名称创建一个新Tag
,将其内容设置为原始注释的剥离内容,并将注释替换为我们刚刚创建的标记。
Here's the result: 这是结果:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<comment>here is a comment inside the head tag</comment>
</head>
<body>
...
<comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.