包括来自 BeautifulSoup soup.findall 输出的 HTML 结束标记

Question

I hope I'm just missing a parameter and looking forward to your help.我希望我只是缺少一个参数并期待您的帮助。 I want to get all tags from a piece of HTML including the closing tags (I'm doing some analysis on the ordering of HTML tags for thousands for pages for HTML and hence need to extract both opening and closing tags in the order they appear on the page).我想从一段 HTML 中获取所有标签，包括结束标签（我正在对 HTML 页面的数千个 HTML 标签的排序进行一些分析，因此需要按照它们出现的顺序提取开始和结束标签这一页）。

Snippet of my code so far:到目前为止我的代码片段：

data = '<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'

tags = []

soup = BeautifulSoup(data, "html.parser")

for tag in soup.find_all():

    tags.append(tag.name)

tag_string = '-'.join(tags)

print(tags)

print(tag_string)

Current output:电流输出：

['h1', 'p', 'ol', 'li', 'br']

h1-p-ol-li-br

Desired output (show the closing tag so I can see it is in the correct order):所需的输出（显示结束标签，以便我可以看到它的顺序正确）：

['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br

Answer 1

This should help you,这应该可以帮助你，

from html.parser import HTMLParser

tagsOrder = []

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        tagsOrder.append(tag)

    def handle_endtag(self, tag):
        tagsOrder.append("/"+tag)

parser = MyHTMLParser()

print(parser.feed('<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'))
print(tagsOrder)
print('-'.join(tagsOrder))

Result结果

['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br

For more information please go through the official documentation at Example HTML Parser Application有关更多信息，请参阅示例 HTML 解析器应用程序中的官方文档

包括来自 BeautifulSoup soup.findall 输出的 HTML 结束标记

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-02-21 14:06:35

包括来自 BeautifulSoup soup.findall 输出的 HTML 结束标记

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-02-21 14:06:35

解决方案1
1 已采纳 2020-02-21 14:06:35