简体   繁体   English

包括来自 BeautifulSoup soup.findall 输出的 HTML 结束标记

[英]Include the HTML closing tag from BeautifulSoup soup.findall output

I hope I'm just missing a parameter and looking forward to your help.我希望我只是缺少一个参数并期待您的帮助。 I want to get all tags from a piece of HTML including the closing tags (I'm doing some analysis on the ordering of HTML tags for thousands for pages for HTML and hence need to extract both opening and closing tags in the order they appear on the page).我想从一段 HTML 中获取所有标签,包括结束标签(我正在对 HTML 页面的数千个 HTML 标签的排序进行一些分析,因此需要按照它们出现的顺序提取开始和结束标签这一页)。

Snippet of my code so far:到目前为止我的代码片段:

data = '<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'

tags = []

soup = BeautifulSoup(data, "html.parser")

for tag in soup.find_all():

    tags.append(tag.name)

tag_string = '-'.join(tags)

print(tags)

print(tag_string)

Current output:电流输出:

['h1', 'p', 'ol', 'li', 'br']

h1-p-ol-li-br

Desired output (show the closing tag so I can see it is in the correct order):所需的输出(显示结束标签,以便我可以看到它的顺序正确):

['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br

This should help you,这应该可以帮助你,

from html.parser import HTMLParser

tagsOrder = []

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        tagsOrder.append(tag)

    def handle_endtag(self, tag):
        tagsOrder.append("/"+tag)

parser = MyHTMLParser()

print(parser.feed('<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'))
print(tagsOrder)
print('-'.join(tagsOrder))

Result结果

['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br

For more information please go through the official documentation at Example HTML Parser Application有关更多信息,请参阅示例 HTML 解析器应用程序中的官方文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM