简体   繁体   中英

Include the HTML closing tag from BeautifulSoup soup.findall output

I hope I'm just missing a parameter and looking forward to your help. I want to get all tags from a piece of HTML including the closing tags (I'm doing some analysis on the ordering of HTML tags for thousands for pages for HTML and hence need to extract both opening and closing tags in the order they appear on the page).

Snippet of my code so far:

data = '<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'

tags = []

soup = BeautifulSoup(data, "html.parser")

for tag in soup.find_all():

    tags.append(tag.name)

tag_string = '-'.join(tags)

print(tags)

print(tag_string)

Current output:

['h1', 'p', 'ol', 'li', 'br']

h1-p-ol-li-br

Desired output (show the closing tag so I can see it is in the correct order):

['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br

This should help you,

from html.parser import HTMLParser

tagsOrder = []

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        tagsOrder.append(tag)

    def handle_endtag(self, tag):
        tagsOrder.append("/"+tag)

parser = MyHTMLParser()

print(parser.feed('<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'))
print(tagsOrder)
print('-'.join(tagsOrder))

Result

['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br

For more information please go through the official documentation at Example HTML Parser Application

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM