Include the HTML closing tag from BeautifulSoup soup.findall output

Question

I hope I'm just missing a parameter and looking forward to your help. I want to get all tags from a piece of HTML including the closing tags (I'm doing some analysis on the ordering of HTML tags for thousands for pages for HTML and hence need to extract both opening and closing tags in the order they appear on the page).

Snippet of my code so far:

data = '<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'

tags = []

soup = BeautifulSoup(data, "html.parser")

for tag in soup.find_all():

    tags.append(tag.name)

tag_string = '-'.join(tags)

print(tags)

print(tag_string)

Current output:

['h1', 'p', 'ol', 'li', 'br']

h1-p-ol-li-br

Desired output (show the closing tag so I can see it is in the correct order):

['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br

Answer 1

This should help you,

from html.parser import HTMLParser

tagsOrder = []

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        tagsOrder.append(tag)

    def handle_endtag(self, tag):
        tagsOrder.append("/"+tag)

parser = MyHTMLParser()

print(parser.feed('<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'))
print(tagsOrder)
print('-'.join(tagsOrder))

Result

['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br

For more information please go through the official documentation at Example HTML Parser Application

Include the HTML closing tag from BeautifulSoup soup.findall output

Question

1 answers

solution1
1 ACCPTED 2020-02-21 14:06:35

Include the HTML closing tag from BeautifulSoup soup.findall output

Question

1 answers

solution1 1 ACCPTED 2020-02-21 14:06:35

solution1
1 ACCPTED 2020-02-21 14:06:35