I hope I'm just missing a parameter and looking forward to your help. I want to get all tags from a piece of HTML including the closing tags (I'm doing some analysis on the ordering of HTML tags for thousands for pages for HTML and hence need to extract both opening and closing tags in the order they appear on the page).
Snippet of my code so far:
data = '<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'
tags = []
soup = BeautifulSoup(data, "html.parser")
for tag in soup.find_all():
tags.append(tag.name)
tag_string = '-'.join(tags)
print(tags)
print(tag_string)
Current output:
['h1', 'p', 'ol', 'li', 'br']
h1-p-ol-li-br
Desired output (show the closing tag so I can see it is in the correct order):
['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br
This should help you,
from html.parser import HTMLParser
tagsOrder = []
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
tagsOrder.append(tag)
def handle_endtag(self, tag):
tagsOrder.append("/"+tag)
parser = MyHTMLParser()
print(parser.feed('<h1>Overview</h1> <p>Several methods can be used...</p><ol><li>hello world</li></ol><br>'))
print(tagsOrder)
print('-'.join(tagsOrder))
Result
['h1', '/h1', 'p', '/p', 'ol', 'li', '/li', '/ol', 'br']
h1-/h1-p-/p-ol-li-/li-/ol-br
For more information please go through the official documentation at Example HTML Parser Application
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.