I have a simple html file like this. In fact I pulled it from a wiki page, removed some html attributes and converted to this simple html page.
<html>
<body>
<h1>draw electronics schematics</h1>
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>
<h2>second header</h2>
<p>
<!-- ..again some text and images -->
</p>
</body>
</html>
I read this html file using python and beautiful soup like this.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"))
pages = []
What I'd like to do is split this html page into two parts. The first part will be between first header and second header. And the second part will be between second header <h2> and </body> tags. Then I'd like to store them in a list eg. pages. So I'd be able to create multiple pages from an html page according to <h2> tags.
Any ideas on how should I do this? Thanks..
Look for the h2
tags, then use .next_sibling
to grab everything until it's another h2
tag:
soup = BeautifulSoup(open("test.html"))
pages = []
h2tags = soup.find_all('h2')
def next_element(elem):
while elem is not None:
# Find next element, skip NavigableString objects
elem = elem.next_sibling
if hasattr(elem, 'name'):
return elem
for h2tag in h2tags:
page = [str(h2tag)]
elem = next_element(h2tag)
while elem and elem.name != 'h2':
page.append(str(elem))
elem = next_element(elem)
pages.append('\n'.join(page))
Using your sample, this gives:
>>> pages
['<h2>first header</h2>\n<p>\n<!-- ..some text images -->\n</p>\n<h3>some header</h3>\n<p>\n<!-- ..some image -->\n</p>\n<p>\n<!-- ..some text -->\n</p>', '<h2>second header</h2>\n<p>\n<!-- ..again some text and images -->\n</p>']
>>> print pages[0]
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.