简体   繁体   English

如何使用python和美丽的汤将html页面拆分为多个页面

[英]How to split a html page to multiple pages using python and beautiful soup

I have a simple html file like this. 我有一个像这样的简单html文件。 In fact I pulled it from a wiki page, removed some html attributes and converted to this simple html page. 事实上,我从维基页面中删除了它,删除了一些html属性并转换为这个简单的html页面。

<html>
   <body>
      <h1>draw electronics schematics</h1>
      <h2>first header</h2>
      <p>
         <!-- ..some text images -->
      </p>
      <h3>some header</h3>
      <p>
         <!-- ..some image -->
      </p>
      <p>
         <!-- ..some text -->
      </p>
      <h2>second header</h2>
      <p>
         <!-- ..again some text and images -->
      </p>
   </body>
</html>

I read this html file using python and beautiful soup like this. 我用这样的python和漂亮的汤读了这个html文件。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("test.html"))

pages = []

What I'd like to do is split this html page into two parts. 我想做的是将这个html页面分成两部分。 The first part will be between first header and second header. 第一部分将在第一个标题和第二个标题之间。 And the second part will be between second header <h2> and </body> tags. 第二部分将位于第二个标题<h2>和</ body>标记之间。 Then I'd like to store them in a list eg. 然后我想将它们存储在列表中,例如。 pages. 页面。 So I'd be able to create multiple pages from an html page according to <h2> tags. 所以我可以根据<h2>标签从html页面创建多个页面。

Any ideas on how should I do this? 关于我该怎么做的任何想法? Thanks.. 谢谢..

Look for the h2 tags, then use .next_sibling to grab everything until it's another h2 tag: 查找h2标签,然后使用.next_sibling抓取所有内容,直到它是另一个h2标签:

soup = BeautifulSoup(open("test.html"))
pages = []
h2tags = soup.find_all('h2')

def next_element(elem):
    while elem is not None:
        # Find next element, skip NavigableString objects
        elem = elem.next_sibling
        if hasattr(elem, 'name'):
            return elem

for h2tag in h2tags:
    page = [str(h2tag)]
    elem = next_element(h2tag)
    while elem and elem.name != 'h2':
        page.append(str(elem))
        elem = next_element(elem)
    pages.append('\n'.join(page))

Using your sample, this gives: 使用您的样本,这给出:

>>> pages
['<h2>first header</h2>\n<p>\n<!-- ..some text images -->\n</p>\n<h3>some header</h3>\n<p>\n<!-- ..some image -->\n</p>\n<p>\n<!-- ..some text -->\n</p>', '<h2>second header</h2>\n<p>\n<!-- ..again some text and images -->\n</p>']
>>> print pages[0]
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM