簡體   English   中英

美麗的湯遍歷html標簽

[英]Beautiful Soup iterate over html tag

我在HTML中有以下代碼

 <section> <section> <h2>Title1</h2> <p>Text1</p> <p>Text1</p> </section> <section> <h2>Title2</h2> <p>Text2</p> <p>Text2</p> </section> <section> <h2>Title3</h2> <p>Text3</p> <p>Text3</p> </section> </section> <section> <h2>Title2-1</h2> <p>Text2-1</p> <p>Text2-1</p> </section> <section> <h2>Title3-1</h2> <p>Text3-1</p> <p>Text3-1</p> </section> 
正如在某些部分中有小節一樣,而有些則沒有。 我想獲取子節和沒有子節的內容,我試圖在這些子節上進行迭代,以便可以在scrapy中創建索引。 我有以下代碼用於scrapy:

 class RUSpider(BaseSpider): name = "ru" allowed_domains = ["http://127.0.0.1:8000/"] start_urls = [ "http://127.0.0.1:8000/week2/1_am/#/", "http://127.0.0.1:8000/week1/1/", "http://127.0.0.1:8000/week3/1_am/" ] rules = [ Rule(SgmlLinkExtractor(), follow=True) ] def parse(self, response): filename = response.url.split("/")[3] hxs = HtmlXPathSelector(response) divs = hxs.select('//div') sections = divs.select('//section').extract() # print sections.extract #class definition for scrapy and html selector for each in sections: #iterate over loop [above sections] soup = BeautifulSoup(each) sp= soup.prettify() elements = soup.findAll("section".split()) print len(elements),'sublength' if len(elements ) > 1: for element in elements: for subelement in element: print subelement,'element' else: item = RItem() # create Index Item item['html_content'] = each print each yield item 

盡管一些不包含小節的節被分解為單個元素,但某些結果的格式正確。

我要每個部分。 我的意思是因為1個部分還有其他部分。 我想遍歷這些部分並逐個獲取它們,以便跟蹤跟蹤。 由於某些部分沒有子部分,因此無需遍歷它們。

在BeautifulSoup中,有沒有更好的方法? 我想要以下輸出

  <section> <h2>Title1</h2> <p>Text1</p> <p>Text1</p> </section> <section> <h2>Title2</h2> <p>Text2</p> <p>Text2</p> </section> <section> <h2>Title3</h2> <p>Text3</p> <p>Text3</p> </section> <section> <h2>Title2-1</h2> <p>Text2-1</p> <p>Text2-1</p> </section> <section> <h2>Title3-1</h2> <p>Text3-1</p> <p>Text3-1</p> </section> 

檢查此方法。 這是您提供的數據的通用名稱。

data = """
<section>
    <section>
        <h2>Title1</h2>
        <p>Text1</p>
        <p>Text1</p>
     </section>
  <section>
        <h2>Title2</h2>
        <p>Text2</p>
        <p>Text2</p>
     </section>
  <section>
        <h2>Title3</h2>
        <p>Text3</p>
        <p>Text3</p>
     </section>
  </section>
<section>
        <h2>Title2-1</h2>
        <p>Text2-1</p>
        <p>Text2-1</p>
</section>
<section>
        <h2>Title3-1</h2>
        <p>Text3-1</p>
        <p>Text3-1</p>
</section>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(data)

sections = soup.find_all('section')


for each in sections: #iterate over loop [above sections]
    if each.find('section'):
        continue
    else:
        print each.prettify()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM