美麗的湯遍歷html標簽

Question

我在HTML中有以下代碼

 <section> <section> <h2>Title1</h2> <p>Text1</p> <p>Text1</p> </section> <section> <h2>Title2</h2> <p>Text2</p> <p>Text2</p> </section> <section> <h2>Title3</h2> <p>Text3</p> <p>Text3</p> </section> </section> <section> <h2>Title2-1</h2> <p>Text2-1</p> <p>Text2-1</p> </section> <section> <h2>Title3-1</h2> <p>Text3-1</p> <p>Text3-1</p> </section>

正如在某些部分中有小節一樣，而有些則沒有。 我想獲取子節和沒有子節的內容，我試圖在這些子節上進行迭代，以便可以在scrapy中創建索引。 我有以下代碼用於scrapy：

 class RUSpider(BaseSpider): name = "ru" allowed_domains = ["http://127.0.0.1:8000/"] start_urls = [ "http://127.0.0.1:8000/week2/1_am/#/", "http://127.0.0.1:8000/week1/1/", "http://127.0.0.1:8000/week3/1_am/" ] rules = [ Rule(SgmlLinkExtractor(), follow=True) ] def parse(self, response): filename = response.url.split("/")[3] hxs = HtmlXPathSelector(response) divs = hxs.select('//div') sections = divs.select('//section').extract() # print sections.extract #class definition for scrapy and html selector for each in sections: #iterate over loop [above sections] soup = BeautifulSoup(each) sp= soup.prettify() elements = soup.findAll("section".split()) print len(elements),'sublength' if len(elements ) > 1: for element in elements: for subelement in element: print subelement,'element' else: item = RItem() # create Index Item item['html_content'] = each print each yield item

盡管一些不包含小節的節被分解為單個元素，但某些結果的格式正確。

我要每個部分。 我的意思是因為1個部分還有其他部分。 我想遍歷這些部分並逐個獲取它們，以便跟蹤跟蹤。 由於某些部分沒有子部分，因此無需遍歷它們。

在BeautifulSoup中，有沒有更好的方法？ 我想要以下輸出

  <section> <h2>Title1</h2> <p>Text1</p> <p>Text1</p> </section> <section> <h2>Title2</h2> <p>Text2</p> <p>Text2</p> </section> <section> <h2>Title3</h2> <p>Text3</p> <p>Text3</p> </section> <section> <h2>Title2-1</h2> <p>Text2-1</p> <p>Text2-1</p> </section> <section> <h2>Title3-1</h2> <p>Text3-1</p> <p>Text3-1</p> </section>

Answer 1

檢查此方法。 這是您提供的數據的通用名稱。

data = """
<section>
    <section>
        <h2>Title1</h2>
        <p>Text1</p>
        <p>Text1</p>
     </section>
  <section>
        <h2>Title2</h2>
        <p>Text2</p>
        <p>Text2</p>
     </section>
  <section>
        <h2>Title3</h2>
        <p>Text3</p>
        <p>Text3</p>
     </section>
  </section>
<section>
        <h2>Title2-1</h2>
        <p>Text2-1</p>
        <p>Text2-1</p>
</section>
<section>
        <h2>Title3-1</h2>
        <p>Text3-1</p>
        <p>Text3-1</p>
</section>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(data)

sections = soup.find_all('section')


for each in sections: #iterate over loop [above sections]
    if each.find('section'):
        continue
    else:
        print each.prettify()

美麗的湯遍歷html標簽

問題描述

1 個解決方案

解決方案1
3 已采納 2014-10-29 08:49:59

美麗的湯遍歷html標簽

問題描述

1 個解決方案

解決方案1 3 已采納 2014-10-29 08:49:59

解決方案1
3 已采納 2014-10-29 08:49:59