![](/img/trans.png)
[英]How do I iterate over the HTML attributes of a Beautiful Soup element?
[英]Beautiful Soup iterate over html tag
我在HTML中有以下代碼
<section> <section> <h2>Title1</h2> <p>Text1</p> <p>Text1</p> </section> <section> <h2>Title2</h2> <p>Text2</p> <p>Text2</p> </section> <section> <h2>Title3</h2> <p>Text3</p> <p>Text3</p> </section> </section> <section> <h2>Title2-1</h2> <p>Text2-1</p> <p>Text2-1</p> </section> <section> <h2>Title3-1</h2> <p>Text3-1</p> <p>Text3-1</p> </section>
class RUSpider(BaseSpider): name = "ru" allowed_domains = ["http://127.0.0.1:8000/"] start_urls = [ "http://127.0.0.1:8000/week2/1_am/#/", "http://127.0.0.1:8000/week1/1/", "http://127.0.0.1:8000/week3/1_am/" ] rules = [ Rule(SgmlLinkExtractor(), follow=True) ] def parse(self, response): filename = response.url.split("/")[3] hxs = HtmlXPathSelector(response) divs = hxs.select('//div') sections = divs.select('//section').extract() # print sections.extract #class definition for scrapy and html selector for each in sections: #iterate over loop [above sections] soup = BeautifulSoup(each) sp= soup.prettify() elements = soup.findAll("section".split()) print len(elements),'sublength' if len(elements ) > 1: for element in elements: for subelement in element: print subelement,'element' else: item = RItem() # create Index Item item['html_content'] = each print each yield item
盡管一些不包含小節的節被分解為單個元素,但某些結果的格式正確。
我要每個部分。 我的意思是因為1個部分還有其他部分。 我想遍歷這些部分並逐個獲取它們,以便跟蹤跟蹤。 由於某些部分沒有子部分,因此無需遍歷它們。
在BeautifulSoup中,有沒有更好的方法? 我想要以下輸出
<section> <h2>Title1</h2> <p>Text1</p> <p>Text1</p> </section> <section> <h2>Title2</h2> <p>Text2</p> <p>Text2</p> </section> <section> <h2>Title3</h2> <p>Text3</p> <p>Text3</p> </section> <section> <h2>Title2-1</h2> <p>Text2-1</p> <p>Text2-1</p> </section> <section> <h2>Title3-1</h2> <p>Text3-1</p> <p>Text3-1</p> </section>
檢查此方法。 這是您提供的數據的通用名稱。
data = """
<section>
<section>
<h2>Title1</h2>
<p>Text1</p>
<p>Text1</p>
</section>
<section>
<h2>Title2</h2>
<p>Text2</p>
<p>Text2</p>
</section>
<section>
<h2>Title3</h2>
<p>Text3</p>
<p>Text3</p>
</section>
</section>
<section>
<h2>Title2-1</h2>
<p>Text2-1</p>
<p>Text2-1</p>
</section>
<section>
<h2>Title3-1</h2>
<p>Text3-1</p>
<p>Text3-1</p>
</section>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data)
sections = soup.find_all('section')
for each in sections: #iterate over loop [above sections]
if each.find('section'):
continue
else:
print each.prettify()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.