[英]Web Scraping of data in order Using python (BeautifulSoup)
如何使用beautifulsoup从具有以下内容的HTML页面中抓取数据
<div class="accordion-item accordion-item-active">
<p class="accordion-title">
<a href="javascript:void(0)"><span class="accordion-toggle"></span> What different payment modes are available to purchase SysTools products?</a>
</p>
<div class="accordion-content" style="display: block;">
<div>
<p> There are various payment methods available for purchasing SysTools products:</p>
<ul class="list-with-icons list-icons-right-open">
<li>Credit Card/Debit Card</li>
<li>PayPal Account</li>
<li>Pay with Amazon</li>
<li>Purchase Order</li>
<li>Wire Transfer</li>
<li>eCheque Payment</li>
</ul>
<p>We accept all major cards such as MasterCard, VISA, Maestro Card, American Express, etc.</p>
</div>
</div>
</div>
上面的“ div”重复使用不同的数据,其中一些div不包含“ ul”,“ li”标签,仅包含一些“ p”标签。 我当然可以分别废弃“ p”标签,“ ul”,“ li”标签。 但是我想按顺序刮擦整个“ div”,首先是“ p”标签,然后是其他“ p”标签,然后是列表标签,然后在其他“ div”标签上进行迭代(具有相同的格式) 。
您的问题不是很清楚。 虽然我可能已经知道了。 因此,这里有一个功能可以满足您的需求!
def process_div(div_tag):
"""
Parameter: a div tag.
Returns: lists of p, ul and li tags.
"""
p_tags = [div_tag.find('p', class_ = 'accordion-title'),
div_tag.find('div', class_ = 'accordion-content').find_all('p')]
ul_tag = div_tag.find('div', class_ = 'accordion-content').find('ul')
li_tags = ul_tag.find_all('li')
return p_tags, ul_tag, li_tags
它将按顺序返回div
标签的内容。 在较大的html
树上使用try/except
语句时,在此处添加它可能会很有用。 这是我的小演示:
html = """
<div class="accordion-item accordion-item-active">
<p class="accordion-title">
<a href="javascript:void(0)"><span class="accordion-toggle"></span> What different payment modes are available to purchase SysTools products?</a>
</p>
<div class="accordion-content" style="display: block;">
<div>
<p> There are various payment methods available for purchasing SysTools products:</p>
<ul class="list-with-icons list-icons-right-open">
<li>Credit Card/Debit Card</li>
<li>PayPal Account</li>
<li>Pay with Amazon</li>
<li>Purchase Order</li>
<li>Wire Transfer</li>
<li>eCheque Payment</li>
</ul>
<p>We accept all major cards such as MasterCard, VISA, Maestro Card, American Express, etc.</p>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
然后,
p_tags, ul_tag, li_tags = process_div(soup.div)
print p_tags
[<p class="accordion-title">\n<a href="javascript:void(0)"><span class="accordion-toggle"></span> What different payment modes are available to purchase SysTools products?</a>\n</p>,
[<p> There are various payment methods available for purchasing SysTools products:</p>,
<p>We accept all major cards such as MasterCard, VISA, Maestro Card, American Express, etc.</p>]]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.