How can I scrape data using beautifulsoup, from an HTML page which has
<div class="accordion-item accordion-item-active">
<p class="accordion-title">
<a href="javascript:void(0)"><span class="accordion-toggle"></span> What different payment modes are available to purchase SysTools products?</a>
</p>
<div class="accordion-content" style="display: block;">
<div>
<p> There are various payment methods available for purchasing SysTools products:</p>
<ul class="list-with-icons list-icons-right-open">
<li>Credit Card/Debit Card</li>
<li>PayPal Account</li>
<li>Pay with Amazon</li>
<li>Purchase Order</li>
<li>Wire Transfer</li>
<li>eCheque Payment</li>
</ul>
<p>We accept all major cards such as MasterCard, VISA, Maestro Card, American Express, etc.</p>
</div>
</div>
</div>
The above "div" is in repetition with different data, with a few divs not containing the "ul","li" tags, only containing a few "p" tags. I can, of course, scrap the "p" tags separately, "ul", "li" tags separately. But i want to scrape the entire "div" in order, first being "p" tag, then the other "p" tag, then the list tags, and then iterate it over the other "div" tags (having the same format).
your question is not very clear. Though I might have got it. So here is a function that might fit your needs!
def process_div(div_tag):
"""
Parameter: a div tag.
Returns: lists of p, ul and li tags.
"""
p_tags = [div_tag.find('p', class_ = 'accordion-title'),
div_tag.find('div', class_ = 'accordion-content').find_all('p')]
ul_tag = div_tag.find('div', class_ = 'accordion-content').find('ul')
li_tags = ul_tag.find_all('li')
return p_tags, ul_tag, li_tags
It will return in order the content of you div
tag. Maybe adding a try/except
statement might be useful here when using it on larger html
trees. Here is my small demo:
html = """
<div class="accordion-item accordion-item-active">
<p class="accordion-title">
<a href="javascript:void(0)"><span class="accordion-toggle"></span> What different payment modes are available to purchase SysTools products?</a>
</p>
<div class="accordion-content" style="display: block;">
<div>
<p> There are various payment methods available for purchasing SysTools products:</p>
<ul class="list-with-icons list-icons-right-open">
<li>Credit Card/Debit Card</li>
<li>PayPal Account</li>
<li>Pay with Amazon</li>
<li>Purchase Order</li>
<li>Wire Transfer</li>
<li>eCheque Payment</li>
</ul>
<p>We accept all major cards such as MasterCard, VISA, Maestro Card, American Express, etc.</p>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
then,
p_tags, ul_tag, li_tags = process_div(soup.div)
print p_tags
[<p class="accordion-title">\n<a href="javascript:void(0)"><span class="accordion-toggle"></span> What different payment modes are available to purchase SysTools products?</a>\n</p>,
[<p> There are various payment methods available for purchasing SysTools products:</p>,
<p>We accept all major cards such as MasterCard, VISA, Maestro Card, American Express, etc.</p>]]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.