简体   繁体   中英

Web Scraping of data in order Using python (BeautifulSoup)

How can I scrape data using beautifulsoup, from an HTML page which has

<div class="accordion-item accordion-item-active">
      <p class="accordion-title">
        <a href="javascript:void(0)"><span class="accordion-toggle"></span> What different payment modes are available to purchase SysTools products?</a>
      </p>
      <div class="accordion-content" style="display: block;">
        <div>
          <p> There are various payment methods available for purchasing SysTools products:</p>
          <ul class="list-with-icons list-icons-right-open">
            <li>Credit Card/Debit Card</li>
            <li>PayPal Account</li>
            <li>Pay with Amazon</li>
            <li>Purchase Order</li>
            <li>Wire Transfer</li>
            <li>eCheque Payment</li>
          </ul>
          <p>We accept all major cards such as MasterCard, VISA, Maestro Card, American Express, etc.</p>
        </div>
      </div>
    </div>

The above "div" is in repetition with different data, with a few divs not containing the "ul","li" tags, only containing a few "p" tags. I can, of course, scrap the "p" tags separately, "ul", "li" tags separately. But i want to scrape the entire "div" in order, first being "p" tag, then the other "p" tag, then the list tags, and then iterate it over the other "div" tags (having the same format).

your question is not very clear. Though I might have got it. So here is a function that might fit your needs!

def process_div(div_tag):
    """
    Parameter: a div tag.
    Returns: lists of p, ul and li tags.
    """
    p_tags = [div_tag.find('p', class_ = 'accordion-title'), 
              div_tag.find('div', class_ = 'accordion-content').find_all('p')]
    ul_tag = div_tag.find('div', class_ = 'accordion-content').find('ul')
    li_tags = ul_tag.find_all('li')
    return p_tags, ul_tag, li_tags

It will return in order the content of you div tag. Maybe adding a try/except statement might be useful here when using it on larger html trees. Here is my small demo:

html = """
<div class="accordion-item accordion-item-active">
      <p class="accordion-title">
        <a href="javascript:void(0)"><span class="accordion-toggle"></span> What different payment modes are available to purchase SysTools products?</a>
      </p>
      <div class="accordion-content" style="display: block;">
        <div>
          <p> There are various payment methods available for purchasing SysTools products:</p>
          <ul class="list-with-icons list-icons-right-open">
            <li>Credit Card/Debit Card</li>
            <li>PayPal Account</li>
            <li>Pay with Amazon</li>
            <li>Purchase Order</li>
            <li>Wire Transfer</li>
            <li>eCheque Payment</li>
          </ul>
          <p>We accept all major cards such as MasterCard, VISA, Maestro Card, American Express, etc.</p>
        </div>
      </div>
    </div>
"""
soup = BeautifulSoup(html, 'html.parser') 

then,

p_tags, ul_tag, li_tags = process_div(soup.div)
print p_tags
[<p class="accordion-title">\n<a href="javascript:void(0)"><span class="accordion-toggle"></span> What different payment modes are available to purchase SysTools products?</a>\n</p>,
 [<p> There are various payment methods available for purchasing SysTools products:</p>,
  <p>We accept all major cards such as MasterCard, VISA, Maestro Card, American Express, etc.</p>]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM