简体   繁体   English

使用 BeautifulSoup 抓取时如何处理某些页面中缺失的元素

[英]How to treat missing elements from certain pages when scraping with BeautifulSoup

I need to scrape the code below from a product page, and then split it to show the author and the illustrator separately.我需要从产品页面抓取下面的代码,然后将其拆分以分别显示作者和插图画家。

The problem is: Some pages have both an author and illustrator: page 1 Certain pages have only an author - page2 Certain pages have neither and author nor an illustrator ( page3)问题是:有些页面同时有作者和插画师:第 1 页某些页面只有作者 - 第 2 页某些页面既没有作者也没有插画师(第 3 页)

The only way to differentiate between the author区分作者的唯一方法

  • and the illustrator和插画师
  • is to see if the word (Illustreerder) is present.是查看单词 (Illustreerder) 是否存在。

    How can I assign default values to author and illustrator for when they are empty?当作者和插画家为空时,如何为它们分配默认值?

     <ul class="product-brands"> <li class="brand-item"> <a href="https://lapa.co.za/Skrywer/zinelda-mcdonald-illustreerder.html" title="Zinelda McDonald (Illustreerder)">Zinelda McDonald (Illustreerder)</a> </li> <li class="brand-item"> <a href="https://lapa.co.za/Skrywer/jose-reinette-palmer.html" title="Jose Palmer &amp; Reinette Lombard">Jose Palmer &amp; Reinette Lombard</a> </li> </ul>
     from bs4 import BeautifulSoup import requests headers = { 'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148' } # AUTHOR & ILLUSTRATOR page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie' # AUTHOR ONLY page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/' # NO AUTHOR and NO ILLUSTRATOR page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/' # PAGE WITH NO STOCK page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek' illustrator = '(Illustreerder)' productlist = [] r = requests.get(page2, headers=headers) soup = BeautifulSoup(r.content, 'lxml') isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "") stocks = soup.find('div', class_='stock available') if stocks is not None: stock = stocks.text.strip() if stocks is None: stock = 'n/a' for ultag in soup.find_all('ul', {'class': 'product-brands'}): for litag in ultag.find_all('li'): author = litag.text.strip() or 'None' if illustrator not in author: author = author for ultag in soup.find_all('ul', {'class': 'product-brands'}): for litag in ultag.find_all('li'): author = litag.text.strip() if illustrator in author: illustrator = author bookdata = [isbn, stock, author, illustrator] print(bookdata)

    Page 1 Expected Output第 1 页 预期输出

    ['9781776356515', 'In voorraad', 'Jose Palmer & Reinette Lombard', 'Zinelda McDonald']

    Page 2 Expected Output第 2 页 预期输出

    ['9780799383874', 'In voorraad', 'Jaco Jacobs', 'None']

    Page 3 Expected Output第 3 页 预期输出

    ['9780799383690', 'In voorraad', 'None', 'None']
  • You can do that using if else statement as follows:您可以使用 if else 语句执行此操作,如下所示:

    from bs4 import BeautifulSoup
    import requests
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
    }
    
    # AUTHOR & ILLUSTRATOR
    #page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'
    
    # AUTHOR ONLY
    page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'
    
    # NO AUTHOR and NO ILLUSTRATOR
    #page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'
    
    # PAGE WITH NO STOCK
    #page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'
    
    
    #illustrator = '(Illustreerder)'
    productlist = []
    
    r = requests.get(page2, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    
    isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "")
    stocks = soup.find('div', class_='stock available')
    
    stock = stocks.text.strip() if stocks else 'n/a'
    
    for ultag in soup.find_all('ul', {'class': 'product-brands'}):
        for litag in ultag.find_all('li'):
            author = litag.text.strip() if litag else None
    
    
    for ultag in soup.find_all('ul', {'class': 'product-brands'}):
        for litag in ultag.find_all('li'):
            illustratorhor = litag.text.strip() if litag == '(Illustreerder)'else None
         
              
    bookdata = [isbn, stock, author, illustratorhor]
    print(bookdata)  
    

    Output:输出:

    ['9780799383874', 'In voorraad', 'Jaco Jacobs', None] 
    
    
    
    
        
    

    声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

     
    粤ICP备18138465号  © 2020-2024 STACKOOM.COM