使用 BeautifulSoup 抓取时如何处理某些页面中缺失的元素

Question

I need to scrape the code below from a product page, and then split it to show the author and the illustrator separately.我需要从产品页面抓取下面的代码，然后将其拆分以分别显示作者和插图画家。

The problem is: Some pages have both an author and illustrator: page 1 Certain pages have only an author - page2 Certain pages have neither and author nor an illustrator ( page3)问题是：有些页面同时有作者和插画师：第 1 页某些页面只有作者 - 第 2 页某些页面既没有作者也没有插画师（第 3 页）

The only way to differentiate between the author区分作者的唯一方法

and the illustrator和插画师

is to see if the word (Illustreerder) is present.是查看单词 (Illustreerder) 是否存在。

How can I assign default values to author and illustrator for when they are empty?当作者和插画家为空时，如何为它们分配默认值？

 <ul class="product-brands"> <li class="brand-item"> <a href="https://lapa.co.za/Skrywer/zinelda-mcdonald-illustreerder.html" title="Zinelda McDonald (Illustreerder)">Zinelda McDonald (Illustreerder)</a> </li> <li class="brand-item"> <a href="https://lapa.co.za/Skrywer/jose-reinette-palmer.html" title="Jose Palmer &amp; Reinette Lombard">Jose Palmer &amp; Reinette Lombard</a> </li> </ul>

 from bs4 import BeautifulSoup import requests headers = { 'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148' } # AUTHOR & ILLUSTRATOR page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie' # AUTHOR ONLY page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/' # NO AUTHOR and NO ILLUSTRATOR page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/' # PAGE WITH NO STOCK page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek' illustrator = '(Illustreerder)' productlist = [] r = requests.get(page2, headers=headers) soup = BeautifulSoup(r.content, 'lxml') isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "") stocks = soup.find('div', class_='stock available') if stocks is not None: stock = stocks.text.strip() if stocks is None: stock = 'n/a' for ultag in soup.find_all('ul', {'class': 'product-brands'}): for litag in ultag.find_all('li'): author = litag.text.strip() or 'None' if illustrator not in author: author = author for ultag in soup.find_all('ul', {'class': 'product-brands'}): for litag in ultag.find_all('li'): author = litag.text.strip() if illustrator in author: illustrator = author bookdata = [isbn, stock, author, illustrator] print(bookdata)

Page 1 Expected Output第 1 页预期输出

['9781776356515', 'In voorraad', 'Jose Palmer & Reinette Lombard', 'Zinelda McDonald']

Page 2 Expected Output第 2 页预期输出

['9780799383874', 'In voorraad', 'Jaco Jacobs', 'None']

Page 3 Expected Output第 3 页预期输出

['9780799383690', 'In voorraad', 'None', 'None']

Answer 1

You can do that using if else statement as follows:您可以使用 if else 语句执行此操作，如下所示：

from bs4 import BeautifulSoup
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}

# AUTHOR & ILLUSTRATOR
#page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'

# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'

# NO AUTHOR and NO ILLUSTRATOR
#page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'

# PAGE WITH NO STOCK
#page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'


#illustrator = '(Illustreerder)'
productlist = []

r = requests.get(page2, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')

isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "")
stocks = soup.find('div', class_='stock available')

stock = stocks.text.strip() if stocks else 'n/a'

for ultag in soup.find_all('ul', {'class': 'product-brands'}):
    for litag in ultag.find_all('li'):
        author = litag.text.strip() if litag else None


for ultag in soup.find_all('ul', {'class': 'product-brands'}):
    for litag in ultag.find_all('li'):
        illustratorhor = litag.text.strip() if litag == '(Illustreerder)'else None
     
          
bookdata = [isbn, stock, author, illustratorhor]
print(bookdata)

Output:输出：

['9780799383874', 'In voorraad', 'Jaco Jacobs', None]

使用 BeautifulSoup 抓取时如何处理某些页面中缺失的元素

问题描述

1 个解决方案

解决方案1
0 2021-11-02 14:47:53

使用 BeautifulSoup 抓取时如何处理某些页面中缺失的元素

问题描述

1 个解决方案

解决方案1 0 2021-11-02 14:47:53

解决方案1
0 2021-11-02 14:47:53