[英]How to treat missing elements from certain pages when scraping with BeautifulSoup
I need to scrape the code below from a product page, and then split it to show the author and the illustrator separately.我需要从产品页面抓取下面的代码,然后将其拆分以分别显示作者和插图画家。
The problem is: Some pages have both an author and illustrator: page 1 Certain pages have only an author - page2 Certain pages have neither and author nor an illustrator ( page3)问题是:有些页面同时有作者和插画师:第 1 页某些页面只有作者 - 第 2 页某些页面既没有作者也没有插画师(第 3 页)
The only way to differentiate between the author区分作者的唯一方法
How can I assign default values to author and illustrator for when they are empty?当作者和插画家为空时,如何为它们分配默认值?
<ul class="product-brands"> <li class="brand-item"> <a href="https://lapa.co.za/Skrywer/zinelda-mcdonald-illustreerder.html" title="Zinelda McDonald (Illustreerder)">Zinelda McDonald (Illustreerder)</a> </li> <li class="brand-item"> <a href="https://lapa.co.za/Skrywer/jose-reinette-palmer.html" title="Jose Palmer & Reinette Lombard">Jose Palmer & Reinette Lombard</a> </li> </ul>
from bs4 import BeautifulSoup import requests headers = { 'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148' } # AUTHOR & ILLUSTRATOR page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie' # AUTHOR ONLY page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/' # NO AUTHOR and NO ILLUSTRATOR page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/' # PAGE WITH NO STOCK page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek' illustrator = '(Illustreerder)' productlist = [] r = requests.get(page2, headers=headers) soup = BeautifulSoup(r.content, 'lxml') isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "") stocks = soup.find('div', class_='stock available') if stocks is not None: stock = stocks.text.strip() if stocks is None: stock = 'n/a' for ultag in soup.find_all('ul', {'class': 'product-brands'}): for litag in ultag.find_all('li'): author = litag.text.strip() or 'None' if illustrator not in author: author = author for ultag in soup.find_all('ul', {'class': 'product-brands'}): for litag in ultag.find_all('li'): author = litag.text.strip() if illustrator in author: illustrator = author bookdata = [isbn, stock, author, illustrator] print(bookdata)
Page 1 Expected Output第 1 页 预期输出
['9781776356515', 'In voorraad', 'Jose Palmer & Reinette Lombard', 'Zinelda McDonald']
Page 2 Expected Output第 2 页 预期输出
['9780799383874', 'In voorraad', 'Jaco Jacobs', 'None']
Page 3 Expected Output第 3 页 预期输出
['9780799383690', 'In voorraad', 'None', 'None']
You can do that using if else statement as follows:您可以使用 if else 语句执行此操作,如下所示:
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}
# AUTHOR & ILLUSTRATOR
#page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'
# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'
# NO AUTHOR and NO ILLUSTRATOR
#page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'
# PAGE WITH NO STOCK
#page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'
#illustrator = '(Illustreerder)'
productlist = []
r = requests.get(page2, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "")
stocks = soup.find('div', class_='stock available')
stock = stocks.text.strip() if stocks else 'n/a'
for ultag in soup.find_all('ul', {'class': 'product-brands'}):
for litag in ultag.find_all('li'):
author = litag.text.strip() if litag else None
for ultag in soup.find_all('ul', {'class': 'product-brands'}):
for litag in ultag.find_all('li'):
illustratorhor = litag.text.strip() if litag == '(Illustreerder)'else None
bookdata = [isbn, stock, author, illustratorhor]
print(bookdata)
Output:输出:
['9780799383874', 'In voorraad', 'Jaco Jacobs', None]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.