[英]I want to extract the text inside h4 and the text related to h4 and the link related to them(with xpath)
[英]How to extract text within h4 strong?
我試圖從每個產品頁面中提取每個“總體評級”(強標簽中的數字值) https://www.guitarguitar.co.uk/product/12082017334688--ephonehone-les-paul-standard-plus-top- pro-translucent-blue結構如下:
<div class="col-sm-12">
<h2 class="line-bottom"> Customer Reviews</h2>
<h4>
Overall Rating
<strong>5</strong>
<span></span>
</h4>
</div>
我試圖只提取強大的價值觀。
productsRating = soup.find("div", {"class": "col-sm-12"}.h4
這有時會起作用,但頁面會為不同的元素使用相同的類,因此它會提取不需要的html元素。
有沒有解決方案只能獲得產品的整體評論?
EDITED!
這是我程序的整個循環。
for page in range(1, 2):
guitarPage = requests.get('https://www.guitarguitar.co.uk/guitars/electric/page-{}'.format(page)).text
soup = BeautifulSoup(guitarPage, 'lxml')
guitars = soup.find_all(class_='col-xs-6 col-sm-4 col-md-4 col-lg-3')
for guitar in guitars:
title_text = guitar.h3.text.strip()
print('Guitar Name: ', title_text)
price = guitar.find(class_='price bold small').text.strip()
trim = re.compile(r'[^\d.,]+')
int_price = trim.sub('', price)
print('Guitar Price: ', int_price)
priceSave = guitar.find('span', {'class': 'price save'})
if priceSave is not None:
priceOf = priceSave.text
trim = re.compile(r'[^\d.,]+')
int_priceOff = trim.sub('', priceOf)
print('Save: ', int_priceOff)
else:
print("No discount!")
image = guitar.img.get('src')
print('Guitar Image: ', image)
productLink = guitar.find('a').get('href')
linkProd = url + productLink
print('Link of product', linkProd)
productsPage.append(linkProd)
for products in productsPage:
response = requests.get(products)
soup = BeautifulSoup(response.content, "lxml")
productsDetails = soup.find("div", {"class": "description-preview"})
if productsDetails is not None:
description = productsDetails.text
print('product detail: ', description)
else:
print('none')
time.sleep(0.2)
productsRating = soup.find_all('strong')[0].text
print(productsRating)
嘗試:
import requests
from bs4 import BeautifulSoup
url = 'https://www.guitarguitar.co.uk/product/190319340849008--gibson-les-paul-standard-60s-iced-tea'
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
try:
productsRating = soup.find('h2', string=lambda s: "Customer reviews" in s).find_next_siblings()[0].find('strong').text
except:
productsRating = None
print(productsRating)
審閱信息全部在腳本標記中,您可以使用json提取和加載。 簡單地看看如何在循環中適應它。
import requests
from bs4 import BeautifulSoup as bs
import json
url = 'https://www.guitarguitar.co.uk/product/12082017334688--epiphone-les-paul-standard-plus-top-pro-translucent-blue'
r = requests.get(url)
soup = bs(r.content, 'lxml')
script = soup.select_one('[type="application/ld+json"]').text
data = json.loads(script.strip())
overall_rating = data['@graph'][2]['aggregateRating']['ratingValue']
reviews = [review for review in data['@graph'][2]['review']] #extract what you want
輸出:
探索json
要處理沒有評論,您可以使用簡單的try except
:
import requests
from bs4 import BeautifulSoup as bs
import json
url = 'https://www.guitarguitar.co.uk/product/190319340849008--gibson-les-paul-standard-60s-iced-tea'
r = requests.get(url)
soup = bs(r.content, 'lxml')
script = soup.select_one('[type="application/ld+json"]').text
data = json.loads(script.strip())
try:
overall_rating = data['@graph'][2]['aggregateRating']['ratingValue']
reviews = [review for review in data['@graph'][2]['review']] #extract what you want
except: #you might want to use except KeyError
overall_rating = "None"
reviews = ['None']
或者,使用if語句:
if 'aggregateRating' in script:
overall_rating = data['@graph'][2]['aggregateRating']['ratingValue']
reviews = [review for review in data['@graph'][2]['review']] #extract what you want
else:
overall_rating = "None"
reviews = ['None']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.