简体   繁体   English

如何在h4内提取文字强?

[英]How to extract text within h4 strong?

I am trying to extract each "Overall Rating" (number value in strong tags) from each product page https://www.guitarguitar.co.uk/product/12082017334688--epiphone-les-paul-standard-plus-top-pro-translucent-blue The structure goes as follows: 我试图从每个产品页面中提取每个“总体评级”(强标签中的数字值) https://www.guitarguitar.co.uk/product/12082017334688--ephonehone-les-paul-standard-plus-top- pro-translucent-blue结构如下:

  <div class="col-sm-12"> 
   <h2 class="line-bottom"> Customer Reviews</h2>
   <h4>
   Overall Rating
   <strong>5</strong>
   <span></span>
  </h4>
  </div>

I am trying to extract only the strong values. 我试图只提取强大的价值观。

 productsRating = soup.find("div", {"class": "col-sm-12"}.h4

This sometimes works, but the page makes use of same class for different elements so it extracts un-wanted html elements. 这有时会起作用,但页面会为不同的元素使用相同的类,因此它会提取不需要的html元素。

Is there any solution to only getting the products overall reviews? 有没有解决方案只能获得产品的整体评论?

EDITED!! EDITED!

this is the whole loop for my program. 这是我程序的整个循环。

for page in range(1, 2):
    guitarPage = requests.get('https://www.guitarguitar.co.uk/guitars/electric/page-{}'.format(page)).text
    soup = BeautifulSoup(guitarPage, 'lxml')
    guitars = soup.find_all(class_='col-xs-6 col-sm-4 col-md-4 col-lg-3')

    for guitar in guitars:

        title_text = guitar.h3.text.strip()
        print('Guitar Name: ', title_text)
        price = guitar.find(class_='price bold small').text.strip()
        trim = re.compile(r'[^\d.,]+')
        int_price = trim.sub('', price)
        print('Guitar Price: ', int_price)

        priceSave = guitar.find('span', {'class': 'price save'})
        if priceSave is not None:
            priceOf = priceSave.text
            trim = re.compile(r'[^\d.,]+')
            int_priceOff = trim.sub('', priceOf)
            print('Save: ', int_priceOff)
        else:
            print("No discount!")

        image = guitar.img.get('src')
        print('Guitar Image: ', image)

        productLink = guitar.find('a').get('href')
        linkProd = url + productLink
        print('Link of product', linkProd)
        productsPage.append(linkProd)

        for products in productsPage:
            response = requests.get(products)
            soup = BeautifulSoup(response.content, "lxml")
            productsDetails = soup.find("div", {"class": "description-preview"})
            if productsDetails is not None:
                description = productsDetails.text
                print('product detail: ', description)
            else:
                print('none')
            time.sleep(0.2)
            productsRating = soup.find_all('strong')[0].text
            print(productsRating)

Try: 尝试:

import requests
from bs4 import BeautifulSoup 

url = 'https://www.guitarguitar.co.uk/product/190319340849008--gibson-les-paul-standard-60s-iced-tea'

html = requests.get(url).text

soup = BeautifulSoup(html, "lxml")
try:
    productsRating = soup.find('h2', string=lambda s: "Customer reviews" in s).find_next_siblings()[0].find('strong').text
except:
    productsRating = None

print(productsRating)

Review info is all in a script tag you can extract and load with json. 审阅信息全部在脚本标记中,您可以使用json提取和加载。 Simply enough to see how to fit that in a loop. 简单地看看如何在循环中适应它。

import requests
from bs4 import BeautifulSoup as bs
import json

url = 'https://www.guitarguitar.co.uk/product/12082017334688--epiphone-les-paul-standard-plus-top-pro-translucent-blue'
r = requests.get(url)
soup = bs(r.content, 'lxml')
script = soup.select_one('[type="application/ld+json"]').text
data = json.loads(script.strip())
overall_rating = data['@graph'][2]['aggregateRating']['ratingValue']
reviews = [review for review in data['@graph'][2]['review']] #extract what you want

Output: 输出:

在此输入图像描述


Explore json 探索json


To handle no reviews you could use a simply try except : 要处理没有评论,您可以使用简单的try except

import requests
from bs4 import BeautifulSoup as bs
import json

url = 'https://www.guitarguitar.co.uk/product/190319340849008--gibson-les-paul-standard-60s-iced-tea'
r = requests.get(url)
soup = bs(r.content, 'lxml')
script = soup.select_one('[type="application/ld+json"]').text
data = json.loads(script.strip())
try:
    overall_rating = data['@graph'][2]['aggregateRating']['ratingValue']
    reviews = [review for review in data['@graph'][2]['review']] #extract what you want
except: #you might want to use except KeyError
    overall_rating = "None"
    reviews = ['None']

or, use an if statement: 或者,使用if语句:

if 'aggregateRating' in script:
    overall_rating = data['@graph'][2]['aggregateRating']['ratingValue']
    reviews = [review for review in data['@graph'][2]['review']] #extract what you want
else:
    overall_rating = "None"
    reviews = ['None']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM