簡體   English   中英

使用 Python 從 BeautifulSoup bs4.element.Tag 中提取信息

[英]Extract Information from BeautifulSoup bs4.element.Tag using Python

我正在練習通過 web 從網站https://www.kerastase.com.au/抓取來提取一些信息。 例如,我專注於暢銷商品(7 件)。 我已經能夠使用以下代碼提取名稱、描述和價格。

import requests
from bs4 import BeautifulSoup

url='https://www.kerastase.com.au/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

prod_names = soup.find_all("h3", class_="c-product-tile__name")
prod_names = [prod.get_text() for prod in prod_names]
prices = soup.find_all("span", class_="c-product-price__value")
prices = [float(price.get_text()[2:]) for price in prices if (len(price) > 0)]
prod_descs = soup.find_all("p", class_="c-product-tile__description")
prod_descs = [desc.get_text() for desc in prod_descs]

然而,提取評分和評論數量似乎更復雜。 它是一個嵌套的 div。 我已經能夠使用以下命令提取第一項的標題; 但是它是一團糟,並且不知道在這一步之后該怎么做:

soup.findAll('figcaption', class_="c-product-tile__caption")[0]

這是我得到的一個項目的完整標題示例:

<figcaption class="c-product-tile__caption"> <div class="c-product-tile__caption-inner"> <div class="c-product-tile__wishlist"> <button aria-label="Add to Wishlist Elixir Ultime Pride Edition Hair Oil" aria-pressed="" class="c-add-to-wishlist" data-analytics='{"products":[{"pid":"3474637116088","title":"Elixir Ultime Pride Edition Hair Oil","description":"","url":"https://www.kerastase.com.au/collections/elixir-ultime/elixir-ultime-pride-edition-hair-oil/3474637116088.html","imgUrl":"https://www.kerastase.com.au/on/demandware.static/-/Sites-kerastase-master-catalog/default/dw377882d1/2022/Elixir%20Ultime/Pride/1.%20Product.jpg","currency":"AUD","price":65,"name":"Elixir Ultime Pride Edition Hair Oil","subname":"Iconic nourishing hair oil for all hair types. Kérastase will be donating to Minus18, subsidising LGBTQIA+ Inclusion Workshops for schools across Australia.","id":"elixir-pride","salePrice":65,"brand":"Kérastase","category":"others/collections/elixir ultime","productTopCategory":"products","variant":"100 ml","size":"100 ml","color":"","fragrance":"","stock":"in stock","autoReplenishmentInterval":"not present","upc":"3474637116088","regularPrice":null,"isProductSet":false,"isProductGroup":false,"isBundle":false,"bundleID":"","rating":5,"numberReviews":2,"vtoState":"not present","collection":["Elixir Ultime"],"customizations":{"engraving":"not present"},"badges":"none","remainingStock":null}],"label":"elixir ultime pride edition hair oil::3474637116088","category":"{{dataLayer.page.category}}"}' data-component="product/AddToWishlist" data-component-options='{"pid":"3474637116088","url":{"add":"https://www.kerastase.com.au/on/demandware.store/Sites-kerastase-au-ng-Site/en_AU/Wishlist-AddToWishList","remove":"https://www.kerastase.com.au/on/demandware.store/Sites-kerastase-au-ng-Site/en_AU/Wishlist-RemoveFromWishList"},"text":{"title":{"add":"Add to Wishlist","remove":"Remove from Wishlist"},"accessibility":{"addAriaLabel":"Add to Wishlist Elixir Ultime Pride Edition Hair Oil","removeAriaLabel":"Remove from Wishlist Elixir Ultime Pride Edition Hair Oil"}},"isLabel":false}' title="Add to Wishlist"> <span class="h-show-for-sr" data-js-wishlist-text="">Wishlist</span> </button> </div> <h3 class="c-product-tile__name"><a data-js-product-name="" data-lora-datalayer='{"products":{"3474637116088":{"name":"Elixir Ultime Pride Edition Hair Oil"}}}' href="/collections/elixir-ultime/elixir-ultime-pride-edition-hair-oil/elixir-pride.html"> Elixir Ultime Pride Edition Hair Oil </a></h3><p class="c-product-tile__description"> Iconic nourishing hair oil for all hair types. Kérastase will be donating to Minus18, subsidising LGBTQIA+ Inclusion Workshops for schools across Australia. </p> <div class="c-product-tile__info m-multiple-items"> <div class="c-product-tile__info-item c-product-tile__rating"> <div data-bv-productid="elixir-pride" data-bv-redirect-url="/collections/elixir-ultime/elixir-ultime-pride-edition-hair-oil/elixir-pride.html" data-bv-seo="false" data-bv-show="inline_rating" data-component="product/BazaarvoiceService"> </div> </div> <div class="c-product-tile__info-item c-product-tile__price"> <div class="c-product-price" data-component="product/ProductPrice" data-component-options='{"pid":"3474637116088","reloadData":{"configid":null},"dataModelId":"productprice"}'> <span class="c-product-price__label h-hidden" data-js-pricelabel="">Old price</span> <span class="c-product-price__value m-old h-hidden" data-js-standardprice=""></span> <span class="c-product-price__label h-hidden" data-js-pricelabel="">New price</span> <span class="c-product-price__value" data-js-saleprice="">A$65.00</span> </div> </div> </div> <div class="c-product-tile__variations-group"> <div class="c-product-tile__swatch-group"> </div> <div class="c-product-tile__variations"> <div class="c-product-tile__variations-label">One size available</div> <div class="c-product-tile__variations-single-text"> <span data-js-pid="">100 ml</span> </div> </div> </div> </div> <div class="c-product-tile__actions m-add-bag-enabled" data-js-producttile-actions=""> <div data-component="global/ComponentPlaceholder" data-component-options='{"_lazyload":true,"reloadData":{"id":"productmainaction","section":"product","configid":"producttile","reloadUrl":"https://www.kerastase.com.au/on/demandware.store/Sites-kerastase-au-ng-Site/en_AU/CDSLazyload-product_productmainaction?configid=producttile&amp;data=3474637116088&amp;id=productmainaction&amp;pageId=homepage&amp;section=product"}}'> <button class="c-button m-expand-for-medium-down c-product-add-bag__button m-loading"> <span>Loading ...</span> </button> </div> </div> </figcaption>

如何從中獲得產品評分和評論數量? 示例:“評級”:5,“數字評論”:2

(可能可以從上面獲取所有產品信息,但不知道最好的方法是什么)。

如果您發現產品詳細信息數據的主要特定標簽在button標簽內,並且它包含json格式的數據,因此我們可以使用數據並查找相關信息

main_tag=soup.find_all("div",class_="c-product-tile__figure")
import json
dict1={}
for i in range(len(main_tag)):
    json_data=main_tag[i].find("button")['data-analytics']    
    details=json.loads(json_data)
    price=details['products'][0]['price']
    rating=details['products'][0]['rating']
    numberReviews=details['products'][0]['numberReviews']
    title=details['products'][0]['title']
    dict1[i]={'name':title,'price':price,'rating':rating,'reviews':numberReviews}

Output:

{0: {'name': 'Elixir Ultime Pride Edition Hair Oil',
  'price': 65,
  'rating': 5,
  'reviews': 2},
 1: {'name': 'Nutritive 8HR Magic Night Hair Serum',
  'price': 67,
  'rating': 4.5701,
  'reviews': 749},
  ....
  }

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM