簡體   English   中英

Beautiful Soup 在現有元素上返回 None

[英]Beautiful Soup returns None on existing element

我正在努力降低產品的價格。 這是我的代碼:

from bs4 import BeautifulSoup as soup
import requests

page_url = "https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/"
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
uClient = requests.get(page_url, headers=headers)
print(uClient)
page_soup = soup(uClient.content, "html.parser") #requests
test = page_soup.find("p", {"class":"fb-price"})
print(test)

但我得到以下響應而不是所需的價格

<Response [200]>
None

我已經使用 Chrome 開發人員工具檢查過該元素是否存在。 網址: https : //www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/

如果您轉到network tab您將獲得以下鏈接,該鏈接以 json 格式檢索數據。您可以在沒有 Selenium 和 Beautifulsoup 的情況下執行此操作

Url="https://www.falabella.com/rest/model/falabella/rest/browse/BrowseActor/fetch-item-details?{%22products%22:[{%22productId%22:%225311634%22}, {%22productId%22:%225311597%22},{%22productId%22:%225311505%22},{%22productId%22:%226009874%22},{%22productId%22:%22532},{%2} 22productId%22:%225311510%22},{%22productId%22:%226009845%22},{%22productId%22:%226009871%22},{%22productId%22:%226009868%222 22:%226009774%22},{%22productId%22:%226782957%22},{%22productId%22:%226009783%22},{%22productId%22:%226782958%22},{%22 %228107608%22},{%22productId%22:%228107640%22},{%22productId%22:%226009875%22},{%22productId%22:%226782967%22},{%2282%productId26 %22}]}"

試試下面的代碼。

import requests

page_url = "https://www.falabella.com/rest/model/falabella/rest/browse/BrowseActor/fetch-item-details?{%22products%22:[{%22productId%22:%225311634%22},{%22productId%22:%225311597%22},{%22productId%22:%225311505%22},{%22productId%22:%226009874%22},{%22productId%22:%225311494%22},{%22productId%22:%225311510%22},{%22productId%22:%226009845%22},{%22productId%22:%226009871%22},{%22productId%22:%226009868%22},{%22productId%22:%226009774%22},{%22productId%22:%226782957%22},{%22productId%22:%226009783%22},{%22productId%22:%226782958%22},{%22productId%22:%228107608%22},{%22productId%22:%228107640%22},{%22productId%22:%226009875%22},{%22productId%22:%226782967%22},{%22productId%22:%226782922%22}]}"
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
response=requests.get(page_url, headers=headers)
res=response.json()
for item in res['products'][0]['product']['prices']:
    print(item['symbol'] + item['originalPrice'])

輸出

$ 379.990
$ 569.990

獲取產品名稱:

print(res['products'][0]['product']['displayName'])

輸出:

Smartphone iPhone 7 PLUS 32GB

如果您只尋找value $ 379.990的打印本

print(res['products'][0]['product']['prices'][0]['symbol'] +res['products'][0]['product']['prices'][0]['originalPrice'] )

問題是 JS 腳本在頁面加載后動態插入此 HTML 節點。 該請求僅檢索原始 HTML,不會等待腳本運行。

您可以使用無頭瀏覽器,例如 Chrome Webdriver,它能夠實時等待頁面加載,然后動態訪問 DOM。 這是安裝后如何使用的示例:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/"
opts = Options()  
opts.add_argument("--headless")  
opts.add_argument("log-level=3") # suppress console noise
driver = webdriver.Chrome(options=opts)
driver.get(url)

print(driver.find_element_by_class_name("fb-price").text) # => $ 379.990

正如另一個答案中所指出的,另一個不錯的選擇是對腳本用於訪問數據的 URL 進行相同的 API 調用。 使用這種方法不需要安裝或導入任何東西,因此它非常輕量級,並且 API 可能沒有類名那么脆弱(反之亦然)。

這是非常hacky的,對於實際用例,我建議使用這個: Web-scraping JavaScript page with Python


通過 cURL 下載原始 HTML 並使用 grep(在您的情況下,您可以在資源管理器的“源”選項卡中搜索源代碼),我能夠發現價格存儲在fbra_browseMainProductConfig變量中。 使用 BeautifulSoup,我能夠為它提取腳本:

import requests, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/").content)
# grab the text where it has `fbra_browseMainProductConfig` in it, and strip the extra whitespace
script_contents = soup(text=re.compile("fbra_browseMainProductConfig"))[0].strip()

從那里,我檢查了輸出,發現第一行是fbra_browseMainProductConfig聲明。 所以:

import json
# split the contents of the script tag into lines, take the first element (0th index), strip any additional whitespace
mainProductConfigLine = script_contents.splitlines()[0].strip()
# split the variable from the declaration, JSON that (removing the ending semicolon)
mainProductConfig = json.loads(mainProductConfigLine.split(" = ",1)[1][:-1])
# grab the prices (plural, there are more than one)
# in order to find the key, I messed around with the dict in a Python REPL and found it
prices = [price["originalPrice"] for price in mainProductConfig["state"]["product"]["prices"] if "originalPrice" in price]

希望這可以幫助!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM