Beautiful Soup 在現有元素上返回 None

Question

我正在努力降低產品的價格。 這是我的代碼：

from bs4 import BeautifulSoup as soup
import requests

page_url = "https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/"
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
uClient = requests.get(page_url, headers=headers)
print(uClient)
page_soup = soup(uClient.content, "html.parser") #requests
test = page_soup.find("p", {"class":"fb-price"})
print(test)

但我得到以下響應而不是所需的價格

<Response [200]>
None

我已經使用 Chrome 開發人員工具檢查過該元素是否存在。 網址： https : //www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/

Answer 1

如果您轉到network tab您將獲得以下鏈接，該鏈接以 json 格式檢索數據。您可以在沒有 Selenium 和 Beautifulsoup 的情況下執行此操作

Url="https://www.falabella.com/rest/model/falabella/rest/browse/BrowseActor/fetch-item-details?{%22products%22:[{%22productId%22:%225311634%22}, {%22productId%22:%225311597%22},{%22productId%22:%225311505%22},{%22productId%22:%226009874%22},{%22productId%22:%22532},{%2} 22productId%22:%225311510%22},{%22productId%22:%226009845%22},{%22productId%22:%226009871%22},{%22productId%22:%226009868%222 22:%226009774%22},{%22productId%22:%226782957%22},{%22productId%22:%226009783%22},{%22productId%22:%226782958%22},{%22 %228107608%22},{%22productId%22:%228107640%22},{%22productId%22:%226009875%22},{%22productId%22:%226782967%22},{%2282%productId26 %22}]}"

試試下面的代碼。

import requests

page_url = "https://www.falabella.com/rest/model/falabella/rest/browse/BrowseActor/fetch-item-details?{%22products%22:[{%22productId%22:%225311634%22},{%22productId%22:%225311597%22},{%22productId%22:%225311505%22},{%22productId%22:%226009874%22},{%22productId%22:%225311494%22},{%22productId%22:%225311510%22},{%22productId%22:%226009845%22},{%22productId%22:%226009871%22},{%22productId%22:%226009868%22},{%22productId%22:%226009774%22},{%22productId%22:%226782957%22},{%22productId%22:%226009783%22},{%22productId%22:%226782958%22},{%22productId%22:%228107608%22},{%22productId%22:%228107640%22},{%22productId%22:%226009875%22},{%22productId%22:%226782967%22},{%22productId%22:%226782922%22}]}"
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
response=requests.get(page_url, headers=headers)
res=response.json()
for item in res['products'][0]['product']['prices']:
    print(item['symbol'] + item['originalPrice'])

輸出：

$ 379.990
$ 569.990

獲取產品名稱：

print(res['products'][0]['product']['displayName'])

輸出：

Smartphone iPhone 7 PLUS 32GB

如果您只尋找value $ 379.990的打印本

print(res['products'][0]['product']['prices'][0]['symbol'] +res['products'][0]['product']['prices'][0]['originalPrice'] )

Answer 2

問題是 JS 腳本在頁面加載后動態插入此 HTML 節點。 該請求僅檢索原始 HTML，不會等待腳本運行。

您可以使用無頭瀏覽器，例如 Chrome Webdriver，它能夠實時等待頁面加載，然后動態訪問 DOM。 這是安裝后如何使用它的示例：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/"
opts = Options()  
opts.add_argument("--headless")  
opts.add_argument("log-level=3") # suppress console noise
driver = webdriver.Chrome(options=opts)
driver.get(url)

print(driver.find_element_by_class_name("fb-price").text) # => $ 379.990

正如另一個答案中所指出的，另一個不錯的選擇是對腳本用於訪問數據的 URL 進行相同的 API 調用。 使用這種方法不需要安裝或導入任何東西，因此它非常輕量級，並且 API 可能沒有類名那么脆弱（反之亦然）。

Answer 3

這是非常hacky的，對於實際用例，我建議使用這個： Web-scraping JavaScript page with Python

通過 cURL 下載原始 HTML 並使用 grep（在您的情況下，您可以在資源管理器的“源”選項卡中搜索源代碼），我能夠發現價格存儲在fbra_browseMainProductConfig變量中。 使用 BeautifulSoup，我能夠為它提取腳本：

import requests, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/").content)
# grab the text where it has `fbra_browseMainProductConfig` in it, and strip the extra whitespace
script_contents = soup(text=re.compile("fbra_browseMainProductConfig"))[0].strip()

從那里，我檢查了輸出，發現第一行是fbra_browseMainProductConfig聲明。 所以：

import json
# split the contents of the script tag into lines, take the first element (0th index), strip any additional whitespace
mainProductConfigLine = script_contents.splitlines()[0].strip()
# split the variable from the declaration, JSON that (removing the ending semicolon)
mainProductConfig = json.loads(mainProductConfigLine.split(" = ",1)[1][:-1])
# grab the prices (plural, there are more than one)
# in order to find the key, I messed around with the dict in a Python REPL and found it
prices = [price["originalPrice"] for price in mainProductConfig["state"]["product"]["prices"] if "originalPrice" in price]

希望這可以幫助！

Beautiful Soup 在現有元素上返回 None

問題描述

3 個解決方案

解決方案1
3 已采納 2019-12-17 16:21:27

解決方案2
2 2019-12-17 16:15:20

解決方案3
1 2019-12-17 16:25:07

Beautiful Soup 在現有元素上返回 None

問題描述

3 個解決方案

解決方案1 3 已采納 2019-12-17 16:21:27

解決方案2 2 2019-12-17 16:15:20

解決方案3 1 2019-12-17 16:25:07

解決方案1
3 已采納 2019-12-17 16:21:27

解決方案2
2 2019-12-17 16:15:20

解決方案3
1 2019-12-17 16:25:07