简体   繁体   English

Beautiful Soup 在现有元素上返回 None

[英]Beautiful Soup returns None on existing element

I'm trying to scrape the price of a product.我正在努力降低产品的价格。 Here's my code:这是我的代码:

from bs4 import BeautifulSoup as soup
import requests

page_url = "https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/"
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
uClient = requests.get(page_url, headers=headers)
print(uClient)
page_soup = soup(uClient.content, "html.parser") #requests
test = page_soup.find("p", {"class":"fb-price"})
print(test)

But I get the following response instead of the desired price但我得到以下响应而不是所需的价格

<Response [200]>
None

I have checked that the element exists using Chrome developer tools.我已经使用 Chrome 开发人员工具检查过该元素是否存在。 URL: https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/网址: https : //www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/

If you go to network tab you get the following link which retrieve data in json format.You can do that without Selenium and Beautifulsoup如果您转到network tab您将获得以下链接,该链接以 json 格式检索数据。您可以在没有 Selenium 和 Beautifulsoup 的情况下执行此操作

Url="https://www.falabella.com/rest/model/falabella/rest/browse/BrowseActor/fetch-item-details?{%22products%22:[{%22productId%22:%225311634%22},{%22productId%22:%225311597%22},{%22productId%22:%225311505%22},{%22productId%22:%226009874%22},{%22productId%22:%225311494%22},{%22productId%22:%225311510%22},{%22productId%22:%226009845%22},{%22productId%22:%226009871%22},{%22productId%22:%226009868%22},{%22productId%22:%226009774%22},{%22productId%22:%226782957%22},{%22productId%22:%226009783%22},{%22productId%22:%226782958%22},{%22productId%22:%228107608%22},{%22productId%22:%228107640%22},{%22productId%22:%226009875%22},{%22productId%22:%226782967%22},{%22productId%22:%226782922%22}]}" Url="https://www.falabella.com/rest/model/falabella/rest/browse/BrowseActor/fetch-item-details?{%22products%22:[{%22productId%22:%225311634%22}, {%22productId%22:%225311597%22},{%22productId%22:%225311505%22},{%22productId%22:%226009874%22},{%22productId%22:%22532},{%2} 22productId%22:%225311510%22},{%22productId%22:%226009845%22},{%22productId%22:%226009871%22},{%22productId%22:%226009868%222 22:%226009774%22},{%22productId%22:%226782957%22},{%22productId%22:%226009783%22},{%22productId%22:%226782958%22},{%22 %228107608%22},{%22productId%22:%228107640%22},{%22productId%22:%226009875%22},{%22productId%22:%226782967%22},{%2282%productId26 %22}]}"

Try the below code.试试下面的代码。

import requests

page_url = "https://www.falabella.com/rest/model/falabella/rest/browse/BrowseActor/fetch-item-details?{%22products%22:[{%22productId%22:%225311634%22},{%22productId%22:%225311597%22},{%22productId%22:%225311505%22},{%22productId%22:%226009874%22},{%22productId%22:%225311494%22},{%22productId%22:%225311510%22},{%22productId%22:%226009845%22},{%22productId%22:%226009871%22},{%22productId%22:%226009868%22},{%22productId%22:%226009774%22},{%22productId%22:%226782957%22},{%22productId%22:%226009783%22},{%22productId%22:%226782958%22},{%22productId%22:%228107608%22},{%22productId%22:%228107640%22},{%22productId%22:%226009875%22},{%22productId%22:%226782967%22},{%22productId%22:%226782922%22}]}"
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
response=requests.get(page_url, headers=headers)
res=response.json()
for item in res['products'][0]['product']['prices']:
    print(item['symbol'] + item['originalPrice'])

Output :输出

$ 379.990
$ 569.990

To get the product name:获取产品名称:

print(res['products'][0]['product']['displayName'])

Output:输出:

Smartphone iPhone 7 PLUS 32GB

If you only looking for the value $ 379.990 the print this如果您只寻找value $ 379.990的打印本

print(res['products'][0]['product']['prices'][0]['symbol'] +res['products'][0]['product']['prices'][0]['originalPrice'] )

The problem is that a JS script is inserting this HTML node dynamically after the page load.问题是 JS 脚本在页面加载后动态插入此 HTML 节点。 The request retrieves only the raw HTML and doesn't wait around for scripts to run.该请求仅检索原始 HTML,不会等待脚本运行。

You can use a headless browser such as Chrome Webdriver which is able to wait for the page to load in real time, then access the DOM dynamically.您可以使用无头浏览器,例如 Chrome Webdriver,它能够实时等待页面加载,然后动态访问 DOM。 Here's a sample of how you could use this after installing it :这是安装后如何使用的示例:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/"
opts = Options()  
opts.add_argument("--headless")  
opts.add_argument("log-level=3") # suppress console noise
driver = webdriver.Chrome(options=opts)
driver.get(url)

print(driver.find_element_by_class_name("fb-price").text) # => $ 379.990

As pointed out in the other answer , another good option is to make the same API call to the URL that the script uses to access the data.正如另一个答案中所指出的,另一个不错的选择是对脚本用于访问数据的 URL 进行相同的 API 调用。 There's nothing to install or import using this approach, so it's very lightweight, and the API may be less brittle than the class name (or vice versa).使用这种方法不需要安装或导入任何东西,因此它非常轻量级,并且 API 可能没有类名那么脆弱(反之亦然)。

This is extremely hacky, and for real use cases, I would suggest using this: Web-scraping JavaScript page with Python这是非常hacky的,对于实际用例,我建议使用这个: Web-scraping JavaScript page with Python


By downloading the raw HTML via cURL and using grep (in your case, you could use a search on the source in Sources tab in the explorer), I was able to find that the price was stored in the fbra_browseMainProductConfig variable.通过 cURL 下载原始 HTML 并使用 grep(在您的情况下,您可以在资源管理器的“源”选项卡中搜索源代码),我能够发现价格存储在fbra_browseMainProductConfig变量中。 Using BeautifulSoup, I was able to pull the script for it:使用 BeautifulSoup,我能够为它提取脚本:

import requests, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/").content)
# grab the text where it has `fbra_browseMainProductConfig` in it, and strip the extra whitespace
script_contents = soup(text=re.compile("fbra_browseMainProductConfig"))[0].strip()

From there, I checked the output, and found that the first line was the fbra_browseMainProductConfig declaration.从那里,我检查了输出,发现第一行是fbra_browseMainProductConfig声明。 So:所以:

import json
# split the contents of the script tag into lines, take the first element (0th index), strip any additional whitespace
mainProductConfigLine = script_contents.splitlines()[0].strip()
# split the variable from the declaration, JSON that (removing the ending semicolon)
mainProductConfig = json.loads(mainProductConfigLine.split(" = ",1)[1][:-1])
# grab the prices (plural, there are more than one)
# in order to find the key, I messed around with the dict in a Python REPL and found it
prices = [price["originalPrice"] for price in mainProductConfig["state"]["product"]["prices"] if "originalPrice" in price]

Hope this helps!希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM