简体   繁体   中英

Beautiful Soup find returns [] or none

I'm making my first little web scraping program. I'm trying to get the price of a product but soup.find returns "None".

import requests
from bs4 import BeautifulSoup

site = 'https://www.pichau.com.br/placa-de-video-asus-geforce-gtx-1650-dual-4gb-gddr5-128-bit-dual-gtx1650-4g'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 OPR/82.0.4227.50'}

page = requests.get(site, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(class_ = 'jss237')

print(price)

This is returning None, however, if I get the class of the box that covers the entire thing, like this

price = soup.find(class_ = 'MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-sm-5').get_text()

It returns everything, including the prices that I'm trying to get

Placa de Video Asus GeForce GTX 1650 Dual, 4GB, GDDR5, 128-bit, DUAL-GTX1650-4G...SKU: DUAL-GTX1650-4Gà vistaR$1.989,00no PIX com 12% descontoR$ 2.260,23em até 12x de 188,35sem juros no cartão CaracterísticasGarantia: 12 Meses

The .jsN class names appear to be auto-generated, or subject to A/B pages, so I noticed they were changing from load to load after posting my initial answer (see the edit history if you want to see the old solution).

The primary price is available in the static markup as metadata:

<meta property="product:price:amount" content="R$1.989,00" />

Select that with

print(soup.select_one('[property="product:price:amount"]')['content'])

If you want 188,35 , you could use some nearby expected text, icon or DOM structure to identify it, or use a regex on the body text to grab price-looking substrings:

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.pichau.com.br/placa-de-video-asus-geforce-gtx-1650-dual-4gb-gddr5-128-bit-dual-gtx1650-4g"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "lxml")
body = soup.select_one("body").decode_contents()
print(re.findall(r"\b(?:\d+\.)?\d+,\d{2,}\b", body)) 
# => ['1.989,00', '2.260,23', '188,35']

You can be more specific than body to reduce false positives, at the risk of depending on that selector existing (use case dependent).

Note that I used lxml, which is faster and more adaptive than html.parser, but you can use html.parser if you don't have lxml handy.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM