简体   繁体   中英

Python / Selenium / Beautiful Soup not scraping desired elements

I'm struggling to get this code to extract the desired information from one single page.

I've tried all the usual selenium tactics and added a time delay. Hopefully, it's something simple. I'm not getting any error messages.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs
from time import sleep

options = Options()
options.add_argument("--headless")
options.add_argument("window-size=1400,600")
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36"
options.add_argument(f'user-agent={user_agent}')
capabilities = { 'chromeOptions':  { 'useAutomationExtension': False},'args': ['--disable-extensions']}
browser = webdriver.Chrome(executable_path=r'/usr/local/bin/chromedriver',desired_capabilities = capabilities,options=options)

url='https://groceries.asda.com/product/celery-spring-onions/asda-growers-selection-trimmed-spring-onions/41676'

browser.get(url)
sleep(3)
source_data = browser.page_source
bs_data = bs(source_data,"html.parser")

#product id
try:
    product_id = bs_data.findfindAll('span', {'class': 'pdp-main-details__product-code'})       
    product_id = product_id.replace('Product code:','').strip()
except:
    product_id = "n/a"

#image address 
try:
    for image in bs_data.find("div", {"class":"s7staticimage"}):
        image_url = image.find('img')['src']
except:
       image_url = "n/a"   

#product description
try:
    product_desc = bs_data.find('class',{'pdp-main-pdp-main-details__title'})
    product_desc = product_desc.get_text().strip()
except:
    product_desc = "n/a"

#product price
try:
    product_price = bs_data.find('class',{'co-product__price pdp-main-details__price'})
    product_price = product_price.get_text().strip()
except:
    product_price = "n/a"

print (url,'|',image_url,'|',product_id,'|',product_desc,'|',product_price)        


browser.quit()

Any assistance is greatly appreciated.

Thanks

Since the content is dynamically generated, your soup has nothing in it to find. Selenium is good enough. I don't know why you have treated the elements as list because there is only one of each on this page.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
capabilities = { 'chromeOptions':  { 'useAutomationExtension': False},'args': ['--disable-extensions']}
browser = webdriver.Chrome(executable_path='C:/bin/chromedriver.exe',desired_capabilities = capabilities,options=options)
url='https://groceries.asda.com/product/celery-spring-onions/asda-growers-selection-trimmed-spring-onions/41676'

browser.get(url)
browser.implicitly_wait(15)
product_id = browser.find_element_by_class_name('pdp-main-details__product-code')
print(product_id.text)
image = browser.find_element_by_xpath("//*[@id=\"s7viewer_flyout\"]/div[1]/img[1]")
image_url = image.get_attribute('src')
print(image_url)

Output:-

Product code: 410212
https://ui.assets-asda.com/dm/asdagroceries/5050854288142_T1?defaultImage=asdagroceries/noImage&resMode=sharp2&id=PqaST3&fmt=jpg&fit=constrain,1&wid=188&hei=188

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM