简体   繁体   中英

Python Web Scraping - How To Skip Non-Existent Elements In Iteration

How do I return a different value in my iterator for xpaths that don't exist?

I'm scraping a comparison site but not all elements are present across all the listings due to different offerings.

Problem breakdown:

  1. I start by getting the total number of listings and storing them in the variable 'savings' and use the len(savings) value as my iteration range.

savings = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="rc-ratetable"]/div/div')))

  1. There is a field that specifies offer mechanics with the xpath '//*[@id="rc-ratetable"]/div/div[{i}]/div[2]/div[2]/div[4]/p' however it is not present on all the listings because some do not have applicable offers. And so when my iterator passes through a non-existent {i}, it triggers the exception.

My current solution: I got it to work by using try-except but it's taking too long since it has to go through all the exceptions and surely there's a better way to do this.

I've included my code below:

import time
import datetime
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Chrome()
url = 'https://www.ratecity.com.au/savings-accounts'
driver.get(url)

time.sleep(2.5)

driver.execute_script("window.scrollTo(0, 1080)")

time.sleep(2.5)

driver.execute_script("window.scrollTo(0, 2080)")

get_today = datetime.datetime.now()
today = get_today.strftime('%d/%m/%Y')
affiliate = 'RateCity'
rank = 1

results = [['Date', 'Affiliate', 'Position', 'Provider', 'Product', 'Maximum Rate', 'Standard Rate', 'Max Rate Conditions', 'Savings Details']]

load_more_button = f'//*[@id="__next"]/div/main/div[3]/div/div[2]/div[4]/div[1]/div/button'
load_more_clicks = 2

for i in range(1, load_more_clicks):
    driver.execute_script('arguments[0].click();', WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, load_more_button))))
    time.sleep(2.5)
    
read_more = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="rc-ratetable"]/div/div//span[contains(text()," ...read more")]')))
for i in range(1, len(read_more) + 1):
    read_more_button = f'//*[@id="rc-ratetable"]/div/div//span[contains(text()," ...read more")]'
    driver.execute_script('arguments[0].click();', WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, read_more_button))))
    time.sleep(1.5)
    
savings = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="rc-ratetable"]/div/div')))
    
for i in range(1, len(savings) + 1) :
    
    try :
        savings_conditions = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f'//*[@id="rc-ratetable"]/div/div[{i}]/div[2]/div[2]/div[4]/p'))).text.replace(' ...read less','')
    except :
        savings_conditions = 'No max rate conditions listed'
    
    print(savings_conditions)

I would like to point out that the reason why it is slow isn't entirely because of the exception, but more towards the conditional check of element presence, especially those without savings_conditions , eg, //*[@id="rc-ratetable"]/div/div[2]/div[2]/div[2]/div[4]/p , given an element conditional check is already done when retrieving savings , you may proceed to retrieve the savings_conditions it.

for i in range(1, len(savings) + 1):
    try:
        savings_conditions = driver.find_element(
            By.XPATH, f'//*[@id="rc-ratetable"]/div/div[{i}]/div[2]/div[2]/div[4]/p').text.replace(
            ' ...read less', '')
    except NoSuchElementException:
        savings_conditions = 'No max rate conditions listed'
    print(savings_conditions)

In case you need to wait for those elements to become loaded you can reduce the WebDriverWait timeout to set it fe to 3 seconds instead of 30 seconds, as following.

for i in range(1, len(savings) + 1) :    
    try :
        savings_conditions = WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, f'//*[@id="rc-ratetable"]/div/div[{i}]/div[2]/div[2]/div[4]/p'))).text.replace(' ...read less','')
    except :
        savings_conditions = 'No max rate conditions listed'
    
    print(savings_conditions)

In case you do not really need to wait there since the page is already loaded you can use simple driver.find_elements method (in case you did not set implicitly_wait in your code and you should not do that since you are using WebDriverWait and these two should not be mixed) since driver.find_elements returns a list of web element objects in any case, it does not throw exception. The list will be empty if no matches and non-empty if match found.
So, your code could be as following:

for i in range(1, len(savings) + 1) :
    result = driver.find_elements(By.XPATH, f'//*[@id="rc-ratetable"]/div/div[{i}]/div[2]/div[2]/div[4]/p')
    if results:
        savings_conditions = results[0].text.replace(' ...read less','')
    else:
        savings_conditions = 'No max rate conditions listed'    
    print(savings_conditions)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM