Web 造景 | Python Selenium webdriver 使用 xpath 查找动态元素

Question

如果这个长问题看起来很基本，请提前道歉！

鉴于：

在图书馆网站中搜索查询链接：

url = 'https://digi.kansalliskirjasto.fi/search?query=economic%20crisis&orderBy=RELEVANCE'

我想提取此特定查询的每个单独搜索结果（一页中总共 20 个）的所有有用信息，如图中的红色矩形所示：

目前，我有以下代码：

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

def run_selenium(URL):
    options = Options()
    options.add_argument("--remote-debugging-port=9222"),
    options.headless = True
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    
    driver.get(URL)
    pt = "//app-digiweb/ng-component/section/div/div/app-binding-search-results/div/div"
    medias = driver.find_elements(By.XPATH, pt) # expect to obtain a list with 20 elements!!
    print(medias) # >>>>>> result: []
    print("#"*100)
    for i, v in enumerate(medias):
        print(i, v.get_attribute("innerHTML"))

if __name__ == '__main__':
    url = 'https://digi.kansalliskirjasto.fi/search?query=economic%20crisis&orderBy=RELEVANCE'
    run_selenium(URL=url)

问题：

看一下 chrome 中的检查部分：

我已经尝试了几个由 Chrome 扩展程序XPath Helper和SelectorsHub生成的 xpath 生成 XPath 并将其用作我的 python 代码库搜索引擎中的pt变量，但结果是[]或者什么都没有。

使用SelectorsHub并将鼠标悬停在Rel XPath ，我收到此警告： id & class both look dynamic. Uncheck id & class checkbox to generate rel xpath without them if it is generated with them. id & class both look dynamic. Uncheck id & class checkbox to generate rel xpath without them if it is generated with them.

问题：

假设selenium作为 web 抓取包含动态属性的页面的工具，而不是此处和此处推荐的BeautifulSoup ，不应该driver.find_elements()返回一个包含 20 个元素的列表，每个元素包含所有信息并被提取？

>>>>> 更新 <<<<<工作解决方案（尽管时间效率低下）

正如@JaSON 在解决方案中所推荐的，我现在在try except块中使用WebDriverWait ，如下所示：

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common import exceptions

def get_all_search_details(URL):
    st_t = time.time()
    SEARCH_RESULTS = {}
    options = Options()
    options.headless = True    
    options.add_argument("--remote-debugging-port=9222")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-gpu")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-extensions")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver =webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.get(URL)
    print(f"Scraping {driver.current_url}")
    try:
        medias = WebDriverWait(driver,timeout=10,).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'result-row')))
        for media_idx, media_elem in enumerate(medias):
            outer_html = media_elem.get_attribute('outerHTML')
            result = scrap_newspaper(outer_html) # some function to retrieve results
            SEARCH_RESULTS[f"result_{media_idx}"] = result
    except exceptions.StaleElementReferenceException as e:
        print(f"Selenium: {type(e).__name__}: {e.args}")
        return
    except exceptions.NoSuchElementException as e:
        print(f"Selenium: {type(e).__name__}: {e.args}")
        return
    except exceptions.TimeoutException as e:
        print(f"Selenium: {type(e).__name__}: {e.args}")
        return
    except exceptions.WebDriverException as e:
        print(f"Selenium: {type(e).__name__}: {e.args}")
        return
    except exceptions.SessionNotCreatedException as e:
        print(f"Selenium: {type(e).__name__}: {e.args}")
        return
    except Exception as e:
        print(f"Selenium: {type(e).__name__} line {e.__traceback__.tb_lineno} of {__file__}: {e.args}")
        return
    except:
        print(f"Selenium General Exception: {URL}")
        return
    print(f"\t\tFound {len(medias)} media(s) => {len(SEARCH_RESULTS)} search result(s)\tElapsed_t: {time.time()-st_t:.2f} s")
    return SEARCH_RESULTS

if __name__ == '__main__':
    url = 'https://digi.kansalliskirjasto.fi
    get_all_search_details(URL=url)

这种方法有效，但似乎非常耗时且效率低下：

Found 20 media(s) => 20 search result(s) Elapsed_t: 15.22 s

Answer 1

这只是问题 #2 的答案，因为 #1 和 #3（正如 Prophet 在评论中已经说过的那样）对 SO 无效。

由于您正在处理动态内容find_elements不是您所需要的。 尝试等待所需数据出现：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

medias = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'media')))

Answer 2

在搜索结果的顶部，有一个选项可以将搜索结果下载为 excel，还有报纸/期刊元数据和搜索周围的文本。 它比刮取单个元素更容易使用吗？ （Excel 仅包含 10.000 个首次命中，你...）

Web 造景 | Python Selenium webdriver 使用 xpath 查找动态元素

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-10-26 14:30:26

解决方案2
0 2022-10-28 10:43:06

Web 造景 | Python Selenium webdriver 使用 xpath 查找动态元素

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-10-26 14:30:26

解决方案2 0 2022-10-28 10:43:06

解决方案1
1 已采纳 2022-10-26 14:30:26

解决方案2
0 2022-10-28 10:43:06