如何在抓取網站時到達最后一頁后停止 selenium webdriver？

Question

網站上的數據量（頁數）不斷變化，我需要通過分頁循環抓取所有頁面。 網址： https://monentreprise.bj/page/annonces

我試過的代碼：

xpath= "//*[@id='yw3']/li[12]/a"        
while True:
    next_page = driver.find_elements(By.XPATH,xpath)
    if len(next_page) < 1:
        print("No more pages")
        break
    else:
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
        print('ok')

ok連續打印

Answer 1

因為條件if len(next_page)<1總是 False。

例如，我嘗試了 url monentreprise.bj/page/annonces?Company_page=9999999999999999999999 ，它給出了第 13 頁，即最后一頁

您可以嘗試的可能是檢查“下一頁”按鈕是否被禁用

Answer 2

這里有幾個問題：

//*[@id='yw3']/li[12]/a不是next分頁按鈕的正確定位器。
最后一頁到達 state 的更好指示是驗證此基於 css_locator 的元素.pagination.next包含disabled的 class。
在單擊下一頁按鈕之前，您必須向下滾動頁面
單擊分頁按鈕后，您必須添加延遲。 否則這將不起作用。
這段代碼對我有用：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Chrome()
my_url = "https://monentreprise.bj/page/annonces"
driver.get(my_url)
next_page_parent = '.pagination .next'
next_page_parent_arrow = '.pagination .next a'
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(0.5)
    parent = driver.find_element(By.CSS_SELECTOR,next_page_parent)
    class_name = parent.get_attribute("class")
    if "disabled" in class_name:
        print("No more pages")
        break
    else:
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_parent_arrow))).click()
        time.sleep(1.5)
        print('ok')

output 是：

ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
No more pages

如何在抓取網站時到達最后一頁后停止 selenium webdriver？

問題描述

2 個解決方案

解決方案1
2 2022-03-15 10:29:58

解決方案2
2 已采納 2022-03-15 10:46:58

如何在抓取網站時到達最后一頁后停止 selenium webdriver？

問題描述

2 個解決方案

解決方案1 2 2022-03-15 10:29:58

解決方案2 2 已采納 2022-03-15 10:46:58

解決方案1
2 2022-03-15 10:29:58

解決方案2
2 已采納 2022-03-15 10:46:58