简体   繁体   English

如何在抓取网站时到达最后一页后停止 selenium webdriver?

[英]How to stop the selenium webdriver after reaching the last page while scraping the website?

The amount of data(number of pages) on the site keeps changing and I need to scrape all the pages looping through the pagination.网站上的数据量(页数)不断变化,我需要通过分页循环抓取所有页面。 Website: https://monentreprise.bj/page/annonces网址: https://monentreprise.bj/page/annonces

Code I tried:我试过的代码:

xpath= "//*[@id='yw3']/li[12]/a"        
while True:
    next_page = driver.find_elements(By.XPATH,xpath)
    if len(next_page) < 1:
        print("No more pages")
        break
    else:
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
        print('ok')

ok is printed continuously ok连续打印

Because the condition if len(next_page)<1 is always False.因为条件if len(next_page)<1总是 False。

For instance I tried the url monentreprise.bj/page/annonces?Company_page=99999999999999999999999 and it gives the page 13 which is the last page例如,我尝试了 url monentreprise.bj/page/annonces?Company_page=9999999999999999999999 ,它给出了第 13 页,即最后一页

What you could try maybe is checking if the "next page" button is disabled您可以尝试的可能是检查“下一页”按钮是否被禁用在此处输入图像描述

There are several issues here:这里有几个问题:

  1. //*[@id='yw3']/li[12]/a is not a correct locator for the next pagination button. //*[@id='yw3']/li[12]/a不是next分页按钮的正确定位器。
  2. The better indication for the last page reached state here will be to validate if this css_locator based element .pagination.next contains disabled class.最后一页到达 state 的更好指示是验证此基于 css_locator 的元素.pagination.next包含disabled的 class。
  3. You have to scroll the page down before clicking the next page button在单击下一页按钮之前,您必须向下滚动页面
  4. You have to add a delay after clicking on the pagination button.单击分页按钮后,您必须添加延迟。 Otherwise this will not work.否则这将不起作用。
    This code worked for me:这段代码对我有用:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Chrome()
my_url = "https://monentreprise.bj/page/annonces"
driver.get(my_url)
next_page_parent = '.pagination .next'
next_page_parent_arrow = '.pagination .next a'
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(0.5)
    parent = driver.find_element(By.CSS_SELECTOR,next_page_parent)
    class_name = parent.get_attribute("class")
    if "disabled" in class_name:
        print("No more pages")
        break
    else:
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_parent_arrow))).click()
        time.sleep(1.5)
        print('ok')

The output is: output 是:

ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
No more pages

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 抓取网站时如何在Python Selenium中转到下一页直到最后一页? - How to go to next page until the last page in Python Selenium when scraping website? 使用 Selenium Webdriver 停止页面加载 - stop page loading with Selenium Webdriver 使用 selenium 抓取此网站时如何获得任何结果? - How to get any result while scraping this website with selenium? 在python中使用Selenium Webdriver刮取打印预览页面时出现问题 - Having an issue while scraping the print preview page using selenium webdriver in python Selenium Webdriver - 如何通过抓取提取文本 - Selenium Webdriver - How to extract texts through scraping 在Python中使用硒抓取网站时拒绝访问 - Access denied while scraping a website with selenium in Python 在selenium webdriver中停止无限页面加载 - python - Stop infinite page load in selenium webdriver - python 在 selenium webdriver python 中停止无限页面加载 - Stop infinite page load in selenium webdriver python 使用 Selenium 加载整个页面后,我的代码在前 100 个项目后停止抓取? - After using Selenium to load the whole page, my code stop scraping after the first 100 items? 如何避免在通过 Selenium 抓取 Facebook 时自动重定向到登录页面 - How to avoid automatic redirection to login page while scraping Facebook by Selenium
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM