繁体   English   中英

Python - Selenium下一页

[英]Python - Selenium next page

我正在尝试使用抓取应用程序来抓取Hants.gov.uk,现在我正在处理它只是单击页面而不是抓取。 当它到达第1页的最后一行时它就停止了,所以我所做的就是点击“下一页”按钮,但首先它必须回到原来的URL。 它点击了第2页,但在第2页被删除之后它没有转到第3页,它只是重新启动第2页。

有人可以帮我解决这个问题吗?

码:

import time
import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"

driver = webdriver.Chrome(executable_path=r"C:\Users\Goten\Desktop\chromedriver.exe")
driver.get(url)

driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()

def start():
    elements = driver.find_elements_by_css_selector(".searchResult a")
    links = [link.get_attribute("href") for link in elements]

    result = []
    for link in links:
        if link not in result:
            result.append(link)
        else:
            driver.get(link)
            goUrl = urllib.request.urlopen(link)
            soup = BeautifulSoup(goUrl.read(), "html.parser")
            #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
            for i in range(20):
                pass # Don't worry about all this commented code, it isn't relevant right now
                #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
                #print(table.text)
            #   div = soup.select("div.applicationDetails")
            #   getDiv = div[i].split(":")[1].get_text()
            #   log = open("log.txt", "a")
            #   log.write(getDiv + "\n")
            #log.write("\n")

start()
driver.get(url)

for i in range(5):
    driver.find_element_by_id("ctl00_mainContentPlaceHolder_lvResults_bottomPager_ctl02_NextButton").click()
    url = driver.current_url
    start()
    driver.get(url)
driver.close()

尝试这个:

import time
# import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"

driver = webdriver.Chrome()
driver.get(url)

driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()

result = []


def start():
    elements = driver.find_elements_by_css_selector(".searchResult a")
    links = [link.get_attribute("href") for link in elements]
    result.extend(links)

def start2():
    for link in result:
        # if link not in result:
        #     result.append(link)
        # else:
            driver.get(link)
            goUrl = urllib.request.urlopen(link)
            soup = BeautifulSoup(goUrl.read(), "html.parser")
            #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
            for i in range(20):
                pass # Don't worry about all this commented code, it isn't relevant right now
                #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
                #print(table.text)
            #   div = soup.select("div.applicationDetails")
            #   getDiv = div[i].split(":")[1].get_text()
            #   log = open("log.txt", "a")
            #   log.write(getDiv + "\n")
            #log.write("\n")


while True:
    start()
    element = driver.find_element_by_class_name('rdpPageNext')
    try:
        check = element.get_attribute('onclick')
        if check != "return false;":
            element.click()
        else:
            break

    except:
        break

print(result)
start2()
driver.get(url)

根据网址https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True点击浏览所有页面,您可以使用以下解决方案:

  • 代码块:

     from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC options = Options() options.add_argument("start-maximized") options.add_argument("disable-infobars") options.add_argument("--disable-extensions") driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\\Utility\\BrowserDrivers\\chromedriver.exe') driver.get('https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "mainContentPlaceHolder_btnAccept"))).click() numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#ctl00_mainContentPlaceHolder_lvResults_topPager div.rdpWrap.rdpNumPart>a")))) print(numLinks) for i in range(numLinks): print("Perform your scrapping here on page {}".format(str(i+1))) WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@id='ctl00_mainContentPlaceHolder_lvResults_topPager']//div[@class='rdpWrap rdpNumPart']//a[@class='rdpCurrentPage']/span//following::span[1]"))).click() driver.quit() 
  • 控制台输出:

     8 Perform your scrapping here on page 1 Perform your scrapping here on page 2 Perform your scrapping here on page 3 Perform your scrapping here on page 4 Perform your scrapping here on page 5 Perform your scrapping here on page 6 Perform your scrapping here on page 7 Perform your scrapping here on page 8 

嗨@Feitan Portor你写的代码绝对完美,你被重定向到第一页的唯一原因是因为你在最后一个for循环中给出了url = driver.current_url ,它是保持静态的url而且只有java启动下一个点击事件的脚本,只需删除url = driver.current_urldriver.get(url)

并且你很高兴我已经测试了我的自己以获得你的刮刀所在的当前页面只是在for循环中添加这个部分,这样你就可以知道你的刮刀在哪里:

ss = driver.find_element_by_class_name('rdpCurrentPage').text
    print(ss)

希望这能解决你的困惑

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM