在python中通過selenium的分頁導航

Question

我正在使用Python和Selenium來抓取這個網站。 我有代碼工作，但它目前只刮擦第一頁，我想迭代所有頁面並刮掉所有頁面，但他們以奇怪的方式處理分頁我將如何通過頁面並逐個刮擦它們？

分頁HTML：

<div class="pagination">
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to first page">First</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to previous page">Prev</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to page 1">1</a>
    <span class="current">2</span>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to page 3">3</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to page 4">4</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to next page">Next</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to last page">Last</a>
</div>

刮刀：

import re
import json
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options

options = Options()
# options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, 
executable_path=r'/Users/weaabduljamac/Downloads/chromedriver')

url = 'https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList'
driver.get(url)

def getData():
  data = []
  rows = driver.find_element_by_xpath('//*[@id="form1"]/table/tbody').find_elements_by_tag_name('tr')
 for row in rows:
    app_number = row.find_elements_by_tag_name('td')[1].text
    address =  row.find_elements_by_tag_name('td')[2].text
    proposals =  row.find_elements_by_tag_name('td')[3].text
    status =  row.find_elements_by_tag_name('td')[4].text
    data.append({"CaseRef": app_number, "address": address, "proposals": proposals, "status": status})
print(data)
return data


def main():
 all_data = []
 select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
 list_options = select.options

 for item in range(len(list_options)):
    select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
    select.select_by_index(str(item))
    driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
    all_data.extend( getData() )
    driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()
    driver.get(url)

 with open( 'wiltshire.json', 'w+' ) as f:
    json.dump( all_data, f )
 driver.quit()


if __name__ == "__main__":
    main()

Answer 1

在繼續自動化任何方案之前，請始終記下執行方案時要執行的手動步驟。 您想要的手動步驟（我從問題中理解）是 -

1）轉到網站 - https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList

2）選擇第一周選項

3）單擊搜索

4）從每個頁面獲取數據

5）再次加載URL

6）選擇第二周選項

7）單擊搜索

8）從每個頁面獲取數據

.. 等等。

你有一個循環選擇不同的周，但在每周循環迭代周期選項中，你還需要包含一個循環迭代所有頁面。 由於您沒有這樣做，您的代碼只返回第一頁的數據。

另一個問題是如何找到“下一步”按鈕 -

driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()

您正在選擇第四個<a>元素，這個元素當然不健壯，因為在不同的頁面中，“下一步”按鈕的索引會有所不同。 相反，使用這個更好的定位器 -

driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()

用於創建將遍歷頁面的循環的邏輯 -

首先，您需要頁數。 我通過在“下一步”按鈕之前找到<a>做到這一點。 根據下面的截圖，很明顯這個元素的文本將等於頁面數 -

-

我使用以下代碼做到了 -

number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)

現在，一旦你有多個頁面作為number_of_pages ，你只需要點擊“下一步”按鈕number_of_pages - 1次！

main功能的最終代碼 -

def main():
 all_data = []
 select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
 list_options = select.options

 for item in range(len(list_options)):
    select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
    select.select_by_index(str(item))
    driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
    number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
    for j in range(number_of_pages - 1):
      all_data.extend(getData())
      driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
      time.sleep(1)
    driver.get(url)

 with open( 'wiltshire.json', 'w+' ) as f:
    json.dump( all_data, f )
 driver.quit()

Answer 2

首先使用分頁獲取分頁中的總頁數

ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,1')
ins.find_element_by_class_name("pagination")
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', {'class':'pagination'})
all_as = div[0].find_all('a')
total = 0

for i in range(len(all_as)):
    if 'Next' in all_as[i].text:
        total = all_as[i-1].text
        break

現在只需循環遍歷范圍

for i in range(total):
 ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,{}'.format(count))

繼續遞增計數並獲取頁面的源代碼，然后獲取它的數據。 注意：點擊一頁到另一頁時，不要忘記睡眠

Answer 3

以下方法對我來說很簡單。

driver.find_element_by_link_text("3").click()
driver.find_element_by_link_text("4").click()
....
driver.find_element_by_link_text("Next").click()

在python中通過selenium的分頁導航

問題描述

3 個解決方案

解決方案1
3 已采納 2018-08-08 11:07:06

解決方案2
0 2018-08-08 11:02:39

解決方案3
0 2019-05-06 05:10:39

在python中通過selenium的分頁導航

問題描述

3 個解決方案

解決方案1 3 已采納 2018-08-08 11:07:06

解決方案2 0 2018-08-08 11:02:39

解決方案3 0 2019-05-06 05:10:39

解決方案1
3 已采納 2018-08-08 11:07:06

解決方案2
0 2018-08-08 11:02:39

解決方案3
0 2019-05-06 05:10:39