简体   繁体   English

在python中通过selenium的分页导航

[英]navigating through pagination with selenium in python

I'm scraping this website using Python and Selenium. 我正在使用Python和Selenium来抓取这个网站。 I have the code working but it currently only scrapes the first page, I would like to iterate through all the pages and scrape them all but they handle pagination in a weird way how would I go through the pages and scrape them one by one? 我有代码工作,但它目前只刮擦第一页,我想迭代所有页面并刮掉所有页面,但他们以奇怪的方式处理分页我将如何通过页面并逐个刮擦它们?

Pagination HTML: 分页HTML:

<div class="pagination">
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to first page">First</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to previous page">Prev</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to page 1">1</a>
    <span class="current">2</span>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to page 3">3</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to page 4">4</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to next page">Next</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to last page">Last</a>
</div>

Scraper: 刮刀:

import re
import json
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options

options = Options()
# options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, 
executable_path=r'/Users/weaabduljamac/Downloads/chromedriver')

url = 'https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList'
driver.get(url)

def getData():
  data = []
  rows = driver.find_element_by_xpath('//*[@id="form1"]/table/tbody').find_elements_by_tag_name('tr')
 for row in rows:
    app_number = row.find_elements_by_tag_name('td')[1].text
    address =  row.find_elements_by_tag_name('td')[2].text
    proposals =  row.find_elements_by_tag_name('td')[3].text
    status =  row.find_elements_by_tag_name('td')[4].text
    data.append({"CaseRef": app_number, "address": address, "proposals": proposals, "status": status})
print(data)
return data


def main():
 all_data = []
 select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
 list_options = select.options

 for item in range(len(list_options)):
    select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
    select.select_by_index(str(item))
    driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
    all_data.extend( getData() )
    driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()
    driver.get(url)

 with open( 'wiltshire.json', 'w+' ) as f:
    json.dump( all_data, f )
 driver.quit()


if __name__ == "__main__":
    main()

Before moving on to automating any scenario, always write down the manual steps you would perform to execute the scenario. 在继续自动化任何方案之前,请始终记下执行方案时要执行的手动步骤。 Manual steps for what you want to (which I understand from the question) is - 您想要的手动步骤(我从问题中理解)是 -

1) Go to site - https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList 1)转到网站 - https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList

2) Select first week option 2)选择第一周选项

3) Click search 3)单击搜索

4) Get the data from every page 4)从每个页面获取数据

5) Load the url again 5)再次加载URL

6) Select second week option 6)选择第二周选项

7) Click search 7)单击搜索

8) Get the data from every page 8)从每个页面获取数据

.. and so on. .. 等等。

You are having a loop to select different weeks but inside each loop iteration for the week option, you also need to include a loop to iterate over all the pages. 你有一个循环选择不同的周,但在每周循环迭代周期选项中,你还需要包含一个循环迭代所有页面。 Since you are not doing that, your code is returning only the data from the first page. 由于您没有这样做,您的代码只返回第一页的数据。

Another problem is with how you are locaing the 'Next' button - 另一个问题是如何找到“下一步”按钮 -

driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()

You are selecting the 4th <a> element which is ofcourse not robust because in different pages, the Next button's index will be different. 您正在选择第四个<a>元素,这个元素当然不健壮,因为在不同的页面中,“下一步”按钮的索引会有所不同。 Instead, use this better locator - 相反,使用这个更好的定位器 -

driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()

Logic for creating loop which will iterate through pages - 用于创建将遍历页面的循环的逻辑 -

First you will need the number of pages. 首先,您需要页数。 I did that by locating the <a> immediately before the "Next" button. 我通过在“下一步”按钮之前找到<a>做到这一点。 As per the screenshot below, it is clear that the text of this element will be equal to the number of pages - 根据下面的截图,很明显这个元素的文本将等于页面数 -

截图 - -

I did that using following code - 我使用以下代码做到了 -

number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)

Now once you have number of pages as number_of_pages , you only need to click "Next" button number_of_pages - 1 times! 现在,一旦你有多个页面作为number_of_pages ,你只需要点击“下一步”按钮number_of_pages - 1次!

Final code for your main function- main功能的最终代码 -

def main():
 all_data = []
 select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
 list_options = select.options

 for item in range(len(list_options)):
    select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
    select.select_by_index(str(item))
    driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
    number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
    for j in range(number_of_pages - 1):
      all_data.extend(getData())
      driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
      time.sleep(1)
    driver.get(url)

 with open( 'wiltshire.json', 'w+' ) as f:
    json.dump( all_data, f )
 driver.quit()

first get the total pages in the pagination, using 首先使用分页获取分页中的总页数

ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,1')
ins.find_element_by_class_name("pagination")
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', {'class':'pagination'})
all_as = div[0].find_all('a')
total = 0

for i in range(len(all_as)):
    if 'Next' in all_as[i].text:
        total = all_as[i-1].text
        break

Now just loop through the range 现在只需循环遍历范围

for i in range(total):
 ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,{}'.format(count))

keep incrementing the count and get the source code for the page and then get the data for it. 继续递增计数并获取页面的源代码,然后获取它的数据。 Note: Don't forget the sleep when clicking on going form one page to another 注意:点击一页到另一页时,不要忘记睡眠

Following approach is simply worked for me. 以下方法对我来说很简单。

driver.find_element_by_link_text("3").click()
driver.find_element_by_link_text("4").click()
....
driver.find_element_by_link_text("Next").click()    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM