简体   繁体   English

如何循环浏览多个页面以使用 Selenium 和 Python 下载 excel 文件

[英]How can I loop through several pages to download excel files using Selenium and Python

I am trying to build a web scraper that will go through a website's pages and download the excel files from a dropdown menu at the bottom of the page.我正在尝试构建一个网络爬虫,它将遍历网站的页面并从页面底部的下拉菜单中下载 excel 文件。

The webpages only allow me to download the 50 locations that are displayed on each page and I cannot download all of them at once.这些网页只允许我下载每个页面上显示的 50 个位置,我无法一次下载所有位置。

I am able to download the first page's Excel file, but the following pages yield nothing else.我可以下载第一页的 Excel 文件,但以下页面没有其他内容。

I get the following output after running the code I have provided below.运行下面提供的代码后,我得到以下输出。

Skipped a page 
No more pages.

If I exclude the lines where it asks to download the pages, it is able to go through each page until the end successfully.如果我排除它要求下载页面的行,它可以浏览每个页面直到成功结束。

I'll provide an example below for what I am trying to get accomplished.我将在下面提供一个示例,说明我要完成的工作。

I would appreciate any help and advice!我将不胜感激任何帮助和建议! Thank you!谢谢!

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

state = 'oklahoma'
rent_to_own = 'rent to own'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get('https://www.careeronestop.org/toolkit/jobs/find-businesses.aspx')

industry = driver.find_element(By.ID, "txtKeyword") 
industry.send_keys(rent_to_own)

location = driver.find_element(By.ID, "txtLocation")
location.send_keys(state)

driver.find_element(By.ID, "btnSubmit").click()

driver.implicitly_wait(3)
        
def web_scrape():
        more_drawer = driver.find_element(By.XPATH, "//div[@class='more-drawer']//a[@href='/toolkit/jobs/find-businesses.aspx?keyword="+rent_to_own+"&ajax=0&location="+state+"&lang=en&Desfillall=y#Des']")
        more_drawer.click()

        driver.implicitly_wait(5)

        get_50 = Select(driver.find_element(By.ID, 'ViewPerPage'))
        get_50.select_by_value('50')

        driver.implicitly_wait(5)

        filter_description = driver.find_element(By.XPATH, "//ul[@class='filters-list']//a[@href='/toolkit/jobs/find-businesses.aspx?keyword="+rent_to_own+"&ajax=0&location="+state+"&lang=en&Desfillall=y&pagesize=50&currentpage=1&descfilter=Furniture~B~Renting ~F~ Leasing']")
        filter_description.click()
        
        while True:
            try:
                download_excel = Select(driver.find_element(By.ID, 'ResultsDownload'))
                download_excel.select_by_value('Excel')
                driver.implicitly_wait(20)
                first_50 = driver.find_element(By.XPATH, "//div[@id='relatedOccupations']//a[@onclick='hideMoreRelatedOccupations()']")
                first_50.click()
                driver.implicitly_wait(20)
                next_page = driver.find_element(By.XPATH, "//div[@class='pagination-wrap']//div//a[@class='next-page']")
                next_page.click()
                driver.implicitly_wait(20)
                print("Skipped a page.")
            except:
                print("No more pages.")
                return
web_scrape()

Below is something that works.下面是一些有效的东西。 Again I would think the way I went about this could be improved.我再次认为我解决这个问题的方式可以改进。 I stuck with Selenium but you really don't even need to open the webpage and can just webscrape using correct URL params with Beautiful Soup.我坚持使用 Selenium,但您甚至不需要打开网页,只需使用正确的 URL 参数和 Beautiful Soup 进行网页抓取。 Also the fastest way was probably not to write every item into excel one at a time but it works, better way is probably using pandas and then creating an excel workbook at the end.此外,最快的方法可能不是一次将每个项目写入 excel,但它有效,更好的方法可能是使用 pandas,然后在最后创建一个 excel 工作簿。 But anyway if you have any questions let me know.但无论如何,如果您有任何问题,请告诉我。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import openpyxl as xl
import os
import math

cwd = os.getcwd() #Or whatever dir you want
filename = '\test123.xlsx'

location = 'oklahoma'
keyword = 'rent to own'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get('https://www.careeronestop.org/toolkit/jobs/find-businesses.aspx?keyword=' + keyword + '&ajax=0&location=' + location + '&radius=50&pagesize=50&currentpage=1&lang=en')

driver.implicitly_wait(3)

wb = xl.Workbook()
ws = wb.worksheets[0]

#get number of pages
ret = driver.find_element(By.ID, 'recordNumber')
lp = math.ceil(float(ret.text)/50)
r = 1

for i in range(1, lp):
    
    print(i)
    driver.get('https://www.careeronestop.org/toolkit/jobs/find-businesses.aspx?keyword=' + keyword + '&ajax=0&location=' + location + '&radius=50&pagesize=50&currentpage=' + str(i) + '&lang=en')
    table_id = driver.find_elements(By.CLASS_NAME, 'res-table')[0]
    rows = table_id.find_elements(By.TAG_NAME, "tr")
    
    for count, row in enumerate(rows, start=1):
        if count >= 0:
            cols = row.find_elements(By.TAG_NAME, "td")
            refs = row.find_elements(By.TAG_NAME, "a")
            for c, ref in enumerate(refs, start=1):
                ws.cell(row=r, column=c).value = '=HYPERLINK("{}", "{}")'.format(ref.get_attribute("href"), ref.text)
            for c, col in enumerate(cols, start=1):
                if c > 1:
                    ws.cell(row=r, column=c).value = col.text
        r += 1




wb.save(cwd + filename)
print('done')

This returns an excel file with 750+ rows of data with links included.这将返回一个包含 750 多行数据的 Excel 文件,其中包含链接。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM