使用Python无法使用不变的网址抓取多个页面

Question

After searching for a hint I found my problem is closely related to this question and based on this answer I thought I was about to solve my problem, but I did not do it. 在寻找提示之后，我发现我的问题与该问题密切相关，并且基于此答案，我认为自己将要解决问题，但我没有这样做。

I need to extract all URLs from this site http://elempleo.com/cr/ofertas-empleo , I did the following: 我需要从该站点http://elempleo.com/cr/ofertas-empleo中提取所有URL，我做了以下工作：

page_no=1
payload = {"jobOfferId":0,
           "salaryInfo":[],
           "city":0,
           "publishDate":0,
           "area":40,
           "countryId":0,
           "departmentId":0,
           "companyId":0,
           "pageIndex":page_no,
           "pageSize":"20"},
           "sortExpression":"PublishDate_Desc"}

page = requests.get('http://elempleo.com/cr/ofertas-empleo/get', params=payload)
soup = BeautifulSoup(page.content, 'html.parser')

href_list=soup.select(".text-ellipsis")

for urls in href_list:
    print("http://elempleo.com"+urls.get("href"))

http://elempleo.com/cr/ofertas-trabajo/ap-representative/757190
http://elempleo.com/cr/ofertas-trabajo/ingeniero-de-procesos-sap/757189
http://elempleo.com/cr/ofertas-trabajo/sr-program-analyst-months/757188
http://elempleo.com/cr/ofertas-trabajo/executive-asistant/757187
http://elempleo.com/cr/ofertas-trabajo/asistente-comercial-bilingue/757186
http://elempleo.com/cr/ofertas-trabajo/accounting-assistant/757185
http://elempleo.com/cr/ofertas-trabajo/asistente-contable/757184
http://elempleo.com/cr/ofertas-trabajo/personal-para-cajas-alajuela-con-experiencia-en-farmacia/757183
http://elempleo.com/cr/ofertas-trabajo/oficial-de-seguridad/743703
http://elempleo.com/cr/ofertas-trabajo/tecnico-de-mantenimiento-en-extrusion/757182
http://elempleo.com/cr/ofertas-trabajo/gerente-servicio-al-cliente-y-ventas/757181
http://elempleo.com/cr/ofertas-trabajo/encargadoa-departamento-de-recursos-humanos-ingles-intermedio/757180
http://elempleo.com/cr/ofertas-trabajo/director-of-development/757177
http://elempleo.com/cr/ofertas-trabajo/generalista-de-recursos-humanos-ingles-intermedio/757178
http://elempleo.com/cr/ofertas-trabajo/accounts-payable-specialist-seasonal-contract/757176
http://elempleo.com/cr/ofertas-trabajo/electricista-industrial/757175
http://elempleo.com/cr/ofertas-trabajo/payroll-analyst-months-contract/757174
http://elempleo.com/cr/ofertas-trabajo/gerente-servicio-post-venta/757172
http://elempleo.com/cr/ofertas-trabajo/operario-de-proceso/757171
http://elempleo.com/cr/ofertas-trabajo/cajero-de-kiosco-ubicacion-area-metropolitana-fines-de-semana-disponibilidad-de-horarios/757170

As you can see, it shows 20 urls, which is OK, but if I chage page_no=2 , page_no=3 , ... page_no=100 and run the above code again it returns the same result as before; 如您所见，它显示20个url，这是可以的，但是如果我page_no=2 ， page_no=3 ，... page_no=100并再次运行上述代码，它将返回与以前相同的结果； I need all urls from all pages in this website. 我需要本网站所有页面的所有URL。 Can anybody help me? 有谁能够帮助我？

Also, I set "area":40 which corresponds to sistemas category in Área de trabajo field. 此外，我在sistemas Área de trabajo字段中设置了"area":40 sistemas类别的"area":40 。 It doesn't do nothing, because results are not filtered as sistemas category. 它不会执行任何操作，因为不会将结果过滤为sistemas类别。

I used beautifulsoup in Python3 running on Ubuntu 18.04. 我在运行于Ubuntu 18.04的Python3中使用了beautifulsoup 。

Answers using rvest package in R are also welcome! 也欢迎在R中使用rvest软件包的答案！

Answer 1

if you try scrolling through the pages with web console open, you will notice that pagination is done through the findByFilter javascript query. 如果您尝试在打开Web控制台的情况下滚动浏览页面，则会注意到分页是通过findByFilter javascript查询完成的。 Python requests cannot handle this kind of page modifications. Python请求无法处理这种页面修改。

You have two choices here: 您有两种选择：

use selenium browser to get a javascript-enabled scraper 使用硒浏览器获取启用了JavaScript的抓取工具
Try to mock the headers and request payload for http://elempleo.com/cr/api/joboffers/findbyfilter POST request and get the data straight from the api (which would also comfortably give you a json response that you can put straight to python dictionary). 尝试模拟标头并请求http://elempleo.com/cr/api/joboffers/findbyfilter POST请求的有效负载，并直接从api获取数据（这也可以轻松地给您提供json响应，您可以将其直接放入python字典）。

Answer 2

To setup selenium visit this link 要设置硒，请访问此链接

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
url = "http://elempleo.com/cr/ofertas-empleo/"

Note: You need to download the suitable browser driver from this link and add it's path to system environment variable 注意：您需要从此链接下载合适的浏览器驱动程序，并将其路径添加到系统环境变量中

# here I am using chrome webdriver
# setting up selenium
driver = webdriver.Chrome(executable_path=r"F:\Projects\sms_automation\chromedriver.exe")  # initialize webdriver instance
driver.get(url)  # open URL in browser
driver.find_element_by_id("ResultsByPage").send_keys('100')  # set items per page to 100
time.sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
url_set = ["http://elempleo.com"+i.get("href") for i in soup.select(".text-ellipsis")]
while True:
    try:
        driver.find_element_by_class_name("js-btn-next").click()  # go to next page
        time.sleep(3)
        soup = BeautifulSoup(driver.page_source, "html.parser")
        current_page_url = ["http://elempleo.com"+i.get("href") for i in soup.select(".text-ellipsis")]
        if url_set[-1] == current_page_url[-1]:
            break
        url_set += current_page_url
    except WebDriverException:
        time.sleep(5)

Result: 结果：

print(len(url_set))   # outputs 2641
print(url_set)  # outputs ['http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/analista-de-sistemas-financieros/753845', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/balance-sheet-and-cash-flow-specialist/755211', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/coordinador-de-compensacion/757369', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/gerente-de-agronomia/757368', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/responsable-de-capacitacion-y-desempeno/757367', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/pmp-gestor-de-proyectos/757366', ....]

使用Python无法使用不变的网址抓取多个页面

问题描述

2 个解决方案

解决方案1
1 2018-07-05 03:09:31

解决方案2
1 已采纳 2018-07-05 04:27:18

使用Python无法使用不变的网址抓取多个页面

问题描述

2 个解决方案

解决方案1 1 2018-07-05 03:09:31

解决方案2 1 已采纳 2018-07-05 04:27:18

解决方案1
1 2018-07-05 03:09:31

解决方案2
1 已采纳 2018-07-05 04:27:18