使用Python无法使用不变的网址抓取多个页面

Question

在寻找提示之后，我发现我的问题与该问题密切相关，并且基于此答案，我认为自己将要解决问题，但我没有这样做。

我需要从该站点http://elempleo.com/cr/ofertas-empleo中提取所有URL，我做了以下工作：

page_no=1
payload = {"jobOfferId":0,
           "salaryInfo":[],
           "city":0,
           "publishDate":0,
           "area":40,
           "countryId":0,
           "departmentId":0,
           "companyId":0,
           "pageIndex":page_no,
           "pageSize":"20"},
           "sortExpression":"PublishDate_Desc"}

page = requests.get('http://elempleo.com/cr/ofertas-empleo/get', params=payload)
soup = BeautifulSoup(page.content, 'html.parser')

href_list=soup.select(".text-ellipsis")

for urls in href_list:
    print("http://elempleo.com"+urls.get("href"))

http://elempleo.com/cr/ofertas-trabajo/ap-representative/757190
http://elempleo.com/cr/ofertas-trabajo/ingeniero-de-procesos-sap/757189
http://elempleo.com/cr/ofertas-trabajo/sr-program-analyst-months/757188
http://elempleo.com/cr/ofertas-trabajo/executive-asistant/757187
http://elempleo.com/cr/ofertas-trabajo/asistente-comercial-bilingue/757186
http://elempleo.com/cr/ofertas-trabajo/accounting-assistant/757185
http://elempleo.com/cr/ofertas-trabajo/asistente-contable/757184
http://elempleo.com/cr/ofertas-trabajo/personal-para-cajas-alajuela-con-experiencia-en-farmacia/757183
http://elempleo.com/cr/ofertas-trabajo/oficial-de-seguridad/743703
http://elempleo.com/cr/ofertas-trabajo/tecnico-de-mantenimiento-en-extrusion/757182
http://elempleo.com/cr/ofertas-trabajo/gerente-servicio-al-cliente-y-ventas/757181
http://elempleo.com/cr/ofertas-trabajo/encargadoa-departamento-de-recursos-humanos-ingles-intermedio/757180
http://elempleo.com/cr/ofertas-trabajo/director-of-development/757177
http://elempleo.com/cr/ofertas-trabajo/generalista-de-recursos-humanos-ingles-intermedio/757178
http://elempleo.com/cr/ofertas-trabajo/accounts-payable-specialist-seasonal-contract/757176
http://elempleo.com/cr/ofertas-trabajo/electricista-industrial/757175
http://elempleo.com/cr/ofertas-trabajo/payroll-analyst-months-contract/757174
http://elempleo.com/cr/ofertas-trabajo/gerente-servicio-post-venta/757172
http://elempleo.com/cr/ofertas-trabajo/operario-de-proceso/757171
http://elempleo.com/cr/ofertas-trabajo/cajero-de-kiosco-ubicacion-area-metropolitana-fines-de-semana-disponibilidad-de-horarios/757170

如您所见，它显示20个url，这是可以的，但是如果我page_no=2 ， page_no=3 ，... page_no=100并再次运行上述代码，它将返回与以前相同的结果； 我需要本网站所有页面的所有URL。 有谁能够帮助我？

此外，我在sistemas Área de trabajo字段中设置了"area":40 sistemas类别的"area":40 。 它不会执行任何操作，因为不会将结果过滤为sistemas类别。

我在运行于Ubuntu 18.04的Python3中使用了beautifulsoup 。

也欢迎在R中使用rvest软件包的答案！

Answer 1

如果您尝试在打开Web控制台的情况下滚动浏览页面，则会注意到分页是通过findByFilter javascript查询完成的。 Python请求无法处理这种页面修改。

您有两种选择：

使用硒浏览器获取启用了JavaScript的抓取工具
尝试模拟标头并请求http://elempleo.com/cr/api/joboffers/findbyfilter POST请求的有效负载，并直接从api获取数据（这也可以轻松地给您提供json响应，您可以将其直接放入python字典）。

Answer 2

要设置硒，请访问此链接

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
url = "http://elempleo.com/cr/ofertas-empleo/"

注意：您需要从此链接下载合适的浏览器驱动程序，并将其路径添加到系统环境变量中

# here I am using chrome webdriver
# setting up selenium
driver = webdriver.Chrome(executable_path=r"F:\Projects\sms_automation\chromedriver.exe")  # initialize webdriver instance
driver.get(url)  # open URL in browser
driver.find_element_by_id("ResultsByPage").send_keys('100')  # set items per page to 100
time.sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
url_set = ["http://elempleo.com"+i.get("href") for i in soup.select(".text-ellipsis")]
while True:
    try:
        driver.find_element_by_class_name("js-btn-next").click()  # go to next page
        time.sleep(3)
        soup = BeautifulSoup(driver.page_source, "html.parser")
        current_page_url = ["http://elempleo.com"+i.get("href") for i in soup.select(".text-ellipsis")]
        if url_set[-1] == current_page_url[-1]:
            break
        url_set += current_page_url
    except WebDriverException:
        time.sleep(5)

结果：

print(len(url_set))   # outputs 2641
print(url_set)  # outputs ['http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/analista-de-sistemas-financieros/753845', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/balance-sheet-and-cash-flow-specialist/755211', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/coordinador-de-compensacion/757369', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/gerente-de-agronomia/757368', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/responsable-de-capacitacion-y-desempeno/757367', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/pmp-gestor-de-proyectos/757366', ....]

使用Python无法使用不变的网址抓取多个页面

问题描述

2 个解决方案

解决方案1
1 2018-07-05 03:09:31

解决方案2
1 已采纳 2018-07-05 04:27:18

使用Python无法使用不变的网址抓取多个页面

问题描述

2 个解决方案

解决方案1 1 2018-07-05 03:09:31

解决方案2 1 已采纳 2018-07-05 04:27:18

解决方案1
1 2018-07-05 03:09:31

解决方案2
1 已采纳 2018-07-05 04:27:18