[英]Unable to scrape multiple pages with an unchanging URL with Python
After searching for a hint I found my problem is closely related to this question and based on this answer I thought I was about to solve my problem, but I did not do it. 在寻找提示之后,我发现我的问题与该问题密切相关,并且基于此答案,我认为自己将要解决问题,但我没有这样做。
I need to extract all URLs from this site http://elempleo.com/cr/ofertas-empleo , I did the following: 我需要从该站点http://elempleo.com/cr/ofertas-empleo中提取所有URL,我做了以下工作:
page_no=1
payload = {"jobOfferId":0,
"salaryInfo":[],
"city":0,
"publishDate":0,
"area":40,
"countryId":0,
"departmentId":0,
"companyId":0,
"pageIndex":page_no,
"pageSize":"20"},
"sortExpression":"PublishDate_Desc"}
page = requests.get('http://elempleo.com/cr/ofertas-empleo/get', params=payload)
soup = BeautifulSoup(page.content, 'html.parser')
href_list=soup.select(".text-ellipsis")
for urls in href_list:
print("http://elempleo.com"+urls.get("href"))
http://elempleo.com/cr/ofertas-trabajo/ap-representative/757190
http://elempleo.com/cr/ofertas-trabajo/ingeniero-de-procesos-sap/757189
http://elempleo.com/cr/ofertas-trabajo/sr-program-analyst-months/757188
http://elempleo.com/cr/ofertas-trabajo/executive-asistant/757187
http://elempleo.com/cr/ofertas-trabajo/asistente-comercial-bilingue/757186
http://elempleo.com/cr/ofertas-trabajo/accounting-assistant/757185
http://elempleo.com/cr/ofertas-trabajo/asistente-contable/757184
http://elempleo.com/cr/ofertas-trabajo/personal-para-cajas-alajuela-con-experiencia-en-farmacia/757183
http://elempleo.com/cr/ofertas-trabajo/oficial-de-seguridad/743703
http://elempleo.com/cr/ofertas-trabajo/tecnico-de-mantenimiento-en-extrusion/757182
http://elempleo.com/cr/ofertas-trabajo/gerente-servicio-al-cliente-y-ventas/757181
http://elempleo.com/cr/ofertas-trabajo/encargadoa-departamento-de-recursos-humanos-ingles-intermedio/757180
http://elempleo.com/cr/ofertas-trabajo/director-of-development/757177
http://elempleo.com/cr/ofertas-trabajo/generalista-de-recursos-humanos-ingles-intermedio/757178
http://elempleo.com/cr/ofertas-trabajo/accounts-payable-specialist-seasonal-contract/757176
http://elempleo.com/cr/ofertas-trabajo/electricista-industrial/757175
http://elempleo.com/cr/ofertas-trabajo/payroll-analyst-months-contract/757174
http://elempleo.com/cr/ofertas-trabajo/gerente-servicio-post-venta/757172
http://elempleo.com/cr/ofertas-trabajo/operario-de-proceso/757171
http://elempleo.com/cr/ofertas-trabajo/cajero-de-kiosco-ubicacion-area-metropolitana-fines-de-semana-disponibilidad-de-horarios/757170
As you can see, it shows 20 urls, which is OK, but if I chage page_no=2
, page_no=3
, ... page_no=100
and run the above code again it returns the same result as before; 如您所见,它显示20个url,这是可以的,但是如果我page_no=2
, page_no=3
,... page_no=100
并再次运行上述代码,它将返回与以前相同的结果; I need all urls from all pages in this website. 我需要本网站所有页面的所有URL。 Can anybody help me? 有谁能够帮助我?
Also, I set "area":40
which corresponds to sistemas
category in Área de trabajo
field. 此外,我在sistemas
Área de trabajo
字段中设置了"area":40
sistemas
类别的"area":40
。 It doesn't do nothing, because results are not filtered as sistemas
category. 它不会执行任何操作,因为不会将结果过滤为sistemas
类别。
I used beautifulsoup
in Python3 running on Ubuntu 18.04. 我在运行于Ubuntu 18.04的Python3中使用了beautifulsoup
。
Answers using rvest
package in R are also welcome! 也欢迎在R中使用rvest
软件包的答案!
if you try scrolling through the pages with web console open, you will notice that pagination is done through the findByFilter javascript query. 如果您尝试在打开Web控制台的情况下滚动浏览页面,则会注意到分页是通过findByFilter javascript查询完成的。 Python requests cannot handle this kind of page modifications. Python请求无法处理这种页面修改。
You have two choices here: 您有两种选择:
To setup selenium visit this link 要设置硒,请访问此链接
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
url = "http://elempleo.com/cr/ofertas-empleo/"
Note: You need to download the suitable browser driver from this link and add it's path to system environment variable 注意:您需要从此链接下载合适的浏览器驱动程序,并将其路径添加到系统环境变量中
# here I am using chrome webdriver
# setting up selenium
driver = webdriver.Chrome(executable_path=r"F:\Projects\sms_automation\chromedriver.exe") # initialize webdriver instance
driver.get(url) # open URL in browser
driver.find_element_by_id("ResultsByPage").send_keys('100') # set items per page to 100
time.sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
url_set = ["http://elempleo.com"+i.get("href") for i in soup.select(".text-ellipsis")]
while True:
try:
driver.find_element_by_class_name("js-btn-next").click() # go to next page
time.sleep(3)
soup = BeautifulSoup(driver.page_source, "html.parser")
current_page_url = ["http://elempleo.com"+i.get("href") for i in soup.select(".text-ellipsis")]
if url_set[-1] == current_page_url[-1]:
break
url_set += current_page_url
except WebDriverException:
time.sleep(5)
Result: 结果:
print(len(url_set)) # outputs 2641
print(url_set) # outputs ['http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/analista-de-sistemas-financieros/753845', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/balance-sheet-and-cash-flow-specialist/755211', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/coordinador-de-compensacion/757369', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/gerente-de-agronomia/757368', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/responsable-de-capacitacion-y-desempeno/757367', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/pmp-gestor-de-proyectos/757366', ....]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.