繁体   English   中英

使用python抓取动态javascript内容网页

[英]Scrape dynamic javascript content webpage using python

我正在尝试使用 Python 抓取此网站:“ https://ec.europa.eu/research/mariecurieactions/how-to/find-job_en ”。

首先我注意到我感兴趣的表格实际上在这个网址: https : //ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm

但是,requests + BS4 只给了我 HTML 格式的页面源。 我认为这是因为内容是动态的。

因此,我尝试了 Selenium + BS4 来抓取网站,但我仍然只能抓取页面源。

from selenium.webdriver import Firefox
from bs4 import BeautifulSoup
import lxml

driver = Firefox()
url = 'https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')

如何抓取上述网站?

如果再进一步,您会在此处找到真实数据: https : //euraxess.ec.europa.eu/sites/default/files/exports/msca.xml下面是使用 SimplifiedDoc 的示例。

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml') 
doc = SimplifiedDoc(html)
jobs = doc.selects('job-opportunity')
for job in jobs:
    print (job.select('job-id>text()'),job.select('job-title>text()'))

结果:

367020 Early-Stage Researcher (ESR) 3-year PhD position - "Efficient intra-cavity and extra-cavity generation of beams with radial and azimuthal polarization in high-power thin-disk lasers" - Project: GREAT
377512 8 Short-term Early Stage Researcher positions available through the EvoCELL ITN (single cell genomics, evo-devo and science outreach)
383978 ESR (early stage researcher) for intelligent quality control cycles in Industry 4.0 process chains enabled by machine learning
......

实际上,您可以使用 requests + BS4 获得所需的结果。 您需要做的就是使用 API https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml以及标头。

代码

import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'euraxess.ec.europa.eu',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'accept': 'application/xml, text/xml, */*; q=0.01',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
    'origin': 'https://ec.europa.eu',
    'sec-fetch-site': 'same-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://ec.europa.eu/',
    'accept-language': 'en-US,en;q=0.9',
}

response = requests.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml',headers=headers)
# print(response.text)

soup = BeautifulSoup(response.content, 'html.parser')
ID = soup.find_all('job-id')
Title = soup.find_all('job-title')
for ID,Title in zip(ID,Title):
    print(ID.text,Title.text)

输出

383876 PhD position in the framework of HEalth data LInkage for ClinicAL benefit (Helical) project
433411 PhD Student in Biophysics/Electrophysiology
454880 15 PhD positions in Marie Sklodowska Curie ITN “Active Monitoring of Cancer As An Alternative To Surgery” (CAST)
465392 15 Marie Curie PhD Positions in ''Mobility and Training for Beyond 5G Ecosystems (MOTOR5G)''
480654 Early Stage Research Position in mmWave-based communication systems at National Instruments Dresden GmbH
....

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM