簡體   English   中英

使用python抓取動態javascript內容網頁

[英]Scrape dynamic javascript content webpage using python

我正在嘗試使用 Python 抓取此網站:“ https://ec.europa.eu/research/mariecurieactions/how-to/find-job_en ”。

首先我注意到我感興趣的表格實際上在這個網址: https : //ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm

但是,requests + BS4 只給了我 HTML 格式的頁面源。 我認為這是因為內容是動態的。

因此,我嘗試了 Selenium + BS4 來抓取網站,但我仍然只能抓取頁面源。

from selenium.webdriver import Firefox
from bs4 import BeautifulSoup
import lxml

driver = Firefox()
url = 'https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')

如何抓取上述網站?

如果再進一步,您會在此處找到真實數據: https : //euraxess.ec.europa.eu/sites/default/files/exports/msca.xml下面是使用 SimplifiedDoc 的示例。

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml') 
doc = SimplifiedDoc(html)
jobs = doc.selects('job-opportunity')
for job in jobs:
    print (job.select('job-id>text()'),job.select('job-title>text()'))

結果:

367020 Early-Stage Researcher (ESR) 3-year PhD position - "Efficient intra-cavity and extra-cavity generation of beams with radial and azimuthal polarization in high-power thin-disk lasers" - Project: GREAT
377512 8 Short-term Early Stage Researcher positions available through the EvoCELL ITN (single cell genomics, evo-devo and science outreach)
383978 ESR (early stage researcher) for intelligent quality control cycles in Industry 4.0 process chains enabled by machine learning
......

實際上,您可以使用 requests + BS4 獲得所需的結果。 您需要做的就是使用 API https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml以及標頭。

代碼

import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'euraxess.ec.europa.eu',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'accept': 'application/xml, text/xml, */*; q=0.01',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
    'origin': 'https://ec.europa.eu',
    'sec-fetch-site': 'same-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://ec.europa.eu/',
    'accept-language': 'en-US,en;q=0.9',
}

response = requests.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml',headers=headers)
# print(response.text)

soup = BeautifulSoup(response.content, 'html.parser')
ID = soup.find_all('job-id')
Title = soup.find_all('job-title')
for ID,Title in zip(ID,Title):
    print(ID.text,Title.text)

輸出

383876 PhD position in the framework of HEalth data LInkage for ClinicAL benefit (Helical) project
433411 PhD Student in Biophysics/Electrophysiology
454880 15 PhD positions in Marie Sklodowska Curie ITN “Active Monitoring of Cancer As An Alternative To Surgery” (CAST)
465392 15 Marie Curie PhD Positions in ''Mobility and Training for Beyond 5G Ecosystems (MOTOR5G)''
480654 Early Stage Research Position in mmWave-based communication systems at National Instruments Dresden GmbH
....

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM