简体   繁体   中英

Scrape dynamic javascript content webpage using python

I am trying to scrape this website: ' https://ec.europa.eu/research/mariecurieactions/how-to/find-job_en ' using Python.

First I noticed that the table I am interested in is actually at this url: https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm

However, requests + BS4 is only giving me the page source in HTML. I assume that this is because the content is dynamic.

Therefore I tried Selenium + BS4 to scrape the website, but I still only manage to scrape the page source.

from selenium.webdriver import Firefox
from bs4 import BeautifulSoup
import lxml

driver = Firefox()
url = 'https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')

How can I scrape the aforementioned website?

If you go further, you'll find the real data here: https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml Here is an example of using SimplifiedDoc.

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml') 
doc = SimplifiedDoc(html)
jobs = doc.selects('job-opportunity')
for job in jobs:
    print (job.select('job-id>text()'),job.select('job-title>text()'))

Result:

367020 Early-Stage Researcher (ESR) 3-year PhD position - "Efficient intra-cavity and extra-cavity generation of beams with radial and azimuthal polarization in high-power thin-disk lasers" - Project: GREAT
377512 8 Short-term Early Stage Researcher positions available through the EvoCELL ITN (single cell genomics, evo-devo and science outreach)
383978 ESR (early stage researcher) for intelligent quality control cycles in Industry 4.0 process chains enabled by machine learning
......

Actually, you can get the desired results using requests + BS4. All you need to do is to use the API https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml along with headers.

Code

import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'euraxess.ec.europa.eu',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'accept': 'application/xml, text/xml, */*; q=0.01',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
    'origin': 'https://ec.europa.eu',
    'sec-fetch-site': 'same-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://ec.europa.eu/',
    'accept-language': 'en-US,en;q=0.9',
}

response = requests.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml',headers=headers)
# print(response.text)

soup = BeautifulSoup(response.content, 'html.parser')
ID = soup.find_all('job-id')
Title = soup.find_all('job-title')
for ID,Title in zip(ID,Title):
    print(ID.text,Title.text)

output

383876 PhD position in the framework of HEalth data LInkage for ClinicAL benefit (Helical) project
433411 PhD Student in Biophysics/Electrophysiology
454880 15 PhD positions in Marie Sklodowska Curie ITN “Active Monitoring of Cancer As An Alternative To Surgery” (CAST)
465392 15 Marie Curie PhD Positions in ''Mobility and Training for Beyond 5G Ecosystems (MOTOR5G)''
480654 Early Stage Research Position in mmWave-based communication systems at National Instruments Dresden GmbH
....

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM