简体   繁体   中英

Scraping Multiple Page using Python and BeautifulSoup - Website url does not work

My python code successfully scrapes text from https://www.groupeactual.eu/offre-emploi and saves them in a csv file.

However, there are multiple pages available at the site above in which I would like to be able to scrape.

For example, with the url above, when I click the link to "page 2" the overall url changes but when I used that url in my code, I get the results from page 1.

How can my code be changed to scrape data from all the available listed pages?

My code:

from bs4 import BeautifulSoup
import requests
import pandas as pd 

response = requests.get('https://www.groupeactual.eu/offre-emploi').text

soup = BeautifulSoup(response, "html.parser")

[Rest of the code goes here .... ]

The data is loaded via Ajax from different URL. This script goes through all pages and prints titles, links from each page:

import re
import requests
from bs4 import BeautifulSoup


data = {
    '_token': "",
    'limit': "21",
    'order': "",
    'adresse': "",
    'google_adresse': "",
    'distance': "",
    'niveau-experience': "0;10",
    'relations[besoin][contrat][debut]': "",
    'js_range_demarrage_dates': "",
    'informations[remunerations]': "10000;100000",
    'page': ""
}

headers = {
    'X-Requested-With': 'XMLHttpRequest'
}

url = 'https://www.groupeactual.eu/offre-emploi?limit=21&order=&adresse=&distance=&niveau-experience=0%3B10&relations%5Bbesoin%5D%5Bcontrat%5D%5Bdebut%5D=&js_range_demarrage_dates=&informations%5Bremunerations%5D=10000%3B100000&page=1'
api_url = 'https://www.groupeactual.eu/offre-emploi/search'


urls = []
with requests.session() as s:
    soup = BeautifulSoup(s.get(url).content, 'html.parser')
    data['_token'] = soup.select_one('meta[name="csrf-token"]')['content']

    page = 1
    while True:
        data['page'] = page
        print('Page {}...'.format(page))
        soup = BeautifulSoup(s.post(api_url, data=data, headers=headers).content, 'html.parser')
        cards = soup.select('.card')
        if not cards:
            break

        for i, card in enumerate(cards, 1):
            u = re.search(r"'(.*?)'", card['onclick']).group(1)
            print('{:<5} {:<60} {}'.format(i, card.h3.text, u))
            urls.append(u)

        page += 1

print(urls)

Prints:

Page 1...
1     Coffreur bancheur (H/F)                                      https://www.groupeactual.eu/offre-emploi/coffreur-bancheur-hf-ernee-RE0046450A46458?utm_medium=api&utm_campaign=Coffreur+bancheur+%28H%2FF%29-46458
2     PEINTRE H/F                                                  https://www.groupeactual.eu/offre-emploi/peintre-hf-laval-RE0046827A50628?utm_medium=api&utm_campaign=PEINTRE+H%2FF-50628
3     PEINTRE H/F                                                  https://www.groupeactual.eu/offre-emploi/peintre-hf-augny-AG5640208TAA50789?utm_medium=api&utm_campaign=PEINTRE+H%2FF-50789
4     Technicien Fibre Optique (h/f)                               https://www.groupeactual.eu/offre-emploi/technicien-fibre-optique-hf-forbach-AG5640208BCA50790?utm_medium=api&utm_campaign=Technicien+Fibre+Optique+%28h%2Ff%29-50790
5     CONDUCTEUR D'ENGINS H/F                                      https://www.groupeactual.eu/offre-emploi/conducteur-dengins-hf-amblainville-RE0047896A51376?utm_medium=api&utm_campaign=CONDUCTEUR+D%27ENGINS+H%2FF-51376
6     Technicien Informatique (h/f)                                https://www.groupeactual.eu/offre-emploi/technicien-informatique-hf-metz-RE0047858A52066?utm_medium=api&utm_campaign=Technicien+Informatique+%28h%2Ff%29-52066
7     Opérateur Traitement de Surface H/F                          https://www.groupeactual.eu/offre-emploi/operateur-traitement-de-surface-hf-bressuire-RE0050805A53145?utm_medium=api&utm_campaign=Op%C3%A9rateur+Traitement+de+Surface+H%2FF-53145
8     CHAUFFEUR PL SPL (H/F)                                       https://www.groupeactual.eu/offre-emploi/chauffeur-pl-spl-hf-boulogne-sur-mer-RE0047560A53509?utm_medium=api&utm_campaign=CHAUFFEUR+PL+SPL+%28H%2FF%29-53509
9     Technicien d'Installations Électriques (H/F)                 https://www.groupeactual.eu/offre-emploi/technicien-dinstallations-electriques-hf-metz-RE0048762A53801?utm_medium=api&utm_campaign=Technicien+d%27Installations+%C3%89lectriques+%28H%2FF%29-53801
10    Cuisinier en industrie agroalimentaire (H/F)                 https://www.groupeactual.eu/offre-emploi/cuisinier-en-industrie-agroalimentaire-hf-talmont-saint-hilaire-RE0073692A93442?utm_medium=api&utm_campaign=Cuisinier+en+industrie+agroalimentaire+%28H%2FF%29-93442
11    Préparateur de commandes (H/F)                               https://www.groupeactual.eu/offre-emploi/preparateur-de-commandes-hf-sevremoine-RE0074893A94943?utm_medium=api&utm_campaign=Pr%C3%A9parateur+de+commandes+%28H%2FF%29-94943

... and so on (until page 135)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM