简体   繁体   中英

Selenium with Python / Navigating to next page

I have a hard time browsing through the 448 consecutive pages of the following page https://www.digitalwallonia.be/fr/cartographie/ with Selenium under Python in a robust manner. I tried (too) many things without satisfactory result (hence, difficult to put relevant code).

Would like to see your solution. Apologize if the question is not appropriately formulated: first timer.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.implicitly_wait(20)


browser.get('https://www.digitalwallonia.be/fr/cartographie')
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAll"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_configure"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAllAndNext"]').click()

WebDriverWait(browser, 1000).until(EC.element_to_be_clickable((By.CLASS_NAME,'next'))).click()

input('Press ENTER to close the automated browser')
browser.quit()

I get the following error: selenium.common.exceptions.ElementNotInteractableException: Message: Element could not be scrolled into view

I would advice here about several issues:

  1. You should preferably use WebDriverWait , not implicitly_wait since the former is waiting for element presence only while with WebDriverWait you can wait for more mature element states ie to be visible, clickable and more.
  2. Don't mix WebDriverWait and implicitly_wait in the same file, it may cause problems.
  3. The next page buttons are on the bottom of the page, so you will need to scrool down and only after that to click the pager button.
  4. No need to set the timeout for more than 30 seconds.
    The code below is working:
import time

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


options = Options()
options.add_argument("start-maximized")


webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(service=webdriver_service, options=options)
url = "https://www.digitalwallonia.be/fr/cartographie"
actions = ActionChains(driver)

wait = WebDriverWait(driver, 10)
driver.get(url)

wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAll"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_configure"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAllAndNext"]'))).click()

driver.execute_script("window.scrollBy(0, arguments[0]);", 800)
time.sleep(0.5)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.next a'))).click()

Every time you click to go to next page ('Suivant' button), the javascript in page is making a POST request to an API endpoint, with a header and a payload. Header, payload and API endpoint can be found in browser Dev tools - Network tab (select only XHR calls). Hence, we can try and scrape that API url using requests and avoiding the overheads of selenium/chromedriver. Below is a way of obtaining that data:

import requests
import pandas as pd

big_df = pd.DataFrame()
url = 'https://search.production.ribo.digitalwallonia.be/contentful-entries_production/_search/template'

headers = {
    'content-type': 'application/json',
    'Origin': 'https://www.digitalwallonia.be',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
counter = 0
while True:
    payload = '{"id":"filter-profile-search-template-fr-v3","params":{"categoriesSlugList":[],"programsSlugList":[],"from":' + str(counter) + ',"regionsList":[],"size":100}}'
    r = s.post(url, data=payload)
    big_df = pd.concat([big_df, pd.json_normalize(r.json()['hits']['hits'])], axis=0, ignore_index=True)
    counter = counter + 100
    if counter > 448*12:
        break
print(big_df)

We are getting 100 items at once (the actual page is getting 12 at once). After a minute or so, you should have the following dataframe displayed in your terminal:

    _index  _type   _id     _score  sort    _source.sys.id  _source.sys.contentType.sys.id  _source.sys.updatedAt   _source.fields.addresses.fr     _source.fields.belgianEnterprisesNumbers.fr     _source.fields.urlsWebSite.fr   _source.fields.shortDescription.en  _source.fields.shortDescription.fr  _source.fields.logoAssetImage.fr.file.en.fileName   _source.fields.logoAssetImage.fr.file.en.details.image.width    _source.fields.logoAssetImage.fr.file.en.details.image.height   _source.fields.logoAssetImage.fr.file.en.details.size   _source.fields.logoAssetImage.fr.file.en.contentType    _source.fields.logoAssetImage.fr.file.en.url    _source.fields.logoAssetImage.fr.file.fr.fileName   _source.fields.logoAssetImage.fr.file.fr.details.image.width    _source.fields.logoAssetImage.fr.file.fr.details.image.height   _source.fields.logoAssetImage.fr.file.fr.details.size   _source.fields.logoAssetImage.fr.file.fr.contentType    _source.fields.logoAssetImage.fr.file.fr.url    _source.fields.logoAssetImage.fr.title.en   _source.fields.logoAssetImage.fr.title.fr   _source.fields.title.en     _source.fields.title.fr     _source.fields.slug.en  _source.fields.slug.fr  _source.fields.urlsSocialNetwork.fr     _source.fields.shortTitle.en    _source.fields.shortTitle.fr    _source.fields.founders.fr  _source.fields.mainNaceCode.fr  _source.fields.staffing.fr  _source.fields.logoAssetImage.fr    _source.fields.partnersAdditionalDescriptions.fr    _source.fields.incubators.fr
0   contentful-entries_productionv3     _doc    3O1t8sTHhj5ZGrmGKtHI6y  None    [ Dynamix JAVA]     3O1t8sTHhj5ZGrmGKtHI6y  profile     2022-09-01T14:36:06.899Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.388591169708497, 'Lat': 50.7035958197085}, 'Northeast': {'Lng': 4.391289130291502, 'Lat': 50.7062937802915}}, 'coordinates': [4.3898572, 50.7050388], 'type': 'Point', 'Location': {'Lng': 4.3898572, 'Lat': 50.7050388}}, 'Metadata': {'PlaceId': 'ChIJOZeR297Rw0cR_y-bZPZvwzQ', 'AddressType': 'head office', 'Timestamp': '2022-08-29T13:55:32.180Z'}, 'FormattedAddress': 'Av. des Dauphins 17, 1410 Waterloo, Belgique', 'MainAddress': True}]  [0715677777]    [{'Metadata': {'Timestamp': '2022-08-29T15:58:45+02:00'}, 'URL': 'https://dynamix-it.be/'}]     Consulting company specialised in JAVA, SAP, DotNet, and son one.   Société de consultance spécialisée en JAVA, SAP, DotNet, etc.   dynamix_java.png    160.0   160.0   15950.0     image/png   //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/1e5bd1ac59dab0126baea85f9156b872/dynamix_java.png    dynamix java.png    160.0   160.0   15950.0     image/png   //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/8e23b45bf77a17026df43cd072d06a52/dynamix_java.png    Dynamix Java    Dynamix Java    Dynamix JAVA    Dynamix JAVA    dynamix-java    dynamix-java    [{'Metadata': {'Timestamp': '2022-08-29T15:58:14+02:00'}, 'URL': 'https://www.facebook.com/DYNAMIXJAVASPRL'}, {'Metadata': {'Timestamp': '2022-08-29T15:58:27+02:00'}, 'URL': 'https://www.linkedin.com/company/dynamixjava/'}]     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
1   contentful-entries_productionv3     _doc    4D2kOg0t4iRD11fzJFaPc8  None    [ Lan-Area ]    4D2kOg0t4iRD11fzJFaPc8  profile     2022-08-25T08:42:32.473Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.744188919708497, 'Lat': 50.3149442697085}, 'Northeast': {'Lng': 4.746886880291502, 'Lat': 50.3176422302915}}, 'coordinates': [4.745529299999999, 50.31632769999999], 'type': 'Point', 'Location': {'Lng': 4.745529299999999, 'Lat': 50.31632769999999}}, 'Metadata': {'PlaceId': 'ChIJm9XAKz6SwUcRs45ovYpmEpc', 'AddressType': 'head office', 'Timestamp': '2022-06-21T14:17:33.655Z'}, 'FormattedAddress': 'Rue d'Ermeton 14, 5537 Anhée, Belgique', 'MainAddress': True}]  [0779822986]    [{'Metadata': {'Timestamp': '2022-08-25T10:42:29+02:00'}, 'URL': 'https://www.lan-area.be/'}]   Platform exclusively focused on local sports competition. Lan-Area has created a central calendar where all local events are announced and a Belgian community space where players can post their teams, courses and successes.     Plateforme exclusivement tournée vers la compétition e-sportive locale . Lan-Area a créé un calendrier central où tous les évènements locaux sont annoncés et un espace communautaire belge où les joueurs peuvent afficher leurs équipes, parcours et succès.  lan-Aera.jpg    450.0   250.0   21154.0     image/jpeg  //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/346ee9006b0b5e3e33d2fab6ce293a47/lan-Aera.jpg    lan-Aera.jpg    450.0   250.0   21154.0     image/jpeg  //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/7f30ce6782073cf51d16c1f67ef5ee0d/lan-Aera.jpg    lan-Aera    Logo Lan-Aera   Lan-Aera    Lan-Area    lan-aera    lan-area    [{'Metadata': {'Timestamp': '2022-06-21T15:06:34+02:00'}, 'URL': 'https://www.facebook.com/lanarea2020'}, {'Metadata': {'Timestamp': '2022-06-21T15:07:31+02:00'}, 'URL': 'https://twitter.com/LanArea5'}, {'Metadata': {'Timestamp': '2022-06-21T15:59:53+02:00'}, 'URL': 'https://www.twitch.tv/ladh_lanarea'}]   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   contentful-entries_productionv3     _doc    6sbdRDRWJXTTtbR1wycE52  None    [1-formation.be]    6sbdRDRWJXTTtbR1wycE52  profile     2022-05-15T11:21:20.388Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:43:01.598Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}]   [0891973792]    [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'http://www.1-formation.be/'}]  Training in IT following based on four subjects: office applications, web and image, web marketing and communication, personnel management and development.     Formations en informatique suivant quatre thématiques: bureautique, web et image, webmarketing et communication, management et développement personnel.     NaN     NaN     NaN     NaN     NaN     NaN     logo-f-1-formation.jpg  350.0   77.0    5569.0  image/jpeg  //images.ctfassets.net/myqv2p4gx62v/7Itx3K16vYyGTHuYUD7TfW/7103d85dbce48d1c3a0535dac76df5c0/logo-f-1-formation.jpg  NaN     logo-f-1-formation.jpg  1-formation.be  1-formation.be  1-formationbe   1-formationbe   [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://twitter.com/1formation_be'}, {'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://www.facebook.com/1formation'}]    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
3   contentful-entries_productionv3     _doc    4EuOqP1eQIeka5xHcoq5mQ  None    [1-position.be]     4EuOqP1eQIeka5xHcoq5mQ  profile     2022-05-15T11:21:23.274Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:51:39.745Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}]   [0891973792]    []  Communications agency and IT training centre: website creation, professional SEO, the creation of Google Adwords campaigns, copywriting and web content, visual identity creation, communications consulting.   Agence de communication et centre de formation informatique: création de sites web, référencement professionnel, création et gestion de campagnes Google AdWords, copywriting et écriture web, création d'identité visuelle, conseil en communication.  NaN     NaN     NaN     NaN     NaN     NaN     marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png   169.0   129.0   11128.0     image/png   //images.ctfassets.net/myqv2p4gx62v/2RMVJINCIXiF4O2hZIb6kx/c1aebc77207c1a5ae67af5ebd87b1dd3/marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png   NaN     marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png   1-position.be   1-position.be   1-positionbe    1-positionbe    [{'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://twitter.com/1position'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.facebook.com/pages/1-positionbe/147447630063'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.linkedin.com/company/1-position.be'}]     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
4   contentful-entries_productionv3     _doc    1VvYEZncg0lEDL8RzGAvmE  None    [123 Automation Engineering & Development]  1VvYEZncg0lEDL8RzGAvmE  profile     2022-05-15T05:25:51.214Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.456926070107278, 'Lat': 50.53833147010727}, 'Northeast': {'Lng': 4.459625729892722, 'Lat': 50.54103112989272}}, 'coordinates': [4.4582759, 50.5396813], 'type': 'Point', 'Location': {'Lng': 4.4582759, 'Lat': 50.5396813}}, 'Metadata': {'PlaceId': 'EjNSdWUgZGVzIEFydGlzYW5zIDQsIDYyMTAgTGVzIEJvbnMgVmlsbGVycywgQmVsZ2lxdWUiGhIYChQKEgn75Aq3dyzCRxFEh7hEj1NdPBAE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:17:32.918Z'}, 'FormattedAddress': 'Rue des Artisans 4, 6210 Les Bons Villers, Belgique', 'MainAddress': True}]    [0820888531]    [{'Metadata': {'Timestamp': '2022-05-07T15:17:32.867Z'}, 'URL': 'http://www.123automation.be/'}]    NaN     Automation et robotique industrielle: étude, conception, développement, intégration et maintenance de solutions automatisées visant l’amélioration de la productivité dans les processus de fabrication quels qu’ils soient.    NaN     NaN     NaN     NaN     NaN     NaN     123automation.png   319.0   111.0   5802.0  image/png   //images.ctfassets.net/myqv2p4gx62v/6uY3Y6EDfICh8wdp4XNK7Z/082273035f7a600ec34098b09ab4fee9/123automation.png   NaN     123automation.png   123 Automation Engineering & Development    123 Automation Engineering & Development    123-automation  123-automation  []  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
5360    contentful-entries_productionv3     _doc    1AbDfyZ4rHL18Bw6aiJKSA  None    [École Centrale des Arts et Métiers - HE Vinci]     1AbDfyZ4rHL18Bw6aiJKSA  profile     2022-05-15T11:43:23.005Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.452325870107279, 'Lat': 50.84853592010727}, 'Northeast': {'Lng': 4.455025529892723, 'Lat': 50.85123557989272}}, 'coordinates': [4.4538028, 50.8499896], 'type': 'Point', 'Location': {'Lng': 4.4538028, 'Lat': 50.8499896}}, 'Metadata': {'PlaceId': 'ChIJwdgtpYbcw0cRfjW1nUhDNk8', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:44:19.720Z'}, 'FormattedAddress': 'Prom. de l'Alma 50, 1200 Woluwe-Saint-Lambert, Belgique', 'MainAddress': True}]     [0459279954, 0409454123]    [{'Metadata': {'Timestamp': '2022-05-07T15:44:19.660Z'}, 'URL': 'http://www.ecam.be/'}]     NaN     L'ECAM est un Institut Supérieur Industriel ayant pour objet la formation de Master en sciences industrielles dans une des spécialités suivantes: ​automatisation, construction, électromécanique, électronique, géomètre, informatique, business analyst (alternance).     NaN     NaN     NaN     NaN     NaN     NaN     ecam.jpg    512.0   512.0   93657.0     image/jpeg  //images.ctfassets.net/myqv2p4gx62v/4e2oSTcbXRABuyibUwgs95/4e5d8f540ccc67065a94eb528418ddd7/ecam.jpg    NaN     ecam.jpg    École Centrale des Arts et Métiers - HE Vinci   École Centrale des Arts et Métiers - HE Vinci   ecole-centrale-des-arts-et-metiers  ecole-centrale-des-arts-et-metiers  []  ECAM    ECAM    NaN     NaN     NaN     NaN     NaN     NaN
5361    contentful-entries_productionv3     _doc    5vp8xZpO6CucXtOmc1H8yR  None    [École communale fondamentale de Seneffe]   5vp8xZpO6CucXtOmc1H8yR  profile     2022-05-15T09:12:19.246Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.252977370107278, 'Lat': 50.52898217010728}, 'Northeast': {'Lng': 4.255677029892722, 'Lat': 50.53168182989272}}, 'coordinates': [4.2543333, 50.5303456], 'type': 'Point', 'Location': {'Lng': 4.2543333, 'Lat': 50.5303456}}, 'Metadata': {'PlaceId': 'ChIJt1KItgg0wkcR6ekUYWMbdDg', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:58:11.863Z'}, 'FormattedAddress': 'Rue de Buisseret 19, 7180 Seneffe, Belgique', 'MainAddress': True}]     NaN     []  NaN     Ecole fondamentale.     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     École communale fondamentale de Seneffe     École communale fondamentale de Seneffe     ecole-communale-de-seneffe  ecole-communale-de-seneffe  []  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
[...]

This dataframe has 5365 rows × 40 columns. You can inspect the initial json response and dissect it further, maybe you need more/less/other information from it.

Requests docs: https://requests.readthedocs.io/en/latest/

Pandas relevant documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM