简体   繁体   English

Selenium 和 Python / 导航到下一页

[英]Selenium with Python / Navigating to next page

I have a hard time browsing through the 448 consecutive pages of the following page https://www.digitalwallonia.be/fr/cartographie/ with Selenium under Python in a robust manner.我很难用 ZA7F5F35426B927411FC9231B56 下的 Selenium 浏览下一页https://www.digitalwallonia.be/fr/cartographie/的 448 个连续页面。 I tried (too) many things without satisfactory result (hence, difficult to put relevant code).我尝试了(太多)没有令人满意的结果(因此,很难放置相关代码)。

Would like to see your solution.想看看你的解决方案。 Apologize if the question is not appropriately formulated: first timer.如果问题表述不当,请道歉:第一次。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.implicitly_wait(20)


browser.get('https://www.digitalwallonia.be/fr/cartographie')
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAll"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_configure"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAllAndNext"]').click()

WebDriverWait(browser, 1000).until(EC.element_to_be_clickable((By.CLASS_NAME,'next'))).click()

input('Press ENTER to close the automated browser')
browser.quit()

I get the following error: selenium.common.exceptions.ElementNotInteractableException: Message: Element could not be scrolled into view我收到以下错误:selenium.common.exceptions.ElementNotInteractableException:消息:无法将元素滚动到视图中

I would advice here about several issues:我会在这里就几个问题提出建议:

  1. You should preferably use WebDriverWait , not implicitly_wait since the former is waiting for element presence only while with WebDriverWait you can wait for more mature element states ie to be visible, clickable and more.您最好使用WebDriverWait ,而不是implicitly_wait ,因为前者仅在等待元素存在,而使用WebDriverWait您可以等待更成熟的元素状态,即可见、可点击等。
  2. Don't mix WebDriverWait and implicitly_wait in the same file, it may cause problems.不要在同一个文件中混合使用WebDriverWaitimplicitly_wait ,这可能会导致问题。
  3. The next page buttons are on the bottom of the page, so you will need to scrool down and only after that to click the pager button. next page按钮位于页面底部,因此您需要向下滚动,然后才能单击寻呼按钮。
  4. No need to set the timeout for more than 30 seconds.无需设置超过 30 秒的超时时间。
    The code below is working:下面的代码正在工作:
import time

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


options = Options()
options.add_argument("start-maximized")


webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(service=webdriver_service, options=options)
url = "https://www.digitalwallonia.be/fr/cartographie"
actions = ActionChains(driver)

wait = WebDriverWait(driver, 10)
driver.get(url)

wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAll"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_configure"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAllAndNext"]'))).click()

driver.execute_script("window.scrollBy(0, arguments[0]);", 800)
time.sleep(0.5)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.next a'))).click()

Every time you click to go to next page ('Suivant' button), the javascript in page is making a POST request to an API endpoint, with a header and a payload. Every time you click to go to next page ('Suivant' button), the javascript in page is making a POST request to an API endpoint, with a header and a payload. Header, payload and API endpoint can be found in browser Dev tools - Network tab (select only XHR calls). Header、有效负载和 API 端点可以在浏览器开发工具 - 网络选项卡中找到(仅选择 XHR 调用)。 Hence, we can try and scrape that API url using requests and avoiding the overheads of selenium/chromedriver.因此,我们可以尝试使用请求来抓取 API url 并避免 selenium/chromedriver 的开销。 Below is a way of obtaining that data:以下是获取该数据的一种方式:

import requests
import pandas as pd

big_df = pd.DataFrame()
url = 'https://search.production.ribo.digitalwallonia.be/contentful-entries_production/_search/template'

headers = {
    'content-type': 'application/json',
    'Origin': 'https://www.digitalwallonia.be',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
counter = 0
while True:
    payload = '{"id":"filter-profile-search-template-fr-v3","params":{"categoriesSlugList":[],"programsSlugList":[],"from":' + str(counter) + ',"regionsList":[],"size":100}}'
    r = s.post(url, data=payload)
    big_df = pd.concat([big_df, pd.json_normalize(r.json()['hits']['hits'])], axis=0, ignore_index=True)
    counter = counter + 100
    if counter > 448*12:
        break
print(big_df)

We are getting 100 items at once (the actual page is getting 12 at once).我们一次获得 100 个项目(实际页面一次获得 12 个)。 After a minute or so, you should have the following dataframe displayed in your terminal:大约一分钟后,您应该在终端中显示以下 dataframe:

    _index  _type   _id     _score  sort    _source.sys.id  _source.sys.contentType.sys.id  _source.sys.updatedAt   _source.fields.addresses.fr     _source.fields.belgianEnterprisesNumbers.fr     _source.fields.urlsWebSite.fr   _source.fields.shortDescription.en  _source.fields.shortDescription.fr  _source.fields.logoAssetImage.fr.file.en.fileName   _source.fields.logoAssetImage.fr.file.en.details.image.width    _source.fields.logoAssetImage.fr.file.en.details.image.height   _source.fields.logoAssetImage.fr.file.en.details.size   _source.fields.logoAssetImage.fr.file.en.contentType    _source.fields.logoAssetImage.fr.file.en.url    _source.fields.logoAssetImage.fr.file.fr.fileName   _source.fields.logoAssetImage.fr.file.fr.details.image.width    _source.fields.logoAssetImage.fr.file.fr.details.image.height   _source.fields.logoAssetImage.fr.file.fr.details.size   _source.fields.logoAssetImage.fr.file.fr.contentType    _source.fields.logoAssetImage.fr.file.fr.url    _source.fields.logoAssetImage.fr.title.en   _source.fields.logoAssetImage.fr.title.fr   _source.fields.title.en     _source.fields.title.fr     _source.fields.slug.en  _source.fields.slug.fr  _source.fields.urlsSocialNetwork.fr     _source.fields.shortTitle.en    _source.fields.shortTitle.fr    _source.fields.founders.fr  _source.fields.mainNaceCode.fr  _source.fields.staffing.fr  _source.fields.logoAssetImage.fr    _source.fields.partnersAdditionalDescriptions.fr    _source.fields.incubators.fr
0   contentful-entries_productionv3     _doc    3O1t8sTHhj5ZGrmGKtHI6y  None    [ Dynamix JAVA]     3O1t8sTHhj5ZGrmGKtHI6y  profile     2022-09-01T14:36:06.899Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.388591169708497, 'Lat': 50.7035958197085}, 'Northeast': {'Lng': 4.391289130291502, 'Lat': 50.7062937802915}}, 'coordinates': [4.3898572, 50.7050388], 'type': 'Point', 'Location': {'Lng': 4.3898572, 'Lat': 50.7050388}}, 'Metadata': {'PlaceId': 'ChIJOZeR297Rw0cR_y-bZPZvwzQ', 'AddressType': 'head office', 'Timestamp': '2022-08-29T13:55:32.180Z'}, 'FormattedAddress': 'Av. des Dauphins 17, 1410 Waterloo, Belgique', 'MainAddress': True}]  [0715677777]    [{'Metadata': {'Timestamp': '2022-08-29T15:58:45+02:00'}, 'URL': 'https://dynamix-it.be/'}]     Consulting company specialised in JAVA, SAP, DotNet, and son one.   Société de consultance spécialisée en JAVA, SAP, DotNet, etc.   dynamix_java.png    160.0   160.0   15950.0     image/png   //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/1e5bd1ac59dab0126baea85f9156b872/dynamix_java.png    dynamix java.png    160.0   160.0   15950.0     image/png   //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/8e23b45bf77a17026df43cd072d06a52/dynamix_java.png    Dynamix Java    Dynamix Java    Dynamix JAVA    Dynamix JAVA    dynamix-java    dynamix-java    [{'Metadata': {'Timestamp': '2022-08-29T15:58:14+02:00'}, 'URL': 'https://www.facebook.com/DYNAMIXJAVASPRL'}, {'Metadata': {'Timestamp': '2022-08-29T15:58:27+02:00'}, 'URL': 'https://www.linkedin.com/company/dynamixjava/'}]     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
1   contentful-entries_productionv3     _doc    4D2kOg0t4iRD11fzJFaPc8  None    [ Lan-Area ]    4D2kOg0t4iRD11fzJFaPc8  profile     2022-08-25T08:42:32.473Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.744188919708497, 'Lat': 50.3149442697085}, 'Northeast': {'Lng': 4.746886880291502, 'Lat': 50.3176422302915}}, 'coordinates': [4.745529299999999, 50.31632769999999], 'type': 'Point', 'Location': {'Lng': 4.745529299999999, 'Lat': 50.31632769999999}}, 'Metadata': {'PlaceId': 'ChIJm9XAKz6SwUcRs45ovYpmEpc', 'AddressType': 'head office', 'Timestamp': '2022-06-21T14:17:33.655Z'}, 'FormattedAddress': 'Rue d'Ermeton 14, 5537 Anhée, Belgique', 'MainAddress': True}]  [0779822986]    [{'Metadata': {'Timestamp': '2022-08-25T10:42:29+02:00'}, 'URL': 'https://www.lan-area.be/'}]   Platform exclusively focused on local sports competition. Lan-Area has created a central calendar where all local events are announced and a Belgian community space where players can post their teams, courses and successes.     Plateforme exclusivement tournée vers la compétition e-sportive locale . Lan-Area a créé un calendrier central où tous les évènements locaux sont annoncés et un espace communautaire belge où les joueurs peuvent afficher leurs équipes, parcours et succès.  lan-Aera.jpg    450.0   250.0   21154.0     image/jpeg  //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/346ee9006b0b5e3e33d2fab6ce293a47/lan-Aera.jpg    lan-Aera.jpg    450.0   250.0   21154.0     image/jpeg  //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/7f30ce6782073cf51d16c1f67ef5ee0d/lan-Aera.jpg    lan-Aera    Logo Lan-Aera   Lan-Aera    Lan-Area    lan-aera    lan-area    [{'Metadata': {'Timestamp': '2022-06-21T15:06:34+02:00'}, 'URL': 'https://www.facebook.com/lanarea2020'}, {'Metadata': {'Timestamp': '2022-06-21T15:07:31+02:00'}, 'URL': 'https://twitter.com/LanArea5'}, {'Metadata': {'Timestamp': '2022-06-21T15:59:53+02:00'}, 'URL': 'https://www.twitch.tv/ladh_lanarea'}]   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   contentful-entries_productionv3     _doc    6sbdRDRWJXTTtbR1wycE52  None    [1-formation.be]    6sbdRDRWJXTTtbR1wycE52  profile     2022-05-15T11:21:20.388Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:43:01.598Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}]   [0891973792]    [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'http://www.1-formation.be/'}]  Training in IT following based on four subjects: office applications, web and image, web marketing and communication, personnel management and development.     Formations en informatique suivant quatre thématiques: bureautique, web et image, webmarketing et communication, management et développement personnel.     NaN     NaN     NaN     NaN     NaN     NaN     logo-f-1-formation.jpg  350.0   77.0    5569.0  image/jpeg  //images.ctfassets.net/myqv2p4gx62v/7Itx3K16vYyGTHuYUD7TfW/7103d85dbce48d1c3a0535dac76df5c0/logo-f-1-formation.jpg  NaN     logo-f-1-formation.jpg  1-formation.be  1-formation.be  1-formationbe   1-formationbe   [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://twitter.com/1formation_be'}, {'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://www.facebook.com/1formation'}]    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
3   contentful-entries_productionv3     _doc    4EuOqP1eQIeka5xHcoq5mQ  None    [1-position.be]     4EuOqP1eQIeka5xHcoq5mQ  profile     2022-05-15T11:21:23.274Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:51:39.745Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}]   [0891973792]    []  Communications agency and IT training centre: website creation, professional SEO, the creation of Google Adwords campaigns, copywriting and web content, visual identity creation, communications consulting.   Agence de communication et centre de formation informatique: création de sites web, référencement professionnel, création et gestion de campagnes Google AdWords, copywriting et écriture web, création d'identité visuelle, conseil en communication.  NaN     NaN     NaN     NaN     NaN     NaN     marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png   169.0   129.0   11128.0     image/png   //images.ctfassets.net/myqv2p4gx62v/2RMVJINCIXiF4O2hZIb6kx/c1aebc77207c1a5ae67af5ebd87b1dd3/marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png   NaN     marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png   1-position.be   1-position.be   1-positionbe    1-positionbe    [{'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://twitter.com/1position'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.facebook.com/pages/1-positionbe/147447630063'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.linkedin.com/company/1-position.be'}]     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
4   contentful-entries_productionv3     _doc    1VvYEZncg0lEDL8RzGAvmE  None    [123 Automation Engineering & Development]  1VvYEZncg0lEDL8RzGAvmE  profile     2022-05-15T05:25:51.214Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.456926070107278, 'Lat': 50.53833147010727}, 'Northeast': {'Lng': 4.459625729892722, 'Lat': 50.54103112989272}}, 'coordinates': [4.4582759, 50.5396813], 'type': 'Point', 'Location': {'Lng': 4.4582759, 'Lat': 50.5396813}}, 'Metadata': {'PlaceId': 'EjNSdWUgZGVzIEFydGlzYW5zIDQsIDYyMTAgTGVzIEJvbnMgVmlsbGVycywgQmVsZ2lxdWUiGhIYChQKEgn75Aq3dyzCRxFEh7hEj1NdPBAE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:17:32.918Z'}, 'FormattedAddress': 'Rue des Artisans 4, 6210 Les Bons Villers, Belgique', 'MainAddress': True}]    [0820888531]    [{'Metadata': {'Timestamp': '2022-05-07T15:17:32.867Z'}, 'URL': 'http://www.123automation.be/'}]    NaN     Automation et robotique industrielle: étude, conception, développement, intégration et maintenance de solutions automatisées visant l’amélioration de la productivité dans les processus de fabrication quels qu’ils soient.    NaN     NaN     NaN     NaN     NaN     NaN     123automation.png   319.0   111.0   5802.0  image/png   //images.ctfassets.net/myqv2p4gx62v/6uY3Y6EDfICh8wdp4XNK7Z/082273035f7a600ec34098b09ab4fee9/123automation.png   NaN     123automation.png   123 Automation Engineering & Development    123 Automation Engineering & Development    123-automation  123-automation  []  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
5360    contentful-entries_productionv3     _doc    1AbDfyZ4rHL18Bw6aiJKSA  None    [École Centrale des Arts et Métiers - HE Vinci]     1AbDfyZ4rHL18Bw6aiJKSA  profile     2022-05-15T11:43:23.005Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.452325870107279, 'Lat': 50.84853592010727}, 'Northeast': {'Lng': 4.455025529892723, 'Lat': 50.85123557989272}}, 'coordinates': [4.4538028, 50.8499896], 'type': 'Point', 'Location': {'Lng': 4.4538028, 'Lat': 50.8499896}}, 'Metadata': {'PlaceId': 'ChIJwdgtpYbcw0cRfjW1nUhDNk8', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:44:19.720Z'}, 'FormattedAddress': 'Prom. de l'Alma 50, 1200 Woluwe-Saint-Lambert, Belgique', 'MainAddress': True}]     [0459279954, 0409454123]    [{'Metadata': {'Timestamp': '2022-05-07T15:44:19.660Z'}, 'URL': 'http://www.ecam.be/'}]     NaN     L'ECAM est un Institut Supérieur Industriel ayant pour objet la formation de Master en sciences industrielles dans une des spécialités suivantes: ​automatisation, construction, électromécanique, électronique, géomètre, informatique, business analyst (alternance).     NaN     NaN     NaN     NaN     NaN     NaN     ecam.jpg    512.0   512.0   93657.0     image/jpeg  //images.ctfassets.net/myqv2p4gx62v/4e2oSTcbXRABuyibUwgs95/4e5d8f540ccc67065a94eb528418ddd7/ecam.jpg    NaN     ecam.jpg    École Centrale des Arts et Métiers - HE Vinci   École Centrale des Arts et Métiers - HE Vinci   ecole-centrale-des-arts-et-metiers  ecole-centrale-des-arts-et-metiers  []  ECAM    ECAM    NaN     NaN     NaN     NaN     NaN     NaN
5361    contentful-entries_productionv3     _doc    5vp8xZpO6CucXtOmc1H8yR  None    [École communale fondamentale de Seneffe]   5vp8xZpO6CucXtOmc1H8yR  profile     2022-05-15T09:12:19.246Z    [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.252977370107278, 'Lat': 50.52898217010728}, 'Northeast': {'Lng': 4.255677029892722, 'Lat': 50.53168182989272}}, 'coordinates': [4.2543333, 50.5303456], 'type': 'Point', 'Location': {'Lng': 4.2543333, 'Lat': 50.5303456}}, 'Metadata': {'PlaceId': 'ChIJt1KItgg0wkcR6ekUYWMbdDg', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:58:11.863Z'}, 'FormattedAddress': 'Rue de Buisseret 19, 7180 Seneffe, Belgique', 'MainAddress': True}]     NaN     []  NaN     Ecole fondamentale.     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     École communale fondamentale de Seneffe     École communale fondamentale de Seneffe     ecole-communale-de-seneffe  ecole-communale-de-seneffe  []  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
[...]

This dataframe has 5365 rows × 40 columns.这个 dataframe 有 5365 行 × 40 列。 You can inspect the initial json response and dissect it further, maybe you need more/less/other information from it.您可以检查初始 json 响应并进一步剖析它,也许您需要更多/更少/其他信息。

Requests docs: https://requests.readthedocs.io/en/latest/请求文档: https://requests.readthedocs.io/en/latest/

Pandas relevant documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html Pandas 相关文档: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM