[英]Selenium with Python / Navigating to next page
我很难用 ZA7F5F35426B927411FC9231B56 下的 Selenium 浏览下一页https://www.digitalwallonia.be/fr/cartographie/的 448 个连续页面。 我尝试了(太多)没有令人满意的结果(因此,很难放置相关代码)。
想看看你的解决方案。 如果问题表述不当,请道歉:第一次。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.implicitly_wait(20)
browser.get('https://www.digitalwallonia.be/fr/cartographie')
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAll"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_configure"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAllAndNext"]').click()
WebDriverWait(browser, 1000).until(EC.element_to_be_clickable((By.CLASS_NAME,'next'))).click()
input('Press ENTER to close the automated browser')
browser.quit()
我收到以下错误:selenium.common.exceptions.ElementNotInteractableException:消息:无法将元素滚动到视图中
我会在这里就几个问题提出建议:
WebDriverWait
,而不是implicitly_wait
,因为前者仅在等待元素存在,而使用WebDriverWait
您可以等待更成熟的元素状态,即可见、可点击等。WebDriverWait
和implicitly_wait
,这可能会导致问题。next page
按钮位于页面底部,因此您需要向下滚动,然后才能单击寻呼按钮。import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(service=webdriver_service, options=options)
url = "https://www.digitalwallonia.be/fr/cartographie"
actions = ActionChains(driver)
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAll"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_configure"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAllAndNext"]'))).click()
driver.execute_script("window.scrollBy(0, arguments[0]);", 800)
time.sleep(0.5)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.next a'))).click()
Every time you click to go to next page ('Suivant' button), the javascript in page is making a POST request to an API endpoint, with a header and a payload. Header、有效负载和 API 端点可以在浏览器开发工具 - 网络选项卡中找到(仅选择 XHR 调用)。 因此,我们可以尝试使用请求来抓取 API url 并避免 selenium/chromedriver 的开销。 以下是获取该数据的一种方式:
import requests
import pandas as pd
big_df = pd.DataFrame()
url = 'https://search.production.ribo.digitalwallonia.be/contentful-entries_production/_search/template'
headers = {
'content-type': 'application/json',
'Origin': 'https://www.digitalwallonia.be',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
counter = 0
while True:
payload = '{"id":"filter-profile-search-template-fr-v3","params":{"categoriesSlugList":[],"programsSlugList":[],"from":' + str(counter) + ',"regionsList":[],"size":100}}'
r = s.post(url, data=payload)
big_df = pd.concat([big_df, pd.json_normalize(r.json()['hits']['hits'])], axis=0, ignore_index=True)
counter = counter + 100
if counter > 448*12:
break
print(big_df)
我们一次获得 100 个项目(实际页面一次获得 12 个)。 大约一分钟后,您应该在终端中显示以下 dataframe:
_index _type _id _score sort _source.sys.id _source.sys.contentType.sys.id _source.sys.updatedAt _source.fields.addresses.fr _source.fields.belgianEnterprisesNumbers.fr _source.fields.urlsWebSite.fr _source.fields.shortDescription.en _source.fields.shortDescription.fr _source.fields.logoAssetImage.fr.file.en.fileName _source.fields.logoAssetImage.fr.file.en.details.image.width _source.fields.logoAssetImage.fr.file.en.details.image.height _source.fields.logoAssetImage.fr.file.en.details.size _source.fields.logoAssetImage.fr.file.en.contentType _source.fields.logoAssetImage.fr.file.en.url _source.fields.logoAssetImage.fr.file.fr.fileName _source.fields.logoAssetImage.fr.file.fr.details.image.width _source.fields.logoAssetImage.fr.file.fr.details.image.height _source.fields.logoAssetImage.fr.file.fr.details.size _source.fields.logoAssetImage.fr.file.fr.contentType _source.fields.logoAssetImage.fr.file.fr.url _source.fields.logoAssetImage.fr.title.en _source.fields.logoAssetImage.fr.title.fr _source.fields.title.en _source.fields.title.fr _source.fields.slug.en _source.fields.slug.fr _source.fields.urlsSocialNetwork.fr _source.fields.shortTitle.en _source.fields.shortTitle.fr _source.fields.founders.fr _source.fields.mainNaceCode.fr _source.fields.staffing.fr _source.fields.logoAssetImage.fr _source.fields.partnersAdditionalDescriptions.fr _source.fields.incubators.fr
0 contentful-entries_productionv3 _doc 3O1t8sTHhj5ZGrmGKtHI6y None [ Dynamix JAVA] 3O1t8sTHhj5ZGrmGKtHI6y profile 2022-09-01T14:36:06.899Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.388591169708497, 'Lat': 50.7035958197085}, 'Northeast': {'Lng': 4.391289130291502, 'Lat': 50.7062937802915}}, 'coordinates': [4.3898572, 50.7050388], 'type': 'Point', 'Location': {'Lng': 4.3898572, 'Lat': 50.7050388}}, 'Metadata': {'PlaceId': 'ChIJOZeR297Rw0cR_y-bZPZvwzQ', 'AddressType': 'head office', 'Timestamp': '2022-08-29T13:55:32.180Z'}, 'FormattedAddress': 'Av. des Dauphins 17, 1410 Waterloo, Belgique', 'MainAddress': True}] [0715677777] [{'Metadata': {'Timestamp': '2022-08-29T15:58:45+02:00'}, 'URL': 'https://dynamix-it.be/'}] Consulting company specialised in JAVA, SAP, DotNet, and son one. Société de consultance spécialisée en JAVA, SAP, DotNet, etc. dynamix_java.png 160.0 160.0 15950.0 image/png //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/1e5bd1ac59dab0126baea85f9156b872/dynamix_java.png dynamix java.png 160.0 160.0 15950.0 image/png //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/8e23b45bf77a17026df43cd072d06a52/dynamix_java.png Dynamix Java Dynamix Java Dynamix JAVA Dynamix JAVA dynamix-java dynamix-java [{'Metadata': {'Timestamp': '2022-08-29T15:58:14+02:00'}, 'URL': 'https://www.facebook.com/DYNAMIXJAVASPRL'}, {'Metadata': {'Timestamp': '2022-08-29T15:58:27+02:00'}, 'URL': 'https://www.linkedin.com/company/dynamixjava/'}] NaN NaN NaN NaN NaN NaN NaN NaN
1 contentful-entries_productionv3 _doc 4D2kOg0t4iRD11fzJFaPc8 None [ Lan-Area ] 4D2kOg0t4iRD11fzJFaPc8 profile 2022-08-25T08:42:32.473Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.744188919708497, 'Lat': 50.3149442697085}, 'Northeast': {'Lng': 4.746886880291502, 'Lat': 50.3176422302915}}, 'coordinates': [4.745529299999999, 50.31632769999999], 'type': 'Point', 'Location': {'Lng': 4.745529299999999, 'Lat': 50.31632769999999}}, 'Metadata': {'PlaceId': 'ChIJm9XAKz6SwUcRs45ovYpmEpc', 'AddressType': 'head office', 'Timestamp': '2022-06-21T14:17:33.655Z'}, 'FormattedAddress': 'Rue d'Ermeton 14, 5537 Anhée, Belgique', 'MainAddress': True}] [0779822986] [{'Metadata': {'Timestamp': '2022-08-25T10:42:29+02:00'}, 'URL': 'https://www.lan-area.be/'}] Platform exclusively focused on local sports competition. Lan-Area has created a central calendar where all local events are announced and a Belgian community space where players can post their teams, courses and successes. Plateforme exclusivement tournée vers la compétition e-sportive locale . Lan-Area a créé un calendrier central où tous les évènements locaux sont annoncés et un espace communautaire belge où les joueurs peuvent afficher leurs équipes, parcours et succès. lan-Aera.jpg 450.0 250.0 21154.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/346ee9006b0b5e3e33d2fab6ce293a47/lan-Aera.jpg lan-Aera.jpg 450.0 250.0 21154.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/7f30ce6782073cf51d16c1f67ef5ee0d/lan-Aera.jpg lan-Aera Logo Lan-Aera Lan-Aera Lan-Area lan-aera lan-area [{'Metadata': {'Timestamp': '2022-06-21T15:06:34+02:00'}, 'URL': 'https://www.facebook.com/lanarea2020'}, {'Metadata': {'Timestamp': '2022-06-21T15:07:31+02:00'}, 'URL': 'https://twitter.com/LanArea5'}, {'Metadata': {'Timestamp': '2022-06-21T15:59:53+02:00'}, 'URL': 'https://www.twitch.tv/ladh_lanarea'}] NaN NaN NaN NaN NaN NaN NaN NaN
2 contentful-entries_productionv3 _doc 6sbdRDRWJXTTtbR1wycE52 None [1-formation.be] 6sbdRDRWJXTTtbR1wycE52 profile 2022-05-15T11:21:20.388Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:43:01.598Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}] [0891973792] [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'http://www.1-formation.be/'}] Training in IT following based on four subjects: office applications, web and image, web marketing and communication, personnel management and development. Formations en informatique suivant quatre thématiques: bureautique, web et image, webmarketing et communication, management et développement personnel. NaN NaN NaN NaN NaN NaN logo-f-1-formation.jpg 350.0 77.0 5569.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/7Itx3K16vYyGTHuYUD7TfW/7103d85dbce48d1c3a0535dac76df5c0/logo-f-1-formation.jpg NaN logo-f-1-formation.jpg 1-formation.be 1-formation.be 1-formationbe 1-formationbe [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://twitter.com/1formation_be'}, {'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://www.facebook.com/1formation'}] NaN NaN NaN NaN NaN NaN NaN NaN
3 contentful-entries_productionv3 _doc 4EuOqP1eQIeka5xHcoq5mQ None [1-position.be] 4EuOqP1eQIeka5xHcoq5mQ profile 2022-05-15T11:21:23.274Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:51:39.745Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}] [0891973792] [] Communications agency and IT training centre: website creation, professional SEO, the creation of Google Adwords campaigns, copywriting and web content, visual identity creation, communications consulting. Agence de communication et centre de formation informatique: création de sites web, référencement professionnel, création et gestion de campagnes Google AdWords, copywriting et écriture web, création d'identité visuelle, conseil en communication. NaN NaN NaN NaN NaN NaN marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png 169.0 129.0 11128.0 image/png //images.ctfassets.net/myqv2p4gx62v/2RMVJINCIXiF4O2hZIb6kx/c1aebc77207c1a5ae67af5ebd87b1dd3/marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png NaN marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png 1-position.be 1-position.be 1-positionbe 1-positionbe [{'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://twitter.com/1position'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.facebook.com/pages/1-positionbe/147447630063'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.linkedin.com/company/1-position.be'}] NaN NaN NaN NaN NaN NaN NaN NaN
4 contentful-entries_productionv3 _doc 1VvYEZncg0lEDL8RzGAvmE None [123 Automation Engineering & Development] 1VvYEZncg0lEDL8RzGAvmE profile 2022-05-15T05:25:51.214Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.456926070107278, 'Lat': 50.53833147010727}, 'Northeast': {'Lng': 4.459625729892722, 'Lat': 50.54103112989272}}, 'coordinates': [4.4582759, 50.5396813], 'type': 'Point', 'Location': {'Lng': 4.4582759, 'Lat': 50.5396813}}, 'Metadata': {'PlaceId': 'EjNSdWUgZGVzIEFydGlzYW5zIDQsIDYyMTAgTGVzIEJvbnMgVmlsbGVycywgQmVsZ2lxdWUiGhIYChQKEgn75Aq3dyzCRxFEh7hEj1NdPBAE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:17:32.918Z'}, 'FormattedAddress': 'Rue des Artisans 4, 6210 Les Bons Villers, Belgique', 'MainAddress': True}] [0820888531] [{'Metadata': {'Timestamp': '2022-05-07T15:17:32.867Z'}, 'URL': 'http://www.123automation.be/'}] NaN Automation et robotique industrielle: étude, conception, développement, intégration et maintenance de solutions automatisées visant l’amélioration de la productivité dans les processus de fabrication quels qu’ils soient. NaN NaN NaN NaN NaN NaN 123automation.png 319.0 111.0 5802.0 image/png //images.ctfassets.net/myqv2p4gx62v/6uY3Y6EDfICh8wdp4XNK7Z/082273035f7a600ec34098b09ab4fee9/123automation.png NaN 123automation.png 123 Automation Engineering & Development 123 Automation Engineering & Development 123-automation 123-automation [] NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5360 contentful-entries_productionv3 _doc 1AbDfyZ4rHL18Bw6aiJKSA None [École Centrale des Arts et Métiers - HE Vinci] 1AbDfyZ4rHL18Bw6aiJKSA profile 2022-05-15T11:43:23.005Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.452325870107279, 'Lat': 50.84853592010727}, 'Northeast': {'Lng': 4.455025529892723, 'Lat': 50.85123557989272}}, 'coordinates': [4.4538028, 50.8499896], 'type': 'Point', 'Location': {'Lng': 4.4538028, 'Lat': 50.8499896}}, 'Metadata': {'PlaceId': 'ChIJwdgtpYbcw0cRfjW1nUhDNk8', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:44:19.720Z'}, 'FormattedAddress': 'Prom. de l'Alma 50, 1200 Woluwe-Saint-Lambert, Belgique', 'MainAddress': True}] [0459279954, 0409454123] [{'Metadata': {'Timestamp': '2022-05-07T15:44:19.660Z'}, 'URL': 'http://www.ecam.be/'}] NaN L'ECAM est un Institut Supérieur Industriel ayant pour objet la formation de Master en sciences industrielles dans une des spécialités suivantes: automatisation, construction, électromécanique, électronique, géomètre, informatique, business analyst (alternance). NaN NaN NaN NaN NaN NaN ecam.jpg 512.0 512.0 93657.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/4e2oSTcbXRABuyibUwgs95/4e5d8f540ccc67065a94eb528418ddd7/ecam.jpg NaN ecam.jpg École Centrale des Arts et Métiers - HE Vinci École Centrale des Arts et Métiers - HE Vinci ecole-centrale-des-arts-et-metiers ecole-centrale-des-arts-et-metiers [] ECAM ECAM NaN NaN NaN NaN NaN NaN
5361 contentful-entries_productionv3 _doc 5vp8xZpO6CucXtOmc1H8yR None [École communale fondamentale de Seneffe] 5vp8xZpO6CucXtOmc1H8yR profile 2022-05-15T09:12:19.246Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.252977370107278, 'Lat': 50.52898217010728}, 'Northeast': {'Lng': 4.255677029892722, 'Lat': 50.53168182989272}}, 'coordinates': [4.2543333, 50.5303456], 'type': 'Point', 'Location': {'Lng': 4.2543333, 'Lat': 50.5303456}}, 'Metadata': {'PlaceId': 'ChIJt1KItgg0wkcR6ekUYWMbdDg', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:58:11.863Z'}, 'FormattedAddress': 'Rue de Buisseret 19, 7180 Seneffe, Belgique', 'MainAddress': True}] NaN [] NaN Ecole fondamentale. NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN École communale fondamentale de Seneffe École communale fondamentale de Seneffe ecole-communale-de-seneffe ecole-communale-de-seneffe [] NaN NaN NaN NaN NaN NaN NaN NaN
[...]
这个 dataframe 有 5365 行 × 40 列。 您可以检查初始 json 响应并进一步剖析它,也许您需要更多/更少/其他信息。
请求文档: https://requests.readthedocs.io/en/latest/
Pandas 相关文档: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.