I'm trying to get information in the last link that i'll show you in the website this one
The problem is my list of elements is not displayed even though when I try find_element
(one) it works. Here is my code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
options = Options()
# Creating our dictionary
all_services = pd.DataFrame(columns=['Profil', 'Motif', 'Questions', 'Reponses'])
path = "C:/Users/Al4D1N/Documents/ChromeDriver_webscraping/chromedriver.exe"
driver = webdriver.Chrome(options=options, executable_path=path)
# we are going to visit all profils procedures
# for profil in ['particuliers','professionnels','associations']:
# driver.get("https://www.demarches.interieur.gouv.fr/{profil}/accueil-{profil}")
driver.get("https://www.demarches.interieur.gouv.fr/associations/accueil-associations")
# Get all first elements in bodyFiche id which contains all procedures for associations profile
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
for service in list_of_services:
# In each element, select the tags
# atags = service.find_elements_by_css_selector('a')
atags = service.find_elements_by_xpath("//li[starts-with(@id,'summary')]")
for atag in atags:
# In each atag, select the href
href = atag.get_attribute('href')
print(href)
# Open a new window
driver.execute_script("window.open('');")
# Switch to the new window and open URL
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# we are now on the second link
# Get all links in the iterated element
list_of_services2 = driver.find_elements_by_class_name("content")
for service2 in list_of_services2:
atags2 = service2.find_elements_by_css_selector('a')
for atag2 in atags2:
href = atag2.get_attribute('href')
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# we are now on the third link
# Get all links in the iterated element
list_of_services3 = driver.find_elements_by_class_name("content")
for service3 in list_of_services2:
atags3 = service3.find_elements_by_css_selector('a')
for atag3 in atags3:
href = atag3.get_attribute('href')
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# Get Q/A section
list_of_services4 = driver.find_elements_by_class_name("QuestionReponse")
for service4 in list_of_services4:
atags4 = service4.find.elements_by_css_selector('a')
for atag4 in atags4:
href = atag3.get_attribute('href')
# We store our questions
questions = href.text
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
# Get data
reponses = driver.find_elements_by_class_name("texte")
all_services = all_services.append({'Questions': questions,
'Reponses': reponses}, ignore_index=True)
driver.close()
driver.switch_to.window(driver.window_handles[0])
driver.close()
driver.switch_to.window(driver.window_handles[0])
driver.close()
driver.switch_to.window(driver.window_handles[0])
# Close the tab with URL B
driver.close()
# Switch back to the first tab with URL A
driver.switch_to.window(driver.window_handles[0])
driver.close()
all_services.to_excel('Limit_Testing.xlsx', index=False)
I'm not sure if my method is working or not, the idea is going through links like in a tree and when I succeed to my leaf I get my desired information. Correct me if im wrong. I don't know my list_of_services
is a NULL list, even if im correct on the class name.
What's worked for me in previous experiences: add waiting time. The logic for this is that when you make the GET request, you go straight to analyze whether there is a WebElement with class='liste-sous-menu'
, without waiting for the driver to get the website loaded, this causes the list to be empty as there is nothing to return. Therefore, my suggestion is the following:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
## Import sleep
from time import sleep
options = Options()
path = "C:/Users/Al4D1N/Documents/ChromeDriver_webscraping/chromedriver.exe"
driver = webdriver.Chrome(options=options, executable_path=path)
driver.get("https://www.demarches.interieur.gouv.fr/associations/accueil-associations")
################### HERE YOU ADD SOME WAITING TIME, it will depend on the speed of you computer/driver
sleep(0.5)
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
I have applied it in your code and it now seems to be returning a list with content. However, it does not return the links, it just returns the UL (unordered list) that contains the links, you will need to dig deeper once you have the UL element. This means adding the following:
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
### Now you get the li elements (each row)
services = list_of_services.find_elements_by_tag_name('li')
## Now you iterate over the services object (list of 'li' elements)
Hope to have solved your question.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.