简体   繁体   中英

Scraping data from a dynamic table containing multiple drop-down options using Selenium Python

I am very new to web scraping and currently attempting to scrape information about all water utilities from this site that has options of different regions and output to a csv file.

The url of this site does not change; it stays the same every time drop down options are selected. My code so far (influenced by this stackoverflow post is able to select the first region from the options, but it doesn't seem to go any further. I have the following so far:

from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select


url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   print("Starting output for the region: " + region)

   # Select all options from drop down menu
   selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

   print("Now constructing output for: " + region)

   # Select table and wait for data to populate
   selectOption.select_by_visible_text(region)

   time.sleep(4)

   # Select the table containing the data and select all rows
   table = browser.find_element_by_xpath("//*[@id='MainContent_gvUtilities']")
   print(table)
   table_rows = table.find_elements_by_xpath(".//tr")

   # Create a list for each column in the table with each column number
   utility_name = [] #0
   country = [] #2
   city = []    #3
   population = [] #4

   for row in table_rows:
      column_element = row.find_elements_by_xpath(".//td")
      utility_name.append(column_element[0])
      country.append(column_element[2])
      city.append(column_element[3])
      population.append(column_element[4])

   #Create a dictionary of all utilities for each region
   dict_output = {
       "Utility Name": utility_name,
       "Country": country,
       "City": city, 
       "Population": population,
   }

   df = pd.DataFrame.from_dict(dict_output)
   df.to_csv(region, index = False)


browser.close()
browser.quit()

I get this error everytime:

  File "/home/ken/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=91.0.4472.77)
  (Driver info: chromedriver=2.26.436382 (70eb799287ce4c2208441fc057053a5b07ceabac),platform=Linux 5.8.0-59-generic x86_64)

I am stuck here, and I seem unable to figure out what I am doing wrong, or what I am actually supposed to do to resolve this error. Any help or pointers to this end will be highly appreciated!

Thanks!!

I can't seem to reproduce your error. But ran it and here's a few things:

  1. You have typo in your regions list: 'Latin America (including USA and Canada' should be 'Latin America (including USA and Canada)'
  2. Have you considered using pandas to parse the table? It uses BeautifulSoup under the hood, and does the bulk of the work for you.

Code:

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select


url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   print("Starting output for the region: " + region)

   # Select all options from drop down menu
   selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

   print("Now constructing output for: " + region)

   # Select table and wait for data to populate
   selectOption.select_by_visible_text(region)

   time.sleep(4)

   # Select the table containing the data and select all rows
   table = pd.read_html(browser.page_source)[0][:-1].dropna(axis=1)
   print(table)

   table.csv(region, index = False)


browser.close()
browser.quit()                        

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM