简体   繁体   中英

Scraping multiple select options using Selenium

I am required to scrape PDF's from the website https://secc.gov.in/lgdStateList . There are 3 drop-down menus for a state, a district and a block. There are several states, under each state we have districts and under each district there are blocks.

I tried to implement the following code. I was able to select the state, but there seems to be some error when I select the district.

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time

from selenium import webdriver
from bs4 import BeautifulSoup

browser = webdriver.Chrome()
url = ("https://secc.gov.in/lgdStateList")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, 'html.parser')
for name_list in soup.find_all(class_ ='dropdown-row'):
    print(name_list.text)

driver = webdriver.Chrome()
driver.get('https://secc.gov.in/lgdStateList')

selectState = Select(driver.find_element_by_id("lgdState"))
for state in selectState.options:
    state.click()
    selectDistrict = Select(driver.find_element_by_id("lgdDistrict"))
    for district in selectDistrict.options:
        district.click()
        selectBlock = Select(driver.find_element_by_id("lgdBlock"))
        for block in selectBlock.options():
            block.click()

The error I ran into is:

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="lgdDistrict"]"}
  (Session info: chrome=83.0.4103.106)

I need help crawling through the 3 menus.

Any help/suggestions would be really appreciated. Let me know of any clarifications in the comments.

This is where you can find the value of different states. You can find the same from district and block dropdowns.

You should now use those values within payload to get the table you would like to grab data from:

import urllib3
import requests
from bs4 import BeautifulSoup

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

link = "https://secc.gov.in/lgdGpList"

payload = {
    'stateCode': '10',
    'districtCode': '188',
    'blockCode': '1624'
}

r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for items in soup.select("table#example tr"):
    data = [' '.join(item.text.split()) for item in items.select("th,td")]
    print(data)

Output the script produces:

['Select State', 'Select District', 'Select Block']
['', 'Select District', 'Select Block']
['ARARIA BASTI (93638)', 'BANGAMA (93639)', 'BANSBARI (93640)']
['BASANTPUR (93641)', 'BATURBARI (93642)', 'BELWA (93643)']
['BOCHI (93644)', 'CHANDRADEI (93645)', 'CHATAR (93646)']
['CHIKANI (93647)', 'DIYARI (93648)', 'GAINRHA (93649)']
['GAIYARI (93650)', 'HARIA (93651)', 'HAYATPUR (93652)']
['JAMUA (93653)', 'JHAMTA (93654)', 'KAMALDAHA (93655)']
['KISMAT KHAWASPUR (93656)', 'KUSIYAR GAWON (93657)', 'MADANPUR EAST (93658)']
['MADANPUR WEST (93659)', 'PAIKTOLA (93660)', 'POKHARIA (93661)']
['RAMPUR KODARKATTI (93662)', 'RAMPUR MOHANPUR EAST (93663)', 'RAMPUR MOHANPUR WEST (93664)']
['SAHASMAL (93665)', 'SHARANPUR (93666)', 'TARAUNA BHOJPUR (93667)']

You need to scrape the numbers available in brackets adjacent to each results above and then use them in payload and send another post requests to download the pdf files. Make sure to put the script in a folder before execution so that you can get all the files within.

import urllib3
import requests
from bs4 import BeautifulSoup

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

link = "https://secc.gov.in/lgdGpList"
download_link = "https://secc.gov.in/downloadLgdwisePdfFile"

payload = {
    'stateCode': '10',
    'districtCode': '188',
    'blockCode': '1624'
}
r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select("table#example td > a[onclick^='downloadLgdFile']"):
    gp_code = item.text.strip().split("(")[1].split(")")[0]
    payload['gpCode'] = gp_code
    with open(f'{gp_code}.pdf','wb') as f:
        f.write(requests.post(download_link,data=payload,verify=False).content)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM