简体   繁体   中英

How can I scrape this?

I need to scrape this page (which has a form): http://kllads.kar.nic.in/MLAWise_reports.aspx , with Python preferably (if not Python, then JavaScript). I was looking at libraries like RoboBrowser (which is basically Mechanize + BeautifulSoup) and (maybe) Selenium but I'm not quite sure on how to go about it. From inspecting the element, it seems to be a WebForm that I need to fill in. After filling that in, the webpage generates some data that I need to store. How should I do this?

You can interact with the javascript web forms relatively easily in Selenium. You may need to install a webdriver quickly, but besides that all you need to do is find the form using its xpath and then have Selenium select an option from the drop down menu using the option's xpath. For the web page provided that would look something like this:

#import functions from selenium module
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# open chrome browser using webdriver
path_to_chromedriver = '/Users/Michael/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path=path_to_chromedriver)

# open web page using browser
browser.get('http://kllads.kar.nic.in/MLAWise_reports.aspx')

# wait for page to load then find 'Constituency Name' dropdown and select 'Aland (46)''
const_name = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ddlconstname"]')))
browser.find_element_by_xpath('//*[@id="ddlconstname"]/option[2]').click()

# wait for the page to load then find 'Select Status' dropdown and select 'OnGoing'
sel_status = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ddlstatus1"]')))
browser.find_element_by_xpath('//*[@id="ddlstatus1"]/option[2]').click()

# wait for browser to load then click 'Generate Report'
gen_report = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="BtnReport"]')))
browser.find_element_by_xpath('//*[@id="BtnReport"]').click()

Between each interaction, you are just giving the browser some time to load before attempting click the next element. Once all the forms are filled out, the page will display the data based on the options selected and you should be able to scrape the table data. I had a few issues when attempting to load data for the first Constituency Name option, but the others seemed to work fine.

You should also be able to loop through all the dropdown options available under each web form to display all the data.

Hope that helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM