简体   繁体   中英

Web Scraper for dynamic forms in python

I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx .

It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state, there is an onclick java script event running which gets the values of corresponding cities in a state.

I am familiar with mechanize module in python. I came across several links telling me that I cannot handle dynamic content in mechanize. But this link http://toddhayton.com/2014/12/08/form-handling-with-mechanize-and-beautifulsoup/ in the section " Adding item dynamically " states that I can use mechanize to handle dynamic content but I did not understand this line of code in it

item = Item(br.form.find_control(name='searchAuxCountryID'),{'contents': '3', 'value': '3', 'label': 3})

What is "Item" in this line of code corresponding to the city field in the form. I came across selenium module which might help me handling dynamic drop down list. But I was not able to find anything in its documentation or any good blog on how to use it.

Can some one suggest me how to submit this form for different models, states and cities? Any links on how to solve this problem will be appreciated. A sample code in python on how to submit the form will be helpful. Thanks in advance.

If you look at the request being sent to that site in developer tools, you'll see that a POST is sent as soon as you select a state. The response that is sent back has the form with the values in the city dropdown populated.

So, to replicate this in your script you want something like the following:

  • Open the page
  • Select the form
  • Select values for model and state
  • Submit the form
  • Select the form from the response sent back
  • Select value for city (it should be populated now)
  • Submit the form
  • Parse the response for the table of results

That will look something like:

#!/usr/bin/env python                                                                                                                                                                

import re
import mechanize

from bs4 import BeautifulSoup

def select_form(form):
    return form.attrs.get('id', None) == 'form1'

def get_state_items(browser):
    browser.select_form(predicate=select_form)
    ctl = browser.form.find_control('ctl00$ContentPlaceHolder1$ddlState')
    state_items = ctl.get_items()
    return state_items[1:]

def get_city_items(browser):
    browser.select_form(predicate=select_form)
    ctl = browser.form.find_control('ctl00$ContentPlaceHolder1$ddlCity')
    city_items = ctl.get_items()
    return city_items[1:]

br = mechanize.Browser()
br.open('http://www.marutisuzuki.com/Maruti-Price.aspx')    
br.select_form(predicate=select_form)
br.form['ctl00$ContentPlaceHolder1$ddlmodel'] = ['AK'] # model = Maruti Suzuki Alto K10                                                                                              

for state in get_state_items(br):
    # 1 - Submit form for state.name to get cities for this state                                                                                                                    
    br.select_form(predicate=select_form)
    br.form['ctl00$ContentPlaceHolder1$ddlState'] = [ state.name ]
    br.submit()

    # 2 - Now the city dropdown is filled for state.name                                                                                                                             
    for city in get_city_items(br):
        br.select_form(predicate=select_form)
        br.form['ctl00$ContentPlaceHolder1$ddlCity'] = [ city.name ]
        br.submit()

        s = BeautifulSoup(br.response().read())
        t = s.find('table', id='ContentPlaceHolder1_dtDealer')
        r = re.compile(r'^ContentPlaceHolder1_dtDealer_lblName_\d+$')

        header_printed = False
        for p in t.findAll('span', id=r):
            tr = p.findParent('tr')
            td = tr.findAll('td')

            if header_printed is False:
                str = '%s, %s' % (city.attrs['label'], state.attrs['label'])
                print str
                print '-' * len(str)
                header_printed = True

            print ' '.join(['%s' % x.text.strip() for x in td])

我在本教程中遇到了同样的问题,这对我有用:

item = mechanize.Item(br.form.find_control(name='searchAuxCountryID'),{'contents': '3', 'value': '3', 'label': 3})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM