简体   繁体   English

Web Scraper用于python中的动态表单

[英]Web Scraper for dynamic forms in python

I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx . 我正在尝试填写此网站http://www.marutisuzuki.com/Maruti-Price.aspx的表格。

It consists of three drop down lists. 它由三个下拉列表组成。 One is Model of the car, Second is the state and third is city. 一是汽车模型,二是州,三是城市。 The first two are static and the third, city is generated dynamically depending upon the value of state, there is an onclick java script event running which gets the values of corresponding cities in a state. 前两个是静态的,第三个是city的,它是根据state的值动态生成的,正在运行一个onclick Java脚本事件,该事件获取一个州中相应城市的值。

I am familiar with mechanize module in python. 我熟悉python中的机械化模块。 I came across several links telling me that I cannot handle dynamic content in mechanize. 我遇到了几个链接,这些链接告诉我我无法在机械化中处理动态内容 But this link http://toddhayton.com/2014/12/08/form-handling-with-mechanize-and-beautifulsoup/ in the section " Adding item dynamically " states that I can use mechanize to handle dynamic content but I did not understand this line of code in it 但是,“ 动态添加项目 ”部分中的链接http://toddhayton.com/2014/12/08/form-handling-with-mechanize-and-beautifulsoup/指出,我可以使用机械化来处理动态内容,但是我做到了不明白其中的这一行代码

item = Item(br.form.find_control(name='searchAuxCountryID'),{'contents': '3', 'value': '3', 'label': 3})

What is "Item" in this line of code corresponding to the city field in the form. 此代码行中与表单中的city字段对应的“ Item”是什么。 I came across selenium module which might help me handling dynamic drop down list. 我遇到了硒模块,它可以帮助我处理动态下拉列表。 But I was not able to find anything in its documentation or any good blog on how to use it. 但是我无法在其文档中找到任何有关如何使用它的东西。

Can some one suggest me how to submit this form for different models, states and cities? 有人可以建议我如何针对不同的模型,州和城市提交此表格吗? Any links on how to solve this problem will be appreciated. 任何有关如何解决此问题的链接将不胜感激。 A sample code in python on how to submit the form will be helpful. python中有关如何提交表单的示例代码将很有帮助。 Thanks in advance. 提前致谢。

If you look at the request being sent to that site in developer tools, you'll see that a POST is sent as soon as you select a state. 如果您在开发人员工具中查看发送到该站点的请求,则将看到在选择状态后立即发送了POST。 The response that is sent back has the form with the values in the city dropdown populated. 发送回的响应具有填写城市下拉列表中的值的形式。

So, to replicate this in your script you want something like the following: 因此,要将其复制到脚本中,您需要以下内容:

  • Open the page 打开页面
  • Select the form 选择表格
  • Select values for model and state 选择模型和状态的值
  • Submit the form 提交表格
  • Select the form from the response sent back 从发送回的响应中选择表格
  • Select value for city (it should be populated now) 选择城市值(应立即填充)
  • Submit the form 提交表格
  • Parse the response for the table of results 解析结果表的响应

That will look something like: 看起来像:

#!/usr/bin/env python                                                                                                                                                                

import re
import mechanize

from bs4 import BeautifulSoup

def select_form(form):
    return form.attrs.get('id', None) == 'form1'

def get_state_items(browser):
    browser.select_form(predicate=select_form)
    ctl = browser.form.find_control('ctl00$ContentPlaceHolder1$ddlState')
    state_items = ctl.get_items()
    return state_items[1:]

def get_city_items(browser):
    browser.select_form(predicate=select_form)
    ctl = browser.form.find_control('ctl00$ContentPlaceHolder1$ddlCity')
    city_items = ctl.get_items()
    return city_items[1:]

br = mechanize.Browser()
br.open('http://www.marutisuzuki.com/Maruti-Price.aspx')    
br.select_form(predicate=select_form)
br.form['ctl00$ContentPlaceHolder1$ddlmodel'] = ['AK'] # model = Maruti Suzuki Alto K10                                                                                              

for state in get_state_items(br):
    # 1 - Submit form for state.name to get cities for this state                                                                                                                    
    br.select_form(predicate=select_form)
    br.form['ctl00$ContentPlaceHolder1$ddlState'] = [ state.name ]
    br.submit()

    # 2 - Now the city dropdown is filled for state.name                                                                                                                             
    for city in get_city_items(br):
        br.select_form(predicate=select_form)
        br.form['ctl00$ContentPlaceHolder1$ddlCity'] = [ city.name ]
        br.submit()

        s = BeautifulSoup(br.response().read())
        t = s.find('table', id='ContentPlaceHolder1_dtDealer')
        r = re.compile(r'^ContentPlaceHolder1_dtDealer_lblName_\d+$')

        header_printed = False
        for p in t.findAll('span', id=r):
            tr = p.findParent('tr')
            td = tr.findAll('td')

            if header_printed is False:
                str = '%s, %s' % (city.attrs['label'], state.attrs['label'])
                print str
                print '-' * len(str)
                header_printed = True

            print ' '.join(['%s' % x.text.strip() for x in td])

我在本教程中遇到了同样的问题,这对我有用:

item = mechanize.Item(br.form.find_control(name='searchAuxCountryID'),{'contents': '3', 'value': '3', 'label': 3})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM