简体   繁体   English

使用Beautiful Soup和Python的Web Scraping Asp.NET网站

[英]Web Scraping Asp.NET site using Beautiful Soup and Python

I have the following code, but its gives be 200 OK with first page (state of default drop down) response. 我有以下代码,但第一页(默认下拉状态)响应为200 OK。 Please note that the Drop Down lists are dymanic and progressive until final search button appears , Can someone correct me as to what is wrong with my code? 请注意,下拉列表是动态的和渐进的,直到出现“最终搜索”按钮为止,有人可以纠正我的代码问题吗?

def process(ghatno):
    home_url = 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik'
    post_url = 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik'

    print "Please wait...getting details of :" + ghatno

    with requests.Session() as session:
        r = session.get(url=post_url)
        cookies = r.cookies
        pprint.pprint(r.headers)
        gethead = r.headers
        soup = BeautifulSoup(r.text, 'html.parser')
        viewstate = soup.select('input[name="__VIEWSTATE"]')[0]['value']
        csrftoken = soup.select('input[name="__CSRFTOKEN"]')[0]['value']
        eventvalidation = soup.select('input[name="__EVENTVALIDATION"]')[0]['value']
        viewgen = soup.select('input[name="__VIEWSTATEGENERATOR"]')[0]['value']

        data = {
            '__CSRFTOKEN':csrftoken,
            '__EVENTARGUMENT':'',
            '__EVENTTARGET':'',
            '__LASTFOCUS':'',
            '__SCROLLPOSITION':'0',
            '__SCROLLPOSITIONY':'0',
            '__EVENTVALIDATION': eventvalidation,
            '__VIEWSTATE':viewstate,
            '__VIEWSTATEGENERATOR': viewgen,
            'ctl00$ContentPlaceHolder5$ddlLanguage' : 'en-US',
            'ctl00$ContentPlaceHolder5$btnSearchCommonSr':'Search',
            'ctl00$ContentPlaceHolder5$ddlTaluka': '2',
            'ctl00$ContentPlaceHolder5$ddlVillage': '25',
            'ctl00$ContentPlaceHolder5$ddlYear': '20192020',
            'ctl00$ContentPlaceHolder5$grpSurveyLocation': 'rdbSurveyNo',
            'ctl00$ContentPlaceHolder5$txtCommonSurvey': 363
        }


        headers = {
        'Host': 'igrmaharashtra.gov.in',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0',
        'Referer': 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik',
        'Host': 'igrmaharashtra.gov.in',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',

        }


        r = requests.post(url=post_url, data=json.dumps(data), cookies=cookies, headers = headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        table = SoupStrainer('tr')
        soup = BeautifulSoup(soup.get_text(), 'html.parser', parse_only=table)
        print(soup.get_text())
        pprint.pprint(r.headers)
        print r.text
        getpost = r.headers
        getpostrequest = r.request.headers
        getresponsebody = r.request.body

        f = open('/var/www/html/nashik/hiren.txt', 'w')
        f.write(str(gethead))
        f.write(str(getpostrequest))
        f.write(str(getresponsebody))
        f.write(str(getpost))

My response is as below : 我的回应如下:

Response header - (GET Request) 响应标头-(获取请求)

{'Content-Length': '5994', 'X-AspNet-Version': '4.0.30319', 'Set-Cookie': 'ASP.NET_SessionId=24wwh11lwvzy5gf0xlzi1we4; path=/; HttpOnly, __CSRFCOOKIE=d7b10286-fc9f-4ed2-863d-304737df8758; path=/; HttpOnly', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'X-Powered-By': 'ASP.NET', 'Server': 'Microsoft-IIS/8.0', 'Cache-Control': 'private', 'Date': 'Thu, 02 May 2019 08:21:48 GMT', 'Content-Type': 'text/html; charset=utf-8'}

Response header - (GET Request) 响应标头-(获取请求)

{'Content-Length': '3726', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'Host': 'igrmaharashtra.gov.in', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0', 'Connection': 'keep-alive', 'Referer': 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik', 'Cookie': '__CSRFCOOKIE=d7b10286-fc9f-4ed2-863d-304737df8758; ASP.NET_SessionId=24wwh11lwvzy5gf0xlzi1we4', 'Content-Type': 'application/x-www-form-urlencoded'}

Response header - (POST Request) 响应头-(POST请求)

{'Content-Length': '7834', 'X-AspNet-Version': '4.0.30319', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'X-Powered-By': 'ASP.NET', 'Server': 'Microsoft-IIS/8.0', 'Cache-Control': 'private', 'Date': 'Fri, 03 May 2019 10:21:45 GMT', 'Content-Type': 'text/html; {'Content-Length':'7834','X-AspNet-Version':'4.0.30319','Content-Encoding':'gzip','Vary':'Accept-Encoding','X-Powered-创建者:'ASP.NET','服务器':'Microsoft-IIS / 8.0','缓存控制':'私有','日期':'星期五,2019年5月3日10:21:45 GMT',' Content-Type':'text / html; charset=utf-8'} charset = utf-8'}

**Default Page Selected Drop Down is returned ** **返回默认页面下拉菜单**

नाशिक and - - Select Taluka - - INSTEAD of option value "2" ie इगतपुरी once option "2" is selected I want value "25" in next drop down before I put my final survey "363" for results. 和--选择Taluka--选择选项值“ 2”的INSTEAD,即选择选项“ 2”后,我要在下一个下拉列表中选择值“ 25”,然后再放置最终调查“ 363”作为结果。

Please note I tried Mechanize browser too, but no luck !! 请注意,我也尝试过机械化浏览器,但是没有运气!

Finally the solution is to do post requests multiple times in same "session" with same "cookie" and iterate through them. 最后,解决方案是在具有相同“ cookie”的同一“会话”中多次发布请求,并遍历它们。 It works now ! 现在可以使用了!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM