简体   繁体   中英

Unable to scrape a name from a webpage using requests

I've created a script in python to fetch a name which is populated upon filling in an input in a webpage. Here is how you can get that name -> after opening that webpage (sitelink has been given below), put 16803 right next to CP Number and hit the search button.

I know how to grab that using selenium but I'm not interested to go that route. I'm trying here to collect the name using requests module. I tried to mimick the steps (what I can see in the chrome dev tools) within my script as to how the requests is being sent to that site. The only thing I can't supply automatically within payload parameter is ScrollTop .

Website Link

This is my attempt:

import requests
from bs4 import BeautifulSoup

URL = "https://www.icsi.in/student/Members/MemberSearch.aspx"

with requests.Session() as s:
    r = s.get(URL)
    cookie_item = "; ".join([str(x)+"="+str(y) for x,y in r.cookies.items()])
    soup = BeautifulSoup(r.text,"lxml")

    payload = {
        'StylesheetManager_TSSM':soup.select_one("#StylesheetManager_TSSM")['value'],
        'ScriptManager_TSM':soup.select_one("#ScriptManager_TSM")['value'],
        '__VIEWSTATE':soup.select_one("#__VIEWSTATE")['value'],
        '__VIEWSTATEGENERATOR':soup.select_one("#__VIEWSTATEGENERATOR")['value'],
        '__EVENTVALIDATION':soup.select_one("#__EVENTVALIDATION")['value'],
        'dnn$ctlHeader$dnnSearch$Search':soup.select_one("#dnn_ctlHeader_dnnSearch_SiteRadioButton")['value'],
        'dnn$ctr410$MemberSearch$ddlMemberType':0,
        'dnn$ctr410$MemberSearch$txtCpNumber': 16803,
        'ScrollTop': 474,
        '__dnnVariable': soup.select_one("#__dnnVariable")['value'],
    }

    headers = {
        'Content-Type':'multipart/form-data; boundary=----WebKitFormBoundaryBhsR9ScAvNQ1o5ks',
        'Referer': 'https://www.icsi.in/student/Members/MemberSearch.aspx',
        'Cookie':cookie_item,
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
    }
    res = s.post(URL,data=payload,headers=headers)
    soup_obj = BeautifulSoup(res.text,"lxml")
    name = soup_obj.select_one(".name_head > span").text
    print(name)

When I execute the above script I get the following error:

AttributeError: 'NoneType' object has no attribute 'text'

How can I grab a name populated upon filling in an input in a webpage using requests?

The main issue with your code is the data encoding. I've noticed that you've set the Content-Type header to "multipart/form-data" but that is not enough to create multipart-encoded data. In fact, it is a problem because the actual encoding is different since you're using the data parameter which URL-encodes the POST data. In order to create multipart-encoded data, you should use the files parameter.

You could do that either by passing an extra dummy parameter to files ,

res = s.post(URL, data=payload, files={'file':''})

(that would change the encoding for all POST data, not just the 'file' field)

Or you could convert the values in your payload dictionary to tuples, which is the expected structure when posting files with requests.

payload = {k:(None, str(v)) for k,v in payload.items()}

The first value is for the file name; it is not needed in this case so I've set it to None .

Next, your POST data should contain an __EVENTTARGET value that is required in order to get a valid response. (When creating the POST data dictionary it is important to submit all the data that the server expects. We can get that data from a browser: either by inspecting the HTML form or by inspecting the network traffic.) The complete code,

import requests
from bs4 import BeautifulSoup

URL = "https://www.icsi.in/student/Members/MemberSearch.aspx"

with requests.Session() as s:
    r = s.get(URL)
    soup = BeautifulSoup(r.text,"lxml")

    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['dnn$ctr410$MemberSearch$txtCpNumber'] = 16803
    payload["__EVENTTARGET"] = 'dnn$ctr410$MemberSearch$btnSearch'
    payload = {k:(None, str(v)) for k,v in payload.items()}

    r = s.post(URL, files=payload)
    soup_obj = BeautifulSoup(r.text,"lxml")
    name = soup_obj.select_one(".name_head > span").text
    print(name)

After some more tests, I discovered that the server also accepts URL-encoded data (probably because there are no files posted). So you can get a valid response either with data or with files , provided that you don't change the default Content-Type header.

It is not necessary to add any extra headers. When using a Session object, cookies are stored and submitted by default. The Content-Type header is created automatically - "application/x-www-form-urlencoded" when using the data parameter, "multipart/form-data" using files . Changing the default User-Agent or adding a Referer is not required.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM