简体   繁体   English

无法使用请求从网页中抓取名称

[英]Unable to scrape a name from a webpage using requests

I've created a script in python to fetch a name which is populated upon filling in an input in a webpage. 我已经在python中创建了一个脚本来获取一个名称,该名称在填写网页输入时填充。 Here is how you can get that name -> after opening that webpage (sitelink has been given below), put 16803 right next to CP Number and hit the search button. 打开该网页后,您将获得该名称->(在下面提供了网站链接),在CP Number旁边放置16803 ,然后点击搜索按钮。

I know how to grab that using selenium but I'm not interested to go that route. 我知道如何使用selenium来抓住它,但是我对走那条路线不感兴趣。 I'm trying here to collect the name using requests module. 我在这里尝试使用requests模块收集名称。 I tried to mimick the steps (what I can see in the chrome dev tools) within my script as to how the requests is being sent to that site. 我试图模仿脚本中有关如何将请求发送到该站点的步骤(在chrome开发工具中可以看到)。 The only thing I can't supply automatically within payload parameter is ScrollTop . 我无法在payload参数内自动提供的唯一东西是ScrollTop

Website Link 网站连结

This is my attempt: 这是我的尝试:

import requests
from bs4 import BeautifulSoup

URL = "https://www.icsi.in/student/Members/MemberSearch.aspx"

with requests.Session() as s:
    r = s.get(URL)
    cookie_item = "; ".join([str(x)+"="+str(y) for x,y in r.cookies.items()])
    soup = BeautifulSoup(r.text,"lxml")

    payload = {
        'StylesheetManager_TSSM':soup.select_one("#StylesheetManager_TSSM")['value'],
        'ScriptManager_TSM':soup.select_one("#ScriptManager_TSM")['value'],
        '__VIEWSTATE':soup.select_one("#__VIEWSTATE")['value'],
        '__VIEWSTATEGENERATOR':soup.select_one("#__VIEWSTATEGENERATOR")['value'],
        '__EVENTVALIDATION':soup.select_one("#__EVENTVALIDATION")['value'],
        'dnn$ctlHeader$dnnSearch$Search':soup.select_one("#dnn_ctlHeader_dnnSearch_SiteRadioButton")['value'],
        'dnn$ctr410$MemberSearch$ddlMemberType':0,
        'dnn$ctr410$MemberSearch$txtCpNumber': 16803,
        'ScrollTop': 474,
        '__dnnVariable': soup.select_one("#__dnnVariable")['value'],
    }

    headers = {
        'Content-Type':'multipart/form-data; boundary=----WebKitFormBoundaryBhsR9ScAvNQ1o5ks',
        'Referer': 'https://www.icsi.in/student/Members/MemberSearch.aspx',
        'Cookie':cookie_item,
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
    }
    res = s.post(URL,data=payload,headers=headers)
    soup_obj = BeautifulSoup(res.text,"lxml")
    name = soup_obj.select_one(".name_head > span").text
    print(name)

When I execute the above script I get the following error: 当我执行上述脚本时,出现以下错误:

AttributeError: 'NoneType' object has no attribute 'text'

How can I grab a name populated upon filling in an input in a webpage using requests? 使用请求填写网页中的输入时,如何获取填充的名称?

The main issue with your code is the data encoding. 您的代码的主要问题是数据编码。 I've noticed that you've set the Content-Type header to "multipart/form-data" but that is not enough to create multipart-encoded data. 我注意到您已将Content-Type标头设置为“ multipart / form-data”,但这不足以创建多部分编码的数据。 In fact, it is a problem because the actual encoding is different since you're using the data parameter which URL-encodes the POST data. 实际上,这是一个问题,因为由于您使用的是URL编码POST数据的data参数,所以实际的编码是不同的。 In order to create multipart-encoded data, you should use the files parameter. 为了创建多部分编码的数据,应该使用files参数。

You could do that either by passing an extra dummy parameter to files , 您可以通过向files传递一个额外的哑参数来实现,

res = s.post(URL, data=payload, files={'file':''})

(that would change the encoding for all POST data, not just the 'file' field) (这将更改所有POST数据的编码,而不仅仅是'file'字段)

Or you could convert the values in your payload dictionary to tuples, which is the expected structure when posting files with requests. 或者,您可以将payload字典中的值转换为元组,这是在发布带有请求的文件时的预期结构。

payload = {k:(None, str(v)) for k,v in payload.items()}

The first value is for the file name; 第一个值是文件名; it is not needed in this case so I've set it to None . 在这种情况下不需要它,因此我将其设置为None

Next, your POST data should contain an __EVENTTARGET value that is required in order to get a valid response. 接下来,您的POST数据应包含__EVENTTARGET值,该值是获得有效响应所必需的。 (When creating the POST data dictionary it is important to submit all the data that the server expects. We can get that data from a browser: either by inspecting the HTML form or by inspecting the network traffic.) The complete code, (创建POST数据字典时,提交服务器期望的所有数据很重要。我们可以从浏览器中获取这些数据:通过检查HTML表单或通过检查网络流量。)完整的代码,

import requests
from bs4 import BeautifulSoup

URL = "https://www.icsi.in/student/Members/MemberSearch.aspx"

with requests.Session() as s:
    r = s.get(URL)
    soup = BeautifulSoup(r.text,"lxml")

    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['dnn$ctr410$MemberSearch$txtCpNumber'] = 16803
    payload["__EVENTTARGET"] = 'dnn$ctr410$MemberSearch$btnSearch'
    payload = {k:(None, str(v)) for k,v in payload.items()}

    r = s.post(URL, files=payload)
    soup_obj = BeautifulSoup(r.text,"lxml")
    name = soup_obj.select_one(".name_head > span").text
    print(name)

After some more tests, I discovered that the server also accepts URL-encoded data (probably because there are no files posted). 经过更多测试后,我发现服务器还接受URL编码的数据(可能是因为没有发布文件)。 So you can get a valid response either with data or with files , provided that you don't change the default Content-Type header. 因此,只要不更改默认的Content-Type标头,就可以使用datafiles获得有效的响应。

It is not necessary to add any extra headers. 无需添加任何额外的标题。 When using a Session object, cookies are stored and submitted by default. 使用Session对象时,默认情况下会存储和提交cookie。 The Content-Type header is created automatically - "application/x-www-form-urlencoded" when using the data parameter, "multipart/form-data" using files . Content-Type标头是自动创建的-使用data参数时使用“ application / x-www-form-urlencoded”,使用files “ multipart / form-data”。 Changing the default User-Agent or adding a Referer is not required. 不需要更改默认的用户代理或添加引荐来源网址。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM