简体   繁体   English

为包含 _dopostback 方法的多个页面抓取网站,并且页面的 URL 不会更改

[英]Scraping a website for multiple pages that contains _dopostback method and the URL doesn't change for the pages

I am using BeautifulSoup to scrape from the https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019我正在使用BeautifulSouphttps://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=01
There are a total of two pages of information and to navigate over the pages, there are several links in the top as well in the bottom like 1,2.共有两页信息,要浏览页面,顶部和底部都有几个链接,如 1,2。 These links use _dopostback这些链接使用_dopostback

href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView2',' Page$2 ')" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView2',' Page$2 ')"

The problem is when we try to navigate from one page to another, the Url doesn't change only the bold text changes ie for Page 1 it is Page$1 , for Page 2 it is Page$2 .问题是当我们尝试从一个页面导航到另一个页面时, Url 不会仅更改粗体文本更改,即 Page 1 它是Page$1 ,对于 Page 2 它是Page$2 How do I use BeautifulSoup to iterate over several pages and extract the information?如何使用 BeautifulSoup 遍历多个页面并提取信息? The form data is as follows.表格数据如下。

ctl00$ScriptManager1: ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2 ctl00$ContentPlaceHolder1$ddl_District: 019 ctl00$ContentPlaceHolder1$rdo_Govt_Flag: G __EVENTTARGET: ctl00$ContentPlaceHolder1$GridView2 __EVENTARGUMENT: Page$2 ctl00$ScriptManager1: ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2 ctl00$ContentPlaceHolder1$ddl_District: 019 ctl00$ContentPlaceHolder1$rdo_Govt_Flag: G __EVENTTARGET: ctl00$ContentPlaceHolder1$2$GridView2 __EVENTARGUMENT:

There is also a variable called _VIEWSTATE in the form data, but the contents are so huge.表单数据中还有一个叫_VIEWSTATE的变量,但是内容实在是太大了。 I looked at multiple solutions and posts that are suggesting to see the parameters of post call and use them but I am unable to make sense of the parameters that are provided in post .我查看了多个解决方案和帖子,建议查看post call 的参数并使用它们,但我无法理解post中提供的参数。

You can use this example how to load next page on this site using requests :您可以使用此示例如何使用requests在此站点上加载下一页:

import requests
from bs4 import BeautifulSoup


url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(requests.get(url).content, "html.parser")


def load_page(soup, page_num):
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
    }

    payload = {
        "ctl00$ScriptManager1": "ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2",
        "__EVENTTARGET": "ctl00$ContentPlaceHolder1$GridView2",
        "__EVENTARGUMENT": "Page${}".format(page_num),
        "__LASTFOCUS": "",
        "__ASYNCPOST": "true",
    }

    for inp in soup.select("input"):
        payload[inp["name"]] = inp.get("value")

    payload["ctl00$ContentPlaceHolder1$ddl_District"] = "019"
    payload["ctl00$ContentPlaceHolder1$rdo_Govt_Flag"] = "G"
    del payload["ctl00$ContentPlaceHolder1$chk_Available"]

    api_url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
    soup = BeautifulSoup(
        requests.post(api_url, data=payload, headers=headers).content,
        "html.parser",
    )
    return soup


# print hospitals from first page:
for h5 in soup.select("h5"):
    print(h5.text)

# load second page
soup = load_page(soup, 2)

# print hospitals from second page
for h5 in soup.select("h5"):
    print(h5.text)

Prints:印刷:

 AMRI, Salt Lake - Vivekananda Yuba Bharati Krirangan Salt Lake Stadium (Satellite Govt. Building)
 Calcutta National Medical College and Hospital (Government Hospital)
 CHITTARANJAN NATIONAL CANCER INSTITUTE-CNCI (Government Hospital)
 College of Medicine  Sagore Dutta Hospital (Government Hospital)
 ESI Hospital Maniktala (Government Hospital)
 ESI Hospital Sealdah (Government Hospital)
 I.D. And B.G. Hospital (Government Hospital)
 M R Bangur Hospital (Government Hospital)
 Medical College and Hospital, Kolkata, (Government Hospital)
 Nil Ratan Sarkar Medical College and Hospital (Government Hospital)
 R. G. Kar Medical College and Hospital  (Government Hospital)
 Sambhunath Pandit Hospital (Government Hospital)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM