簡體   English   中英

為包含 _dopostback 方法的多個頁面抓取網站,並且頁面的 URL 不會更改

[英]Scraping a website for multiple pages that contains _dopostback method and the URL doesn't change for the pages

我正在使用BeautifulSouphttps://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=01
共有兩頁信息,要瀏覽頁面,頂部和底部都有幾個鏈接,如 1,2。 這些鏈接使用_dopostback

href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView2',' Page$2 ')"

問題是當我們嘗試從一個頁面導航到另一個頁面時, Url 不會僅更改粗體文本更改,即 Page 1 它是Page$1 ,對於 Page 2 它是Page$2 如何使用 BeautifulSoup 遍歷多個頁面並提取信息? 表格數據如下。

ctl00$ScriptManager1: ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2 ctl00$ContentPlaceHolder1$ddl_District: 019 ctl00$ContentPlaceHolder1$rdo_Govt_Flag: G __EVENTTARGET: ctl00$ContentPlaceHolder1$2$GridView2 __EVENTARGUMENT:

表單數據中還有一個叫_VIEWSTATE的變量,但是內容實在是太大了。 我查看了多個解決方案和帖子,建議查看post call 的參數並使用它們,但我無法理解post中提供的參數。

您可以使用此示例如何使用requests在此站點上加載下一頁:

import requests
from bs4 import BeautifulSoup


url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(requests.get(url).content, "html.parser")


def load_page(soup, page_num):
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
    }

    payload = {
        "ctl00$ScriptManager1": "ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2",
        "__EVENTTARGET": "ctl00$ContentPlaceHolder1$GridView2",
        "__EVENTARGUMENT": "Page${}".format(page_num),
        "__LASTFOCUS": "",
        "__ASYNCPOST": "true",
    }

    for inp in soup.select("input"):
        payload[inp["name"]] = inp.get("value")

    payload["ctl00$ContentPlaceHolder1$ddl_District"] = "019"
    payload["ctl00$ContentPlaceHolder1$rdo_Govt_Flag"] = "G"
    del payload["ctl00$ContentPlaceHolder1$chk_Available"]

    api_url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
    soup = BeautifulSoup(
        requests.post(api_url, data=payload, headers=headers).content,
        "html.parser",
    )
    return soup


# print hospitals from first page:
for h5 in soup.select("h5"):
    print(h5.text)

# load second page
soup = load_page(soup, 2)

# print hospitals from second page
for h5 in soup.select("h5"):
    print(h5.text)

印刷:

 AMRI, Salt Lake - Vivekananda Yuba Bharati Krirangan Salt Lake Stadium (Satellite Govt. Building)
 Calcutta National Medical College and Hospital (Government Hospital)
 CHITTARANJAN NATIONAL CANCER INSTITUTE-CNCI (Government Hospital)
 College of Medicine  Sagore Dutta Hospital (Government Hospital)
 ESI Hospital Maniktala (Government Hospital)
 ESI Hospital Sealdah (Government Hospital)
 I.D. And B.G. Hospital (Government Hospital)
 M R Bangur Hospital (Government Hospital)
 Medical College and Hospital, Kolkata, (Government Hospital)
 Nil Ratan Sarkar Medical College and Hospital (Government Hospital)
 R. G. Kar Medical College and Hospital  (Government Hospital)
 Sambhunath Pandit Hospital (Government Hospital)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM