简体   繁体   English

当 url 保持不变时,无法从网页的第二页抓取名称

[英]Unable to scrape names from the second page of a webpage when the url remains unchanged

I'm trying to scrape different agency name from the second page of a webpage using requests module.我正在尝试使用请求模块从网页的第二页中抓取不同的机构名称。 I can parse the names from it's landing page by sending a get requests to the very url.我可以通过向 url 发送 get 请求来解析它的登陆页面中的名称。

However, when it comes to access the names from it's second page and latter, I need to send post http requests along with appropriate parameters.但是,当涉及到从它的第二页和后者访问名称时,我需要发送 post http 请求以及适当的参数。 I tried to mimic the post requests exactly the way I see it in dev tools but all I get in return is the following:我试图完全按照我在开发工具中看到的方式来模仿发布请求,但我得到的回报如下:

<?xml version='1.0' encoding='UTF-8'?>
<partial-response id="j_id1"><redirect url="/ptn/exceptionhandler/sessionExpired.xhtml"></redirect></partial-response>

This is how I've tried:这是我尝试过的方式:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

link = 'https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml?origin=menu'
url = 'https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")

    payload = {
        'contentForm': 'contentForm',
        'contentForm:j_idt171_windowName': '',
        'contentForm:j_idt187_listButton2_HIDDEN-INPUT': '',
        'contentForm:j_idt192_searchBar_INPUT-SEARCH': '',
        'contentForm:j_idt192_searchBarList_HIDDEN-SUBMITTED-VALUE': '',
        'contentForm:j_id135_0': 'Title',
        'contentForm:j_id135_1': 'Document No.',
        'contentForm:j_id136': 'Match All',
        'contentForm:j_idt853_select': 'ON',
        'contentForm:j_idt859_select': '0',
        'javax.faces.ViewState': soup.select_one('input[name="javax.faces.ViewState"]')['value'],
        'javax.faces.source': 'contentForm:j_idt902:j_idt955_2_2',
        'javax.faces.partial.event': 'click',
        'javax.faces.partial.execute': 'contentForm:j_idt902:j_idt955_2_2 contentForm:j_idt902',
        'javax.faces.partial.render': 'contentForm:j_idt902:j_idt955 contentForm dialogForm',
        'javax.faces.behavior.event': 'action',
        'javax.faces.partial.ajax': 'true'
    }

    s.headers['Referer'] = 'https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml?origin=menu'
    s.headers['Faces-Request'] = 'partial/ajax'
    s.headers['Origin'] = 'https://www.gebiz.gov.sg'
    s.headers['Host'] = 'www.gebiz.gov.sg'
    s.headers['Accept-Encoding'] = 'gzip, deflate, br'

    res = s.post(url,data=payload,allow_redirects=False)
    # soup = BeautifulSoup(res.text,"lxml")
    # for item in soup.select(".commandLink_TITLE-BLUE"):
    #     print(item.get_text(strip=True))
    print(res.text)

How can I parse names from a webpage from it's second page when the url remains unchanged?当 url 保持不变时,如何从网页的第二页解析名称?

You can use Selenium to traverse between pages.您可以使用 Selenium 在页面之间进行遍历。 The following code will allow you to do this.以下代码将允许您执行此操作。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.options import Options
import time


chrome_options = Options()
#chrome_options.add_argument("--headless")
#chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36")


driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)
driver.get("https://www.gebiz.gov.sg/ptn/opportunity/BOListing.xhtml?origin=menu")

#check if next page exists
next_page = driver.find_element_by_xpath("//input[starts-with(@value, 'Next')]")

#click the next button
while next_page is not None:
    time.sleep(5)
    click_btn = driver.find_element_by_xpath("//input[starts-with(@value, 'Next')]")
    click_btn.click()
    time.sleep(5)
    next_page = driver.find_element_by_xpath("//input[starts-with(@value, 'Next')]")

I have not added the code for extracting the Agency names.我没有添加提取代理名称的代码。 I presume it will not be difficult for you.我想这对你来说并不难。

Make sure to install Selenium and download the chrome driver .确保安装 Selenium 并下载chrome 驱动程序 Also make sure to download the correct version of the driver.还要确保下载正确版本的驱动程序。 You can confirm the version by viewing the 'About' section of your chrome browser.您可以通过查看 chrome 浏览器的“关于”部分来确认版本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM