简体   繁体   English

使用 Node.js 和 Puppeteer 进行定时重定向后的 HTTP 响应标头

[英]HTTP response headers after timed redirect using Node.js and Puppeteer

How to get the response headers using Puppeteer has already been answered below:下面已经回答了如何使用 Puppeteer 获取响应标头:

Possible to get HTTP response headers with Nodejs and Puppeteer 可以使用 Nodejs 和 Puppeteer 获取 HTTP 响应标头

However, I have a peculiar situation where the initial URL redirects to another URL after a few seconds.但是,我有一个特殊的情况,最初的 URL 在几秒钟后重定向到另一个 URL。

Here is the pertinent code I'm running:这是我正在运行的相关代码:

const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox'], headless: false});

const page = await browser.newPage();

// get the response object of the initial URL
var page_response_obj = await page.goto(url_str, {timeout: PAGE_TIMEOUT_GOTO_MS, waitUntil: 'domcontentloaded'});

// get page title of initial page
var page_title_1_str = await page.title();

// wait for a few seconds to cover the timed redirect
await page.waitFor(6130);

// get page title of final page
var page_title_2_str = await page.title();

I can get the page titles of the two different pages, but I'm not sure how to get the response headers given that the page_response_obj will contain the response headers for the initial URL.我可以获取两个不同页面的页面标题,但我不确定如何获取响应标头,因为page_response_obj将包含初始 URL 的响应标头。

Is it possible to get the response headers of the final URL?是否可以获得最终 URL 的响应头?

EDIT编辑

I'm using this for websites that have CloudFlare protection where you need to wait for about 5 seconds before you get redirected to the actual website.我将它用于具有 CloudFlare 保护的网站,您需要等待大约 5 秒钟才能重定向到实际网站。

You can use the chained redirect property of Request object.您可以使用请求 object 的链式重定向属性。

const puppeteer = require ('puppeteer')
const url = 'http://doodle.google.com/'

;(async () => {
    const browser = await puppeteer.launch({
        args: ['--no-sandbox', '--disable-setuid-sandbox'],
        headless: true
    })

    const page = (await browser.pages())[0]

    // get the response object of the initial URL
    const response = await page.goto(url, {timeout: 0, waitUntil: 'domcontentloaded'})

    // get the first response header
    console.log ( response.headers() )

    // get page title of initial page
    const title1 = await page.title()

    const chain = response.request().redirectChain()

    // If the page redirected, all of chained response headers will be shown here
    for ( let num in chain ) {
        console.log( chain[num].response().headers() )
        // console.log(chain[0].url()) // => print the URL
    }

    // get page title of final page
    const title2 = await page.title()
})()

On closer inspection, it turns out that there are some redirects that might be coerced on the front-end [via a script], and so may not be captured in a standard redirect chain.仔细检查后发现,有些重定向可能会在前端 [通过脚本] 强制执行,因此可能无法在标准重定向链中捕获。 As such, I didn't have success with Edi's suggestion.因此,我对 Edi 的建议没有成功。

So here's what I needed to change to get things to work:所以这就是我需要改变的东西才能让事情正常工作:

  1. Use a response event handler使用响应事件处理程序
  2. Wait for a long while (30 to 45 seconds) to make sure that you capture the relevant response after redirection.等待很长一段时间(30 到 45 秒)以确保在重定向后捕获相关响应。 You can adjust the length of time if you need to.如果需要,您可以调整时间长度。

In my case, I was trying to determine if gzip is enabled, so I needed a valid response object on the final URL.就我而言,我试图确定是否启用了 gzip,因此我需要在最终的 URL 上提供有效的响应 object。 Here's the revised code:这是修改后的代码:

// define url and host
var url_str = 'https://www.example.com';
var url_host_str = 'example.com';

// define GZIP test function
var _checkGZIP = function(resp_headers_obj)
{
    var resp_header_content_encoding_str = resp_headers_obj['content-encoding'];
    var is_gzip_bool = !!(/gzip/i.test(resp_header_content_encoding_str));

    return is_gzip_bool;
};

const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox'], headless: true});

const page = await browser.newPage();

// set result variable outside event handler scope
var is_gzip_bool = false;

/**
 * Set response event handler
 * This will capture all responses from the initial URL and from final URL
 */
page.on('response', function(response_obj)
{    
    // get URL and headers
    var resp_url_str = response.url();
    var resp_headers_obj = response.headers();

    if(!is_gzip_bool)
    {
        // check for only specific URLs
        if(/^ *https?\:\/\/([^\?\/]+)(\/|)([^\n\r\?\.]+|) *$/i.test(resp_url_str) && resp_url_str.includes(url_host_str))
        {
            // do gzip test
            is_gzip_bool = _checkGZIP(resp_headers_obj);
        }
    }
}

// go to page
await page.goto(url_str, {timeout: PAGE_TIMEOUT_GOTO_MS, waitUntil: 'domcontentloaded'});

// wait for a long while to capture all relevant responses [from both initial and final URL]
await page.waitFor(30000);

// document your result if required

// close browser
await browser.close();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM