简体   繁体   中英

node.js request a webpage with async scripts

I'm downloading a webpage using the request module which is very straight forward.

My problem is that the page I'm trying to download has some async scripts (have the async attributes) and they're not downloaded with the html document return from the http request.

My question is how I can make an http request with/with-out (preferably with) request module, and have the WHOLE page download without exceptions as described above due to some edge cases.

Sounds like you are trying to do webscraping using Javascript.

Using request is a very fundemental approach which may be too low-level and tiome consuming for your needs. The topic is pretty broad but you should look into more purpose built modules such as cheerio, x-ray and nightmare.

x-ray x-ray will let you select elements directly from the page in a jquery like way instead of parsing the whole body.

nightmare provides a modern headless browser which makes it possible for you to enter input as though using the browser manually. With this you should be able to better handle the ajax type requests which are causing you problems.

HTH and good luck!

Using only request you could try the following approach to pull the async scripts.

Note: I have tested this with a very basic set up and there is work to be done to make it robust. However, it worked for me:

Test setup

To set up the test I create a html file which includes a script in the body like this: <script src="abc.js" async></script>

Then create temporary server to launch it (httpster)

Scraper

"use strict";

const request = require('request');

const options1 = { url: 'http://localhost:3333/' }

// hard coded script name for test purposes
const options2 = { url: 'http://localhost:3333/abc.js' }

let htmlData  // store html page here

request.get(options1)
    .on('response', resp => resp.on('data', d => htmlData += d))
    .on('end', () => {
        let scripts; // store scripts here

        // htmlData contains webpage
        // Use xml parser to find all script tags with async tags
        // and their base urls
        // NOT DONE FOR THIS EXAMPLE

        request.get(options2)
            .on('response', resp => resp.on('data', d => scripts += d))
            .on('end', () => {
                let allData = htmlData.toString() + scripts.toString();
                console.log(allData);
            })
           .on('error', err => console.log(err))
    })
    .on('error', err => console.log(err))

This basic example works. You will need to find all js scripts on the page and extract the url part which I have not done here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM