简体   繁体   中英

How to get the HTML from a website using NodeJS?

I know this is a pretty basic question, but I can't get anything working. I have a list of URL's and I need to get the HTML from them using NodeJS.

I have tried using Axios, but the response returned is always undefined. I am hitting the endpoint /process-logs with a post request and the body consists of logFiles (which is an array).

router.post("/process-logs", function (req, res, next) {
  fileStrings = req.body.logFiles;

  for (var i = 0; i < fileStrings.length; i++) {
    axios(fileStrings[i]).then(function (response) {
      console.log(response.body);
    });  
  }

  res.send("done");
});

A sample fileString is of the form https://amazon-artifacts.s3.ap-south-1.amazonaws.com/q-120/log1.txt .

How can I parallelize this process to do the same task for multiple files at a time?

I can think of two approaches here:

  1. the first one is to use ES6 promises (promise.all) and Async/Await feature, by chunking the fileStrings array into n chunks. This is a basic approach and you have to handle a lot of cases.
  • This is a general idea of the flow i am thinking of:

 async function handleChunk (chunk) { const toBeFullfilled = []; for (const file of chunk) { toBeFullfilled.push(axios.get(file)); // replace axios.get with logic per file } return Promise.all(toBeFullfilled); } async function main() { try { const fileStrings = req.body.logfiles; for (i; i < fileStrings; i += limit) { let chunk = fileStrings.slice(i, i+limit); const results = await handleChunk(chunk); console.log(results); } } catch (e) { console.log(e); } } main().then(() => { console.log('done')}).catch((e) => { console.log(e) });

one of the drawbacks is we are processing chunks sequentially (chunk by chunk, still better than file-by-file), one enhancement could be to chunk the fileStrings ahead of time and process the chunks concurrently (it really depends on what you're trying to achieve and what are the limitations you have)

  1. the second approach is to use Async library , which has many control flows and collections that allows you to configure the concurreny... etc. (i really recommend using this approach)

You should have a look at Async's Queue Control Flow to run same task for multiple files concurrently.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM