简体   繁体   中英

Javascript asynchronous web crawler

I have an async function that reads a list of websites from a csv file.

async function readCSV(){
  const fileStream = fs.createReadStream('./topm.csv');

  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });


  for await (const line of rl) {
    var currentline=line.split(",");
    
    var res_server_http = await check_page("http://www."+currentline[1]) 
  }

}

Every time that I read a site I call check_page function that do some operations. Every time that I have one I wait its ending before start to new site.

async function check_page(web_page){
     // do some operations....

}

Up this point it works correctly, but now I have to integrate my code with a web-crawler. Inside readCSV function I have to call it for every site that I read and for each one I should call check_page function.

Now I've edit readCSV in this way:

const fileStream = fs.createReadStream('./topm.csv');

  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });

for await (const line of rl) {
    var currentline=line.split(",");

    await (new Promise( resolve => {
      new Crawler().configure({depth: 2})
      .crawl(site, async (page) => {
          //console.log(page.url);
          var res_server_http = await check_page("http://www."+currentline[1])

          // Resolve here
          resolve();
      });
    }));
  
  }

I'm using this code for web-crawler: https://www.npmjs.com/package/js-crawler

This function now doesn't work because it is not async. How can I change my code?


Now I've this error:

(node:907) UnhandledPromiseRejectionWarning: ReferenceError: site is not defined
at /Users/francesco/Desktop/tesi/crawler.js:55:14
at new Promise (<anonymous>)
at readCSV (/Users/francesco/Desktop/tesi/crawler.js:53:12)
at processTicksAndRejections (internal/process/task_queues.js:97:5)

(node:907) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with.catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode ). (rejection id: 2) (node:907) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Add a Promise :

  for await (const line of rl) {
    var currentline=line.split(",");

    await (new Promise( resolve => {
      new Crawler().configure({depth: 2})
      .crawl(site, async (page) => {
          //console.log(page.url);
          var res_server_http = await check_page("http://www."+currentline[1])

          // Resolve here
          resolve();
      });
    }));
  }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM