简体   繁体   中英

Node.js Force to Wait for Function to Finish

I have a for-loop in a program I am running with Node.js. The function is x() from the xray package and I am using it to scrape and receive data from a webpage and then write that data to a file. This program is successful when used to scrape ~100 pages, but I need to scrape ~10000 pages. When I try to scrape a very large amount of pages, the files are created but they do not hold any data. I believe this problem exists because the for-loop is not waiting for x() to return the data before moving on to the next iteration.

Is there a way to make node wait for the x() function to complete before moving on to the next iteration?

//takes in file of urls, 1 on each line, and splits them into an array. 
//Then scrapes webpages and writes content to a file named for the pmid number that represents the study
 
//split urls into arrays
var fs = require('fs');
var array = fs.readFileSync('Desktop/formatted_urls.txt').toString().split("\n");


var Xray = require('x-ray');
var x = new Xray();
 
for(i in array){
        //get unique number and url from the array to be put into the text file name
                number = array[i].substring(35);
                url = array[i];


        //use .write function of x from xray to write the info to a file
        x(url, 'css selectors').write('filepath' + number + '.txt');
                               
}

Note: Some of the pages I am scraping do not return any value

You can't make a for loop wait for an async operation to complete. To solve this type of problem, you have to do a manual iteration and you need to hook into a completion function for the async operation. Here's the general outline of how that would work:

var index = 0;
function next() {
    if (index < array.length) {
        x(url, ....)(function(err, data) {
            ++index;
            next();
        });
    }
}
next();

Or, perhaps this;

var index = 0;
function next() {
    if (index < array.length) {
        var url = array[index];
        var number = array[i].substring(35);
        x(url, 'css selectors').write('filepath' + number + '.txt').on('end', function() {
            ++index;
            next() 
        });
    }
}
next();

The problem with your code is that you're not waiting for the files to be written to the file system. A better way than downloading the files one by one is to do them in one go and then wait till they complete, rather than processing them one by one before proceeding to the next.

One of the recommended libraries for dealing with promises in nodejs, is bluebird.

http://bluebirdjs.com/docs/getting-started.html

In the updated sample (see below), we iterate through all of the urls and start the download, and keep track of the promises, and then once the files have been written each promise is resolved. Finally, we just wait on all of the promises to get resolved using Promise.all()

Here's the updated code:

var promises = [];
var getDownloadPromise = function(url, number){
    return new Promise(function(resolve){
        x(url, 'css selectors').write('filepath' + number + '.txt').on('finish', function(){
            console.log('Completed ' + url);
            resolve();
        });
    });
};

for(i in array){
    number = array[i].substring(35);
    url = array[i];

    promises.push(getDownloadPromise(url, number));                               
}

Promise.all(promises).then(function(){
    console.log('All urls have been completed');
});

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM