简体   繁体   中英

How to scrape a list of URL and store data in a global variable using requestJS and cheerioJS?

I have a list of URL, say 4 of them. For each I'd like to scrape some information and store the information into a global variable called allData. So my code looks like this:

var request = require('request');
var cheerio = require('cheerio');

var urls = [url1,url2,url3,url4];
var allData = [];

for(var url in urls){
      request(url, function(err,response,body){
         var $ = cheerio.load(body);
         var data = $('h1.large','#title_main').text();
         allData.push(data);
   });
}

However, I realize this won't work due to the asynchronous nature of using request. In the last loop, all data in "datas" are all the same and come from url4. Any idea how I can fix this? Really need this functionality.

Glad you found a solution that worked for you.

You might know about this as 9 months have passed by, but for future reference you could also use some native javascript Array functions that "close" on the scope for each iteration (and avoid having another dependency for your project) -- I do this all the time in some of my internet scrapers using .forEach() :

urls.forEach(function(url){
    request(url, function(err,response,body){
        var $ = cheerio.load(body);
        var data = $('h1.large','#title_main').text();
        allData.push(data);
    });
})

There are a handful of functional programing based methods that exist in the Array.prototype that allow you to execute a function on every iteration (essentially freezing the parameters that go into the function) of the data in the array. There are a handful of functions like .forEach() that allow you to close on parameters within a loop that has asynchronous code in it.

The code above results in four methods being executed, asynchronously. Each method is passed one of the url's in your array. allData will have results appended from contents of the requests/cheerio parsing as each request finishes.

If you need them to be in order, you can access the index passed along with each forEach function iteration:

urls.forEach(function(url,index){
    request(url, function(err,response,body){
        var $ = cheerio.load(body);
        var data = $('h1.large','#title_main').text();
        allData[index]=data;
    });
})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM