简体   繁体   中英

Node.js Multi-page Crawler

I try to crawl into website pages. here my sample code , i used stackoverflow just for test i dont want crawl stackoverflow.

i this code i want get every link in page and push in an array after that go to every link and search for Node (it's just test.)

var request = require('request'); var cheerio = require('cheerio');

var pages = 20;
var counter = 1;
while(counter<=pages){

    var siteUrl = "http://stackoverflow.com/unanswered/tagged/?page="+counter+"&tab=votes";
    var queue = [];
    request(siteUrl, function(error, response, html){
            if(!error){
                var $ = cheerio.load(html);
                // Extract All links in page
                links = $('a');
                $(links).each(function(i, link){
                    queue.push("http://stackoverflow.com"+$(link).attr('href'));
                    });


            }
                // Search For Node.js on every question.
                queue.each(function(i,linkItem){

                    request(linkItem, function(error, response, html){
                        var page = cheerio.load(html);
                        var ser = page.match(/node/i);
                        if (ser & ser.lenght > 0){
                            console.log(page);
                        }
                    });
                })

        })

    counter ++;
}

when i run this code its just show frist page link and show me error each has no method

i will be happy if tell where i am wrong or even my code its right solution.

First of all, you are mixing of async and sync code is not very well. The main problem is that the queue variable you are trying to iterate through has no each method. You can use lodash for doing that or just replace the function call with a simple for loop.

 var i, item; for(i = 0; i < queue.length; i++) { item = queue[i]; request(item, function(error, response, html){ var page = cheerio.load(html); var ser = page.match(/node/i); if (ser & ser.lenght > 0){ console.log(page); } }); } 

Besides I wrote a tutorial for exactly doing what you are going try.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM