简体   繁体   English

Node.js多页爬网程序

[英]Node.js Multi-page Crawler

I try to crawl into website pages. 我尝试抓取网站页面。 here my sample code , i used stackoverflow just for test i dont want crawl stackoverflow. 在这里,我的示例代码中,我仅使用stackoverflow进行测试,我不想爬网stackoverflow。

i this code i want get every link in page and push in an array after that go to every link and search for Node (it's just test.) 我想要这段代码来获取页面中的每个链接,然后将数组推入每个链接并搜索Node (这只是测试)。

var request = require('request'); var request = require('request'); var cheerio = require('cheerio'); var cheerio = require('cheerio');

var pages = 20;
var counter = 1;
while(counter<=pages){

    var siteUrl = "http://stackoverflow.com/unanswered/tagged/?page="+counter+"&tab=votes";
    var queue = [];
    request(siteUrl, function(error, response, html){
            if(!error){
                var $ = cheerio.load(html);
                // Extract All links in page
                links = $('a');
                $(links).each(function(i, link){
                    queue.push("http://stackoverflow.com"+$(link).attr('href'));
                    });


            }
                // Search For Node.js on every question.
                queue.each(function(i,linkItem){

                    request(linkItem, function(error, response, html){
                        var page = cheerio.load(html);
                        var ser = page.match(/node/i);
                        if (ser & ser.lenght > 0){
                            console.log(page);
                        }
                    });
                })

        })

    counter ++;
}

when i run this code its just show frist page link and show me error each has no method 当我运行此代码时,它只显示第一个页面链接并显示错误, each has no method

i will be happy if tell where i am wrong or even my code its right solution. 如果告诉我错误的地方或者我的代码正确的解决方案,我将很高兴。

First of all, you are mixing of async and sync code is not very well. 首先,您混合使用异步代码和同步代码不是很好。 The main problem is that the queue variable you are trying to iterate through has no each method. 主要问题是您要遍历的queue变量没有each方法。 You can use lodash for doing that or just replace the function call with a simple for loop. 您可以使用lodash进行此操作,也可以仅将函数调用替换为简单的for循环。

 var i, item; for(i = 0; i < queue.length; i++) { item = queue[i]; request(item, function(error, response, html){ var page = cheerio.load(html); var ser = page.match(/node/i); if (ser & ser.lenght > 0){ console.log(page); } }); } 

Besides I wrote a tutorial for exactly doing what you are going try. 此外,我写了一个教程来准确地完成您要尝试的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM