简体   繁体   English

使用node-crawler或simplecrawler进行NodeJS Web爬行

[英]NodeJS Web Crawling With node-crawler or simplecrawler

I am new to web crawling and I need some pointers about these two Node JS crawlers. 我是Web爬网的新手,我需要一些有关这两个Node JS爬网程序的指针。

Aim: My aim is to crawl a website and obtain ONLY the internal (local) URLs within that domain. 目的:我的目标是抓取网站并仅获取该域内的内部(本地)URL。 I am not interested in any page data or scraping. 我对任何页面数据或抓取都不感兴趣。 Just the URLs. 只是网址。

My Confusion: When using node-crawler or simplecrawler , do they have to download the entire pages before they return response? 我的困惑:使用node-crawlersimplecrawler时 ,他们是否必须下载整个页面才能返回响应? Is there a way to only find a URL, ping maybe perform some get request and if 200 response, just proceed to the next link without actually having to request the entire page data? 有没有一种方法只能找到一个URL,可以ping可能会执行一些get请求,如果响应为200,则直接进入下一个链接,而无需实际请求整个页面数据?

Is there any other NodeJS crawler or spider which can request and log only URLs? 是否有其他NodeJS搜寻器或Spider只能请求和记录URL? My concern is to make the crawl as lightweight as possible. 我的担心是使爬网尽可能轻巧。

Thank you in advance. 先感谢您。

Crawling only the HTML pages of a website is usually a pretty lightweight process. 仅爬网网站的HTML页面通常是一个非常轻量级的过程。 It is also necessary to download the response bodies of HTML bodies to be able to crawl the site, since the HTML is searched for additional URLs. 还必须下载HTML正文的响应正文,以便能够爬网该站点,因为在HTML中搜索了其他URL。

simplecrawler is configurable so that you can avoid downloading images etc from a website. simplecrawler是可配置的,因此您可以避免从网站下载图像等。 Here's a snippet that you can use to log the URLs that the crawler visits and avoid to download image resources. 这是一个片段,您可以用来记录搜寻器访问的URL并避免下载图像资源。

var Crawler = require("simplecrawler");
var moment = require("moment");
var cheerio = require("cheerio");

var crawler = new Crawler("http://example.com");

function log() {
    var time = moment().format("HH:mm:ss");
    var args = Array.from(arguments);

    args.unshift(time);
    console.log.apply(console, args);
}

crawler.downloadUnsupported = false;
crawler.decodeResponses = true;

crawler.addFetchCondition(function(queueItem) {
    return !queueItem.path.match(/\.(zip|jpe?g|png|mp4|gif)$/i);
});

crawler.on("crawlstart", function() {
    log("crawlstart");
});

crawler.on("fetchcomplete", function(queueItem, responseBuffer) {
    log("fetchcomplete", queueItem.url);
});

crawler.on("fetch404", function(queueItem, response) {
    log("fetch404", queueItem.url, response.statusCode);
});

crawler.on("fetcherror", function(queueItem, response) {
    log("fetcherror", queueItem.url, response.statusCode);
});

crawler.on("complete", function() {
    log("complete");
});

crawler.start();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM