简体   繁体   English

NodeJS Crawler登录站点

[英]NodeJS Crawler log into site

i want to crawl geocaching.com but some data like coords are only for loged in users. 我想抓取geocaching.com,但是像coords这样的某些数据仅适用于登录用户。 Im using "crawler" from npm and have now idea how to log in with crawler but i already got the names of the login form: 我使用npm中的“ crawler”,现在已经知道如何使用搜寻器登录,但是我已经获得了登录表单的名称:

  • ctl00$ContentBody$tbUsername: user ctl00 $ ContentBody $ tbUsername:用户
  • ctl00$ContentBody$tbPassword: passwaord ctl00 $ ContentBody $ tbPassword:passwaord
  • ctl00$ContentBody$btnSignIn: "Sign+In" ctl00 $ ContentBody $ btnSignIn:“登录+登录”

Here is My code so far: 到目前为止,这是我的代码:

var Crawler = require("crawler");
var url = require('url');
var mongoose = require("mongoose");
var Cache = require("./models/cache.js");

mongoose.connect("localhost:27017/Cache");

var removeTags = function(text){
    return String(text).replace(/(<([^>]+)>)/ig,'');
};
var c = new Crawler({
    maxConnections: 10,
    skipDuplicates: true,

    callback: function (error, result, $) {

        if (result.request.uri.href.startsWith("http://www.geocaching.com/geocache/")) {
            var cache = new Cache();
            var id = removeTags($(".CoordInfoCode"));
            Cache.count({
                "_id": id
            }, function (err, count) {
                if (err)
                    return;
                else if (count < 1) {
                    //Saving the data
                }

            });


        }
        if (result.headers['content-type'] == "text/html; charset=utf-8") {
            if ($('a').length != 0) {
                $('a').each(function (index, a) {
                    var toQueueUrl = $(a).attr('href');
                    process.nextTick(function () {
                        process.nextTick(function () {
                            c.queue(toQueueUrl);
                        })
                    });

                });
            }
        }

    }
});

c.queue('http://www.geocaching.com/seek/nearest.aspx?ul=Die_3sten_3');

I made an example javascript crawler on github. 我在github上做了一个示例javascript搜寻器。

It's event driven and use an in-memory queue to store all the resources(ie. urls). 它是事件驱动的,并使用内存中的队列来存储所有资源(即url)。

How to use in your node environment 如何在您的节点环境中使用

var Crawler = require('../lib/crawler')
var crawler = new Crawler('http://www.someUrl.com');

// crawler.maxDepth = 4;
// crawler.crawlInterval = 10;
// crawler.maxListenerCurrency = 10;
// crawler.redisQueue = true;
crawler.start();

Here I'm just showing you 2 core method of a javascript crawler. 在这里,我仅向您展示JavaScript搜寻器的2种核心方法。

Crawler.prototype.run = function() {
  var crawler = this;
  process.nextTick(() => {
    //the run loop
    crawler.crawlerIntervalId = setInterval(() => {

      crawler.crawl();

    }, crawler.crawlInterval);
    //kick off first one
    crawler.crawl();
  });

  crawler.running = true;
  crawler.emit('start');
}


Crawler.prototype.crawl = function() {
  var crawler = this;

  if (crawler._openRequests >= crawler.maxListenerCurrency) return;


  //go get the item
  crawler.queue.oldestUnfetchedItem((err, queueItem, index) => {
    if (queueItem) {
      //got the item start the fetch
      crawler.fetchQueueItem(queueItem, index);
    } else if (crawler._openRequests === 0) {
      crawler.queue.complete((err, completeCount) => {
        if (err)
          throw err;
        crawler.queue.getLength((err, length) => {
          if (err)
            throw err;
          if (length === completeCount) {
            //no open Request, no unfetcheditem stop the crawler
            crawler.emit("complete", completeCount);
            clearInterval(crawler.crawlerIntervalId);
            crawler.running = false;
          }
        });
      });
    }

  });
};

Here is the github link https://github.com/bfwg/node-tinycrawler . 这是github链接https://github.com/bfwg/node-tinycrawler It is a javascript web crawler written under 1000 lines of code. 这是使用1000行代码编写的javascript网络抓取工具。 This should put you on the right track. 这应该使您走上正确的道路。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM