简体   繁体   English

使用Node.js进行爬网

[英]Crawling with Node.js

Complete Node.js noob, so dont judge me... 完成Node.js新手,所以不要评判我...

I have a simple requirement. 我有一个简单的要求。 Crawl a web site, find all the product pages, and save some data from the product pages. 爬网网站,找到所有产品页面,并保存产品页面中的一些数据。

Simpler said then done. 简单地说再做。

Looking at Node.js samples, i cant find something similar. 查看Node.js示例,我找不到类似的东西。

There a request scraper: 有一个要求刮板:

request({uri:'http://www.google.com'}, function (error, response, body) {
  if (!error && response.statusCode == 200) {
    var window = jsdom.jsdom(body).createWindow();
    jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
      // jQuery is now loaded on the jsdom window created from 'body'
      jQuery('.someClass').each(function () { /* Your custom logic */ });
    });
  }
});

But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape. 但是我无法弄清楚一旦它刮掉根页面,或者填充它需要刮擦的数组或URL,该怎么称呼它自己。

Then there's the http agent way: 然后是http代理方式:

var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);

agent.addListener('next', function (err, agent) {
  var window = jsdom.jsdom(agent.body).createWindow();
  jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
    // jQuery is now loaded on the jsdom window created from 'agent.body'
    jquery('.someClass').each(function () { /* Your Custom Logic */ });

    agent.next();
  });
});

agent.addListener('stop', function (agent) {
  sys.puts('the agent has stopped');
});

agent.start();

Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages. 它需要一个位置数组,但是一旦从一个数组开始,就不能再向其添加更多位置来浏览所有产品页面。

And i cant even get Apricot working, for some reason i'm getting an error. 而且我什至无法使Apricot工作,由于某种原因我遇到了错误。

So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db? 因此,我该如何修改上述任何示例(或上面未列出的任何内容)以抓取网站,找到所有产品页面,在其中找到一些数据(jquery.someclass示例应该可以解决问题)并将其保存到一个数据库?

Thanks! 谢谢!

Personally, I use Node IO to scrape some websites. 我个人使用Node IO抓取一些网站。 https://github.com/chriso/node.io https://github.com/chriso/node.io

More details about scraping can be found in the wiki ! 有关刮刮的更多详细信息,请参见Wiki


I've had pretty good success crawling and scraping with Casperjs . 我在使用Casperjs进行爬取和抓取方面取得了很大的成功。 It's a pretty nice library built on top of Phantomjs . 这是一个基于Phantomjs的漂亮库。 I like it because it's fairly succinct. 我喜欢它,因为它非常简洁。 Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. 回调可以以foo.then()的形式执行,这很容易理解,而且我什至可以使用jQuery,因为Phantomjs是Webkit的一种实现。 For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'. 例如,以下示例将实例化Casper实例,并将存档页面上的所有链接推送到名为“链接”的数组。

var casper = require("casper").create();

var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;

casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
    numberOfLinks = this.evaluate(function() {
        return __utils__.findAll('.nav-selector a').length;
    });
    this.echo(numberOfLinks + " items found");

    // cause jquery makes it easier
    casper.page.injectJs('/PATH/TO/jquery.js');
});


// Capture links
capture = function() {
    links = this.evaluate(function() {
        var link = [];
        jQuery('.nav-selector a').each(function() {
            link.push($(this).attr('href'));
        });
        return link;
    });
    this.then(selectLink);
};

You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. 然后,您可以使用节点fs(或其他任何您想要的东西)将数据推送到XML,CSV或任何您想要的东西。 The example for scraping BBC photos was exceptionally helpful when I built my scraper. 当我制作刮板时,刮BBC照片示例特别有用。

This is a view from 10,000 feet of what casper can do. 这是从10,000英尺的卡斯珀可以做的事情上看到的景色。 It has a very potent and broad API. 它具有非常强大的API。 I dig it, in case you couldn't tell :). 我挖了一下,以防您不知道:)。

My full scraping example is here: https://gist.github.com/imjared/5201405 . 我完整的抓取示例在这里: https : //gist.github.com/imjared/5201405

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM