简体   繁体   English

比较Nutch与Heritrix

[英]Give comparision of Nutch Vs Heritrix

I want to select one of the above for building a crawling framework for specific web sites. 我想选择以上一种方法来为特定网站构建爬网框架。 This is not an internet-wide crawl. 这不是整个Internet的爬网。 I am not building a search index, and rather interested in scraping specific pages from the web site. 我不是在建立搜索索引,而是有兴趣从网站上抓取特定页面。

Could somebody please detail about the pros and cons of above? 有人可以详细说明上述优点和缺点吗? Thanks Nayn 谢谢内恩

Your main task is scrape specific pages from the web site. 您的主要任务是从网站上抓取特定页面。

Nutch : Open-source web-search software, built on Lucene Java Nutch :基于Lucene Java构建的开源Web搜索软件

Heritrix : is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project Heritrix :Internet档案馆的开源,可扩展,网络规模,档案质量的网络爬虫项目

So I think Heritrix is much better than Nutch for your project. 因此,我认为Heritrix在您的项目上比Nutch更好。

Learning a framework/library is a valuable exercise. 学习框架/库是有价值的练习。 But it takes some time. 但是需要一些时间。 Since you task is not very complex one, sometimes it would be less painful to write a simple crawler from the scratch in Java 由于您的任务不是很复杂,因此有时候用Java从头开始编写一个简单的搜寻器会比较省事

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM