简体繁体 English

比较Nutch与Heritrix

[英]Give comparision of Nutch Vs Heritrix

原文 2010-07-16 07:30:46 6 1 java/ web-crawler/ nutch

I want to select one of the above for building a crawling framework for specific web sites. 我想选择以上一种方法来为特定网站构建爬网框架。 This is not an internet-wide crawl. 这不是整个Internet的爬网。 I am not building a search index, and rather interested in scraping specific pages from the web site. 我不是在建立搜索索引，而是有兴趣从网站上抓取特定页面。

Could somebody please detail about the pros and cons of above? 有人可以详细说明上述优点和缺点吗？ Thanks Nayn 谢谢内恩

1 个解决方案

Your main task is scrape specific pages from the web site. 您的主要任务是从网站上抓取特定页面。

Nutch : Open-source web-search software, built on Lucene Java Nutch ：基于Lucene Java构建的开源Web搜索软件

Heritrix : is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project Heritrix ：Internet档案馆的开源，可扩展，网络规模，档案质量的网络爬虫项目

So I think Heritrix is much better than Nutch for your project. 因此，我认为Heritrix在您的项目上比Nutch更好。

Learning a framework/library is a valuable exercise. 学习框架/库是有价值的练习。 But it takes some time. 但是需要一些时间。 Since you task is not very complex one, sometimes it would be less painful to write a simple crawler from the scratch in Java 由于您的任务不是很复杂，因此有时候用Java从头开始编写一个简单的搜寻器会比较省事

使用Nutch或Heritrix进行定向爬网 - Directed crawl using Nutch or Heritrix

netbeans vs eclipse 项目结构比较 - netbeans vs eclipse project structure comparision

启动索引已知时，Substring与RegEx之间的性能比较 - Performance comparision between Substring vs RegEx when start index is known

Heritrix检索gzip CSS + JS - Heritrix retrieves gzip CSS + JS

数组列表的比较 - comparision of arraylist

Heritrix在条件注释块中找不到CSS文件 - Heritrix not finding CSS files in conditional comment blocks

自定义螺母 - Customising nutch

如何使用带有Heritrix 3.1的HeaderedArchiveRecord遍历WARC文件 - How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1

在Heritrix搜寻器工具中，如何从搜寻的URL中提取内容 - In Heritrix crawler tool how to extract the contents from crawled urls

Java中模式的字符串比较 - String comparision of patterns in java

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Nutch或Heritrix进行定向爬网 - Directed crawl using Nutch or Heritrix netbeans vs eclipse 项目结构比较 - netbeans vs eclipse project structure comparision 启动索引已知时，Substring与RegEx之间的性能比较 - Performance comparision between Substring vs RegEx when start index is known Heritrix检索gzip CSS + JS - Heritrix retrieves gzip CSS + JS 数组列表的比较 - comparision of arraylist Heritrix在条件注释块中找不到CSS文件 - Heritrix not finding CSS files in conditional comment blocks 自定义螺母 - Customising nutch 如何使用带有Heritrix 3.1的HeaderedArchiveRecord遍历WARC文件 - How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1 在Heritrix搜寻器工具中，如何从搜寻的URL中提取内容 - In Heritrix crawler tool how to extract the contents from crawled urls Java中模式的字符串比较 - String comparision of patterns in java

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM