简体   繁体   中英

Give comparision of Nutch Vs Heritrix

I want to select one of the above for building a crawling framework for specific web sites. This is not an internet-wide crawl. I am not building a search index, and rather interested in scraping specific pages from the web site.

Could somebody please detail about the pros and cons of above? Thanks Nayn

Your main task is scrape specific pages from the web site.

Nutch : Open-source web-search software, built on Lucene Java

Heritrix : is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project

So I think Heritrix is much better than Nutch for your project.

Learning a framework/library is a valuable exercise. But it takes some time. Since you task is not very complex one, sometimes it would be less painful to write a simple crawler from the scratch in Java

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM