I want to select one of the above for building a crawling framework for specific web sites. This is not an internet-wide crawl. I am not building a search index, and rather interested in scraping specific pages from the web site.
Could somebody please detail about the pros and cons of above? Thanks Nayn
Your main task is scrape specific pages from the web site.
Nutch : Open-source web-search software, built on Lucene Java
Heritrix : is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project
So I think Heritrix is much better than Nutch for your project.
Learning a framework/library is a valuable exercise. But it takes some time. Since you task is not very complex one, sometimes it would be less painful to write a simple crawler from the scratch in Java
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.