简体   繁体   中英

Nutch Crawl - Deleting segments on each crawl implications

I noticed that during each Nutch crawl, the indexes sent to Solr were not consistent. Sometimes the latest changes to the webpages were shown, sometimes older changes were shown instead.

Cause

Noticed that Nutch was giving indexes from an older segment to Solr.

Current Solution

Deleting all old segments before fetching and seemed to solve the problem.

Question

Would like to know if there are any implications of such an approach or my understanding to this is incorrect. Would also like to know why does Nutch not automatically remove older segments during a crawl.

Thanks.

If multiple segments are indexed (again) and the same is contained in two or more segments, there is no guarantee that the most recent version is indexed. It's a known problem ( NUTCH-1416 ). The easiest solution is to send only the recently fetched segments to the indexer. The script bin/crawl does this, the index step is done at the end of each cycle for the segment fetched in this cycle.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM