简体   繁体   中英

Nutch-Hadoop:- how can we crawl only the updates in the url going for recrawl?

please anybody let me know how can i identify updates in the url going for re-crawl? i want to crawl only the updated content of the page when the page is going for re-crawl not the older content which has already crawled. thanks in advance. pragya..

I think what you mean is that you want to re-crawl urls ONLY if the content is modified at the server end. You want nutch to identify it and thereby smartly decide to fetch the content or not.

Nutch has this notion of maintaining "Last modified" time of a page and its been stored and NOT put into use while re-crawling pages. They knew that it would save disk space and bandwidth but it didnt catch intrest due to other imp things. People had raised this issue but still i dont see any activity from nutch dev team. Efforts were taken to improve, I still am not sure how precisely current version is using "last modified" field.

you cannot tell nutch to get only the updated content of a page and forget the rest of unchanged data. It will get the full content every time. You might set the recrawl frequency smartly so that pages will be recrawled after they get updated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM