简体繁体中英

Nutch-Hadoop:- how can we crawl only the updates in the url going for recrawl?

原文 2012-04-20 11:48:26 1 2 java/ hadoop/ nutch

please anybody let me know how can i identify updates in the url going for re-crawl? i want to crawl only the updated content of the page when the page is going for re-crawl not the older content which has already crawled. thanks in advance. pragya..

2 answers

I think what you mean is that you want to re-crawl urls ONLY if the content is modified at the server end. You want nutch to identify it and thereby smartly decide to fetch the content or not.

Nutch has this notion of maintaining "Last modified" time of a page and its been stored and NOT put into use while re-crawling pages. They knew that it would save disk space and bandwidth but it didnt catch intrest due to other imp things. People had raised this issue but still i dont see any activity from nutch dev team. Efforts were taken to improve, I still am not sure how precisely current version is using "last modified" field.

you cannot tell nutch to get only the updated content of a page and forget the rest of unchanged data. It will get the full content every time. You might set the recrawl frequency smartly so that pages will be recrawled after they get updated.

How to crawl and parse only precise data using Nutch?

Apache nutch in distributed mode not going to crawl from web

whats wrong in my nutch recrawl script

How to define the coverage of my nutch crawl?

nutch - how to crawl a specific file type?

How to configure nutch 1.4 with hadoop?

Empty Nutch crawl list

Using Nutch how to crawl the dynamic content of web page that are uisng ajax?

Nutch regex for crawl

Nutch is not parsing the entire website, only the first URL

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to crawl and parse only precise data using Nutch? Apache nutch in distributed mode not going to crawl from web whats wrong in my nutch recrawl script How to define the coverage of my nutch crawl? nutch - how to crawl a specific file type? How to configure nutch 1.4 with hadoop? Empty Nutch crawl list Using Nutch how to crawl the dynamic content of web page that are uisng ajax? Nutch regex for crawl Nutch is not parsing the entire website, only the first URL

Related Tags

Nutch-Hadoop:- how can we crawl only the updates in the url going for recrawl?

Question

2 answers

solution1
1 2012-04-20 14:42:02

solution2
1 2012-04-21 18:38:31

Nutch-Hadoop:- how can we crawl only the updates in the url going for recrawl?

Question

2 answers

solution1 1 2012-04-20 14:42:02

solution2 1 2012-04-21 18:38:31

solution1
1 2012-04-20 14:42:02

solution2
1 2012-04-21 18:38:31