简体繁体中英

Apache Nutch restart crawl

原文 2017-06-19 14:30:02 8 1 apache/ hadoop/ web-crawler/ nutch

I am running Apache Nutch 1.12 in local mode.

I needed to edit the seed file to remove a sub domain and add a few new domains and want to restart the crawl from the start.

Problem is whenever i restart the crawl the crawl re-starts from where i stopped it, which is in the middle of the sub domain i removed.

I stopped the crawl by killing the java process (kill -9) - i tried creating a .STOP file in the bin directory but that did not work so I used kill.

Now whenever i restart the crawl i can see from the output it is restarting where the job was stopped. I googled and have come across stopping the hadoop job but i don't have any hadoop files on my server - the only reference to hadoop are jar files in the apache nutch directory.

How can i restart the crawl from the very start and not from where the crawl was last stopped? Effectively i want to start a fresh crawl.

Many thanks

1 answers

To start from scratch, just specify a different crawl dir or delete the existing one.

Removing entries from the seed list will not affect the content of the crawldb or the segments. What you could do to remove a domain without restarting from zero would be to add a pattern to the url filters so that the URLs get deleted from the crawldb during the update step or at least not selected during the generation step.

Trigger Apache Nutch Crawl Programmatically

Crawl Image using Apache Nutch

How to allow apache nutch to crawl forever

How to Crawl .pdf links using Apache Nutch

how to crawl particular website using Apache Nutch?

How to restrict Apache Nutch 2.3.1 to crawl story content and not side bars

how to crawl data on few topics using apache nutch?

Apache Nutch 1.9 on Hadoop 1.2.1 no Crawl class in jar file

How to restrict Apache Nutch to crawl language specific Documents only

Nutch problems executing crawl

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Trigger Apache Nutch Crawl Programmatically Crawl Image using Apache Nutch How to allow apache nutch to crawl forever How to Crawl .pdf links using Apache Nutch how to crawl particular website using Apache Nutch? How to restrict Apache Nutch 2.3.1 to crawl story content and not side bars how to crawl data on few topics using apache nutch? Apache Nutch 1.9 on Hadoop 1.2.1 no Crawl class in jar file How to restrict Apache Nutch to crawl language specific Documents only Nutch problems executing crawl

Related Tags

Apache Nutch restart crawl

Question

1 answers

solution1 0 ACCPTED 2017-06-20 11:02:46

solution1
0 ACCPTED 2017-06-20 11:02:46