简体   繁体   中英

Apache Nutch restart crawl

I am running Apache Nutch 1.12 in local mode.

I needed to edit the seed file to remove a sub domain and add a few new domains and want to restart the crawl from the start.

Problem is whenever i restart the crawl the crawl re-starts from where i stopped it, which is in the middle of the sub domain i removed.

I stopped the crawl by killing the java process (kill -9) - i tried creating a .STOP file in the bin directory but that did not work so I used kill.

Now whenever i restart the crawl i can see from the output it is restarting where the job was stopped. I googled and have come across stopping the hadoop job but i don't have any hadoop files on my server - the only reference to hadoop are jar files in the apache nutch directory.

How can i restart the crawl from the very start and not from where the crawl was last stopped? Effectively i want to start a fresh crawl.

Many thanks

To start from scratch, just specify a different crawl dir or delete the existing one.

Removing entries from the seed list will not affect the content of the crawldb or the segments. What you could do to remove a domain without restarting from zero would be to add a pattern to the url filters so that the URLs get deleted from the crawldb during the update step or at least not selected during the generation step.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM