简体   繁体   中英

How to config Nutch to crawl only the URLs in seeklist? (no crawl back need)

I have a url seedlist contains more than 100000 urls. I know that nutch will crawl not only the urls in the seedlist but also any url links found inside the websites. However, I would like to know is there any way to stop this behavior ? So that only the urls specified in the seedlist are needed to be crawled.

In your nutch-site.xml configuration , set the "db.ignore.external.links" property to true.

This will ignore any urls to domains outside the injected list.

If you are using the crawl command check for the depth parameter.

-depth depth indicates the link depth from the root page that should be crawled.

Using this you can control what level of depth you need Nutch to crawl. Having a value of 1 probably would limit it to the base page only.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM