简体繁体中英

How to config Nutch to crawl only the URLs in seeklist? (no crawl back need)

原文 2013-01-10 03:07:41 7 2 solr/ nutch

I have a url seedlist contains more than 100000 urls. I know that nutch will crawl not only the urls in the seedlist but also any url links found inside the websites. However, I would like to know is there any way to stop this behavior ? So that only the urls specified in the seedlist are needed to be crawled.

2 answers

In your nutch-site.xml configuration , set the "db.ignore.external.links" property to true.

This will ignore any urls to domains outside the injected list.

If you are using the crawl command check for the depth parameter.

-depth depth indicates the link depth from the root page that should be crawled.

Using this you can control what level of depth you need Nutch to crawl. Having a value of 1 probably would limit it to the base page only.

How to crawl images in Nutch?

How or where to run $ ./nutch inject crawl/crawldb urls

how to inject urls found during crawl into nutch seed list

Is there anyway to log the list of urls 'ignored' in Nutch crawl?

How to crawl and parse only precise data using Nutch?

How to Set topN via nutch crawl SCRIPT

Nutch 1.11 crawl Issue

Nutch didn't crawl all URLs from the seed.txt

Getting status of a Nutch crawl?

Nutch Crawl does not working

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to crawl images in Nutch? How or where to run $ ./nutch inject crawl/crawldb urls how to inject urls found during crawl into nutch seed list Is there anyway to log the list of urls 'ignored' in Nutch crawl? How to crawl and parse only precise data using Nutch? How to Set topN via nutch crawl SCRIPT Nutch 1.11 crawl Issue Nutch didn't crawl all URLs from the seed.txt Getting status of a Nutch crawl? Nutch Crawl does not working

Related Tags

How to config Nutch to crawl only the URLs in seeklist? (no crawl back need)

Question

2 answers

solution1
3 2013-01-10 06:18:41

solution2
0 2013-01-10 03:45:51

How to config Nutch to crawl only the URLs in seeklist? (no crawl back need)

Question

2 answers

solution1 3 2013-01-10 06:18:41

solution2 0 2013-01-10 03:45:51

solution1
3 2013-01-10 06:18:41

solution2
0 2013-01-10 03:45:51