how to set nutch to extract content of only urls present on seed file

Question

I am using nutch 2.3 and I am trying to get the html content of some urls present on seed.txt file which I pass to nutch into HBase.

So the problem is as below---

First crawl: Everything runs fine and I get the data into HBase with url as the row key.

Second Run: when i run the crawl for second time with different urls I see there are so many urls for the fetching job is running while I have only one url in my seed file.

So my question is how can make sure that nutch only crawls and get the html contents of the urls present in seed.txt and not the out links present in urls html content of seed.txt

Answer 1

I think you want to fetch only domains that are given in seed file. For that update nutch-site.xml as following

  <property>
   <name>db.ignore.external.links</name>
   <value>true</value>
  </property>

Answer 2

You might keep the iteration of the crawl command as "1" and then nutch will crawl only the urls present in seed.txt file.

eg

bin/crawl -i -D solr.server.url=<solrUrl> <seed-dir> <crawl-dir> 1

Also, you can restrict the outer links by configuring your regex-urlfilter.txt present in conf directory.

#accept anything else
+http://doamin.com

how to set nutch to extract content of only urls present on seed file

Question

2 answers

solution1
1 2018-03-28 03:53:51

solution2
0 2016-04-18 10:35:56

how to set nutch to extract content of only urls present on seed file

Question

2 answers

solution1 1 2018-03-28 03:53:51

solution2 0 2016-04-18 10:35:56

solution1
1 2018-03-28 03:53:51

solution2
0 2016-04-18 10:35:56