简体   繁体   中英

how to set nutch to extract content of only urls present on seed file

I am using nutch 2.3 and I am trying to get the html content of some urls present on seed.txt file which I pass to nutch into HBase.

So the problem is as below---

First crawl: Everything runs fine and I get the data into HBase with url as the row key.

Second Run: when i run the crawl for second time with different urls I see there are so many urls for the fetching job is running while I have only one url in my seed file.

So my question is how can make sure that nutch only crawls and get the html contents of the urls present in seed.txt and not the out links present in urls html content of seed.txt

I think you want to fetch only domains that are given in seed file. For that update nutch-site.xml as following

  <property>
   <name>db.ignore.external.links</name>
   <value>true</value>
  </property>

You might keep the iteration of the crawl command as "1" and then nutch will crawl only the urls present in seed.txt file.

eg

bin/crawl -i -D solr.server.url=<solrUrl> <seed-dir> <crawl-dir> 1

Also, you can restrict the outer links by configuring your regex-urlfilter.txt present in conf directory.

#accept anything else
+http://doamin.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM