Is it possible to define a specific file type that will be crawled?
I'm trying to work around the regex-urlfildtr.txt file, but I only see how I can specify which type NOT to crawl.
Is to possible to define the I want to crawl only, say .doc files?
In $NUTCH_HOME/conf/regex-urlfilter.txt file, delete exiting regex patterns and paste this:
+\.doc$
-.
This will allow only .doc files to get crawled and filter out rest urls.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.