简体   繁体   中英

nutch - how to crawl a specific file type?

Is it possible to define a specific file type that will be crawled?

I'm trying to work around the regex-urlfildtr.txt file, but I only see how I can specify which type NOT to crawl.

Is to possible to define the I want to crawl only, say .doc files?

In $NUTCH_HOME/conf/regex-urlfilter.txt file, delete exiting regex patterns and paste this:

+\.doc$ 
-.

This will allow only .doc files to get crawled and filter out rest urls.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM