简体繁体中英

nutch - how to crawl a specific file type?

原文 2012-01-23 12:51:15 8 1 java/ nutch

Is it possible to define a specific file type that will be crawled?

I'm trying to work around the regex-urlfildtr.txt file, but I only see how I can specify which type NOT to crawl.

Is to possible to define the I want to crawl only, say .doc files?

1 answers

In $NUTCH_HOME/conf/regex-urlfilter.txt file, delete exiting regex patterns and paste this:

+\.doc$ 
-.

This will allow only .doc files to get crawled and filter out rest urls.

How to define the coverage of my nutch crawl?

Empty Nutch crawl list

Nutch regex for crawl

Using Nutch how to crawl the dynamic content of web page that are uisng ajax?

How to crawl and parse only precise data using Nutch?

Directed crawl using Nutch or Heritrix

Nutch-Hadoop:- how can we crawl only the updates in the url going for recrawl?

Nutch does not crawl URLs with query string parameters

Failed to crawl authenticated page with Nutch 2.3

Why does my Apache Nutch warc and commoncrawldump fail after crawl?

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to define the coverage of my nutch crawl? Empty Nutch crawl list Nutch regex for crawl Using Nutch how to crawl the dynamic content of web page that are uisng ajax? How to crawl and parse only precise data using Nutch? Directed crawl using Nutch or Heritrix Nutch-Hadoop:- how can we crawl only the updates in the url going for recrawl? Nutch does not crawl URLs with query string parameters Failed to crawl authenticated page with Nutch 2.3 Why does my Apache Nutch warc and commoncrawldump fail after crawl?

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM