简体   繁体   中英

Apache Nutch: No URLs to fetch - check your seed list and URL filters

I'm using nutch 1.2. When I run the crawl command like so:

bin/nutch crawl urls -dir crawl -depth 2 -topN 1000

Injector: starting at 2011-07-11 12:18:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-11 12:18:44, elapsed: 00:00:07
Generator: starting at 2011-07-11 12:18:45
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
**No URLs to fetch - check your seed list and URL filters.**
crawl finished: crawl

The problem is that it keeps complaining about the: No URLs to fetch - check your seed list and URL filters.

I have a list of urls to crawl under the nutch_root/urls/nutch file. my crawl-urlfilter.txt is also set.

Why would it complain about my url list and filters? it never did this before.

Here is my crawl-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.


# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*152.111.1.87/
+^http://([a-z0-9]*\.)*152.111.1.88/

# skip everything else
-.

Your URL filter rules look weird and I don't think they match valid URLs, something like this should be better no?

+^http://152\.111\.1\.87/
+^http://152\.111\.1\.88/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM