I'm trying to run the script provided in Nutch 1.6 "bin/crawl" which does all of the manual steps below required to go off and spider a site.
When I run these steps manually everything works fine and my page is indexed as expected (albeit only one page but will look into this)
created text file containing a URL @ seeds/urls.txt
bin/nutch inject crawl_test/crawldb seeds/
bin/nutch generate crawl_test/crawldb crawl_test/segments
export SEGMENT=crawl_test/segments/`ls -tr crawl_test/segments|tail -1`
bin/nutch fetch $SEGMENT -noParsing
bin/nutch parse $SEGMENT
bin/nutch updatedb crawl_test/crawldb $SEGMENT -filter -normalize
bin/nutch invertlinks crawl_test/linkdb -dir crawl_test/segments
bin/nutch solrindex http://dev:8080/solr/ crawl_test/crawldb -linkdb crawl_test/linkdb crawl_test/segments/*
The bin/crawl script gives this error...
Indexing 20130412115759 on SOLR index -> someurl:8080/solr/ SolrIndexer: starting at 2013-04-12 11:58:47 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: false SolrIndexer: URL normalizing: false org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch/20130412115759/crawl_fetch Input path does not exist: file:/opt/nutch/20130412115759/crawl_parse Input path does not exist: file:/opt/nutch/20130412115759/parse_data Input path does not exist: file:/opt/nutch/20130412115759/parse_text
Any idea why this script isn't working? I think it must be an error in the script itself rather then my config as the path it is looking for doesn't exist and not sure why it would even be looking there.
Looks like there was a bug with the bin/crawl script
- $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $SEGMENT
+ $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.