简体繁体中英

crawldb urls

原文 2017-03-08 09:38:57 9 1 apache/ hadoop/ solr/ nutch

I am new to Nutch and I want to crawl the website. I am using Nutch 1.12 and I blindly followed the steps mentioned here

I downloaded apache-nutch-1.12-bin.zip then unziped it. using cygwin I am trying to crawl my first website. I just followed the step in the above page.

I have created the dir called urls and inside it I have created seed.txt and included http://nutch.apache.org/ in it.

Now I want to execute the command bin/nutch inject crawl/crawldb urls but I am getting the below exception.

Chola@BNDA000000615 /cygdrive/c/Airbus/apache-nutch-1.12/bin $ ./nutch inject crawl/crawldb urls Injector: starting at 2017-03-08 14:31:17 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: crawl at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:409) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:413) at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:584) at org.apache.nutch.crawl.Injector.inject(Injector.java:350) at org.apache.nutch.crawl.Injector.run(Injector.java:467) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.Injector.main(Injector.java:441)

Could you please someone help to to resolve this issue

1 answers

I had the same problem.

Did you create the url directory with the "-p" argument?

When I did, the problem was solved.

Best luck.

bin/nutch inject crawl/crawldb urls not working

how to inject urls found during crawl into nutch seed list

How to update the fetch status in crawldb in apache nutch?

Apache nutch inject urls

Is there anyway to log the list of urls 'ignored' in Nutch crawl?

Nutch 1.4 with Solr 3.4 - can't crawl URL, “no URLs to fetch”

how to crawl particular website using Apache Nutch?

How to allow apache nutch to crawl forever

How to Crawl .pdf links using Apache Nutch

Apache Nutch 1.x injection crawldb error

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question bin/nutch inject crawl/crawldb urls not working how to inject urls found during crawl into nutch seed list How to update the fetch status in crawldb in apache nutch? Apache nutch inject urls Is there anyway to log the list of urls 'ignored' in Nutch crawl? Nutch 1.4 with Solr 3.4 - can't crawl URL, “no URLs to fetch” how to crawl particular website using Apache Nutch? How to allow apache nutch to crawl forever How to Crawl .pdf links using Apache Nutch Apache Nutch 1.x injection crawldb error

Related Tags

How or where to run $ ./nutch inject crawl/crawldb urls

Question

1 answers

solution1 0 2017-03-10 03:38:46

solution1
0 2017-03-10 03:38:46