简体   繁体   中英

How or where to run $ ./nutch inject crawl/crawldb urls

I am new to Nutch and I want to crawl the website. I am using Nutch 1.12 and I blindly followed the steps mentioned here

I downloaded apache-nutch-1.12-bin.zip then unziped it. using cygwin I am trying to crawl my first website. I just followed the step in the above page.

I have created the dir called urls and inside it I have created seed.txt and included http://nutch.apache.org/ in it.

Now I want to execute the command bin/nutch inject crawl/crawldb urls but I am getting the below exception.

Chola@BNDA000000615 /cygdrive/c/Airbus/apache-nutch-1.12/bin $ ./nutch inject crawl/crawldb urls Injector: starting at 2017-03-08 14:31:17 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: crawl at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:409) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:413) at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:584) at org.apache.nutch.crawl.Injector.inject(Injector.java:350) at org.apache.nutch.crawl.Injector.run(Injector.java:467) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.Injector.main(Injector.java:441)

Could you please someone help to to resolve this issue

I had the same problem.

Did you create the url directory with the "-p" argument?

When I did, the problem was solved.

Best luck.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM