简体   繁体   中英

Apache Nutch 1.x injection crawldb error

Have tried googling the issue but can't find anything useful.

Following tutorial in https://wiki.apache.org/nutch/NutchTutorial

Verified nutch with bin/nutch and it is fine

Have java 8 installed

java -version returns
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

And included in path uxing export

export JAVA_HOME="/cygdrive/c/program files/java/jre8"
export PATH="$JAVA_HOME/bin:$PATH"

Note using windows hence using cygwin64 as well.

Have added directory urls and added file seed.txt with one url

The ran

bin/nutch inject crawl/crawldb urls/seed.txt

and then gets the following error:

Injector: crawlDb: crawl/crawldb Injector: urlDir: urls/seed.txt Injector: Converting injected urls to crawl db entries. Injector: java.io.IOException: lock file crawl/crawldb/.locked already exists.

Hi There are two parts in this problem :

1 . There is already .locked file present in crawldb folder . Just delete the .locked file.

2 . Set the System environment variable Path for both %JAVA_HOME%\\bin and %HADOOP_HOME%\\bin then also set the User environment variable with %JAVA_HOME% and %HADOOP_HOME% without bin.

The error message is quite clear: another Nutch job holds a lock on the CrawlDb resp. it crashed or was killed before the lock file has been removed after the job has succeeded. Deleting the lock file crawl/crawldb/.locked should solve the problem. But it's also good practice to look into log files (esp. the hadoop.log) to find out the reason why the lock file hasn't been removed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM