简体   繁体   中英

Apache Nutch Hadoop Integration

I configured apache-nutch-1.15 and hadoop to run on deploy mode as per the link provided https://wiki.apache.org/nutch/NutchHadoopTutorial

but when I tried to run the below command

hadoop jar apache-nutch-${version}.job org.apache.nutch.crawl.Crawl urls -dir crawl -depth 3 -topN 5

I got the following exception

Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

the class org.apache.nutch.crawl.Crawl is not there in nutch v1.15 but it is present in nutch v1.17.

Please help me with this

The documentation for apache nutch to crawl to hdfs is not updated since 2014. The new version of apache nutch doesn't have any Class named org.apache.nutch.crawl.Crawl.

To run the apache nutch follow the docs related to crawling to local file system ( https://wiki.apache.org/nutch/NutchTutorial ). Opt for "Option 2: Set up Nutch from a source distribution" in the link then you will have a deploy folder in runtime directory (deploy mode is for dumping data onto hadoop)

go to the deploy folder and execute the same commands mentioned for local mode in the above link by replacing all local paths with hdfs paths

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM