简体   繁体   中英

Nutch error in Eclipse

I'm trying to run Apache Nutch from Eclipse . I followed the instructions at http://wiki.apache.org/nutch/RunNutchInEclipse . However, sources of "parse-html" (both java and test) has errors. I run it anyway, it reads and fetches URL's from the seed.txt and returns this error:

Fetcher: finished at 2012-03-31 17:21:56, elapsed: 00:00:07
ParseSegment: starting at 2012-03-31 17:21:56
ParseSegment: segment: crawl/segments/20120331172142
Exception in thread "main" java.io.IOException: Job failed!

I would like to point out that my goal is to get indexes from Nutch and store them in MongoDB .

Add the following to ivy.xml :

<dependency org="rome" name="rome" rev="0.9" />
<dependency org="net.sourceforge.nekohtml" name="nekohtml" rev="1.9.13" />
<dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1" />

I ran into the same problem. Here are two ways that might help:

  • Modify conf/log4j.properties file to report DEBUG messages;
  • read the hadoop.log file which is usually located in $NUTCH_HOME or $NUTCH_HOME/logs.

By examining these messages, you should be able to spot the problem.

Here is a tutorial on Running Nutch in Eclipse which also talks about several error handling.

I found 3 jars and added them to the project as external jars and it worked. Those jars are : cyberneko.jar , rome-0.9.jar and tagsoup-1.2.jar and you can find all by a simple google search.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM