I'm trying to run Apache Nutch from Eclipse . I followed the instructions at http://wiki.apache.org/nutch/RunNutchInEclipse . However, sources of "parse-html" (both java and test) has errors. I run it anyway, it reads and fetches URL's from the seed.txt and returns this error:
Fetcher: finished at 2012-03-31 17:21:56, elapsed: 00:00:07
ParseSegment: starting at 2012-03-31 17:21:56
ParseSegment: segment: crawl/segments/20120331172142
Exception in thread "main" java.io.IOException: Job failed!
I would like to point out that my goal is to get indexes from Nutch and store them in MongoDB .
Add the following to ivy.xml
:
<dependency org="rome" name="rome" rev="0.9" />
<dependency org="net.sourceforge.nekohtml" name="nekohtml" rev="1.9.13" />
<dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1" />
I ran into the same problem. Here are two ways that might help:
By examining these messages, you should be able to spot the problem.
Here is a tutorial on Running Nutch in Eclipse which also talks about several error handling.
I found 3 jars and added them to the project as external jars and it worked. Those jars are : cyberneko.jar , rome-0.9.jar and tagsoup-1.2.jar and you can find all by a simple google search.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.