hadoop上的Nutch部署不会索引到Solr

Question

I have oozie workflow that does a nutch crawl I designed using hue. 我有oozie工作流程，该工作流程使用我设计的色调进行抓取抓取。

All steps in the process work, except for indexing to solr. 除索引到solr之外，该过程的所有步骤均有效。

The oozie action that defines the solrindex is as follows 定义solrindex的oozie动作如下

` `

<start to="solr-test"/>
    <action name="solr-test">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <main-class>org.apache.nutch.indexer.IndexingJob</main-class>
            <java-opts>solr.server.url=http://ip-redacted:8983/solr/raw</java-opts>
            <arg>hdfs://ip-redacted:8020/user/admin/c</arg>
            <arg>-dir</arg>
            <arg>hdfs://ip-redacted:8020/user/admin/s000</arg>
        </java>
        <ok to="end"/>
        <error to="kill"/>
    </action>
    <kill name="kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>

` `

When I run the action I get the following error message 当我运行操作时，出现以下错误消息

Main class [org.apache.oozie.action.hadoop.JavaMain], exit code [-1]

The locations hdfs://ip-redacted:8020/user/admin/c and hdfs://ip-redacted:8020/user/admin/s000 are locations that contain the crawldb and the segments respectively. 位置hdfs://ip-redacted:8020/user/admin/c和hdfs://ip-redacted:8020/user/admin/s000分别是包含crawldb和段的位置。

The stderr of the job says :: 工作的标准答案是：:

`Log Length: 122
Intercepting System.exit(-1)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.JavaMain], exit code [-1]`

The syslog says:: syslog说：：

`ERROR [main] org.apache.nutch.indexer.IndexingJob: Indexer: java.lang.RuntimeException: org.apache.nutch.indexer.IndexWriter not found.
at org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:51)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:100)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:55)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:38)
at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:225)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)`

have verified that the class exists in the apache-nutch-1.7.jar file. 已验证该类存在于apache-nutch-1.7.jar文件中。

And if I request hadoop to run as a map-reduce job in the command shell as follows:: 如果我要求hadoop在命令shell中作为map-reduce作业运行，如下所示：

`hadoop jar apache-nutch-1.7.jar org.apache.nutch.indexer.IndexingJob -D solr.server.url=http://ip-redacted:8983/solr/raw hdfs://ip-redacted:8020/user/admin/c -dir hdfs://ip-redacted:8020/user/admin/s000`

It works!! 有用！！ But, when I do it as a oozie job, created through Hue, it fails... 但是，当我将其作为通过Hue创建的工作时，它失败了……

Also, other actions, like inject, generate, fetch, parse work fine in Hue. 此外，在Hue中，其他操作（如注入，生成，获取，解析）也可以正常工作。 It's only solrindex step that fails and I don't know what to do to fix it. 只是solrindex步骤失败了，我不知道该怎么办。 Any input on this will be great! 在此方面的任何投入都会很棒！

Answer 1

您是否将Nutch jar（如果需要，还有依赖项）放在工作流程的HDFS工作区的“ lib”目录中？

Answer 2

Ah, I'm beginning to loathe the packaging of Nutch! 啊，我开始讨厌Nutch的包装！

Try extracting the classes/plugins folder from the job archive, copy it to HDFS (something like hdfs dfs -put -r plugins lib) and then add the HDFS path of the plugins folder to the "files" list of the indexing step. 尝试从作业存档中提取classes / plugins文件夹，将其复制到HDFS（类似于hdfs dfs -put -r plugins lib之类），然后将plugins文件夹的HDFS路径添加到索引步骤的“文件”列表中。

Best, Edoardo 最好，爱德华多

hadoop上的Nutch部署不会索引到Solr

问题描述

2 个解决方案

解决方案1
0 2014-05-23 06:20:31

解决方案2
0 2014-09-23 08:58:48

hadoop上的Nutch部署不会索引到Solr

问题描述

2 个解决方案

解决方案1 0 2014-05-23 06:20:31

解决方案2 0 2014-09-23 08:58:48

解决方案1
0 2014-05-23 06:20:31

解决方案2
0 2014-09-23 08:58:48