简体   繁体   English

hadoop上的Nutch部署不会索引到Solr

[英]Nutch deployment on hadoop will not index to solr

I have oozie workflow that does a nutch crawl I designed using hue. 我有oozie工作流程,该工作流程使用我设计的色调进行抓取抓取。

All steps in the process work, except for indexing to solr. 除索引到solr之外,该过程的所有步骤均有效。

The oozie action that defines the solrindex is as follows 定义solrindex的oozie动作如下

` `

<start to="solr-test"/>
    <action name="solr-test">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <main-class>org.apache.nutch.indexer.IndexingJob</main-class>
            <java-opts>solr.server.url=http://ip-redacted:8983/solr/raw</java-opts>
            <arg>hdfs://ip-redacted:8020/user/admin/c</arg>
            <arg>-dir</arg>
            <arg>hdfs://ip-redacted:8020/user/admin/s000</arg>
        </java>
        <ok to="end"/>
        <error to="kill"/>
    </action>
    <kill name="kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>

` `

When I run the action I get the following error message 当我运行操作时,出现以下错误消息

Main class [org.apache.oozie.action.hadoop.JavaMain], exit code [-1]

The locations hdfs://ip-redacted:8020/user/admin/c and hdfs://ip-redacted:8020/user/admin/s000 are locations that contain the crawldb and the segments respectively. 位置hdfs://ip-redacted:8020/user/admin/chdfs://ip-redacted:8020/user/admin/s000分别是包含crawldb和段的位置。

The stderr of the job says :: 工作的标准答案是::

`Log Length: 122
Intercepting System.exit(-1)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.JavaMain], exit code [-1]`

The syslog says:: syslog说::

`ERROR [main] org.apache.nutch.indexer.IndexingJob: Indexer: java.lang.RuntimeException: org.apache.nutch.indexer.IndexWriter not found.
at org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:51)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:100)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:55)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:38)
at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:225)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)`

have verified that the class exists in the apache-nutch-1.7.jar file. 已验证该类存在于apache-nutch-1.7.jar文件中。

And if I request hadoop to run as a map-reduce job in the command shell as follows:: 如果我要求hadoop在命令shell中作为map-reduce作业运行,如下所示:

`hadoop jar apache-nutch-1.7.jar org.apache.nutch.indexer.IndexingJob -D solr.server.url=http://ip-redacted:8983/solr/raw hdfs://ip-redacted:8020/user/admin/c -dir hdfs://ip-redacted:8020/user/admin/s000`

It works!! 有用!! But, when I do it as a oozie job, created through Hue, it fails... 但是,当我将其作为通过Hue创建的工作时,它失败了……

Also, other actions, like inject, generate, fetch, parse work fine in Hue. 此外,在Hue中,其他操作(如注入,生成,获取,解析)也可以正常工作。 It's only solrindex step that fails and I don't know what to do to fix it. 只是solrindex步骤失败了,我不知道该怎么办。 Any input on this will be great! 在此方面的任何投入都会很棒!

您是否将Nutch jar(如果需要,还有依赖项)放在工作流程的HDFS工作区的“ lib”目录中?

Ah, I'm beginning to loathe the packaging of Nutch! 啊,我开始讨厌Nutch的包装!

Try extracting the classes/plugins folder from the job archive, copy it to HDFS (something like hdfs dfs -put -r plugins lib) and then add the HDFS path of the plugins folder to the "files" list of the indexing step. 尝试从作业存档中提取classes / plugins文件夹,将其复制到HDFS(类似于hdfs dfs -put -r plugins lib之类),然后将plugins文件夹的HDFS路径添加到索引步骤的“文件”列表中。

Best, Edoardo 最好,爱德华多

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM