简体   繁体   English

如何在映射器中加载hdfs中的SR解析器文件?

[英]How to load SR parser file in hdfs in the mapper?

I am trying to use the CoreNLP project in a mapreduce program to find the sentiment of a large number of text stored in hbase tables. 我试图在mapreduce程序中使用CoreNLP项目来查找存储在hbase表中的大量文本的情绪。 I am using the SR parser for parsing. 我正在使用SR解析器进行解析。 The model file is stored in hdfs at /user/root/englishSR.ser.gz . 模型文件存储在hdfs中的/user/root/englishSR.ser.gz I have added the below line in the mapreduce application code 我在mapreduce应用程序代码中添加了以下行

 job.addCacheFile(new URI("/user/root/englishSR.ser.gz#model"));

Now in the mapper 现在在映射器中

 props.setProperty("parse.model", "./model");

I am getting edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header . 我收到了edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header The pom.xml file contains pom.xml文件包含

<dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>3.4.1</version>
</dependency>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.4.1</version>
    <classifier>models</classifier>
</dependency>

I have tried adding the file to resources and adding to the maven with all resulting in GC overhead limit exceeded or Java Heap issues. 我已经尝试将文件添加到resources并添加到maven ,导致GC overhead limit exceeded或Java堆问题。

I don't know hadoop well, but I suspect that you're confusing CoreNLP about the compression of the SR parser model. 我不太了解hadoop,但我怀疑你对CoreNLP关于SR解析器模型的压缩感到困惑。

First try this without using Hadoop: 首先尝试不使用Hadoop:

java -mx4g edu.stanford.nlp.parser.shiftreduce.ShiftReduceParser -serializedPath /user/root/englishSR.ser.gz

See if that loads the parser fine. 看看是否能很好地加载解析器。 If so, it should print something like the below and exit (otherwise, it will throw an exception...). 如果是这样,它应该打印类似下面的东西并退出(否则,它将抛出异常......)。

Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ... done [10.4 sec].

If that loads a parser fine, then there is nothing wrong with the model file. 如果加载解析器很好,那么模型文件没有任何问题。 I think the problem is then that CoreNLP simply uses whether a file or resource name ends in ".gz" to decide whether it is gzipped, and so it wrongly interprets the line: 我认为问题是CoreNLP只是使用文件或资源名称是否以“.gz”结尾来决定它是否被gzip压缩,因此错误地解释了该行:

props.setProperty("parse.model", "./model");

as saying to load a not-gzipped model. 如说要加载一个非压缩模型。 So I would hope that one or other of the below would work: 所以我希望下面的一个或另一个可以工作:

cd /user/root ; gunzip englishSR.ser.gz

job.addCacheFile(new URI("/user/root/englishSR.ser#model"));

props.setProperty("parse.model", "./model");

Or: 要么:

job.addCacheFile(new URI("/user/root/englishSR.ser#model.gz"));

props.setProperty("parse.model", "./model.gz");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM