简体   繁体   中英

How to load SR parser file in hdfs in the mapper?

I am trying to use the CoreNLP project in a mapreduce program to find the sentiment of a large number of text stored in hbase tables. I am using the SR parser for parsing. The model file is stored in hdfs at /user/root/englishSR.ser.gz . I have added the below line in the mapreduce application code

 job.addCacheFile(new URI("/user/root/englishSR.ser.gz#model"));

Now in the mapper

 props.setProperty("parse.model", "./model");

I am getting edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header . The pom.xml file contains

<dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>3.4.1</version>
</dependency>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.4.1</version>
    <classifier>models</classifier>
</dependency>

I have tried adding the file to resources and adding to the maven with all resulting in GC overhead limit exceeded or Java Heap issues.

I don't know hadoop well, but I suspect that you're confusing CoreNLP about the compression of the SR parser model.

First try this without using Hadoop:

java -mx4g edu.stanford.nlp.parser.shiftreduce.ShiftReduceParser -serializedPath /user/root/englishSR.ser.gz

See if that loads the parser fine. If so, it should print something like the below and exit (otherwise, it will throw an exception...).

Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ... done [10.4 sec].

If that loads a parser fine, then there is nothing wrong with the model file. I think the problem is then that CoreNLP simply uses whether a file or resource name ends in ".gz" to decide whether it is gzipped, and so it wrongly interprets the line:

props.setProperty("parse.model", "./model");

as saying to load a not-gzipped model. So I would hope that one or other of the below would work:

cd /user/root ; gunzip englishSR.ser.gz

job.addCacheFile(new URI("/user/root/englishSR.ser#model"));

props.setProperty("parse.model", "./model");

Or:

job.addCacheFile(new URI("/user/root/englishSR.ser#model.gz"));

props.setProperty("parse.model", "./model.gz");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM