简体   繁体   中英

How to Preserve Original Line Numbering in the Output of Stanford CoreNLP?

Text corpora are often distributed as large files containing specific documents on each new line. For instance, I have a file with 10 million product reviews, one per line, and each review contains multiple sentences.

When processing such files with Stanford CoreNLP, using the command line, for instance

java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma -file test.txt

the output, whether in text or xml format, will number all sentences from 1 to n , ignoring the original line numbering that separates the documents.

I would like to keep track of the original file's line numbering (eg in xml format, to have an output tree like <original_line id=1> , then <sentence id=1> , then <token id=1> ). Or else, to be able to reset the numbering of sentences at the start of each new line in the original file.

I have tried the answer to a similar question about Stanford's POS tagger, without success. Those options do not keep track of the original line numbers.

A quick solution could be to split the original file in multiple files, then processing each of them with CoreNLP and the -filelist input option. However, for large files with millions of documents, creating millions of individual files just to preserve the original line/document numbering seems inefficient.

I suppose it would be possible to modify the source code of Stanford CoreNLP, but I am unfamiliar with Java.

Any solution to preserve the original line numbering in the output would be very helpful, whether through the command line or by showing an example Java code that would achieve that.

I've dug through the code base, and I can't find a command line flag that will help you.

I wrote some sample Java code that should do the trick.

I put this in DocPerLineProcessor.java, which I put into stanford-corenlp-full-2015-04-20. I also put a file called sample-doc-per-line.txt which had 4 sentences per line.

First make sure to compile:

cd stanford-corenlp-full-2015-04-20

javac -cp "*:." DocPerLineProcessor.java

Here is the command to run:

java -cp "*:." DocPerLineProcessor sample-doc-per-line.txt

The output sample-doc-per-line.txt.xml should be the desired xml format, but sentences now have which line number they're on.

Here is the code:

import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*; 
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.util.*;

public class DocPerLineProcessor {
    public static void main (String[] args) throws IOException {
        // set up properties
        Properties props = new Properties();
        props.setProperty("annotators",
            "tokenize, ssplit, pos, lemma, ner, parse");
        // set up pipeline
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        // read in a product review per line
        Iterable<String> lines = IOUtils.readLines(args[0]);
        Annotation mainAnnotation = new Annotation("");
        // add a blank list to put sentences into
        List<CoreMap> blankSentencesList = new ArrayList<CoreMap>();
        mainAnnotation.set(CoreAnnotations.SentencesAnnotation.class,blankSentencesList);
        // process each product review
        int lineNumber = 1;
        for (String line : lines) {
            Annotation annotation = new Annotation(line);
            pipeline.annotate(annotation);
            for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
                sentence.set(CoreAnnotations.LineNumberAnnotation.class,lineNumber);
                mainAnnotation.get(CoreAnnotations.SentencesAnnotation.class).add(sentence);
            }
            lineNumber += 1;
        }
        PrintWriter xmlOut = new PrintWriter(args[0]+".xml");
        pipeline.xmlPrint(mainAnnotation, xmlOut);
    }
}

Now when I run this, the sentence tags also have the appropriate line number. So the sentences still have a global id, but you can mark which line they came from. This will also set it up so newline always ends a sentence.

Please let me know if you need any clarification or if I made any errors transcribing my code.

The Question is already answered but i had the same problem and came up with a command line solution that worked for me. The trick was to specify the tokenizerFactory and give it the option tokenizeNLs=true

It looks like this:

java -mx1g -cp stanford-corenlp-3.6.0.jar:slf4j-api.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier english.conll.4class.distsim.normal.tagger -outputFormat slashTags -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions "tokenizeNLs=true" -textFile untagged_lines.txt > tagged_lines.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM