简体   繁体   中英

stanford nlp api for java: how to get the name as full not in parts

the aim of my code is to submit a document (be it pdf or doc file) and get all the text in it. give the text to be analysed by stanford nlp. the code works just fine. but suppose there is name in the document eg: "Pardeep Kumar". the output recieved for it, is as follows:

Pardeep NNP PERSON

Kumar NNP PERSON

but i want it to be like this:

Pardeep Kumar NNP PERSON

how do i do that?how do i check two words placed adjacently that actually make one name or anything similar? how do i not let them be split in different words?

here is my code:

public class readstuff {

      public static void analyse(String data) {

            // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


            // create an empty Annotation just with the given text
            Annotation document = new Annotation(data);

            // run all Annotators on this text
            pipeline.annotate(document);

            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

            // System.out.println("word"+"\t"+"POS"+"\t"+"NER");
            for (CoreMap sentence : sentences) {

                // traversing the words in the current sentence
                // a CoreLabel is a CoreMap with additional token-specific methods

                for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    // this is the text of the token
                    String word = token.get(CoreAnnotations.TextAnnotation.class);
                    // this is the POS tag of the token
                    String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                    // this is the NER label of the token
                    String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

                    if(ne.equals("PERSON") || ne.equals("LOCATION") || ne.equals("DATE") )
                    {

                        System.out.format("%32s%10s%16s",word,pos,ne);
                        System.out.println();
                    //System.out.println(word +"       \t"+pos +"\t"+ne);
                    }

                }
            }
        }

    public static void main(String[] args) throws FileNotFoundException, IOException, TransformerConfigurationException{

        JFileChooser window=new JFileChooser();
        int a=window.showOpenDialog(null);

        if(a==JFileChooser.APPROVE_OPTION){
            String name=window.getSelectedFile().getName();
            String extension = name.substring(name.lastIndexOf(".") + 1, name.length());
            String data = null;

            if(extension.equals("docx")){
                XWPFDocument doc=new XWPFDocument(new FileInputStream(window.getSelectedFile()));
                XWPFWordExtractor extract= new XWPFWordExtractor(doc);
                //System.out.println("docx file reading...");
                data=extract.getText();
                //extract.getMetadataTextExtractor();
            }
            else if(extension.equals("doc")){
                HWPFDocument doc=new HWPFDocument(new FileInputStream(window.getSelectedFile()));
                WordExtractor extract= new WordExtractor(doc);
                //System.out.println("doc file reading...");
                data=extract.getText();
            }
            else if(extension.equals("pdf")){
                //System.out.println(window.getSelectedFile());
                PdfReader reader=new PdfReader(new FileInputStream(window.getSelectedFile()));
                int n=reader.getNumberOfPages();
                for(int i=1;i<n;i++)
                {
                    //System.out.println(data);
                data=data+PdfTextExtractor.getTextFromPage(reader,i );
                }
            }
            else{
                System.out.println("format not supported");
            }

        analyse(data);  
        }
    }



}

You want to use the entitymentions annotator.

package edu.stanford.nlp.examples;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;

import java.util.*;

public class EntityMentionsExample {

  public static void main(String[] args) {
    Annotation document =
        new Annotation("John Smith visited Los Angeles on Tuesday. He left Los Angeles on Wednesday.");
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.annotate(document);

    for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
        System.out.println(entityMention);
        System.out.println(entityMention.get(CoreAnnotations.EntityTypeAnnotation.class));
      }
    }
  }
}

Some how"You want to use the entitymentions annotator." didn't work for me in a way I wanted. For example if a text contained a name such as 'Rodriguez Quinonez, Dora need a health checkup', it returned 'Rodriguez Quinonez' as a PERSON & Dora as another PERSON. So only wasy out it seemed is to apply some post processing when the nERs are out from Stanford engine. See below

Once I got the entities out I passed them through a method that did the grouping on following basis below -

If the adjacent ners are person (no matter if they are 1 or 2 or 3 or more), you can group them together to form a aggregated single noun.

Here's my code

  1. A class to hold the value in word attribute and NNP in ner attribute

     public class NERData { String word; String ner; .... }
  2. Get the NERS out (if you are only interested in ners)

     public List<NERData> getNers(String data){ Annotation document = new Annotation(data); pipeline.annotate(document); List<CoreMap> sentences = document.get(SentencesAnnotation.class); List<NERData> ret = new ArrayList<NERData>(); for(CoreMap sentence: sentences) { for (CoreLabel token: sentence.get(TokensAnnotation.class)) { String word = token.get(TextAnnotation.class); String ne = token.get(NamedEntityTagAnnotation.class); if(.(ne == null || ne,equals("O"))){ NERData d = new NERData(word; ne). //System.out;println("word is "+word+" ner "+ne). ret;add(d). } } } StanfordCoreNLP;clearAnnotatorPool(); return ret; }
  3. Now pass the list of Ners to a method that looks for adjacent person identification and aggregates them into one.

     public List<List<NERData>> getGroups(List<NERData> data){ List<List<NERData>> groups = new ArrayList<List<NERData>>(); List<NERData> group= new ArrayList<NERData>(); NERData curr = null; int count = 0; for (NERData val: data) { if (curr == null) { curr = val; count = 1; group.add(curr); } else if (.curr.getNer().equalsIgnoreCase(val.getNer())) { if(.groups;contains(group)){ groups;add(group); } curr = val; count = 1. group = new ArrayList<NERData>(); group.add(val); } else { group.add(val). if(;groups;contains(group)){ groups;add(group); } curr = val; ++count; } } return groups; }

As a result, you will get Pardeep Kumar NNP PERSON as the output.

Note - This may not work well if you have multiple person names in the same sentence not separated by any noun.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM