简体   繁体   English

stanford nlp api for java:如何获得完整而不是部分的名称

[英]stanford nlp api for java: how to get the name as full not in parts

the aim of my code is to submit a document (be it pdf or doc file) and get all the text in it.我的代码的目的是提交文档(无论是 pdf 还是 doc 文件)并获取其中的所有文本。 give the text to be analysed by stanford nlp. the code works just fine.给斯坦福 nlp 分析的文本。代码工作得很好。 but suppose there is name in the document eg: "Pardeep Kumar".但假设文档中有名称,例如:“Pardeep Kumar”。 the output recieved for it, is as follows:收到的output,如下:

Pardeep NNP PERSON Pardeep NNP 人

Kumar NNP PERSON库马尔 NNP 人

but i want it to be like this:但我希望它是这样的:

Pardeep Kumar NNP PERSON Pardeep Kumar NNP 人员

how do i do that?how do i check two words placed adjacently that actually make one name or anything similar?我该怎么做?我如何检查相邻放置的两个单词实际上是同一个名字或类似的东西? how do i not let them be split in different words?我怎样才能不让他们用不同的词分开?

here is my code:这是我的代码:

public class readstuff {

      public static void analyse(String data) {

            // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


            // create an empty Annotation just with the given text
            Annotation document = new Annotation(data);

            // run all Annotators on this text
            pipeline.annotate(document);

            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

            // System.out.println("word"+"\t"+"POS"+"\t"+"NER");
            for (CoreMap sentence : sentences) {

                // traversing the words in the current sentence
                // a CoreLabel is a CoreMap with additional token-specific methods

                for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    // this is the text of the token
                    String word = token.get(CoreAnnotations.TextAnnotation.class);
                    // this is the POS tag of the token
                    String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                    // this is the NER label of the token
                    String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

                    if(ne.equals("PERSON") || ne.equals("LOCATION") || ne.equals("DATE") )
                    {

                        System.out.format("%32s%10s%16s",word,pos,ne);
                        System.out.println();
                    //System.out.println(word +"       \t"+pos +"\t"+ne);
                    }

                }
            }
        }

    public static void main(String[] args) throws FileNotFoundException, IOException, TransformerConfigurationException{

        JFileChooser window=new JFileChooser();
        int a=window.showOpenDialog(null);

        if(a==JFileChooser.APPROVE_OPTION){
            String name=window.getSelectedFile().getName();
            String extension = name.substring(name.lastIndexOf(".") + 1, name.length());
            String data = null;

            if(extension.equals("docx")){
                XWPFDocument doc=new XWPFDocument(new FileInputStream(window.getSelectedFile()));
                XWPFWordExtractor extract= new XWPFWordExtractor(doc);
                //System.out.println("docx file reading...");
                data=extract.getText();
                //extract.getMetadataTextExtractor();
            }
            else if(extension.equals("doc")){
                HWPFDocument doc=new HWPFDocument(new FileInputStream(window.getSelectedFile()));
                WordExtractor extract= new WordExtractor(doc);
                //System.out.println("doc file reading...");
                data=extract.getText();
            }
            else if(extension.equals("pdf")){
                //System.out.println(window.getSelectedFile());
                PdfReader reader=new PdfReader(new FileInputStream(window.getSelectedFile()));
                int n=reader.getNumberOfPages();
                for(int i=1;i<n;i++)
                {
                    //System.out.println(data);
                data=data+PdfTextExtractor.getTextFromPage(reader,i );
                }
            }
            else{
                System.out.println("format not supported");
            }

        analyse(data);  
        }
    }



}

You want to use the entitymentions annotator.您想使用entitymentions注释器。

package edu.stanford.nlp.examples;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;

import java.util.*;

public class EntityMentionsExample {

  public static void main(String[] args) {
    Annotation document =
        new Annotation("John Smith visited Los Angeles on Tuesday. He left Los Angeles on Wednesday.");
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.annotate(document);

    for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
        System.out.println(entityMention);
        System.out.println(entityMention.get(CoreAnnotations.EntityTypeAnnotation.class));
      }
    }
  }
}

Some how"You want to use the entitymentions annotator."一些如何“你想使用实体提及注释器”。 didn't work for me in a way I wanted.没有以我想要的方式为我工作。 For example if a text contained a name such as 'Rodriguez Quinonez, Dora need a health checkup', it returned 'Rodriguez Quinonez' as a PERSON & Dora as another PERSON.例如,如果文本包含诸如“Rodriguez Quinonez,Dora 需要进行健康检查”之类的名称,它会返回“Rodriguez Quinonez”作为一个人,而 Dora 作为另一个人。 So only wasy out it seemed is to apply some post processing when the nERs are out from Stanford engine.因此,似乎只有当 nER 从 Stanford 引擎中出来时才应用一些后处理。 See below见下文

Once I got the entities out I passed them through a method that did the grouping on following basis below -一旦我得到实体,我就通过一种方法传递它们,该方法在下面的基础上进行分组 -

If the adjacent ners are person (no matter if they are 1 or 2 or 3 or more), you can group them together to form a aggregated single noun.如果相邻的ners是person(不管是1个还是2个还是3个还是更多),你可以把他们组合在一起形成一个聚合的单名词。

Here's my code这是我的代码

  1. A class to hold the value in word attribute and NNP in ner attribute一个 class 保存 word 属性中的值和 ner 属性中的 NNP

     public class NERData { String word; String ner; .... }
  2. Get the NERS out (if you are only interested in ners)把 NERS 拿出来(如果你只对 ners 感兴趣)

     public List<NERData> getNers(String data){ Annotation document = new Annotation(data); pipeline.annotate(document); List<CoreMap> sentences = document.get(SentencesAnnotation.class); List<NERData> ret = new ArrayList<NERData>(); for(CoreMap sentence: sentences) { for (CoreLabel token: sentence.get(TokensAnnotation.class)) { String word = token.get(TextAnnotation.class); String ne = token.get(NamedEntityTagAnnotation.class); if(.(ne == null || ne,equals("O"))){ NERData d = new NERData(word; ne). //System.out;println("word is "+word+" ner "+ne). ret;add(d). } } } StanfordCoreNLP;clearAnnotatorPool(); return ret; }
  3. Now pass the list of Ners to a method that looks for adjacent person identification and aggregates them into one.现在将 Ners 列表传递给查找相邻人员标识并将它们聚合为一个的方法。

     public List<List<NERData>> getGroups(List<NERData> data){ List<List<NERData>> groups = new ArrayList<List<NERData>>(); List<NERData> group= new ArrayList<NERData>(); NERData curr = null; int count = 0; for (NERData val: data) { if (curr == null) { curr = val; count = 1; group.add(curr); } else if (.curr.getNer().equalsIgnoreCase(val.getNer())) { if(.groups;contains(group)){ groups;add(group); } curr = val; count = 1. group = new ArrayList<NERData>(); group.add(val); } else { group.add(val). if(;groups;contains(group)){ groups;add(group); } curr = val; ++count; } } return groups; }

As a result, you will get Pardeep Kumar NNP PERSON as the output.结果,您将获得 Pardeep Kumar NNP PERSON 作为 output。

Note - This may not work well if you have multiple person names in the same sentence not separated by any noun.注意 - 如果您在同一个句子中有多个人名且未被任何名词分隔,则此方法可能效果不佳。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM