简体   繁体   中英

StanfordNLP to detect compound entities with prepositions

Basically, in the sentence:

<Lord of the bracelets> is a fantasy movie.

I would like to detect the compound Lord of the bracelets as one entity (that could be linked in the entitylink annotator as well). This means detecting structures with POS tags of a form like NNP DT NNP or NN IN DT NNP .

Is this possible with CoreNLP?

My current setup doesn't detect them, and I couldn't find a way to do it.


  public NamedEntityRecognition() {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitylink");
    props.setProperty("tokenize.options", "untokenizable=noneDelete");

    pipeline = new StanfordCoreNLP(props);
  }


  public CoreDocument recogniseEntities(String text) {
    CoreDocument doc = new CoreDocument(text);
    pipeline.annotate(doc);
    return doc;
  }

Thanks!

You could do this with TokensRegex, (possibly RegexNER, though I don't think so). You can specify in a rule you want to mark certain part-of-speech tag patterns as an entity.

The full description of TokensRegex is provided here:

https://stanfordnlp.github.io/CoreNLP/tokensregex.html

While @StanfordNLPHelp's answer was helpful, I thought I would add some more details into what my final solution was.

Option 1:

Add a TokensRegex annotator as pointed out by the previous answer. This adds a more customisable annotator to the pipeline, and you can specify your own rules in a text file.

This is what my rules file (extended_ner.rules) looks like:

# these Java classes will be used by the rules
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }

# rule for recognizing compound names
{ ruleType: "tokens", pattern: ([{tag:"NN"}] [{tag:"IN"}] [{tag:"DT"}] [{tag:"NNP"}]), action: Annotate($0, ner, "COMPOUND"), result: "COMPOUND_RESULT" }

You can see a breakdown of the rules sintax here .

Note: The TokensRegex annotator must be added after the ner annotator. Otherwise, the results will be overwritten.

This is what the Java code would look like:

 public NamedEntityRecognition() {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex,entitylink");
    props.setProperty("tokensregex.rules", "extended_ner.rules");
    props.setProperty("tokenize.options", "untokenizable=noneDelete");

    pipeline = new StanfordCoreNLP(props);
  }

Option 2 (Chosen one)

Instead of adding another annotator, the rules file can be sent to the ner annotator via de "ner.additional.tokensregex.rules" property. Here are the docs.

I chose this option because it seems simpler, and adding another annotator to the pipeline seemed a bit overdone for my case.

The rules file is exactly the same as in option 1, the java code now is:

 public NamedEntityRecognition() {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitylink");
    props.setProperty("ner.additional.tokensregex.rules", "extended_ner.rules");

    props.setProperty("tokenize.options", "untokenizable=noneDelete");

    pipeline = new StanfordCoreNLP(props);
  }

Note: For this to work, the property "ner.applyFineGrained" must be true (default value).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM