简体   繁体   中英

Know the type of a word on Lucene

I'm working on Lucene, as a Newbie and I'm searching a way to find in a tokenstream. If it's a verb, a name, or other. I see the method type() for the token's class but I'm using the class CharTermAttribute .

I have already tried to find a way on the API doc, but i don't found anything for this.

@EJP is correct, that Lucene doesn't know anything about parts of speech on it's own.

However, you could implement a custom filter that handles this well. Lucene's TokenStream API example implementation actually does exactly that. I'll include a few pertinent bits here, but also look over the complete example (starts about halfway down the page).

Two things, in particular, seem of primary interest. First, creating a custom PartOfSpeechAttribute interface extending Attribute . These Attributes are used to attach some attendant data to tokens during analysis. A simplified version of that provided in the example (again, visit the link above to see their more robust implementation):

public class PartOfSpeechAttribute implements Attribute {
  public static enum PartOfSpeech {
    Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Conjunction, Article, Unknown
  }

  private PartOfSpeech pos = PartOfSpeech.Unknown;

  public void setPartOfSpeech(PartOfSpeech pos) {
    this.pos = pos;
  }
}

Then you will need to implement your custom filter which adds these attrributes to each Token.

public static class PartOfSpeechTaggingFilter extends TokenFilter {
  PartOfSpeechAttribute posAtt = addAttribute(PartOfSpeechAttribute.class);

  protected PartOfSpeechTaggingFilter(TokenStream input) {
    super(input);
  }

  public boolean incrementToken() throws IOException {
    if (!input.incrementToken()) {return false;}
    posAtt.setPartOfSpeech(determinePOS(termAtt.buffer(), 0, termAtt.length()));
    return true;
  }

  // determine the part of speech for the given term
  protected PartOfSpeech determinePOS(char[] term, int offset, int length) {
    // ???
  }
}

Then you would be able to extract the PartOfSpeechAttributes from the TokenStream much like you would any other attribute.

Of course, that doesn't answer how to determine the Part of Speech. The implementation of determinePOS is somewhat beyond the scope I can expect to cover here though. OpenNLP might be a good library for making that determination, among others.

There is also some as yet in development work on an OpenNLP analysis module that would definitely be work a look, though it doesn't look like parts of speech are handled by it yet, nor that it has got much love in in about a year.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM