[英]Tokenization, and indexing with Lucene, how to handle external tokenize and part-of-speech?
i would like to build my own - here am not sure which one - tokenizer (from Lucene point of view) or my own analyzer. 我想建立自己的 - 这里不确定哪一个 - 标记器(来自Lucene的观点)或我自己的分析器。 I already write a code that tokenize my documents in word (as a List < String > or a List < Word > where Word is a class with only a kind of container with 3 public String : word, pos, lemma - pos stand for part-of-speech tag).
我已经编写了一个代码,用于将文档标记为单词(作为List <String>或List < Word >,其中Word是一个只有一种带有3个公共字符串的容器的类:word,pos,lemma - pos stand for part语音标签)。
i'm not sure what i am going to index, maybe only " Word.lemma " or something like " Word.lemma + '#' + Word.pos ", probably i will do some filtering from a stop word list based on part-of-speech. 我不确定我要索引什么,也许只有“ Word.lemma ”或类似“ Word.lemma +'#'+ Word.pos ”之类的东西,可能我会根据部分从停用词列表中做一些过滤-of-讲话。
btw here is my misunderstanding : i am not sure where i should plug to the Lucene API, 顺便说一句,这是我的误解:我不确定我应该在哪里插入Lucene API,
should i wrap my own tokenizer inside a new tokenizer ? 我应该将自己的标记器包装在新的标记器中吗? should i rewrite TokenStream ?
我应该重写TokenStream吗? should i consider that this is the job of the analyzer rather than the tokenizer ?
我应该考虑这是分析仪的工作而不是标记器吗? or shoud i bypass everything and directly build my index by adding my word directly inside index, using IndexWriter, Fieldable and so on ?
或者我绕过一切并通过直接在索引中添加我的单词,使用IndexWriter,Fieldable等直接构建我的索引? (if so do you know of any documentation on how to create it's own index from scratch when bypass ing the analysis process)
(如果是这样,你知道有关如何在绕过分析过程时从头开始创建自己的索引的任何文档)
best regards 最好的祝福
EDIT : may be the simplest way should be to org.apache.commons.lang.StringUtils.join my Word -s with a space on the exit of my personal tokenizer/analyzer and rely on the WhiteSpaceTokenizer to feed Lucene (and other classical filters) ? 编辑 :可能是最简单的方法应该是org.apache.commons.lang.StringUtils.join我的Word -s在我的个人标记器/分析器的出口处有一个空格并依赖于WhiteSpaceTokenizer来提供Lucene(以及其他经典过滤器) )?
EDIT : so, i have read EnglishLemmaTokenizer pointed by Larsmans ... but where i am still confused, is the fact that i end my own analysis/tokenization process with a complete *List < Word > * ( Word class wrapping .form/.pos/.lemma ) , this process rely on an external binary that i had wrapped in Java (this is a must do / can not do otherwise - it is not on a consumer point of view, i get the full list as a result) and i still not see how i should wrap it again to get back to the normal Lucene analysis process. 编辑 :所以,我读过Larsmans指出的EnglishLemmaTokenizer ......但是我仍然感到困惑的是,我用一个完整的* List <Word> *( Word类包装.form /。)结束我自己的分析/标记化过程。 pos / .lemma ),这个过程依赖于我用Java包装的外部二进制文件(这是必须做的/不能做的 - 它不是在消费者的观点,我得到完整的列表作为结果)我仍然没有看到我应该如何重新包装它以回到正常的Lucene分析过程。
also i will be using the TermVector feature with TF.IDF like scoring (may be redefining my own), i may also be interested in the proximty searching, thus, discarding some words from their part-of- speech before providing them to a Lucene built-in tokenizer or internal analyzer may seem a bad idea. 我也将使用TF.IDF的TermVector功能,如评分(可能正在重新定义我自己的),我也可能对搜索的内容感兴趣,因此,在将它们提供给Lucene之前,从他们的词性中丢弃一些单词内置标记器或内部分析器可能看起来不错。 And i have difficulties in thinking of a "proper" way to wrap a Word.form / Word.pos / Word.lemma (or even other Word.anyOtherUnterestingAttribute) to the Lucene way.
而且我很难想出一种将Word.form / Word.pos / Word.lemma(甚至其他Word.anyOtherUnterestingAttribute)包装成Lucene方式的“正确”方法。
EDIT: BTW, here is a piece of code that i write inspired by the one of @Larsmans : 编辑: BTW,这是我写的一段代码,受到@Larsmans的启发:
class MyLuceneTokenizer extends TokenStream {
private PositionIncrementAttribute posIncrement;
private CharTermAttribute termAttribute;
private List<TaggedWord> tagged;
private int position;
public MyLuceneTokenizer(Reader input, String language, String pathToExternalBinary) {
super();
posIncrement = addAttribute(PositionIncrementAttribute.class);
termAttribute = addAttribute(CharTermAttribute.class); // TermAttribute is deprecated!
// import com.google.common.io.CharStreams;
text = CharStreams.toString(input); //see http://stackoverflow.com/questions/309424/in-java-how-do-i-read-convert-an-inputstream-to-a-string
tagged = MyTaggerWrapper.doTagging(text, language, pathToExternalBinary);
position = 0;
}
public final boolean incrementToken()
throws IOException {
if (position > tagged.size() -1) {
return false;
}
int increment = 1; // will probably be changed later depending upon any POS filtering or insertion @ same place...
String form = (tagged.get(position)).word;
String pos = (tagged.get(position)).pos;
String lemma = (tagged.get(position)).lemma;
// logic filtering should be here...
// BTW we have broken the idea behing the Lucene nested filters or analyzers!
String kept = lemma;
if (kept != null) {
posIncrement.setPositionIncrement(increment);
char[] asCharArray = kept.toCharArray();
termAttribute.copyBuffer(asCharArray, 0, asCharArray.length);
//termAttribute.setTermBuffer(kept);
position++;
}
return true;
}
}
class MyLuceneAnalyzer extends Analyzer {
private String language;
private String pathToExternalBinary;
public MyLuceneAnalyzer(String language, String pathToExternalBinary) {
this.language = language;
this.pathToExternalBinary = pathToExternalBinary;
}
@Override
public TokenStream tokenStream(String fieldname, Reader input) {
return new MyLuceneTokenizer(input, language, pathToExternalBinary);
}
}
There are various options here, but when I tried to wrap a POS tagger in Lucene, I found that implementing a new TokenStream
and wrapping that inside a new Analyzer
was the easiest option. 这里有各种选项,但是当我尝试在Lucene中包装一个POS标记器时,我发现实现一个新的
TokenStream
并将其包装在一个新的Analyzer
是最简单的选择。 In any case, mucking with IndexWriter
directly seems like a bad idea. 在任何情况下,直接使用
IndexWriter
似乎都是一个坏主意。 You can find my code on my GitHub . 你可以在我的GitHub上找到我的代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.