Apache Lucene：建立索引时如何使用TokenStream手动接受或拒绝令牌

Question

I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine). 我正在寻找一种使用Apache Lucene编写自定义索引的方法（准确地说是PyLucene，但Java回答很好）。

What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken. 我要执行的操作如下：将文档添加到索引时，Lucene会将其标记化，删除停用词等。如果我没有记错的话，通常使用Analyzer来完成。

What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it). 我要实现的是以下内容：在Lucene存储给定术语之前，我想执行查找（例如，在词典中）以检查是否保留该术语或将其丢弃（如果该术语存在于我的词典中），则保留它，否则将其丢弃）。

How should I proceed ? 我应该如何进行？

Here is (in Python) my custom implementation of the Analyzer : 这是（在Python中）我对Analyzer自定义实现：

class CustomAnalyzer(PythonAnalyzer):

    def createComponents(self, fieldName, reader):

        source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
        filter = StandardFilter(Version.LUCENE_4_10_1, source)
        filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
        filter = StopFilter(Version.LUCENE_4_10_1, filter,
                            StopAnalyzer.ENGLISH_STOP_WORDS_SET)

        ts = tokenStream.getTokenStream()
        token = ts.addAttribute(CharTermAttribute.class_)
        offset = ts.addAttribute(OffsetAttribute.class_)

        ts.reset()

         while ts.incrementToken():
           startOffset = offset.startOffset()
           endOffset = offset.endOffset()
           term = token.toString()
           # accept or reject term 

         ts.end()
         ts.close()

           # How to store the terms in the index now ?

         return ????

Thank you for your guidance in advance ! 预先感谢您的指导！

EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents . 编辑1 ：深入研究Lucene的文档后，我发现它与TokenStreamComponents 。 It returns a TokenStream with which you can iterate through the Token list of the field you are indexing. 它返回一个TokenStream，您可以使用它遍历索引字段的Token列表。

Now there is something to do with the Attributes that I do not understand. 现在，有些与我不了解的Attributes有关。 Or more precisely, I can read the tokens, but have no idea how should I proceed afterward. 或更准确地说，我可以读取令牌，但不知道之后该如何处理。

EDIT 2 : I found this post where they mention the use of CharTermAttribute . 编辑2 ：我发现这篇文章中他们提到了CharTermAttribute的使用。 However (in Python though) I cannot access or get a CharTermAttribute . 但是（尽管在Python中）我无法访问或获取CharTermAttribute 。 Any thoughts ? 有什么想法吗？

EDIT3 : I can now access each term, see update code snippet. EDIT3 ：我现在可以访问每个术语，请参阅更新代码片段。 Now what is left to be done is actually storing the desired terms ... 现在剩下要做的实际上是存储所需的术语 ...

Answer 1

The way I was trying to solve the problem was wrong. 我试图解决问题的方式是错误的。 This post and femtoRgon 's answer were the solution. 这篇文章和femtoRgon的答案就是解决方案。

By defining a filter extending PythonFilteringTokenFilter , I can make use of the function accept() (as the one used in the StopFilter for instance). 通过定义一个扩展PythonFilteringTokenFilter的过滤器，我可以使用函数accept() （例如，在StopFilter使用的StopFilter ）。

Here is the corresponding code snippet : 这是相应的代码片段：

class MyFilter(PythonFilteringTokenFilter):

  def __init__(self, version, tokenStream):
    super(MyFilter, self).__init__(version, tokenStream)
    self.termAtt = self.addAttribute(CharTermAttribute.class_)


  def accept(self):
    term = self.termAtt.toString()
    accepted = False
    # Do whatever is needed with the term
    # accepted = ... (True/False)
    return accepted

Then just append the filter to the other filters (as in the code snipped of the question) : 然后只需将过滤器附加到其他过滤器（如在问题的代码片段中所示）：

filter = MyFilter(Version.LUCENE_4_10_1, filter)

Apache Lucene：建立索引时如何使用TokenStream手动接受或拒绝令牌

问题描述

1 个解决方案

解决方案1
0 2016-09-21 09:38:45

Apache Lucene：建立索引时如何使用TokenStream手动接受或拒绝令牌

问题描述

1 个解决方案

解决方案1 0 2016-09-21 09:38:45

解决方案1
0 2016-09-21 09:38:45