简体   繁体   English

将令牌添加到Lucene令牌流

[英]Adding tokens to a lucene tokenstream

I wrote a TokenFilter which adds tokens in a stream. 我写了一个TokenFilter在流中添加令牌。

1. Tests shows it works, but I don't completely understand why. 1.测试表明它有效,但是我不完全理解为什么。

If someone could shed a light on the semantics I'd be grateful. 如果有人可以阐明语义,我将不胜感激。 In particular, at (*) , restoring the state, doesn't that mean we either overwrite the current token or the token created before capturing the state? 特别是,在(*) ,恢复状态,这是否意味着我们要么覆盖当前令牌,要么覆盖捕获状态之前创建的令牌?

This is roughly what I did 这大概是我做的

private final LinkedList<String> extraTokens = new LinkedList<String>();
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private State savedState;

@Override
public boolean incrementToken() throws IOException {
    if (!extraTokens.isEmpty()) {
        // Do we not loose/overwrite the current termAtt token here? (*)
        restoreState(savedState);
        termAtt.setEmpty().append(extraTokens.remove());
        return true;
    }
    if (input.incrementToken()) {
        if (/* condition */) {
           extraTokens.add("fo");
           savedState = captureState();
        }
        return true;
    }
    return false;
}

Does that mean, for an input stream of whitespace tokenized string "abc" 这是否意味着对于空白标记字符串"abc"的输入流

 (a) -> (b) -> (c) -> ...

where bb is a new synonym to b , that the graph will be constructed like this when restoreState is used? 其中bbb的新同义词,那么当使用restoreState时,图将像这样构造?

    (a)
   /   \
(b)    (bb)
   \   /
    (c)
     |
    ...

2. Attributes 2.属性

Given the text foo bar baz with fo being the stem of foo and qux being synonym to bar baz , have I constructed the correct attribute table? 给定文本foo bar baz其中fofoo的主干,而quxbar baz同义词,我是否构造了正确的属性表?

+--------+---------------+-----------+--------------+-----------+
|  Term  |  startOffset  | endOffset | posIncrement | posLenght |
+--------+---------------+-----------+--------------+-----------+
|  foo   |       0       |     3     |      1       |     1     |
|  fo    |       0       |     3     |      0       |     1     |
|  qux   |       4       |     11    |      0       |     2     |
|  bar   |       4       |     7     |      1       |     1     |
|  baz   |       8       |     11    |      1       |     1     |
+--------+---------------+-----------+--------------+-----------+

1. 1。

How the Attribute based API works is, that every TokenStream in your analyzer chain somehow modifies the state of some Attribute s on every call of incrementToken() . 基于属性的API的工作方式是,分析程序链中的每个TokenStream都会在每次调用incrementToken()以某种方式修改某些Attribute的状态。 The last element in your chain then produces the final tokens. 然后,链中的最后一个元素将产生最终标记。

Whenever the client of your analyzer chain calls incrementToken() , the last TokenStream would set the state of some Attribute s to whatever is necessary to represent the next token. 每当您的分析器链的客户端调用incrementToken()令牌incrementToken() ,最后一个TokenStream都会将某些Attribute的状态设置为表示下一个令牌所需的任何值。 If it is unable to do so, it may call incrementToken() on its input, to let the previous TokenStream do its work. 如果无法这样做,则可以在其输入上调用incrementToken() TokenStream incrementToken() ,以让先前的TokenStream进行工作。 This goes on until the last TokenStream returns false , indicating, that no more tokens are available. 这一直持续到最后一个TokenStream返回false为止,表明不再有可用的令牌。

A captureState copies the state of all Attribute s of the calling TokenStream into a State , a restoreState overwrites every Attribute 's state with whatever was captured before (is given as an argument). captureState将调用TokenStream的所有Attribute的状态复制到StaterestoreState用之前捕获的内容(作为参数给出)覆盖每个Attribute的状态。

The way your token filter works is, it will call input.incrementToken() , so that the previous TokenStream will set the Attribute s' state to what would be the next token. 令牌过滤器的工作方式是,它将调用input.incrementToken() ,以便先前的TokenStreamAttribute的状态设置为下一个令牌。 Then, if your defined condition holds (say, the termAtt is "b"), it would add "bb" to a stack, save this state somewhere and return true, so that the client may consume the token. 然后,如果您定义的条件成立(例如,termAtt为“ b”),它将在堆栈中添加“ bb”,将此状态保存在某处并返回true,以便客户端可以使用令牌。 On the next call of incrementToken() , it would not use input.incrementToken() . 在下次调用incrementToken()它不会使用input.incrementToken() Whatever the current state is, it represents the previous, already consumed token. 无论当前状态是什么,它都表示先前已消耗的令牌。 The filter then restores the state, so that everything is exactly as it was before, and then produces "bb" as the current token and returns true, so that the client may consume the token. 然后,筛选器将还原状态,以使所有内容都与以前一样,然后生成“ bb”作为当前令牌并返回true,以便客户端可以使用该令牌。 Only on the next call, it would (again) consume the next token from the previous filter. 仅在下一次调用时,它将(再次)消耗上一个过滤器中的下一个标记。

This won't actually produce the graph you displayed, but insert "bb" after "b" , so it's really 这实际上不会产生您显示的图形,但是会在"b"之后插入"bb" "b" ,因此它确实

(a) -> (b) -> (bb) -> (c)

So, why do you save the state in the first place? 那么,为什么首先要保存状态? When producing tokens, you want to make sure, that eg phrase queries or highlighting will work correctly. 在生成令牌时,您需要确保,例如词组查询或突出显示将正确运行。 When you have the text "abc" and "bb" is a synonym for "b" , you'd expect the phrase query "bc" to work, as well as "bb c" . 当文本为"abc""bb""b"的同义词时,您希望短语查询"bc""bb c" You have to tell the index, that both, "b" and "bb" are in the same position. 您必须告诉索引,“ b”和“ bb”都在同一位置。 Lucene uses a position increment for that and per default, the position increment is 1, meaning that every new token (read, call of incrementToken() ) comes 1 position after the previous one. Lucene为此使用一个位置增量,并且默认情况下,位置增量为1,这意味着每个新令牌(读取,调用incrementToken()令牌incrementToken() )在前一个位置之后到达1个位置。 So, with the final positions, the produces stream is 因此,在最终位置,农产品流是

(a:1) -> (b:2) -> (bb:3) -> (c:4)

while you actually want 当你真正想要

(a:1) — -> (b:2)  -> — (c:3)
      \              /
        -> (bb:2) ->

So, for your filter to produce the graph, you have to set the position increment to 0 for the inserted "bb" 因此,为了使您的过滤器生成图表,您必须将插入的"bb"的位置增量设置为0

private final PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class);
// later in incrementToken
restoreState(savedState);
posIncAtt.setPositionIncrement(0);
termAtt.setEmpty().append(extraTokens.remove());

The restoreState makes sure, that other attributes, like offsets, token types, etc. are preserved and you only have to change the ones, that are required for your use case. restoreState确保保留了其他属性,例如偏移量,令牌类型等,并且您只需要更改用例所需的那些属性即可。 Yes, you are overwriting whatever state was there before restoreState , so it is your responsibility to use this in the right place. 是的,您将覆盖restoreState之前的任何状态,因此在正确的位置使用它是您的责任。 And as long as you don't call input.incrementToken() , you don't advance the input stream, so you can do whatever you want with the state. 而且,只要不调用input.incrementToken() ,就不会提前输入流,因此您可以对状态进行任何操作。

2. 2。

A stemmer only changes the token, it typically doesn't produce new tokens nor changes the position increment or offsets. 词干仅更改令牌,通常不会产生新令牌,也不会更改位置增量或偏移量。 Also, as the position increment means, that the current term should come positionIncrement positions after the previous token, you should have qux with an increment of 1, because it is the next token after of and bar should have an increment of 0 because it is in the same position as qux . 同样,作为位置增量的意思,当前项应该在上一个标记之后出现positionIncrement位置,您应该使qux的增量为1,因为它是后面的下一个标记ofbar的增量应该为0,因为它是与qux处于同一位置。 The table would rather look like 桌子看起来像

+--------+---------------+-----------+--------------+-----------+
|  Term  |  startOffset  | endOffset | posIncrement | posLenght |
+--------+---------------+-----------+--------------+-----------+
|  fo    |       0       |     3     |      1       |     1     |
|  qux   |       4       |     11    |      1       |     2     |
|  bar   |       4       |     7     |      0       |     1     |
|  baz   |       8       |     11    |      1       |     1     |
+--------+---------------+-----------+--------------+-----------+

As a basic rule, for multi-term synonyms, where "ABC" is a synonym for "abc", you should see, that 作为一个基本规则,对于多词同义词,其中“ ABC”是“ abc”的同义词,您应该看到,

  • positionIncrement("ABC") > 0 (the increment of the first token) positionIncrement(“ ABC”)> 0(第一个令牌的增量)
  • positionIncrement(*) >= 0 (positions must not go backwards) positionIncrement(*)> = 0(位置不得向后移动)
  • startOffset("ABC") == startOffset("a") and endOffset("ABC") == endOffset("c") startOffset(“ ABC”)== startOffset(“ a”)和endOffset(“ ABC”)== endOffset(“ c”)
    • actually, tokens at the same (start|end) position must have the same (start|end) offset 实际上,位于相同(开始|结束)位置的令牌必须具有相同的(开始|结束)偏移量

Hope this helps to shed some light. 希望这有助于阐明。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM