简体   繁体   English

为什么在Lucene的同一文档中得到不同的术语?

[英]Why do I get different terms in same document in Lucene?

I have an sentence-based index where every document stores the content and a field for every token. 我有一个基于句子的索引,其中每个文档都存储内容以及每个令牌的字段。 The content field is named content and stores the whole sentence as String, the token fields are named token0 , token1 , ... tonken_n-1 containing each token as a String. content字段被命名为content ,并将整个句子存储为String,令牌字段被命名为token0token1 ,... tonken_n-1其中每个令牌都作为String。

When I used the code sample in the Apache Migration guide to get all unique terms for each field in the example sample "This index is sentence-based.", 当我使用Apache迁移指南中的代码示例获取示例示例“此索引基于句子”中每个字段的所有唯一术语时,

for(String field : fields) {
   Terms terms = fields.terms(field);
   TermsEnum termsEnum = terms.iterator(null);
   BytesRef text;
   while((text = termsEnum.next()) != null) {
     System.out.println("field=" + field + "; text=" + text.utf8ToString());
   }
}

sentence-based is recognized as term in the field token3 , but in the content field only sentence and based based is recognized. sentence-based被认为是在该领域长期token3 ,但在内容领域的只有sentence和基于based被识别。 It seems like fields.terms(field) uses a different Analyzer for each field. 似乎fields.terms(field)对每个字段使用不同的分析器。

I have no clue why I get different terms when applying fields.terms("content") . 我不知道为什么在应用fields.terms("content")时得到不同的术语。 I want to get sentence-based as a term out of the content field instead of sentence and based . 我想从内容字段中获取sentence-based的术语,而不是基于sentencebased

I hope there is an explanation for this phenomenon. 我希望对此现象有一个解释。

It sounds like the content field is being analyzed and the tokenN fields aren't. 听起来好像正在分析content字段,而没有对tokenN字段进行分析。 Analysis happens at index time, so what you need to do is revisit your indexing code and find out why content and token fields are being analyzed differently. 分析发生在索引时间,因此您需要做的是重新访问索引代码,并找出为什么对内容和令牌字段进行不同分析的原因。

If you want the content field to be analyzed, just differently from what you have now, then okay. 如果您分析内容字段,与现在的内容有所不同,那么可以。 Do that! 去做! SimpleAnalyzer may be a good place to start. SimpleAnalyzer可能是一个不错的起点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM