简体   繁体   English

lucene过滤器区分大小写

[英]lucene filter case sensitive

I'm migrating from lucene 3.0.1 to 4.1.0. 我正在从Lucene 3.0.1迁移到4.1.0。 After few days of analysis I suppose there is a differenc in filtering of queries result in these versions. 经过几天的分析,我认为这些版本的查询过滤结果有所不同。 After migration I see difference in query result for the same queries and filters. 迁移后,我看到相同查询和过滤器的查询结果有所不同。

The thing looks as follows: 事情看起来如下:

I was using lucene 3.0.1 but for example StandardAnalyzer for IndexWriter was configured in this way: 我使用的是Lucene 3.0.1,但是例如,IndexWriter的StandardAnalyzer是这样配置的:

new StandardAnalyzer(Version.LUCENE_24)

The same configuration was used for QueryParser. QueryParser使用了相同的配置。 There are few Fields that are NOT_ANALYSED (means not indexed; is deprecated in 4.x) and this cause the problem after migration to 4.0.0 or 4.1.0. 很少有NOT_ANALYSED字段(意味着未建立索引;在4.x中已弃用),这会导致在迁移到4.0.0或4.1.0后出现问题。 The problem is that values of some Fileds that are NOT_ANALYZED are UPPER CASE. 问题是某些NOT_ANALYZED的Filed的值是大写。 The search process looks as as follows: 搜索过程如下所示:

  1. QueryParser get Field (Document has many valuse for the same Field, that are most important information for users) and keyword QueryParser get字段(文档对同一个字段有很多值,对于用户来说是最重要的信息)和关键字
  2. Filters with additional user criteria are prepared QueryWrapperFilter(TermQuery(...)) 准备具有其他用户条件的过滤器QueryWrapperFilter(TermQuery(...))
  3. I override getDocIdSet from org.apache.lucene.search.Filter and iterate over all prepared Filters calling filter.getDocIdSet(IndexReader) and collect filtered elements . 我从org.apache.lucene.search.Filter覆盖getDocIdSet,并遍历所有准备好的Filters,调用filter.getDocIdSet(IndexReader)并收集过滤后的元素。

I have found this ansewer regarding case sensitivity . 我发现这与区分大小写有关 I know that LowerCaseFilter is used in lucene 2.4 What I did is I re-built the index with 4.x but all NOT_ANALYZED values are now lower-case. 我知道在Lucene 2.4中使用LowerCaseFilter,我所做的是用4.x重建索引,但是现在所有的NOT_ANALYZED值都是小写的。 Then the problem disapeard. 然后问题消失了。

What could be the reason that for my solution using 3.0.3 case sensitivity "does not matter" and in 4.x "it matters". 对于我的使用3.0.3区分大小写的解决方案“无关紧要”,而在4.x中则“重要”的原因可能是什么。 Maybe some of you could explain me what is happening under the hood. 也许有些人可以向我解释幕后发生的事情。

Indexing and analyzing are two different things. 索引编制和分析是两件事。

Analyzing means the field is put through the Analyzer of choice. 分析意味着该字段将通过所选的Analyzer Fields that are not analyzes are put in the index just the way they are. 未分析的字段将按原样放置在索引中。

If you index an uppercase string, without analyzing, it will stay uppercase in the index and will not be found using a lowercase query. 如果您对大写字符串进行索引而不进行分析,则它将在索引中保持大写形式,并且无法使用小写查询找到。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM