简体   繁体   English

Lucene通配符查询

[英]Lucene wildcard queries

I have this question relating to Lucene. 我有关于Lucene的这个问题。

I have a form and I get a text from it and I want to perform a full text search in several fields. 我有一个表单,我从中得到一个文本,我想在几个字段中执行全文搜索。 Suppose I get from the input the text "textToLook". 假设我从输入中得到文本“textToLook”。

I have a Lucene Analyzer with several filters. 我有一个带有几个过滤器的Lucene分析仪。 One of them is lowerCaseFilter, so when I create the index, words will be lowercased. 其中一个是lowerCaseFilter,所以当我创建索引时,单词会小写。

Imagine I want to search into two fields field1 and field2 so the lucene query would be something like this (note that 'textToLook' now is 'texttolook'): 想象一下,我想搜索两个字段field1和field2,所以lucene查询将是这样的(注意'textToLook'现在是'texttolook'):

field1: texttolook* field2:texttolook*

In my class I have something like this to create the query. 在我的课堂上,我有类似的东西来创建查询。 I works when there is no wildcard. 我没有通配符时工作。

String text = "textToLook";
String[] fields = {"field1", "field2"};
//analyser is the same as the one used for indexing
Analyzer analyzer = fullTextEntityManager.getSearchFactory().getAnalyzer("customAnalyzer");
MultiFieldQueryParser parser = new MultiFieldQueryParser(fields, analyzer);
org.apache.lucene.search.Query queryTextoLibre = parser.parse(text);

With this code the query would be: 使用此代码,查询将是:

field1: texttolook field2:texttolook

but If I set text to "textToLook*" I get 但如果我将文本设置为“textToLook *”,我会得到

field1: textToLook* field2:textToLook*

which won't find correctly as the indexes are in lowercase. 由于索引是小写的,因此无法正确找到。

I have read in lucene website this: 我在lucene网站上看过这个:

" Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming and lowercasing" “通配符,前缀和模糊查询不会通过分析器传递,分析器是执行诸如词干和小写等操作的组件”

My problem cannot be solved by setting the behaviour case insensitive cause my analyzer has other fields which for examples remove some suffixes of words. 我的问题无法通过设置行为不敏感来解决,因为我的分析器有其他字段,例如删除一些单词的后缀。

I think I can solve the problem by getting how the text would be after going through the filters of my analyzer, then I could add the "*" and then I could build the Query with MultiFieldQueryParser. 我想我可以通过获取我的分析器的过滤器后的文本来解决问题,然后我可以添加“*”然后我可以用MultiFieldQueryParser构建查询。 So in this example I woud get "textToLower" and after being passed to to these filters I could get "texttolower". 所以在这个例子中我得到“textToLower”,在传递给这些过滤器后,我可以得到“texttolower”。 After this I could make "textotolower*". 在此之后,我可以制作“textotolower *”。

But, is there any way to get the value of my text variable after going through all my analyzer's filters? 但是,在通过我的所有分析器的过滤器后,有没有办法获得我的文本变量的值? How can I get all the filters of my analyzer? 如何获得分析仪的所有滤镜? Is this possible? 这可能吗?

Thanks 谢谢

Can you use QueryParser.setLowercaseExpandedTerms(true)? 你可以使用QueryParser.setLowercaseExpandedTerms(true)吗?

http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F

** EDIT ** **编辑**

Okay, I understand your issue now. 好的,我现在明白你的问题。 You actually want the wildcarded term to be stemmed before it's run through the wildcard query. 实际上,您希望在通过通配符查询之前阻止通配符。

You can subclass QueryParser and override 您可以继承QueryParser并覆盖它

protected Query getWildcardQuery(String field, String termStr) throws ParseException

to run termStr through the analyzer before the WildcardQuery is constructed. 在构造WildcardQuery之前通过分析器运行termStr。

This might not be what the user expects, though. 但这可能不是用户期望的。 There's a reason why they've decided not to run wildcarded terms through the analyzer, per the faq: 根据常见问题,他们决定不通过分析器运行通配术语是有原因的:

The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query. 跳过分析器的原因是,如果你正在搜索“狗*”,你就不会想要“狗”首先被“狗”,因为那将匹配“狗*”,这不是预期的查询。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM