简体   繁体   English

Lucene - 短语中的通配符

[英]Lucene - Wildcards in phrases

I am currently attempting to use Lucene to search data populated in an index. 我目前正在尝试使用Lucene来搜索索引中填充的数据。

I can match on exact phrases by enclosing it in brackets (ie "Processing Documents"), but cannot get Lucene to find that phrase by doing any sort of "Processing Document*". 我可以通过将其括在括号中来匹配精确的短语(即“处理文档”),但不能让Lucene通过执行任何类型的“处理文档*”来找到该短语。

The obvious difference being the wildcard at the end. 最明显的区别是最后的通配符。

I am currently attempting to use Luke to view and search the index. 我目前正在尝试使用Luke来查看和搜索索引。 (it drops the asterisk at the end of the phrase when parsing) (解析时,它会在短语末尾删除星号)

Adding the quotes around the data seems to be the main culprit as searching for document* will work, but "document*" does not 在数据周围添加引号似乎是主要的罪魁祸首,因为搜索文档*会起作用,但“文档*”不会

Any assistance would be greatly appreciated 任何帮助将不胜感激

Lucene 2.9具有ComplexPhraseQueryParser ,可以处理短语中的通配符。

What you're looking for is FuzzyQuery which allows one to search for results with similar words based on Levenshtein distance . 您正在寻找的是FuzzyQuery ,它允许人们根据Levenshtein距离搜索具有相似单词的结果。 Alternatively you may also want to consider using slop of PhraseQuery ( also available in MultiPhraseQuery ) if the order of words isn't significant. 或者,如果单词的顺序不重要,您可能还需要考虑使用PhraseQuery的slop也可在MultiPhraseQuery中使用 )。

It seems that the default QueryParser cannot handle this. 似乎默认的QueryParser无法处理这个问题。 You can probably create a custom QueryParser for wildcards in phrases. 您可以在短语中为通配符创建自定义QueryParser。 If your example is representative, stemming may solve your problem. 如果您的示例具有代表性,那么词干就可以解决您的问题。 Please read the documentation for PorterStemFilter to see whether it fits. 请阅读PorterStemFilter的文档以了解它是否合适。

Not only does the QueryParser not support wildcards in phrases, PhraseQuery itself only supports Terms. QueryParser不仅不支持短语中的通配符,PhraseQuery本身也只支持条款。 MultiPhraseQuery comes closer, but as its summary says, you still need to enumerate the IndexReader.terms yourself to match the wildcard. MultiPhraseQuery更接近,但正如其摘要所述,您仍需要自己枚举IndexReader.terms以匹配通配符。

Use a SpanNearQuery with a slop of 0. 使用斜率为0的SpanNearQuery

Unfortunately there's no SpanWildcardQuery in Lucene.Net. 不幸的是,Lucene.Net中没有SpanWildcardQuery。 Either you'll need to use SpanMultiTermQueryWrapper , or with little effort you can convert the java version to C#. 您需要使用SpanMultiTermQueryWrapper ,或者只需很少的努力就可以将Java版本转换为C#。

Another alternative is to use NGrams and specifically the EdgeNGram. 另一种选择是使用NGrams,特别是EdgeNGram。 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

This will create indexes for ngrams or parts of words. 这将为ngrams或部分单词创建索引。 Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents 最小ngram大小为5,最大ngram大小为8的文档将索引:Docum Docume文档文档

There is a bit of a tradeoff for index size and time. 索引大小和时间有一点权衡。 One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms. 其中一本Solr书籍引用作为粗略指南:索引需要10倍的时间使用5倍的磁盘空间创建6倍不同的术语。

However, the EdgeNGram will do better than that. 但是,EdgeNGram会做得更好。

You do need to make sure that you don't submit wildcard character in your queries. 您需要确保不在查询中提交通配符。 As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words). 由于您没有进行通配符搜索,因此您在ngrams(单词部分)上匹配搜索词。

I was also looking for the same thing and what i found is PrefixQuery gives ua combination of some thing like this "Processing Document*".But the thing is your field which you are searching for should be untokenized and store it in lowercase (reason for so since it is untokenized indexer wont save your field values in lowercase) for this to work.Here is code for PrefixQuery which worked for me :- 我也在寻找相同的东西,我发现的是PrefixQuery给你一些像“处理文档*”这样的东西的组合。但是你要搜索的东西应该是未标记的,并以小写形式存储(原因为所以,因为它是未加密的索引器不会保存你的字段值小写)为此工作。这是PrefixQuery的代码,对我有用: -

List<SearchResult> results = new List<SearchResult>();
Lucene.Net.Store.Directory searchDir = FSDirectory.GetDirectory(this._indexLocation, false);
IndexSearcher searcher = new IndexSearcher( searchDir );
Hits hits;

BooleanQuery query = new BooleanQuery();
query.Add(new PrefixQuery(new Term(FILE_NAME_KEY, keyWords.ToLower())), BooleanClause.Occur.MUST);
hits = searcher.Search(query);
this.FillResults(hits, results);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM