[英]Lucene.net - can't do a multiword search
我已经在lucene索引中存储了以下文档:
{
"id" : 1,
"name": "John Smith"
"description": "worker"
"additionalData": "faster data"
"attributes": "is_hired=not"
},
{
"id" : 2,
"name": "Alan Smith"
"description": "hired"
"additionalData": "faster drive"
"attributes": "is_hired=not"
},
{
"id" : 3,
"name": "Mike Std"
"description": "hired"
"additionalData": "faster check"
"attributes": "is_hired=not"
}
现在我想浏览所有字段以检查给定值是否存在:
search term: "John data check"
这让我返回了ID 1 and 3
的文档。 但是,为什么呢?
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
BooleanQuery mainQuery = new BooleanQuery();
mainQuery.MinimumNumberShouldMatch = 1;
var cols = new string[] {
"name",
"additionalData"
};
string[] words = searchData.text.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries);
var queryParser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, cols, analyzer);
foreach (var word in words)
{
BooleanQuery innerQuery = new BooleanQuery();
innerQuery.MinimumNumberShouldMatch = 1;
innerQuery.Add(queryParser.Parse(word), Occur.SHOULD);
mainQuery.Add(innerQuery, Occur.MUST);
}
TopDocs hits = searcher.Search(mainQuery, null, int.MaxValue, Sort.RELEVANCE);
//hits.TotalHits is 0 !!
您构造的查询基本上要求三个单词都匹配。
您将每个单词都包装在带有SHOULD
子句的BooleanQuery
。 这等效于直接使用内部查询(您只是添加一个不会改变查询行为的间接访问)。 布尔查询只有一个子句,该子句应该匹配布尔查询才能匹配。
然后,将其中的每一个包装在另一个布尔查询中,这一次,每个包装都带有一个MUST
子句。 这意味着每个子句必须匹配,查询才能匹配。
为了使BooleanQuery
匹配, MUST
满足所有MUST
子句,如果不存在,则必须满足MinimumNumberShouldMatch
SHOULD
子句的最小值。 将该属性保留为其默认值,因为记录的行为是:
默认情况下,匹配不需要任何可选子句(除非没有必需子句)。
实际上,您的查询是(假设为简单起见,没有MultiFieldQueryParser
):
+(john) +(data) +(check)
或者,以树的形式:
BooleanQuery
MUST: BooleanQuery
SHOULD: TermQuery: john
MUST: BooleanQuery
SHOULD: TermQuery: data
MUST: BooleanQuery
SHOULD: TermQuery: check
可以简化为:
BooleanQuery
MUST: TermQuery: john
MUST: TermQuery: data
MUST: TermQuery: check
但是您想要的查询是:
BooleanQuery
SHOULD: TermQuery: john
SHOULD: TermQuery: data
SHOULD: TermQuery: check
因此,删除mainQuery.MinimumNumberShouldMatch = 1;
行,然后将您的foreach
主体替换为以下内容,它将完成任务:
mainQuery.Add(queryParser.Parse(word), Occur.SHOULD);
好的,这是一个完整的示例,对我有用:
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var directory = new RAMDirectory();
using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
var doc = new Document();
doc.Add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("name", "John Smith", Field.Store.NO, Field.Index.ANALYZED));
doc.Add(new Field("additionalData", "faster data", Field.Store.NO, Field.Index.ANALYZED));
writer.AddDocument(doc);
doc = new Document();
doc.Add(new Field("id", "2", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("name", "Alan Smith", Field.Store.NO, Field.Index.ANALYZED));
doc.Add(new Field("additionalData", "faster drive", Field.Store.NO, Field.Index.ANALYZED));
writer.AddDocument(doc);
doc = new Document();
doc.Add(new Field("id", "3", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("name", "Mike Std", Field.Store.NO, Field.Index.ANALYZED));
doc.Add(new Field("additionalData", "faster check", Field.Store.NO, Field.Index.ANALYZED));
writer.AddDocument(doc);
}
var words = new[] {"John", "data", "check"};
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, new[] {"name", "additionalData"}, analyzer);
var mainQuery = new BooleanQuery();
foreach (var word in words)
mainQuery.Add(parser.Parse(word), Occur.SHOULD); // Should probably use parser.Parse(QueryParser.Escape(word)) instead
using (var searcher = new IndexSearcher(directory))
{
var results = searcher.Search(mainQuery, null, int.MaxValue, Sort.RELEVANCE);
var idFieldSelector = new MapFieldSelector("id");
foreach (var scoreDoc in results.ScoreDocs)
{
var doc = searcher.Doc(scoreDoc.Doc, idFieldSelector);
Console.WriteLine("Found: {0}", doc.Get("id"));
}
}
好吧,在我的情况下,我存储了一个具有相同字段名称的字符串数组,我不得不从结果Document
检索所有字段值,因为当存在许多字段时, Document.Get("field_name")
仅返回第一列值以同样的方式
var multi_fields = doc.GetFields("field_name");
var field_values = multi_fields.Select(x => x.StringValue).ToArray();
另外,我必须启用通配符搜索,因为如果我没有输入完整的单词(例如, Jo
而不是John
,搜索将失败
string[] words = "Jo data check".Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).Select(x => string.Format("*{0}*", x)).ToArray();
var queryParser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, cols, analyzer);
parser.AllowLeadingWildcard = true;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.