简体   繁体   English

Lucene.net-无法进行多字搜索

[英]Lucene.net - can't do a multiword search

I have stored the following documents in my lucene index: 我已经在lucene索引中存储了以下文档:

{
"id" : 1,
"name": "John Smith"
"description": "worker"
"additionalData": "faster data"
"attributes": "is_hired=not"
},
{
"id" : 2,
"name": "Alan Smith"
"description": "hired"
"additionalData": "faster drive"
"attributes": "is_hired=not"
},
{
"id" : 3,
"name": "Mike Std"
"description": "hired"
"additionalData": "faster check"
"attributes": "is_hired=not"
}

and now I want to seach over all the fields to check if the given value exists: 现在我想浏览所有字段以检查给定值是否存在:

search term: "John data check"

which sould me return the documents with ID 1 and 3 . 这让我返回了ID 1 and 3的文档。 But it doesn't, why ? 但是,为什么呢?

var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

BooleanQuery mainQuery = new BooleanQuery();
mainQuery.MinimumNumberShouldMatch = 1;

var cols = new string[] {
                         "name",
                         "additionalData"
                        };

 string[] words = searchData.text.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries);

 var queryParser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, cols, analyzer);

 foreach (var word in words)
 {
    BooleanQuery innerQuery = new BooleanQuery();
    innerQuery.MinimumNumberShouldMatch = 1;

    innerQuery.Add(queryParser.Parse(word), Occur.SHOULD);

    mainQuery.Add(innerQuery, Occur.MUST);
 }

 TopDocs hits = searcher.Search(mainQuery, null, int.MaxValue, Sort.RELEVANCE);

 //hits.TotalHits is 0 !!

The query you constructed basically requires all three words to match. 您构造的查询基本上要求三个单词都匹配。

You wrap each word in a BooleanQuery with a SHOULD clause. 您将每个单词都包装在带有SHOULD子句的BooleanQuery This is equivalent to using the inner query directly (you're just adding an indirection which does not change the behavior of the query). 这等效于直接使用内部查询(您只是添加一个不会改变查询行为的间接访问)。 The boolean query has only one clause, which should match for the boolean query to match. 布尔查询只有一个子句,该子句应该匹配布尔查询才能匹配。

Then, you wrap each one of these in another boolean query, this time with a MUST clause for each. 然后,将其中的每一个包装在另一个布尔查询中,这一次,每个包装都带有一个MUST子句。 This means each clause must match for the query to match. 这意味着每个子句必须匹配,查询才能匹配。

For a BooleanQuery to match, all MUST clauses have to be satisfied, and if there are none, then a minimum of MinimumNumberShouldMatch SHOULD clauses have to be satisfied. 为了使BooleanQuery匹配, MUST满足所有MUST子句,如果不存在,则必须满足MinimumNumberShouldMatch SHOULD子句的最小值。 Leave that property at its default value, as the documented behavior is: 将该属性保留为其默认值,因为记录的行为是:

By default no optional clauses are necessary for a match (unless there are no required clauses). 默认情况下,匹配不需要任何可选子句(除非没有必需子句)。

Effectively, your query is (assuming there is no MultiFieldQueryParser for simplicity): 实际上,您的查询是(假设为简单起见,没有MultiFieldQueryParser ):

+(john) +(data) +(check)

Or, in a tree form: 或者,以树的形式:

BooleanQuery
    MUST: BooleanQuery
        SHOULD: TermQuery: john
    MUST: BooleanQuery
        SHOULD: TermQuery: data
    MUST: BooleanQuery
        SHOULD: TermQuery: check

Which can be simplified to: 可以简化为:

BooleanQuery
    MUST: TermQuery: john
    MUST: TermQuery: data
    MUST: TermQuery: check

But the query you want is: 但是您想要的查询是:

BooleanQuery
    SHOULD: TermQuery: john
    SHOULD: TermQuery: data
    SHOULD: TermQuery: check

So, remove the mainQuery.MinimumNumberShouldMatch = 1; 因此,删除mainQuery.MinimumNumberShouldMatch = 1; line, then replace your foreach body with the following and it should get the job done: 行,然后将您的foreach主体替换为以下内容,它将完成任务:

mainQuery.Add(queryParser.Parse(word), Occur.SHOULD);

Ok, so here's a full example, which works for me: 好的,这是一个完整的示例,对我有用:

var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

var directory = new RAMDirectory();

using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
    var doc = new Document();
    doc.Add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.Add(new Field("name", "John Smith", Field.Store.NO, Field.Index.ANALYZED));
    doc.Add(new Field("additionalData", "faster data", Field.Store.NO, Field.Index.ANALYZED));
    writer.AddDocument(doc);

    doc = new Document();
    doc.Add(new Field("id", "2", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.Add(new Field("name", "Alan Smith", Field.Store.NO, Field.Index.ANALYZED));
    doc.Add(new Field("additionalData", "faster drive", Field.Store.NO, Field.Index.ANALYZED));
    writer.AddDocument(doc);

    doc = new Document();
    doc.Add(new Field("id", "3", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.Add(new Field("name", "Mike Std", Field.Store.NO, Field.Index.ANALYZED));
    doc.Add(new Field("additionalData", "faster check", Field.Store.NO, Field.Index.ANALYZED));
    writer.AddDocument(doc);
}

var words = new[] {"John", "data", "check"};
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, new[] {"name", "additionalData"}, analyzer);


var mainQuery = new BooleanQuery();
foreach (var word in words)
    mainQuery.Add(parser.Parse(word), Occur.SHOULD); // Should probably use parser.Parse(QueryParser.Escape(word)) instead

using (var searcher = new IndexSearcher(directory))
{
    var results = searcher.Search(mainQuery, null, int.MaxValue, Sort.RELEVANCE);
    var idFieldSelector = new MapFieldSelector("id");

    foreach (var scoreDoc in results.ScoreDocs)
    {
        var doc = searcher.Doc(scoreDoc.Doc, idFieldSelector);
        Console.WriteLine("Found: {0}", doc.Get("id"));
    }
}

Well, in my case I stored a string array with the same field name, I had to retrieve all field values from the result Document , because the Document.Get("field_name") returns only the first column value when there are many fields with the same way 好吧,在我的情况下,我存储了一个具有相同字段名称的字符串数组,我不得不从结果Document检索所有字段值,因为当存在许多字段时, Document.Get("field_name")仅返回第一列值以同样的方式

var multi_fields = doc.GetFields("field_name");
var field_values = multi_fields.Select(x => x.StringValue).ToArray();

Plus, I had to enable the WildCard search, because it fails if I don't type a full word, eg Jo instead of John 另外,我必须启用通配符搜索,因为如果我没有输入完整的单词(例如, Jo而不是John ,搜索将失败

 string[] words = "Jo data check".Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).Select(x => string.Format("*{0}*", x)).ToArray();

 var queryParser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, cols, analyzer);
 parser.AllowLeadingWildcard = true;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM