简体   繁体   English

如何突出显示Elasticsearch中的嵌套字段

[英]How to highlight nested fields in Elasticsearch

Although the Lucene logic structure, I'm trying to make my nested fields to be highlighted when some search result is present in their content. 虽然是Lucene逻辑结构,但我试图在我们的内容中出现一些搜索结果时突出显示我的嵌套字段

Here is the explanation from Elasticsearch documentation (mapping nested type `) 以下是Elasticsearch文档的解释(映射嵌套类型 `)

Internal Implementation 内部实施

Internally, nested objects are indexed as additional documents, but, since they can be guaranteed to be indexed within the same "block", it allows for extremely fast joining with parent docs. 在内部,嵌套对象被索引为附加文档,但是,由于可以保证它们在同一“块”中被索引,因此可以非常快速地与父文档连接。

Those internal nested documents are automatically masked away when doing operations against the index (like searching with a match_all query), and they bubble out when using the nested query. 在对索引执行操作时会自动屏蔽这些内部嵌套文档(例如使用match_all查询进行搜索),并且在使用嵌套查询时它们会冒泡。

Because nested docs are always masked to the parent doc, the nested docs can never be accessed outside the scope of the nested query. 由于嵌套文档始终屏蔽到父文档,因此永远不能在嵌套查询的范围之外访问嵌套文档。 For example stored fields can be enabled on fields inside nested objects, but there is no way of retrieving them, since stored fields are fetched outside of the nested query scope. 例如,可以在嵌套对象内的字段上启用存储字段,但无法检索它们,因为存储字段是在嵌套查询范围之外获取的。

0. In my case 在我的情况下

I have an Elasticsearch index containing a mapping like the following: 我有一个Elasticsearch索引,其中包含如下映射

{
    "my_documents": {
        "dynamic_date_formats": [
            "dd.MM.yyyy",
            "yyyy-MM-dd",
            "yyyy-MM-dd HH:mm:ss"
        ],
        "index_analyzer": "Analyzer2_index",
        "search_analyzer": "Analyzer2_search_decompound",
        "_timestamp": {
            "enabled": true
        },
        "properties": {
            "identifier": {
                "type": "string"
            },
            "description": {
                "type": "multi_field",
                "fields": {
                    "sort": {
                        "type": "string",
                        "index": "not_analyzed"
                    },
                    "description": {
                        "type": "string"
                    }
                }
            },
            "files": {
                "type": "nested",
                "include_in_root": true,
                "properties": {
                    "content": {
                        "type": "string",
                        "include_in_root": true
                    }
                }
            },
            "and then some other": "normal string fields"
        }
    }
}

I'm trying to execute a query like this: 我正在尝试执行这样的查询:

{
    "size": 100,
    "query": {
        "bool": {
            "should": [
                {
                    "nested": {
                        "path": "files",
                        "query": {
                            "bool": {
                                "should": {
                                    "match": {
                                        "content": {
                                            "query": "burpcontrol",
                                            "minimum_should_match": "85%"
                                        }
                                    }
                                }
                            }
                        }
                    }
                },
                {
                    "match": {
                        "description": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                },
                {
                    "match": {
                        "identifier": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                }            ]
        }
    },
    "highlight": {
        "pre_tags": [
            "<span style=\"background-color: yellow\">"
        ],
        "post_tags": [
            "</span>"
        ],
        "order": "score",
        "no_match_size": 100,
        "fragment_size": 50,
        "number_of_fragments": 3,
        "require_field_match": true,
        "fields": {
            "files.content": {},
            "description": {},
            "identifier": {}
        }
    }
}

The problem I have are: 我遇到的问题是:

1. require_field_match 1. require_field_match

If I use "require_field_match": false I obtain that, even if highlighting doesn't work on nested fields, the search term is highlighted anyway in ALL the fields. 如果我使用"require_field_match": false我得到了,即使突出显示不适用于嵌套字段,搜索词仍然会在所有字段中突出显示。 This is the solution I'm actually using, but the performances are horrible. 这是我实际使用的解决方案,但表现非常糟糕。 For 50 documents my query needs 25secs. 对于50个文档,我的查询需要25秒。 100 documents about 50secs. 100个文件约50secs。 10 documents 5secs. 10个文件5个。 And if I remove the nested field from the highlighting everything works fast as light! 如果我从突出显示中删除嵌套字段,一切都像光一样快!

2 .include_in_root 2 .include_in_root

I would like to have a flattened version of my nested fields (so to store them as normal objects / fields . To do this I should specify 我想有一个扁平版本的嵌套字段 (所以将它们存储为普通的对象 / 字段 。为此,我应该指定

"files": { "type": "nested", " include_in_root ": true, ... “files”:{“type”:“nested”,“ include_in_root ”:true,...

but I don't know why, after reindexing, I cannot see any additional flattened field in the document root (while I was expecting something like "files.content":["content1", "content2", "..."] ). 但是我不知道为什么在重新索引之后,我在文档根目录中看不到任何额外的扁平化字段(我期待像"files.content":["content1", "content2", "..."]这样的东西"files.content":["content1", "content2", "..."] )。

If it would work it would be instead possible to access (in the flattened field) the content of the nested field, and perform the highlighting on it. 如果它可以工作,则可以访问(在展平的字段中)嵌套字段的内容,并对其执行突出显示。

Do you know if is it possible to achieve a good (and performant) highlighting on nested fields or, at least, suggest me why my query is so slow? 你知道是否有可能在嵌套字段上实现一个好的(和高性能的)突出显示,或者至少建议我为什么我的查询这么慢? (I already optimised the fragments) (我已经优化了片段)

There are a number of things you can do here, with a parent/child relationship. 你可以在这里做很多事情,有父/子关系。 I'll go over a few, and hopefully that will lead you in the right direction; 我会过几点,希望这会引导你朝着正确的方向前进; it will still take lots of testing to figure out whether this solution is going to be more performant for you. 它仍然需要进行大量测试才能确定这种解决方案是否会对您更有效。 Also, I left out a few of the details of your setup, for clarity. 另外,为了清楚起见,我省略了一些设置细节。 Please forgive the long post. 请原谅长篇文章。

I set up a parent/child mapping as follows: 我设置了父/子映射,如下所示:

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
   },
   "mappings": {
      "parent_doc": {
         "properties": {
            "identifier": {
               "type": "string"
            },
            "description": {
               "type": "string"
            }
         }
      },
      "child_doc": {
         "_parent": {
            "type": "parent_doc"
         },
         "properties": {
            "content": {
               "type": "string"
            }
         }
      }
   }
}

Then added some test docs: 然后添加了一些测试文档:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"parent_doc","_id":1}}
{"identifier": "first", "description":"some special text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is special"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is not"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":2}}
{"identifier": "second", "description":"some different text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":2}}
{"content":"different child text, but special"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":3}}
{"identifier": "third", "description":"we don't want this parent"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":3}}
{"content":"or this child"}

If I'm understanding your specs correctly, we would want a query for "special" to return every one of these documents except the last two (correct me if I'm wrong). 如果我正确理解你的规格,我们会希望查询"special"以返回除最后两个之外的所有这些文件(如果我错了,请纠正我)。 We want docs that match the text, have a child that matches the text, or have a parent that matches the text. 我们需要与文本匹配的文档,具有与文本匹配的子项,或者具有与文本匹配的父项。

We can get back parents that match the query like this: 我们可以像这样找回与查询匹配的父母:

POST /test_index/parent_doc/_search
{
    "query": {
        "match": {
           "description": "special"
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         }
      ]
   }
}

And we can get back children that match the query like this: 我们可以像这样找回与查询匹配的子项:

POST /test_index/child_doc/_search
{
    "query": {
        "match": {
           "content": "special"
        }
    },
    "highlight": {
        "fields": {
            "content": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.92364895,
      "hits": [
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.92364895,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.80819285,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

We can get back parents that match the text and children that match the text like this: 我们可以找回匹配文本的父母和与文本匹配的子项,如下所示:

POST /test_index/parent_doc,child_doc/_search
{
    "query": {
        "multi_match": {
           "query": "special",
           "fields": ["description", "content"]
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    }
}
...
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.75740534,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.6627297,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

However, to get back all the docs related to this query, we need to use a bool query: 但是,要获取与此查询相关的所有文档,我们需要使用bool查询:

POST /test_index/parent_doc,child_doc/_search
{
   "query": {
      "bool": {
         "should": [
            {
               "multi_match": {
                  "query": "special",
                  "fields": [
                     "description",
                     "content"
                  ]
               }
            },
            {
               "has_child": {
                  "type": "child_doc",
                  "query": {
                     "match": {
                        "content": "special"
                     }
                  }
               }
            },
            {
               "has_parent": {
                  "type": "parent_doc",
                  "query": {
                     "match": {
                        "description": "special"
                     }
                  }
               }
            }
         ]
      }
   },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    },
    "fields": ["_parent", "_source"]
}
...
{
   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 0.8866254,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 0.8866254,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.67829096,
            "_source": {
               "content": "text that is special"
            },
            "fields": {
               "_parent": "1"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.18709806,
            "_source": {
               "content": "different child text, but special"
            },
            "fields": {
               "_parent": "2"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "NiwsP2VEQBKjqu1M4AdjCg",
            "_score": 0.12531912,
            "_source": {
               "content": "text that is not"
            },
            "fields": {
               "_parent": "1"
            }
         },
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "2",
            "_score": 0.12531912,
            "_source": {
               "identifier": "second",
               "description": "some different text"
            }
         }
      ]
   }
}

(I included the "_parent" field to make it easier to see why docs were included in the results, as shown here ). (我包括"_parent"领域,使其更容易看到为什么文档被列入结果,如图所示这里 )。

Let me know if this helps. 如果这有帮助,请告诉我。

Here is the code I used: 这是我使用的代码:

http://sense.qbox.io/gist/d69a4d6531dc063faa4b4e094cff2a472a73c5a6 http://sense.qbox.io/gist/d69a4d6531dc063faa4b4e094cff2a472a73c5a6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM