繁体   English   中英

如何突出显示Elasticsearch中的嵌套字段

[英]How to highlight nested fields in Elasticsearch

虽然是Lucene逻辑结构,但我试图在我们的内容中出现一些搜索结果时突出显示我的嵌套字段

以下是Elasticsearch文档的解释(映射嵌套类型 `)

内部实施

在内部,嵌套对象被索引为附加文档,但是,由于可以保证它们在同一“块”中被索引,因此可以非常快速地与父文档连接。

在对索引执行操作时会自动屏蔽这些内部嵌套文档(例如使用match_all查询进行搜索),并且在使用嵌套查询时它们会冒泡。

由于嵌套文档始终屏蔽到父文档,因此永远不能在嵌套查询的范围之外访问嵌套文档。 例如,可以在嵌套对象内的字段上启用存储字段,但无法检索它们,因为存储字段是在嵌套查询范围之外获取的。

在我的情况下

我有一个Elasticsearch索引,其中包含如下映射

{
    "my_documents": {
        "dynamic_date_formats": [
            "dd.MM.yyyy",
            "yyyy-MM-dd",
            "yyyy-MM-dd HH:mm:ss"
        ],
        "index_analyzer": "Analyzer2_index",
        "search_analyzer": "Analyzer2_search_decompound",
        "_timestamp": {
            "enabled": true
        },
        "properties": {
            "identifier": {
                "type": "string"
            },
            "description": {
                "type": "multi_field",
                "fields": {
                    "sort": {
                        "type": "string",
                        "index": "not_analyzed"
                    },
                    "description": {
                        "type": "string"
                    }
                }
            },
            "files": {
                "type": "nested",
                "include_in_root": true,
                "properties": {
                    "content": {
                        "type": "string",
                        "include_in_root": true
                    }
                }
            },
            "and then some other": "normal string fields"
        }
    }
}

我正在尝试执行这样的查询:

{
    "size": 100,
    "query": {
        "bool": {
            "should": [
                {
                    "nested": {
                        "path": "files",
                        "query": {
                            "bool": {
                                "should": {
                                    "match": {
                                        "content": {
                                            "query": "burpcontrol",
                                            "minimum_should_match": "85%"
                                        }
                                    }
                                }
                            }
                        }
                    }
                },
                {
                    "match": {
                        "description": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                },
                {
                    "match": {
                        "identifier": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                }            ]
        }
    },
    "highlight": {
        "pre_tags": [
            "<span style=\"background-color: yellow\">"
        ],
        "post_tags": [
            "</span>"
        ],
        "order": "score",
        "no_match_size": 100,
        "fragment_size": 50,
        "number_of_fragments": 3,
        "require_field_match": true,
        "fields": {
            "files.content": {},
            "description": {},
            "identifier": {}
        }
    }
}

我遇到的问题是:

1. require_field_match

如果我使用"require_field_match": false我得到了,即使突出显示不适用于嵌套字段,搜索词仍然会在所有字段中突出显示。 这是我实际使用的解决方案,但表现非常糟糕。 对于50个文档,我的查询需要25秒。 100个文件约50secs。 10个文件5个。 如果我从突出显示中删除嵌套字段,一切都像光一样快!

2 .include_in_root

我想有一个扁平版本的嵌套字段 (所以将它们存储为普通的对象 / 字段 。为此,我应该指定

“files”:{“type”:“nested”,“ include_in_root ”:true,...

但是我不知道为什么在重新索引之后,我在文档根目录中看不到任何额外的扁平化字段(我期待像"files.content":["content1", "content2", "..."]这样的东西"files.content":["content1", "content2", "..."] )。

如果它可以工作,则可以访问(在展平的字段中)嵌套字段的内容,并对其执行突出显示。

你知道是否有可能在嵌套字段上实现一个好的(和高性能的)突出显示,或者至少建议我为什么我的查询这么慢? (我已经优化了片段)

你可以在这里做很多事情,有父/子关系。 我会过几点,希望这会引导你朝着正确的方向前进; 它仍然需要进行大量测试才能确定这种解决方案是否会对您更有效。 另外,为了清楚起见,我省略了一些设置细节。 请原谅长篇文章。

我设置了父/子映射,如下所示:

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
   },
   "mappings": {
      "parent_doc": {
         "properties": {
            "identifier": {
               "type": "string"
            },
            "description": {
               "type": "string"
            }
         }
      },
      "child_doc": {
         "_parent": {
            "type": "parent_doc"
         },
         "properties": {
            "content": {
               "type": "string"
            }
         }
      }
   }
}

然后添加了一些测试文档:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"parent_doc","_id":1}}
{"identifier": "first", "description":"some special text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is special"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is not"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":2}}
{"identifier": "second", "description":"some different text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":2}}
{"content":"different child text, but special"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":3}}
{"identifier": "third", "description":"we don't want this parent"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":3}}
{"content":"or this child"}

如果我正确理解你的规格,我们会希望查询"special"以返回除最后两个之外的所有这些文件(如果我错了,请纠正我)。 我们需要与文本匹配的文档,具有与文本匹配的子项,或者具有与文本匹配的父项。

我们可以像这样找回与查询匹配的父母:

POST /test_index/parent_doc/_search
{
    "query": {
        "match": {
           "description": "special"
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         }
      ]
   }
}

我们可以像这样找回与查询匹配的子项:

POST /test_index/child_doc/_search
{
    "query": {
        "match": {
           "content": "special"
        }
    },
    "highlight": {
        "fields": {
            "content": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.92364895,
      "hits": [
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.92364895,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.80819285,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

我们可以找回匹配文本的父母和与文本匹配的子项,如下所示:

POST /test_index/parent_doc,child_doc/_search
{
    "query": {
        "multi_match": {
           "query": "special",
           "fields": ["description", "content"]
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    }
}
...
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.75740534,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.6627297,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

但是,要获取与此查询相关的所有文档,我们需要使用bool查询:

POST /test_index/parent_doc,child_doc/_search
{
   "query": {
      "bool": {
         "should": [
            {
               "multi_match": {
                  "query": "special",
                  "fields": [
                     "description",
                     "content"
                  ]
               }
            },
            {
               "has_child": {
                  "type": "child_doc",
                  "query": {
                     "match": {
                        "content": "special"
                     }
                  }
               }
            },
            {
               "has_parent": {
                  "type": "parent_doc",
                  "query": {
                     "match": {
                        "description": "special"
                     }
                  }
               }
            }
         ]
      }
   },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    },
    "fields": ["_parent", "_source"]
}
...
{
   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 0.8866254,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 0.8866254,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.67829096,
            "_source": {
               "content": "text that is special"
            },
            "fields": {
               "_parent": "1"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.18709806,
            "_source": {
               "content": "different child text, but special"
            },
            "fields": {
               "_parent": "2"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "NiwsP2VEQBKjqu1M4AdjCg",
            "_score": 0.12531912,
            "_source": {
               "content": "text that is not"
            },
            "fields": {
               "_parent": "1"
            }
         },
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "2",
            "_score": 0.12531912,
            "_source": {
               "identifier": "second",
               "description": "some different text"
            }
         }
      ]
   }
}

(我包括"_parent"领域,使其更容易看到为什么文档被列入结果,如图所示这里 )。

如果这有帮助,请告诉我。

这是我使用的代码:

http://sense.qbox.io/gist/d69a4d6531dc063faa4b4e094cff2a472a73c5a6

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM