簡體   English   中英

如何突出顯示Elasticsearch中的嵌套字段

[英]How to highlight nested fields in Elasticsearch

雖然是Lucene邏輯結構,但我試圖在我們的內容中出現一些搜索結果時突出顯示我的嵌套字段

以下是Elasticsearch文檔的解釋(映射嵌套類型 `)

內部實施

在內部,嵌套對象被索引為附加文檔,但是,由於可以保證它們在同一“塊”中被索引,因此可以非常快速地與父文檔連接。

在對索引執行操作時會自動屏蔽這些內部嵌套文檔(例如使用match_all查詢進行搜索),並且在使用嵌套查詢時它們會冒泡。

由於嵌套文檔始終屏蔽到父文檔,因此永遠不能在嵌套查詢的范圍之外訪問嵌套文檔。 例如,可以在嵌套對象內的字段上啟用存儲字段,但無法檢索它們,因為存儲字段是在嵌套查詢范圍之外獲取的。

在我的情況下

我有一個Elasticsearch索引,其中包含如下映射

{
    "my_documents": {
        "dynamic_date_formats": [
            "dd.MM.yyyy",
            "yyyy-MM-dd",
            "yyyy-MM-dd HH:mm:ss"
        ],
        "index_analyzer": "Analyzer2_index",
        "search_analyzer": "Analyzer2_search_decompound",
        "_timestamp": {
            "enabled": true
        },
        "properties": {
            "identifier": {
                "type": "string"
            },
            "description": {
                "type": "multi_field",
                "fields": {
                    "sort": {
                        "type": "string",
                        "index": "not_analyzed"
                    },
                    "description": {
                        "type": "string"
                    }
                }
            },
            "files": {
                "type": "nested",
                "include_in_root": true,
                "properties": {
                    "content": {
                        "type": "string",
                        "include_in_root": true
                    }
                }
            },
            "and then some other": "normal string fields"
        }
    }
}

我正在嘗試執行這樣的查詢:

{
    "size": 100,
    "query": {
        "bool": {
            "should": [
                {
                    "nested": {
                        "path": "files",
                        "query": {
                            "bool": {
                                "should": {
                                    "match": {
                                        "content": {
                                            "query": "burpcontrol",
                                            "minimum_should_match": "85%"
                                        }
                                    }
                                }
                            }
                        }
                    }
                },
                {
                    "match": {
                        "description": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                },
                {
                    "match": {
                        "identifier": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                }            ]
        }
    },
    "highlight": {
        "pre_tags": [
            "<span style=\"background-color: yellow\">"
        ],
        "post_tags": [
            "</span>"
        ],
        "order": "score",
        "no_match_size": 100,
        "fragment_size": 50,
        "number_of_fragments": 3,
        "require_field_match": true,
        "fields": {
            "files.content": {},
            "description": {},
            "identifier": {}
        }
    }
}

我遇到的問題是:

1. require_field_match

如果我使用"require_field_match": false我得到了,即使突出顯示不適用於嵌套字段,搜索詞仍然會在所有字段中突出顯示。 這是我實際使用的解決方案,但表現非常糟糕。 對於50個文檔,我的查詢需要25秒。 100個文件約50secs。 10個文件5個。 如果我從突出顯示中刪除嵌套字段,一切都像光一樣快!

2 .include_in_root

我想有一個扁平版本的嵌套字段 (所以將它們存儲為普通的對象 / 字段 。為此,我應該指定

“files”:{“type”:“nested”,“ include_in_root ”:true,...

但是我不知道為什么在重新索引之后,我在文檔根目錄中看不到任何額外的扁平化字段(我期待像"files.content":["content1", "content2", "..."]這樣的東西"files.content":["content1", "content2", "..."] )。

如果它可以工作,則可以訪問(在展平的字段中)嵌套字段的內容,並對其執行突出顯示。

你知道是否有可能在嵌套字段上實現一個好的(和高性能的)突出顯示,或者至少建議我為什么我的查詢這么慢? (我已經優化了片段)

你可以在這里做很多事情,有父/子關系。 我會過幾點,希望這會引導你朝着正確的方向前進; 它仍然需要進行大量測試才能確定這種解決方案是否會對您更有效。 另外,為了清楚起見,我省略了一些設置細節。 請原諒長篇文章。

我設置了父/子映射,如下所示:

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
   },
   "mappings": {
      "parent_doc": {
         "properties": {
            "identifier": {
               "type": "string"
            },
            "description": {
               "type": "string"
            }
         }
      },
      "child_doc": {
         "_parent": {
            "type": "parent_doc"
         },
         "properties": {
            "content": {
               "type": "string"
            }
         }
      }
   }
}

然后添加了一些測試文檔:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"parent_doc","_id":1}}
{"identifier": "first", "description":"some special text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is special"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is not"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":2}}
{"identifier": "second", "description":"some different text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":2}}
{"content":"different child text, but special"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":3}}
{"identifier": "third", "description":"we don't want this parent"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":3}}
{"content":"or this child"}

如果我正確理解你的規格,我們會希望查詢"special"以返回除最后兩個之外的所有這些文件(如果我錯了,請糾正我)。 我們需要與文本匹配的文檔,具有與文本匹配的子項,或者具有與文本匹配的父項。

我們可以像這樣找回與查詢匹配的父母:

POST /test_index/parent_doc/_search
{
    "query": {
        "match": {
           "description": "special"
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         }
      ]
   }
}

我們可以像這樣找回與查詢匹配的子項:

POST /test_index/child_doc/_search
{
    "query": {
        "match": {
           "content": "special"
        }
    },
    "highlight": {
        "fields": {
            "content": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.92364895,
      "hits": [
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.92364895,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.80819285,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

我們可以找回匹配文本的父母和與文本匹配的子項,如下所示:

POST /test_index/parent_doc,child_doc/_search
{
    "query": {
        "multi_match": {
           "query": "special",
           "fields": ["description", "content"]
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    }
}
...
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.75740534,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.6627297,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

但是,要獲取與此查詢相關的所有文檔,我們需要使用bool查詢:

POST /test_index/parent_doc,child_doc/_search
{
   "query": {
      "bool": {
         "should": [
            {
               "multi_match": {
                  "query": "special",
                  "fields": [
                     "description",
                     "content"
                  ]
               }
            },
            {
               "has_child": {
                  "type": "child_doc",
                  "query": {
                     "match": {
                        "content": "special"
                     }
                  }
               }
            },
            {
               "has_parent": {
                  "type": "parent_doc",
                  "query": {
                     "match": {
                        "description": "special"
                     }
                  }
               }
            }
         ]
      }
   },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    },
    "fields": ["_parent", "_source"]
}
...
{
   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 0.8866254,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 0.8866254,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.67829096,
            "_source": {
               "content": "text that is special"
            },
            "fields": {
               "_parent": "1"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.18709806,
            "_source": {
               "content": "different child text, but special"
            },
            "fields": {
               "_parent": "2"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "NiwsP2VEQBKjqu1M4AdjCg",
            "_score": 0.12531912,
            "_source": {
               "content": "text that is not"
            },
            "fields": {
               "_parent": "1"
            }
         },
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "2",
            "_score": 0.12531912,
            "_source": {
               "identifier": "second",
               "description": "some different text"
            }
         }
      ]
   }
}

(我包括"_parent"領域,使其更容易看到為什么文檔被列入結果,如圖所示這里 )。

如果這有幫助,請告訴我。

這是我使用的代碼:

http://sense.qbox.io/gist/d69a4d6531dc063faa4b4e094cff2a472a73c5a6

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM