Elasticsearch 索引分片說明

Question

我想弄清楚彈性搜索索引的概念，但不太明白。 我想提前說明幾點。 我了解反向文檔索引的工作原理（將術語映射到文檔 ID），我也了解基於 TF-IDF 的文檔排名如何工作。 我不明白的是實際索引的數據結構。 在提到彈性搜索文檔時，它將索引描述為“具有到文檔的映射的表”。 所以，分片來了！！ 當您查看彈性搜索索引的典型圖片時，它表示如下： 圖片沒有顯示的是實際分區是如何發生的，以及這個 [table -> document] 鏈接是如何跨多個分片拆分的。 例如，每個分片是否垂直拆分表？ 這意味着倒排索引表僅包含分片上存在的術語。 例如，假設我們有 3 個分片，這意味着第一個分片將包含文檔 1，第二個分片僅包含文檔 2，第三個分片是文檔 3。現在，第一個分片索引是否僅包含文檔 1 中存在的術語？ 在這種情況下[藍色，明亮，蝴蝶，微風，懸垂]。 如果是這樣，如果有人搜索 [forget]，彈性搜索如何“知道”不在分片 1 中搜索，或者每次都搜索所有分片？ 當您查看集群圖像時：

目前尚不清楚 shard1、shard2 和 shard3 中究竟是什么。 我們從 Term -> DocumentId -> Document 到“矩形”分片，但分片到底包含什么？

如果有人可以從上面的圖片中解釋它，我將不勝感激。

Answer 1

理論

Elastichsarch 建立在 Lucene 之上。 每個分片只是一個 Lucene 索引。 Lucene 索引，如果簡化的話，就是倒排索引。 每個 Elasticsearch 索引都是一堆分片或 Lucene 索引。 當您查詢一個文檔時，Elasticsearch 會子查詢所有分片，合並結果並返回給您。 當您將文檔索引到 Elasticsearch 時，Elasticsearch 將使用公式計算應寫入哪個分片文檔

shard = hash(routing) % number_of_primary_shards

默認情況下，Elasticsearch 使用文檔id作為路由。 如果您指定routing參數，它將被使用而不是id 。 您可以在搜索查詢和索引、刪除或更新文檔的請求中使用routing參數。 默認情況下使用MurmurHash3作為哈希函數

示例

用 3 個分片創建索引

$ curl -XPUT localhost:9200/so -d '
{ 
    "settings" : { 
        "index" : { 
            "number_of_shards" : 3, 
            "number_of_replicas" : 0 
        } 
    } 
}'

索引文件

$ curl -XPUT localhost:9200/so/question/1 -d '
{ 
    "number" : 47011047, 
    "title" : "need elasticsearch index sharding explanation" 
}'

無路由查詢

$ curl "localhost:9200/so/question/_search?&pretty"

回應

查看_shards.total - 這是被查詢的分片數量。 另請注意，我們找到了文檔

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "so",
        "_type" : "question",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "number" : 47011047,
          "title" : "need elasticsearch index sharding explanation"
        }
      }
    ]
  }
}

使用正確的路由查詢

$ curl "localhost:9200/so/question/_search?explain=true&routing=1&pretty"

回應

_shards.total現在為 1，因為我們指定了路由， _shards.total知道要請求文件的分片。 使用參數explain=true我要求 elasticsearch 給我關於查詢的附加信息。 注意hits._shard - 它被設置為[so][2] 。 這意味着我們的文檔存儲在so索引的第二個分片中。

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_shard" : "[so][2]",
        "_node" : "2skA6yiPSVOInMX0ZsD91Q",
        "_index" : "so",
        "_type" : "question",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "number" : 47011047,
          "title" : "need elasticsearch index sharding explanation"
        },
        ...
}

查詢路由不正確

$ curl "localhost:9200/so/question/_search?explain=true&routing=2&pretty"

回應

_shards.total 再次 1. 但是 Elasticsearch 沒有向我們的查詢返回任何內容，因為我們指定了錯誤的路由並且 Elasticsearch 查詢了沒有文檔的分片。

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Elasticsearch 索引分片說明

問題描述

1 個解決方案

解決方案1
7 2017-10-30 08:47:16

理論

示例

用 3 個分片創建索引

索引文件

無路由查詢

回應

使用正確的路由查詢

回應

查詢路由不正確

回應

附加信息

Elasticsearch 索引分片說明

問題描述

1 個解決方案

解決方案1 7 2017-10-30 08:47:16

理論

示例

用 3 個分片創建索引

索引文件

無路由查詢

回應

使用正確的路由查詢

回應

查詢路由不正確

回應

附加信息

解決方案1
7 2017-10-30 08:47:16