简体繁体 English

Elasticsearch-单索引与多索引

[英]Elasticsearch- Single Index vs Multiple Indexes

原文 2019-01-15 11:05:54 6 1 elasticsearch

I have more than 4000 different fields in one of my index. 我的索引之一中有4000多个不同的字段。 And that number can grow larger with time. 随着时间的推移，这个数字可能会越来越大。 As Elasticsearch give default limit of 1000 field per index. 由于Elasticsearch给出每个索引1000个字段的默认限制。 There must be some reason. 一定有原因。

Now, I am thinking that I should not increase the limit set by Elasticsearch. 现在，我认为我不应该增加Elasticsearch设置的限制。 So I should break my single large index into small multiple indexes. 因此，我应该将单个大索引分解为多个小索引。

Before moving to multiple indexes I have few questions as follows: 在移到多个索引之前，我有几个问题，如下所示：

The number of small multiple indexes can increase up to 50. So searching on all 50 index at a time would slow down search time as compared to a search on the single large index? 小多个索引的数量最多可以增加到50。因此，一次搜索全部50个索引是否会比单大索引的搜索速度慢？
Is there really a need to break my single large index into multiple indexes because of a large number of fields? 由于字段众多，是否真的需要将我的单个大索引分解为多个索引？
When I use small multiple indexes, the total number of shards would increase drastically(more than 250 shards). 当我使用小的多个索引时，分片的总数将急剧增加（超过250个分片）。 Each index would have 5 shards(default number, which I don't want to change). 每个索引将具有5个分片（默认数字，我不想更改）。 Search on these multiple indexes would be searching on these 250 shards at once. 搜索这些多个索引将同时搜索这250个分片。 Will this affect my search performance? 这会影响我的搜索效果吗？ Note: These shards might increase in time as well. 注意：这些分片的时间可能也会增加。 When I use Single large index which contains only 5 shards and a large number of documents, won't this be an overload on these 5 shards? 当我使用仅包含5个分片和大量文档的单一大索引时，这不是这5个分片的重载吗？

1 个解决方案

It strongly depends on your infrastructure. 这在很大程度上取决于您的基础架构。 If you run a single node with 50 Shards a query will run longer than it would with only 1 Shard. 如果运行具有50个分片的单个节点，则查询将比仅包含1个分片的查询运行更长的时间。 If you have 50 Nodes holding one shard each, it will most likely run faster than one node with 1 Shard (if you have a big dataset). 如果您有50个节点各持有一个分片，则它的运行速度可能会比具有1个分片的一个节点快（如果您有一个大数据集）。 In the end, you have to test with real data to be sure. 最后，您必须对真实数据进行测试才能确定。
When there is a massive amount of fields, ES gets a performance problem and errors are more likely. 当存在大量字段时，ES会遇到性能问题，并且更有可能发生错误。 The main problem is that every field has to be stored in the cluster state, which takes a toll on your master node(s). 主要问题是每个字段都必须以集群状态存储，这会对您的主节点造成影响。 Also, in a lot of cases you have to work with lots of sparse data (90% of fields empty). 同样，在很多情况下，您必须使用大量稀疏数据（90％的字段为空）。
As a rule of thumb, one shard should contain between 30 GB and 50 GB of data. 根据经验，一个分片应包含30 GB到50 GB的数据。 I would not worry too much about overloading shards in your use-case. 对于您的用例中的分片过载，我不会太担心。 The opposite is true. 反之亦然。

I suggest testing your use-case with less shards, go down to 1 Shard, 1 Replica for your index. 我建议使用较少的分片来测试您的用例，将索引降低为1个分片，1个副本。 The overhead from searching multiple Shards (5 primary, multiply by replicas) then combining the results again is massive in comparison to your small dataset. 与您的小型数据集相比，搜索多个碎片（5个主要碎片，乘以副本）然后再次组合结果所产生的开销是巨大的。

Keep in mind that document_type behaviour changed and will change further. 请记住，document_type行为已更改，并将进一步更改。 Since 6.X you can only have one document_type per Index, starting in 7.X document_type is removed entirely. 由于6.X，每个索引只能有一个document_type，因此从7.X开始，document_type会被完全删除。 As the API listens at _doc, _doc is the suggested document_type to use in 6.X. 当API侦听_doc时，_doc是建议在6.X中使用的document_type。 Either move to one Index per _type or introduce a new field that stores your type if you need the data in one index. 如果需要将数据存储在一个索引中，请为每个_type移至一个索引，或者引入一个新字段来存储您的类型。