简体   繁体   English

如何将文档索引到特定的 ElasticSearch 分片?

[英]How to index a document to a specific ElasticSearch shard?

I want to index a document to a specific ElasticSearch shard.我想将文档索引到特定的 ElasticSearch 分片。

I know I can configure ES to look at a field, and send it to a specific shard based on that field.我知道我可以配置 ES 来查看一个字段,并根据该字段将其发送到特定的分片。

I don't want to do that.我不想那样做。 I simply want to say: 1) OK, I decide I want to import all documents to Shard 1 this week because I feel like it.我只想说: 1) 好吧,我决定本周将所有文档导入 Shard 1,因为我喜欢。

I know there's a way to send a query to a specific shard, but what about an import?我知道有一种方法可以将查询发送到特定的分片,但是导入呢?

How can I do this?我怎样才能做到这一点?

If you want complete control over shards, you should use multiple indices with single shard each instead of a single index with multiple shards.如果你想完全控制分片,你应该使用多个索引和单个分片,而不是单个索引和多个分片。 This way you will be able to decide which index (and shard since you have only one shard per index) you data will go to.通过这种方式,您将能够决定数据将转到哪个索引(和分片,因为每个索引只有一个分片)。 You can also create an alias that will combine all such indices into a single alias, so you don't have to worry about listing all indices during searching.您还可以创建一个别名,将所有此类索引合并为一个别名,这样您就不必担心在搜索过程中列出所有索引。

From performance perspective there is very little difference between searching a single index with 10 shards and searching 10 indices with a single shard each.从性能的角度来看,搜索具有 10 个分片的单个索引和搜索具有单个分片的 10 个索引之间的区别很小。 In both cases you will be searching 10 shards.在这两种情况下,您都将搜索 10 个分片。 One thing that you should worry about in this scenario is keeping mappings compatible.在这种情况下,您应该担心的一件事是保持映射兼容。 You, probably, don't want to have a field indexed as a string in one index and as an integer in another.您可能不希望一个字段在一个索引中被索引为字符串,而在另一个索引中被索引为整数。

I am sure you have already solved your problem or found another solution, but I had a similar issue in the project and I want to post what we have done to index a document to a specific shard.我确定您已经解决了您的问题或找到了其他解决方案,但我在项目中遇到了类似的问题,我想发布我们为将文档索引到特定分片所做的工作。

You can achieve this by _routing field of Elasticsearch by calculating a shard number with the given formula by Elasticsearch:您可以通过 Elasticsearch 的_routing字段通过使用 Elasticsearch 给定的公式计算分片数来实现这一点:

shard_num = hash(_routing) % num_primary_shards

Let's say you would like to allocate a document to shard number 2 and you have to give the routing name when the shard number is 10 when the modulus is taken from its hash and number of the shard.假设您想将一个文档分配给分片编号 2,并且当分片编号为 10 时,当模数取自其散列和分片编号时,您必须给出路由名称。 For this you have to find a routing name, to explain in code, I will give an example in Java to find a shard number with a particular routing name:为此,您必须找到一个路由名称,在代码中进行解释,我将在 Java 中给出一个示例,以查找具有特定路由名称的分片号:

 for (int i = 0; i < 5; i++) {
    String routing = "tenant"+i;
    final int numberOfShard = 30;
    final int shard = routing.hashCode() % numberOfShard;
    System.out.println("Routing: " + routing + " - shard number: " + shard);
}

Output:输出:

Routing: tenant0 - shard number: -2
Routing: tenant1 - shard number: -1
Routing: tenant2 - shard number: 0
Routing: tenant3 - shard number: -29
Routing: tenant4 - shard number: -28

You have to generate a String that, modulus its hash value and number of shards, leads your desired shard number.您必须生成一个字符串,对其散列值和分片数进行模数,以引导您想要的分片编号。 From the output above, tenant0 routing name leads to shard number 2 .从上面的输出中, tenant0路由名称通向分片shard number 2

As a full example, I would like to demonstrate indexing with a routing name:作为一个完整的例子,我想用一个路由名称来演示索引:

Let's say we create " course " index and set routing required:假设我们创建“ course ”索引并设置所需的路由:

PUT http://localhost:9200/course
{
    "settings": {
        "number_of_shards": 30
    },
    "mappings": {
        "_routing": {
           "required": true 
        }
    }
}

Then you index a document like this:然后你像这样索引一个文档:

PUT http://localhost:9200/course_index/_doc/1?routing=tenant0&refresh=true
{
    "id": 1,
    "title": "Data Security course in Lidl",
    "description": "The course teaches our core Data Security measurements here in Lidle. As new regulations are out, ....",
    "text": "Text of the couse goes here",
    "created_date": 152625632,
    "last_date": 152625632,
    "expiration_date": null,
    "domain_id": 10,
    "language_id": 2
}

In our case, we have a multi-tenant software where about 100 tenants (organizations) share the same index in Elasticsearch, and we had to make sure data security that one tenant can never see data from other tenants.在我们的例子中,我们有一个多租户软件,其中大约 100 个租户(组织)在 Elasticsearch 中共享相同的索引,我们必须确保数据安全,一个租户永远无法看到其他租户的数据。 The solution that we came to create an index for all tenants with 100 shards and dedicate one shard for each tenant by finding a routing name for each tenant.我们使用 100 个分片为所有租户创建索引并通过为每个租户查找路由名称为每个租户专用一个分片的解决方案。 As you can see in the index mapping example above, the routing is set to "required" and whenever you send CRUD operations to Elasticsearch, you have to define a routing otherwise Elasticsearch will throw routing_missing_exception正如您在上面的索引映射示例中看到的,路由设置为“required”,每当您向 Elasticsearch 发送 CRUD 操作时,您必须定义一个路由,否则 Elasticsearch 将抛出routing_missing_exception

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM