简体繁体 English

Citus 数据 - 如何从查询中的单个分片查询数据？

[英]Citus data - How to query data from a single shard in a query?

原文 2022-09-21 15:58:54 9 1 postgresql/ citus

We are evaluating Citus data for the large-scale data use cases in our organization.我们正在为我们组织中的大规模数据用例评估 Citus 数据。 While analyzing, I am trying to see if there is a way to achieve the following with Citus data:在分析时，我想看看是否有办法使用 Citus 数据实现以下目标：

We want to create a distributed table (customers) with customer_id being the shard/distribution key (customer_id is a UUID generated at the application end)我们要创建一个分布式表（客户），customer_id 是分片/分发键（customer_id 是应用程序端生成的 UUID）
While we can use regular SQL queries for all the CRUD operations on these entities, we also have a need to query the table periodically (periodic task) to select multiple entries based on some filter criteria to fetch the result set to application and update a few columns and write back (Read and update operation).虽然我们可以对这些实体上的所有 CRUD 操作使用常规 SQL 查询，但我们还需要定期查询表（周期性任务）到 select 基于某些过滤条件的多个条目，以获取结果集到应用程序并更新一些列和写回（读取和更新操作）。
Our application is a horizontally scalable microservice with multiple instances of the service running in parallel我们的应用程序是一个水平可扩展的微服务，具有多个并行运行的服务实例
So we want to split the periodic task (into multiple sub-tasks) to run on multiple instances of the service to execute this parallelly所以我们想把周期性任务（分成多个子任务）拆分到服务的多个实例上并行执行

So I am looking for a way to query results from a specific shard from the sub-task so that each sub-task is responsible to fetch and update the data on one shard only.因此，我正在寻找一种方法来从子任务的特定分片中查询结果，以便每个子任务只负责获取和更新一个分片上的数据。 This will let us run the periodic task parallelly without worrying about conflicts as each subtask is operating on one shard.这将让我们并行运行周期性任务，而不必担心冲突，因为每个子任务都在一个分片上运行。

I am not able to find anything from the documentation on how we can achieve this.我无法从文档中找到任何关于我们如何实现这一目标的信息。 Is this possible with Citus data? Citus数据可以做到这一点吗？

1 个解决方案

Citus (by default) distributes data accross the shards using the hash value of the distribution column, which is customer_id in your case. Citus（默认情况下）使用分布列的 hash 值（在您的情况下为 customer_id）跨分片分布数据。

To achieve this, you might need to store a (customer_id - shard_id) mapping in your application, and assign subtasks to shards, and send queries from sub-tasks by using this mapping.为此，您可能需要在应用程序中存储 (customer_id - shard_id) 映射，并将子任务分配给分片，并使用此映射从子任务发送查询。

One hacky solution that you might consider: You can add a dummy column (I will name it shard_id) and make it the distribution column.您可能会考虑一种 hacky 解决方案：您可以添加一个虚拟列（我将其命名为 shard_id）并将其设为分布列。 So that your application knows which rows should be fetched/updated from which sub-task.这样您的应用程序就知道应该从哪个子任务中获取/更新哪些行。 In other words, each sub-task will fetch/update the rows with a particular value of (shard_id) column, and all of those rows will be located on the same shard, because they have the same distribution column.换句话说，每个子任务将获取/更新具有 (shard_id) 列的特定值的行，并且所有这些行将位于同一个分片上，因为它们具有相同的分布列。 In this case, you can manipulate which customer_ids will be on the same shard, and which ones should form a separate shard;在这种情况下，您可以操纵哪些 customer_ids 将在同一个分片上，哪些应该形成一个单独的分片； by assigning them the shard_id you want.通过为他们分配你想要的 shard_id。

Also I would suggest you to take a look at "tenant isolation", which is mentioned in the latest blog post: https://www.citusdata.com/blog/2022/09/19/citus-11-1-shards-postgres-tables-without-interruption/#isolate-tenant It basically isolates a tenant (all data with the same customer_id in your case) into a single shard.另外我建议您看一下最新博客文章中提到的“租户隔离”： https://www.citusdata.com/blog/2022/09/19/citus-11-1-shards- postgres-tables-without-interruption/#isolate-tenant它基本上将租户（在您的情况下具有相同 customer_id 的所有数据）隔离到单个分片中。 Maybe it works for you at some point.也许它在某些时候对你有用。