简体   繁体   English

Citus 数据 - 如何从查询中的单个分片查询数据?

[英]Citus data - How to query data from a single shard in a query?

We are evaluating Citus data for the large-scale data use cases in our organization.我们正在为我们组织中的大规模数据用例评估 Citus 数据。 While analyzing, I am trying to see if there is a way to achieve the following with Citus data:在分析时,我想看看是否有办法使用 Citus 数据实现以下目标:

  • We want to create a distributed table (customers) with customer_id being the shard/distribution key (customer_id is a UUID generated at the application end)我们要创建一个分布式表(客户),customer_id 是分片/分发键(customer_id 是应用程序端生成的 UUID)
  • While we can use regular SQL queries for all the CRUD operations on these entities, we also have a need to query the table periodically (periodic task) to select multiple entries based on some filter criteria to fetch the result set to application and update a few columns and write back (Read and update operation).虽然我们可以对这些实体上的所有 CRUD 操作使用常规 SQL 查询,但我们还需要定期查询表(周期性任务)到 select 基于某些过滤条件的多个条目,以获取结果集到应用程序并更新一些列和写回(读取和更新操作)。
  • Our application is a horizontally scalable microservice with multiple instances of the service running in parallel我们的应用程序是一个水平可扩展的微服务,具有多个并行运行的服务实例
  • So we want to split the periodic task (into multiple sub-tasks) to run on multiple instances of the service to execute this parallelly所以我们想把周期性任务(分成多个子任务)拆分到服务的多个实例上并行执行

So I am looking for a way to query results from a specific shard from the sub-task so that each sub-task is responsible to fetch and update the data on one shard only.因此,我正在寻找一种方法来从子任务的特定分片中查询结果,以便每个子任务只负责获取和更新一个分片上的数据。 This will let us run the periodic task parallelly without worrying about conflicts as each subtask is operating on one shard.这将让我们并行运行周期性任务,而不必担心冲突,因为每个子任务都在一个分片上运行。

I am not able to find anything from the documentation on how we can achieve this.我无法从文档中找到任何关于我们如何实现这一目标的信息。 Is this possible with Citus data? Citus数据可以做到这一点吗?

Citus (by default) distributes data accross the shards using the hash value of the distribution column, which is customer_id in your case. Citus(默认情况下)使用分布列的 hash 值(在您的情况下为 customer_id)跨分片分布数据。

To achieve this, you might need to store a (customer_id - shard_id) mapping in your application, and assign subtasks to shards, and send queries from sub-tasks by using this mapping.为此,您可能需要在应用程序中存储 (customer_id - shard_id) 映射,并将子任务分配给分片,并使用此映射从子任务发送查询。

One hacky solution that you might consider: You can add a dummy column (I will name it shard_id) and make it the distribution column.您可能会考虑一种 hacky 解决方案:您可以添加一个虚拟列(我将其命名为 shard_id)并将其设为分布列。 So that your application knows which rows should be fetched/updated from which sub-task.这样您的应用程序就知道应该从哪个子任务中获取/更新哪些行。 In other words, each sub-task will fetch/update the rows with a particular value of (shard_id) column, and all of those rows will be located on the same shard, because they have the same distribution column.换句话说,每个子任务将获取/更新具有 (shard_id) 列的特定值的行,并且所有这些行将位于同一个分片上,因为它们具有相同的分布列。 In this case, you can manipulate which customer_ids will be on the same shard, and which ones should form a separate shard;在这种情况下,您可以操纵哪些 customer_ids 将在同一个分片上,哪些应该形成一个单独的分片; by assigning them the shard_id you want.通过为他们分配你想要的 shard_id。

Also I would suggest you to take a look at "tenant isolation", which is mentioned in the latest blog post: https://www.citusdata.com/blog/2022/09/19/citus-11-1-shards-postgres-tables-without-interruption/#isolate-tenant It basically isolates a tenant (all data with the same customer_id in your case) into a single shard.另外我建议您看一下最新博客文章中提到的“租户隔离”: https://www.citusdata.com/blog/2022/09/19/citus-11-1-shards- postgres-tables-without-interruption/#isolate-tenant它基本上将租户(在您的情况下具有相同 customer_id 的所有数据)隔离到单个分片中。 Maybe it works for you at some point.也许它在某些时候对你有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM