如何有效地对 Cassandra 中的仅索引表进行分区？

Question

I need to create an append only table which should only store a pair of values (foreign_id, some_string) .我需要创建一个仅应存储一对值(foreign_id, some_string)的 append 表。 There will be a limited number of foreign_id values (let's say 100 - 10 000) and 10s of millions of some_string values (they may not be evenly distributed between foreign_ids )会有有限数量的foreign_id值（比如说 100 - 10 000）和数百万个some_string值（它们可能不会在foreign_ids之间均匀分布）

I am only interested whether a given (foreign_id, some_string) pair exists in the table.我只对表中是否存在给定的(foreign_id, some_string)对感兴趣。

What would be the most efficient way (when it comes to query response time) of partitioning this table?对这个表进行分区的最有效方法是什么（在查询响应时间方面）？

I am pretty sure that creating a primary key PRIMARY KEY ((foreign_id), some_string) is a bad idea, because a single partition could easily grow beyond 100 MB which is not recommended AFAIK.我很确定创建主键PRIMARY KEY ((foreign_id), some_string)是个坏主意，因为单个分区很容易增长到100 MB以上，AFAIK 不建议这样做。

Should I simply partition the table by both foreign_id and some_string like this PRIMARY KEY ((foreign_id, some_string)) or is there some issue with this approach?我应该像这个PRIMARY KEY ((foreign_id, some_string))那样简单地按foreign_id和some_string对表进行分区，还是这种方法有问题？

Answer 1

The primary philosophy of data modelling in Cassandra is -- for each application query, design a table that is optimised for that query. Cassandra 中数据建模的主要理念是——为每个应用程序查询设计一个针对该查询进行优化的表。 It is the complete opposite of data modelling in traditional relational databases.它与传统关系数据库中的数据建模完全相反。

Don't get hung up on how you will store the data in the table but focus on what query your application requires because the app query is the crucial aspect that determines how the table will be optimised for reads.不要纠结于如何在表中存储数据，而是关注应用程序需要的查询，因为应用程序查询是决定如何优化表以进行读取的关键方面。

Looking at this statement from you:看看你的这句话：

I am only interested whether a given (foreign_id, some_string) pair exists in the table.我只对表中是否存在给定的(foreign_id, some_string)对感兴趣。

My understanding is that your app query is something along the lines of:我的理解是，您的应用查询大致如下：

"Does ID X and string Y exist?" “ID X 和字符串 Y 存在吗？”

which means that you should partition the table by both ID and string:这意味着您应该按 ID 和字符串对表进行分区：

CREATE TABLE tbl_by_id_string (
    foreign_id text,
    some_string text,
    exists boolean,
    PRIMARY KEY ((foreign_id, some_string))
)

The equivalent CQL query to your app query is:与您的应用程序查询等效的 CQL 查询是：

SELECT exists FROM tbl_by_id_string WHERE foreign_id = ? AND some_string = ?

This design is optimised for your app query and completely eliminates your concern around having large partitions because each partition in the table will only ever have ONE row and will never get any bigger than that.这种设计针对您的应用程序查询进行了优化，完全消除了您对大分区的担忧，因为表中的每个分区都只会有一行，并且永远不会比这更大。

Also, you can have billions and billions of combinations of ID + string and they will be distributed evenly across the nodes in the cluster with this design.此外，您可以拥有数十亿个 ID + 字符串组合，并且通过这种设计，它们将均匀分布在集群中的节点上。 Cheers!干杯!

如何有效地对 Cassandra 中的仅索引表进行分区？

问题描述

1 个解决方案

解决方案1
0 2022-09-09 01:53:56

如何有效地对 Cassandra 中的仅索引表进行分区？

问题描述

1 个解决方案

解决方案1 0 2022-09-09 01:53:56

解决方案1
0 2022-09-09 01:53:56