Spark on Cassandra：有没有办法按分区键删除数据？

Question

The spark Cassandra connector has the RDD.deleteFromCassandra(keyspaceName, tableName) method. spark Cassandra 连接器具有RDD.deleteFromCassandra(keyspaceName, tableName)方法。

The values in the RDD are interpreted as Primary Key constraints. RDD 中的值被解释为主键约束。

I have a table like that :我有一张这样的桌子：

CREATE TABLE table (a int, b int, c int, PRIMARY KEY (a,b));

As you can see, a is the partition key , and b the clustering key .如您所见， a是partition key ， b是clustering key 。

I need to have a spark app that remove efficiently by partition_key , and not by primary key .我需要有一个通过partition_key有效删除的spark app ，而不是通过primary key 。

Indeed, my goal is to always drop entire partitions by their partition keys , and not create a thombstones for each primary key .事实上，我的目标是始终按partition keys删除整个分区，而不是为每个primary key创建一个tombstones。

How to do that with spark connector ?如何使用火花连接器做到这一点？

Thank you谢谢

Answer 1

Yes, it's possible to do if you specify keyColumns parameter to the .deleteFromCassandra function ( docs ).是的，如果您为.deleteFromCassandra函数 ( docs ) 指定keyColumns参数，则可以这样做。 For example, if you have composite partition key consisting of two columns part1 & part2 :例如，如果您有由两列part1和part2组成的复合分区键：

rdd.deleteFromCassandra("keyspace", "table", 
  keyColumns = SomeColumns("part1", "part2"))

This method works only with RDDs, if you use DataFrames, then you just need to do df.rdd .此方法仅适用于 RDD，如果您使用 DataFrames，那么您只需要执行df.rdd 。 Also, in some versions of connector, you may need to restrict selection just to partition columns - see discussion in this answer .此外，在某些版本的连接器中，您可能需要将选择限制为仅分区列 - 请参阅此答案中的讨论。

Spark on Cassandra：有没有办法按分区键删除数据？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-10-27 05:10:05

Spark on Cassandra：有没有办法按分区键删除数据？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-10-27 05:10:05

解决方案1
1 已采纳 2021-10-27 05:10:05