[英]Spark on Cassandra : is there a way to remove data by partition key?
The spark Cassandra connector has the RDD.deleteFromCassandra(keyspaceName, tableName)
method. spark Cassandra 连接器具有RDD.deleteFromCassandra(keyspaceName, tableName)
方法。
The values in the RDD are interpreted as Primary Key constraints. RDD 中的值被解释为主键约束。
I have a table like that :我有一张这样的桌子:
CREATE TABLE table (a int, b int, c int, PRIMARY KEY (a,b));
As you can see, a
is the partition key
, and b
the clustering key
.如您所见, a
是partition key
, b
是clustering key
。
I need to have a spark app
that remove efficiently by partition_key
, and not by primary key
.我需要有一个通过partition_key
有效删除的spark app
,而不是通过primary key
。
Indeed, my goal is to always drop entire partitions by their partition keys
, and not create a thombstones for each primary key
.事实上,我的目标是始终按partition keys
删除整个分区,而不是为每个primary key
创建一个tombstones。
How to do that with spark connector ?如何使用火花连接器做到这一点?
Thank you谢谢
Yes, it's possible to do if you specify keyColumns
parameter to the .deleteFromCassandra
function ( docs ).是的,如果您为.deleteFromCassandra
函数 ( docs ) 指定keyColumns
参数,则可以这样做。 For example, if you have composite partition key consisting of two columns part1
& part2
:例如,如果您有由两列part1
和part2
组成的复合分区键:
rdd.deleteFromCassandra("keyspace", "table",
keyColumns = SomeColumns("part1", "part2"))
This method works only with RDDs, if you use DataFrames, then you just need to do df.rdd
.此方法仅适用于 RDD,如果您使用 DataFrames,那么您只需要执行df.rdd
。 Also, in some versions of connector, you may need to restrict selection just to partition columns - see discussion in this answer .此外,在某些版本的连接器中,您可能需要将选择限制为仅分区列 - 请参阅此答案中的讨论。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.