简体   繁体   English

Spark on Cassandra:有没有办法按分区键删除数据?

[英]Spark on Cassandra : is there a way to remove data by partition key?

The spark Cassandra connector has the RDD.deleteFromCassandra(keyspaceName, tableName) method. spark Cassandra 连接器具有RDD.deleteFromCassandra(keyspaceName, tableName)方法。

The values in the RDD are interpreted as Primary Key constraints. RDD 中的值被解释为主约束。

I have a table like that :我有一张这样的桌子:

CREATE TABLE table (a int, b int, c int, PRIMARY KEY (a,b));

As you can see, a is the partition key , and b the clustering key .如您所见, apartition keybclustering key

I need to have a spark app that remove efficiently by partition_key , and not by primary key .我需要有一个通过partition_key有效删除的spark app ,而不是通过primary key

Indeed, my goal is to always drop entire partitions by their partition keys , and not create a thombstones for each primary key .事实上,我的目标是始终按partition keys删除整个分区,而不是为每个primary key创建一个tombstones。

How to do that with spark connector ?如何使用火花连接器做到这一点?

Thank you谢谢

Yes, it's possible to do if you specify keyColumns parameter to the .deleteFromCassandra function ( docs ).是的,如果您为.deleteFromCassandra函数 ( docs ) 指定keyColumns参数,则可以这样做。 For example, if you have composite partition key consisting of two columns part1 & part2 :例如,如果您有由两列part1part2组成的复合分区键:

rdd.deleteFromCassandra("keyspace", "table", 
  keyColumns = SomeColumns("part1", "part2"))

This method works only with RDDs, if you use DataFrames, then you just need to do df.rdd .此方法仅适用于 RDD,如果您使用 DataFrames,那么您只需要执行df.rdd Also, in some versions of connector, you may need to restrict selection just to partition columns - see discussion in this answer .此外,在某些版本的连接器中,您可能需要将选择限制为仅分区列 - 请参阅此答案中的讨论。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM