[英]Spark RDD join with Cassandra Table
我正在使用Cassandra table
(查找)加入Spark RDD
,但无法理解一些事情。
Cassandra table
拉取 range_start 和 range_end 之间的所有记录,然后将其与 Spark 内存中的RDD
连接,或者它将所有值从 RDD 下推到 Cassandra 并在那里执行连接Cassandra
或Spark
)Spark
总是从Cassandra
提取相同数量的记录?下面的代码:
//creating dataframe with fields required for join with cassandra table
//and converting same to rdd
val df_for_join = src_df.select(src_df("col1"),src_df("col2"))
val rdd_for_join = df_for_join.rdd
val result_rdd = rdd_for_join
.joinWithCassandraTable("my_keyspace", "my_table"
,selectedColumns = SomeColumns("col1","col2","col3","col4")
,SomeColumns("col1", "col2")
).where("created_at >''range_start'' and created_at<= range_end")
.clusteringOrder(Ascending).limit(1)
Cassandra 表详细信息 -
PRIMARY KEY ((col1, col2), created_at) WITH CLUSTERING ORDER BY (created_at ASC)
joinWithCassandra
表从传递的 RDD 中提取分区/主键值,并将它们转换为针对 Cassandra 中的分区的单独请求。 然后,最重要的是,SCC 可能会应用额外的过滤,例如,您where
。 如果我没记错的话,但我可能是错的,限制不会完全推送到 Cassandra - 它仍然可以为每个分区获取limit
行。
您始终可以通过执行result_rdd.toDebugString
来检查连接发生的result_rdd.toDebugString
。 对于我的代码:
val df_for_join = Seq((2, 5),(5, 2)).toDF("col1", "col2")
val rdd_for_join = df_for_join.rdd
val result_rdd = rdd_for_join
.joinWithCassandraTable("test", "jt"
,selectedColumns = SomeColumns("col1","col2", "v")
,SomeColumns("col1", "col2")
).where("created_at >'2020-03-13T00:00:00Z' and created_at<= '2020-03-14T00:00:00Z'")
.limit(1)
它给出了以下内容:
scala> result_rdd.toDebugString
res7: String =
(2) CassandraJoinRDD[14] at RDD at CassandraRDD.scala:19 []
| MapPartitionsRDD[2] at rdd at <console>:45 []
| MapPartitionsRDD[1] at rdd at <console>:45 []
| ParallelCollectionRDD[0] at rdd at <console>:45 []
而如果您进行“正常”连接,您将获得以下信息:
scala> val rdd1 = sc.parallelize(Seq((2, 5),(5, 2)))
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[21] at parallelize at <console>:44
scala> val ct = sc.cassandraTable[(Int, Int)]("test", "jt").select("col1", "col2")
ct: com.datastax.spark.connector.rdd.CassandraTableScanRDD[(Int, Int)] = CassandraTableScanRDD[31] at RDD at CassandraRDD.scala:19
scala> rdd1.join(ct)
res15: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[34] at join at <console>:49
scala> rdd1.join(ct).toDebugString
res16: String =
(6) MapPartitionsRDD[37] at join at <console>:49 []
| MapPartitionsRDD[36] at join at <console>:49 []
| CoGroupedRDD[35] at join at <console>:49 []
+-(3) ParallelCollectionRDD[21] at parallelize at <console>:44 []
+-(6) CassandraTableScanRDD[31] at RDD at CassandraRDD.scala:19 []
SCC 文档的相应部分提供了更多信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.