简体   繁体   English

两个分区数据帧之间的 Spark 共置连接

[英]Spark colocated join between two partitioned dataframes

For the following join between two DataFrames in Spark 1.6.0对于 Spark 1.6.0 中两个DataFrames之间的以下连接

val df0Rep = df0.repartition(32, col("a")).cache
val df1Rep = df1.repartition(32, col("a")).cache
val dfJoin = df0Rep.join(df1Rep, "a")
println(dfJoin.count)

Does this join not only co-partitioned but also co-located?这是否不仅共同分区而且还共同定位? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located.我知道对于 RDD,如果使用相同的分区器并在相同的操作中混洗,则连接将位于同一位置。 But what about dataframes?但是数据帧呢? Thank you.谢谢你。

[ https://medium.com/@achilleus/https-medium-com-joins-in-apache-spark-part-3-1d40c1e51e1c] [ https://medium.com/@achilleus/https-medium-com-joins-in-apache-spark-part-3-1d40c1e51e1c]

According to the article link provided above Sort-Merge join is the default join, would like to add important point根据上面提供的文章链接Sort-Merge join是默认的join,想补充一点

For Ideal performance of Sort-Merge join, it is important that all rows having the same value for the join key are available in the same partition.对于 Sort-Merge 连接的理想性能,重要的是所有具有相同连接键值的行在同一分区中可用。 This warrants for the infamous partition exchange(shuffle) between executors.这保证了执行者之间臭名昭著的分区交换(洗牌)。 Collocated partitions can avoid unnecessary data shuffle.并置分区可以避免不必要的数据洗牌。 Data needs to be evenly distributed n the join keys.数据需要在连接键中均匀分布。 The number of join keys is unique enough so that they can be equally distributed across the cluster to achieve the max parallelism from the available partitions连接键的数量足够独特,以便它们可以在集群中均匀分布,以从可用分区实现最大并行度

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM