简体   繁体   English

如何在 spark 中有效地将大 rdd 连接到非常大的 rdd?

[英]How can I efficiently join a large rdd to a very large rdd in spark?

I have two RDDs.我有两个 RDD。 One RDD is between 5-10 million entries and the other RDD is between 500 million - 750 million entries.一个 RDD 的条目数在 5-1000 万之间,另一个 RDD 的条目数在 5 亿到 7.5 亿之间。 At some point, I have to join these two rdds using a common key.在某些时候,我必须使用公共密钥加入这两个 rdd。

val rddA = someData.rdd.map { x => (x.key, x); } // 10-million
val rddB = someData.rdd.map { y => (y.key, y); } // 600-million
var joinRDD = rddA.join(rddB);

When spark decides to do this join, it decides to do a ShuffledHashJoin.当 spark 决定做这个 join 时,它决定做一个 ShuffledHashJoin。 This causes many of the items in rddB to be shuffled on the network.这会导致 rddB 中的许多项目在网络上被打乱。 Likewise, some of rddA are also shuffled on the network.同样,一些 rddA 也在网络上被洗牌。 In this case, rddA is too "big" to use as a broadcast variable, but seems like a BroadcastHashJoin would be more efficient.在这种情况下,rddA 太大而不能用作广播变量,但似乎 BroadcastHashJoin 会更有效。 Is there to hint to spark to use a BroadcastHashJoin?是否有提示使用 BroadcastHashJoin 的提示? (Apache Flink supports this through join hints). (Apache Flink 通过加入提示支持这一点)。

If not, is the only option to increase the autoBroadcastJoinThreshold?如果没有,是增加 autoBroadcastJoinThreshold 的唯一选择吗?

Update 7/14更新 7/14

My performance issue appears to be squarely rooted in repartitioning.我的性能问题似乎完全源于重新分区。 Normally, an RDD read from HDFS would be partitioned by block, but in this case, the source was a parquet datasource [that I made].通常,从 HDFS 读取的 RDD 将按块进行分区,但在这种情况下,源是 [我制作的] 镶木地板数据源。 When spark (databricks) writes the parquet file, it writes one file per partition, and identically, it reads one partition per file.当 spark (databricks) 写入 parquet 文件时,它会为每个分区写入一个文件,同样地,它会为每个文件读取一个分区。 So, the best answer I've found is that during production of the datasource, to partition it by key then, write out the parquet sink (which is then naturally co-partitioned) and use that as rddB.因此,我发现的最佳答案是,在数据源的生产过程中,要按键对其进行分区,然后写出镶木地板接收器(然后自然地共同分区)并将其用作 rddB。

The answer given is correct, but I think the details about parquet datasource may be useful to someone else.给出的答案是正确的,但我认为有关镶木地板数据源的详细信息可能对其他人有用。

You can partition RDD's with the same partitioner, in this case partitions with the same key will be collocated on the same executor.您可以使用相同的分区器对 RDD 进行分区,在这种情况下,具有相同键的分区将被配置在同一个执行器上。

In this case you will avoid shuffle for join operations.在这种情况下,您将避免加入操作的 shuffle。

Shuffle will happen only once, when you'll update parititoner, and if you'll cache RDD's all joins after that should be local to executors Shuffle 只会发生一次,当您更新 parititoner 时,如果您将缓存 RDD 之后的所有连接,则应该是执行程序本地的

import org.apache.spark.SparkContext._

class A
class B

val rddA: RDD[(String, A)] = ???
val rddB: RDD[(String, B)] = ???

val partitioner = new HashPartitioner(1000)

rddA.partitionBy(partitioner).cache()
rddB.partitionBy(partitioner).cache()

Also you can try to update broadcast threshold size, maybe rddA can broadcasted:您也可以尝试更新广播阈值大小,也许 rddA 可以广播:

--conf spark.sql.autoBroadcastJoinThreshold=300000000 # ~300 mb

We use 400mb for broadcast joins, and it works well.我们使用 400mb 进行广播连接,而且效果很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM