多对多加入 Spark 中的大型数据集

Question

I have two large datasets, A and B, which I wish to join on key K.我有两个大型数据集 A 和 B，我希望在键 K 上加入它们。

Each dataset contains many rows with the same value of K, so this is a many-to-many join.每个数据集包含许多具有相同 K 值的行，因此这是一个多对多连接。

This join fails with memory related errors if I just try it naively.如果我只是天真地尝试，此连接会因内存相关错误而失败。

Let's also say grouping both datasets by K, doing the join and then exploding back out with some trickery to get the correct result isn't a viable option, again due to memory issues假设按 K 对两个数据集进行分组，进行连接，然后使用一些技巧将其分解以获得正确的结果不是一个可行的选择，同样是由于内存问题

Are there any clever tricks people have found which improves the chance of this working?人们是否发现了任何可以提高这种工作机会的聪明技巧？

Update:更新：

Adding a very, very contrived concrete example:添加一个非常非常人为的具体示例：

spark-shell --master local[4] --driver-memory 5G --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.shuffle.partitions=10000 --conf spark.default.parallelism=10000

val numbersA = (1 to 100000).toList.toDS
val numbersWithDataA = numbersA.repartition(10000).map(n => (n, 1, Array.fill[Byte](1000*1000)(0)))
numbersWithDataA.write.mode("overwrite").parquet("numbersWithDataA.parquet")

val numbersB = (1 to 100).toList.toDS
val numbersWithDataB = numbersB.repartition(100).map(n => (n, 1, Array.fill[Byte](1000*1000)(0)))
numbersWithDataB.write.mode("overwrite").parquet("numbersWithDataB.parquet")


val numbersWithDataInA = spark.read.parquet("numbersWithDataA.parquet").toDF("numberA", "one", "dataA")
val numbersWithDataInB = spark.read.parquet("numbersWithDataB.parquet").toDF("numberB", "one", "dataB")

numbersWithDataInA.join(numbersWithDataInB, Seq("one")).write.mode("overwrite").parquet("joined.parquet")

Fails with Caused by: java.lang.OutOfMemoryError: Java heap space失败Caused by: java.lang.OutOfMemoryError: Java heap space

Answer 1

--conf spark.sql.autoBroadcastJoinThreshold=-1

means you are disabling the broadcast feature.意味着您正在禁用广播功能。

You can change it to any suitable value <2gb ( since 2gb limit is there ).您可以将其更改为任何合适的 <2gb 值（因为存在 2gb 限制）。 spark.sql.autoBroadcastJoinThreshold is default 10mb as per spark documentation . spark.sql.autoBroadcastJoinThreshold根据spark 文档默认为 10mb。 I dont know the reason you have disabled it.我不知道你禁用它的原因。 if you disbale it SparkStregies will switch the path to sortmerge join or shuffle hash join.如果您取消它， SparkStregies会将路径切换为 sortmerge join 或 shuffle hash join。 see my article for details 详情请看我的文章

Remaining I dont think there is any need to change as its common pattern of joining 2 datasets.剩下的我认为没有必要改变它加入 2 个数据集的常见模式。

Further reading DataFrame join optimization - Broadcast Hash Join进一步阅读DataFrame join optimization - Broadcast Hash Join

UPDATE : Alternatively In your real example (not contrieved :-)) you can do these steps更新：或者在你的真实例子中（不是做作:-)）你可以做这些步骤

Steps :脚步：

1) Each dataset find out join key (may be for example, pickup a unique/distinct category or country or state field) and collect them as an array since its small data you can collect. 1）每个数据集找出连接键（例如，可以选择唯一/不同的类别或国家或州字段）并将它们作为数组收集，因为您可以收集它的小数据。

2) For each category element in an array join the 2 datasets (playing with small dataset joins) with category as where condition add to a sequence of dataframes. 2）对于数组中的每个类别元素，将 2 个数据集（使用小数据集连接）与类别作为条件添加到数据帧序列中。

3) reduce and union these dataframes. 3）减少和联合这些数据帧。 scala example :标量示例：

val dfCatgories = Seq(df1Category1, df2Category2, df3Category3)
dfCatgories.reduce(_ union _)

Note : for each join I still prefer BHJ since it will be less/no shuffle注意：对于每次加入，我仍然更喜欢 BHJ，因为它会更少/没有洗牌

多对多加入 Spark 中的大型数据集

问题描述

1 个解决方案

解决方案1
2 2020-03-19 21:38:32

多对多加入 Spark 中的大型数据集

问题描述

1 个解决方案

解决方案1 2 2020-03-19 21:38:32

解决方案1
2 2020-03-19 21:38:32