简体   繁体   English

多对多加入 Spark 中的大型数据集

[英]Many to many join on large datasets in Spark

I have two large datasets, A and B, which I wish to join on key K.我有两个大型数据集 A 和 B,我希望在键 K 上加入它们。

Each dataset contains many rows with the same value of K, so this is a many-to-many join.每个数据集包含许多具有相同 K 值的行,因此这是一个多对多连接。

This join fails with memory related errors if I just try it naively.如果我只是天真地尝试,此连接会因内存相关错误而失败。

Let's also say grouping both datasets by K, doing the join and then exploding back out with some trickery to get the correct result isn't a viable option, again due to memory issues假设按 K 对两个数据集进行分组,进行连接,然后使用一些技巧将其分解以获得正确的结果不是一个可行的选择,同样是由于内存问题

Are there any clever tricks people have found which improves the chance of this working?人们是否发现了任何可以提高这种工作机会的聪明技巧?


Update:更新:

Adding a very, very contrived concrete example:添加一个非常非常人为的具体示例:

spark-shell --master local[4] --driver-memory 5G --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.shuffle.partitions=10000 --conf spark.default.parallelism=10000

val numbersA = (1 to 100000).toList.toDS
val numbersWithDataA = numbersA.repartition(10000).map(n => (n, 1, Array.fill[Byte](1000*1000)(0)))
numbersWithDataA.write.mode("overwrite").parquet("numbersWithDataA.parquet")

val numbersB = (1 to 100).toList.toDS
val numbersWithDataB = numbersB.repartition(100).map(n => (n, 1, Array.fill[Byte](1000*1000)(0)))
numbersWithDataB.write.mode("overwrite").parquet("numbersWithDataB.parquet")


val numbersWithDataInA = spark.read.parquet("numbersWithDataA.parquet").toDF("numberA", "one", "dataA")
val numbersWithDataInB = spark.read.parquet("numbersWithDataB.parquet").toDF("numberB", "one", "dataB")

numbersWithDataInA.join(numbersWithDataInB, Seq("one")).write.mode("overwrite").parquet("joined.parquet")

Fails with Caused by: java.lang.OutOfMemoryError: Java heap space失败Caused by: java.lang.OutOfMemoryError: Java heap space

--conf spark.sql.autoBroadcastJoinThreshold=-1

means you are disabling the broadcast feature.意味着您正在禁用广播功能。

You can change it to any suitable value <2gb ( since 2gb limit is there ).您可以将其更改为任何合适的 <2gb 值(因为存在 2gb 限制)。 spark.sql.autoBroadcastJoinThreshold is default 10mb as per spark documentation . spark.sql.autoBroadcastJoinThreshold根据spark 文档默认为 10mb。 I dont know the reason you have disabled it.我不知道你禁用它的原因。 if you disbale it SparkStregies will switch the path to sortmerge join or shuffle hash join.如果您取消它, SparkStregies会将路径切换为 sortmerge join 或 shuffle hash join。 see my article for details 详情请看我的文章

Remaining I dont think there is any need to change as its common pattern of joining 2 datasets.剩下的我认为没有必要改变它加入 2 个数据集的常见模式。

Further reading DataFrame join optimization - Broadcast Hash Join进一步阅读DataFrame join optimization - Broadcast Hash Join

UPDATE : Alternatively In your real example (not contrieved :-)) you can do these steps更新:或者在你的真实例子中(不是做作:-))你可以做这些步骤

Steps :脚步 :

1) Each dataset find out join key (may be for example, pickup a unique/distinct category or country or state field) and collect them as an array since its small data you can collect. 1)每个数据集找出连接键(例如,可以选择唯一/不同的类别或国家或州字段)并将它们作为数组收集,因为您可以收集它的小数据。

2) For each category element in an array join the 2 datasets (playing with small dataset joins) with category as where condition add to a sequence of dataframes. 2)对于数组中的每个类别元素,将 2 个数据集(使用小数据集连接)与类别作为条件添加到数据帧序列中。

3) reduce and union these dataframes. 3)减少和联合这些数据帧。 scala example :标量示例:

val dfCatgories = Seq(df1Category1, df2Category2, df3Category3)
dfCatgories.reduce(_ union _)

Note : for each join I still prefer BHJ since it will be less/no shuffle注意:对于每次加入,我仍然更喜欢 BHJ,因为它会更少/没有洗牌

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM