Spark中两个大型数据集之间的交叉连接

Question

I have 2 large datasets.我有 2 个大型数据集。 First dataset contains about 130 million entries.第一个数据集包含大约 1.3 亿个条目。
The second dataset contains about 40000 entries.第二个数据集包含大约 40000 个条目。 The data is fetched from MySQL tables.数据是从 MySQL 表中获取的。

I need to do a cross-join but I am getting我需要进行交叉连接，但我得到了

java.sql.SQLException: GC overhead limit exceeded

What is the best optimum technique to do this in Scala?在 Scala 中执行此操作的最佳技术是什么？

Following is a snippet of my code:以下是我的代码片段：

val df1 = (spark.read.jdbc(jdbcURL,configurationLoader.mysql_table1,"id",100,100000,40, MySqlConnection.getConnectionProperties))
val df2 = (spark.read.jdbc(jdbcURL,configurationLoader.mysql_table2, MySqlConnection.getConnectionProperties))
val df2Cache = df2.repartition(40).cache()
val crossProduct = df1.join(df2Cache)

df1 is the larger dataset and df2 is the smaller one. df1 是较大的数据集，df2 是较小的数据集。

Answer 1

130M*40K = 52 trillion records is 52 terabytes of required memory to store this data, and this is if we assume that each record is 1 byte, which is most certainly not true. 130M*40K = 52 万亿条记录是存储这些数据所需的 52 TB 内存，如果我们假设每条记录是 1 个字节，这肯定是不正确的。 If it is as much as 64 bytes (which I think is also a very conservative estimate), you'd need 3.32 petabytes (!) of memory just to store the data.如果它多达 64 个字节（我认为这也是一个非常保守的估计），那么您需要 3.32 PB (!) 的内存来存储数据。 It is a very large amount, so unless you have a very large cluster and very fast network inside that cluster, you might want to rethink your algorithm to make it work.这是一个非常大的数量，因此除非您有一个非常大的集群和该集群内的非常快的网络，否则您可能需要重新考虑您的算法以使其工作。

That being said, when you do a join of two SQL datasets/dataframes, the number of partitions that Spark would use to store the result of the join is controlled by the spark.sql.shuffle.partitions property (see here ).话虽如此，当您join两个 SQL 数据集/数据帧时，Spark 用于存储连接结果的分区数由spark.sql.shuffle.partitions属性控制（请参阅此处）。 You might want to set it to a very large number, and set the number of executors to the largest one that you can.您可能希望将其设置为一个非常大的数字，并将执行程序的数量设置为您能做到的最大数量。 Then you might be able to run your processing to the end.然后，您也许可以将处理运行到最后。

Additionally, you may want to look into the spark.shuffle.minNumPartitionsToHighlyCompress option;此外，您可能需要查看spark.shuffle.minNumPartitionsToHighlyCompress选项； if you set it to less than your number of shuffle partitions, you might get another memory boost.如果您将其设置为少于 shuffle 分区的数量，您可能会再次获得内存提升。 Note that this option was a hardcoded constant set to 2000 until a recent Spark version, so depending on your environment you just will need to set spark.sql.shuffle.partitions to a number greater than 2000 to make use of it.请注意，此选项是一个硬编码常量，直到最近的 Spark 版本才设置为 2000，因此根据您的环境，您只需将spark.sql.shuffle.partitions设置为大于 2000 的数字即可使用它。

Answer 2

Agree with Vladimir, thought of adding more points.同意 Vladimir 的观点，想加分。

see MapStatus set spark.sql.shuffle.partitions it to 2001 ( old approach )(default is 200).看到MapStatus集spark.sql.shuffle.partitions它2001 （旧办法）（默认值是200）。

new approach ( spark.shuffle.minNumPartitionsToHighlyCompress ) as Vladimir mentioned in answer.新方法（ spark.shuffle.minNumPartitionsToHighlyCompress ）正如 Vladimir 在回答中提到的那样。

Why this change ?为什么会有这种变化？ : MapStatus has 2000 hardcoded SPARK-24519 ： MapStatus 有 2000 个硬编码 SPARK-24519

it will apply different algorithm to process它将应用不同的算法来处理

def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
    if (uncompressedSizes.length > minPartitionsToUseHighlyCompressMapStatus) {
      HighlyCompressedMapStatus(loc, uncompressedSizes)
    } else {
      new CompressedMapStatus(loc, uncompressedSizes)
    }
  }

HighlyCompressedMapStatus : HighlyCompressedMapStatus ：

A MapStatus implementation that stores the accurate size of huge blocks, which are larger than spark.shuffle.accurateBlockThreshold.一个 MapStatus 实现，用于存储比 spark.shuffle.accurateBlockThreshold 大的大块的准确大小。 It stores the average size of other non-empty blocks, plus a bitmap for tracking which blocks are empty.它存储其他非空块的平均大小，以及用于跟踪哪些块是空的位图。

spark.shuffle.accurateBlockThreshold - see here : When we compress the size of shuffle blocks in HighlyCompressedMapStatus , we will record the size accurately if it's above this config. spark.shuffle.accurateBlockThreshold - 看这里：当我们在HighlyCompressedMapStatus压缩 shuffle 块的大小时，如果它高于此配置，我们将准确记录大小。 This helps to prevent OOM by avoiding underestimating shuffle block size when fetch shuffle blocks.这有助于通过在获取 shuffle 块时避免低估 shuffle 块大小来防止 OOM。

CompressedMapStatus : CompressedMapStatus ：

A MapStatus implementation that tracks the size of each block.跟踪每个块大小的 MapStatus 实现。 Size for each block is represented using a single byte.每个块的大小使用单个字节表示。

Also set to your spark-submit还设置为您的spark-submit

--conf spark.yarn.executor.memoryOverhead=<10% of executor memory>  -- conf spark.shuffle.compress=true --conf spark.shuffle.spill.compress=true

in both cases Compression will use spark.io.compression.codec在这两种情况下，压缩都将使用spark.io.compression.codec

Conclusion : large tasks should use HighlyCompressedMapStatus and executor memory overhead can be 10 percent of your executor memory.结论：大型任务应该使用HighlyCompressedMapStatus并且执行程序内存开销可能是执行程序内存的10%。

Further, have a look at spark memory tuning此外，看看火花内存调整

Answer 3

将 SPARK_EXECUTOR_MEMORY 增加到更高的值并重新分区到更多分区

Spark中两个大型数据集之间的交叉连接

问题描述

3 个解决方案

解决方案1
5 2019-01-12 03:11:35

解决方案2
2 2019-01-14 18:49:08

解决方案3
0 2019-01-14 19:33:28

Spark中两个大型数据集之间的交叉连接

问题描述

3 个解决方案

解决方案1 5 2019-01-12 03:11:35

解决方案2 2 2019-01-14 18:49:08

解决方案3 0 2019-01-14 19:33:28

解决方案1
5 2019-01-12 03:11:35

解决方案2
2 2019-01-14 18:49:08

解决方案3
0 2019-01-14 19:33:28