Cross join between two large datasets in Spark

Question

I have 2 large datasets. First dataset contains about 130 million entries.
The second dataset contains about 40000 entries. The data is fetched from MySQL tables.

I need to do a cross-join but I am getting

java.sql.SQLException: GC overhead limit exceeded

What is the best optimum technique to do this in Scala?

Following is a snippet of my code:

val df1 = (spark.read.jdbc(jdbcURL,configurationLoader.mysql_table1,"id",100,100000,40, MySqlConnection.getConnectionProperties))
val df2 = (spark.read.jdbc(jdbcURL,configurationLoader.mysql_table2, MySqlConnection.getConnectionProperties))
val df2Cache = df2.repartition(40).cache()
val crossProduct = df1.join(df2Cache)

df1 is the larger dataset and df2 is the smaller one.

Answer 1

130M*40K = 52 trillion records is 52 terabytes of required memory to store this data, and this is if we assume that each record is 1 byte, which is most certainly not true. If it is as much as 64 bytes (which I think is also a very conservative estimate), you'd need 3.32 petabytes (!) of memory just to store the data. It is a very large amount, so unless you have a very large cluster and very fast network inside that cluster, you might want to rethink your algorithm to make it work.

That being said, when you do a join of two SQL datasets/dataframes, the number of partitions that Spark would use to store the result of the join is controlled by the spark.sql.shuffle.partitions property (see here ). You might want to set it to a very large number, and set the number of executors to the largest one that you can. Then you might be able to run your processing to the end.

Additionally, you may want to look into the spark.shuffle.minNumPartitionsToHighlyCompress option; if you set it to less than your number of shuffle partitions, you might get another memory boost. Note that this option was a hardcoded constant set to 2000 until a recent Spark version, so depending on your environment you just will need to set spark.sql.shuffle.partitions to a number greater than 2000 to make use of it.

Answer 2

Agree with Vladimir, thought of adding more points.

see MapStatus set spark.sql.shuffle.partitions it to 2001 ( old approach )(default is 200).

new approach ( spark.shuffle.minNumPartitionsToHighlyCompress ) as Vladimir mentioned in answer.

Why this change ? : MapStatus has 2000 hardcoded SPARK-24519

it will apply different algorithm to process

def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
    if (uncompressedSizes.length > minPartitionsToUseHighlyCompressMapStatus) {
      HighlyCompressedMapStatus(loc, uncompressedSizes)
    } else {
      new CompressedMapStatus(loc, uncompressedSizes)
    }
  }

HighlyCompressedMapStatus :

A MapStatus implementation that stores the accurate size of huge blocks, which are larger than spark.shuffle.accurateBlockThreshold. It stores the average size of other non-empty blocks, plus a bitmap for tracking which blocks are empty.

spark.shuffle.accurateBlockThreshold - see here : When we compress the size of shuffle blocks in HighlyCompressedMapStatus , we will record the size accurately if it's above this config. This helps to prevent OOM by avoiding underestimating shuffle block size when fetch shuffle blocks.

CompressedMapStatus :

A MapStatus implementation that tracks the size of each block. Size for each block is represented using a single byte.

Also set to your spark-submit

--conf spark.yarn.executor.memoryOverhead=<10% of executor memory>  -- conf spark.shuffle.compress=true --conf spark.shuffle.spill.compress=true

in both cases Compression will use spark.io.compression.codec

Conclusion : large tasks should use HighlyCompressedMapStatus and executor memory overhead can be 10 percent of your executor memory.

Further, have a look at spark memory tuning

Answer 3

将 SPARK_EXECUTOR_MEMORY 增加到更高的值并重新分区到更多分区

Cross join between two large datasets in Spark

Question

3 answers

solution1
5 2019-01-12 03:11:35

solution2
2 2019-01-14 18:49:08

solution3
0 2019-01-14 19:33:28

Cross join between two large datasets in Spark

Question

3 answers

solution1 5 2019-01-12 03:11:35

solution2 2 2019-01-14 18:49:08

solution3 0 2019-01-14 19:33:28

solution1
5 2019-01-12 03:11:35

solution2
2 2019-01-14 18:49:08

solution3
0 2019-01-14 19:33:28