Cross join optimizations on AWS Glue/Spark

Question

I have 2 dataframes:

df1 - 7 columns (IDs and VARCHARs), rows: 1,700,000

df2 - 7 columns (IDs and VARCHARs), rows: 25,000

Need to find all possible similarities, no way to skip cartesian-product.

AWS Glue: Cluster with 10 (or 20) G.1X Workers

Already tested for 178 partitions (Spark calculated on fly when filtered df1 from bigger df) Running time: 10 hours... I stopped the job, But on S3. more than 999 part-XXX-YYYYY files were found.

Question: How to optimize this cross join on Glue/Spark, if no way to skip cross-join?

Answer 1

With below approach and Glue configuration, Job completed in 121 min:

Glue Details =>

Workers =>G2.X

Number of Workers => 50. You could try with 149 also, this should complete job in 35-45 Min.

I have created two files:-

df1=> 7 columns rows: 1700000, size 140 MB (Based on column size, file size may be different for you)

df2=> 7 columns rows: 25000, size 2 MB

Now I have partitioned first dataframe with 42500.

How did I get the 42500-> First I have created DF1 with 1 records, DF2 with 25000 and saved, cross join output.

It was 3.5 MB file, For best performance, Optimum partition should be around 128 MB. Lets assume you want to make one partition size as 150 MB.

Now output generated from 1 record was 3.5 MB, to make 150 MB partition size we need approx. 42 records per partitions. We have 1700000 records, which makes it approx. 40500 partitions.

For you, size for 1 record could differ. Use same approach to calculate partition size. After the reparation, just use cross join with broadcast.

df1.reparition(40500)

df.crossJoin(broadcast(df2))

Cross join optimizations on AWS Glue/Spark

Question

1 answers

solution1
2 2021-02-07 11:31:27

Cross join optimizations on AWS Glue/Spark

Question

1 answers

solution1 2 2021-02-07 11:31:27

solution1
2 2021-02-07 11:31:27