AWS Glue/Spark 上的交叉联接优化

Question

I have 2 dataframes:我有2个数据框：

df1 - 7 columns (IDs and VARCHARs), rows: 1,700,000 df1 - 7 列（ID 和 VARCHAR），行：1,700,000

df2 - 7 columns (IDs and VARCHARs), rows: 25,000 df2 - 7 列（ID 和 VARCHAR），行：25,000

Need to find all possible similarities, no way to skip cartesian-product.需要找到所有可能的相似之处，没有办法跳过笛卡尔积。

AWS Glue: Cluster with 10 (or 20) G.1X Workers AWS Glue：具有 10 个（或 20 个）G.1X Worker 的集群

Already tested for 178 partitions (Spark calculated on fly when filtered df1 from bigger df) Running time: 10 hours... I stopped the job, But on S3.已经针对 178 个分区进行了测试（当从更大的 df 过滤 df1 时，Spark 会即时计算）运行时间：10 小时......我停止了这项工作，但在 S3 上。 more than 999 part-XXX-YYYYY files were found.找到超过 999 个 part-XXX-YYYYY 文件。

Question: How to optimize this cross join on Glue/Spark, if no way to skip cross-join?问题：如果无法跳过交叉连接，如何优化 Glue/Spark 上的交叉连接？

Answer 1

With below approach and Glue configuration, Job completed in 121 min:使用以下方法和 Glue 配置，作业在 121 分钟内完成：

Glue Details =>胶水细节=>

Workers =>G2.X工人=>G2.X

Number of Workers => 50. You could try with 149 also, this should complete job in 35-45 Min.工人数量=> 50。您也可以尝试使用 149，这应该在 35-45 分钟内完成工作。

I have created two files:-我创建了两个文件：-

df1=> 7 columns rows: 1700000, size 140 MB (Based on column size, file size may be different for you) df1=> 7 columns rows: 1700000, size 140 MB (根据列大小，文件大小可能会有所不同)

df2=> 7 columns rows: 25000, size 2 MB df2=> 7 列行：25000，大小 2 MB

Now I have partitioned first dataframe with 42500.现在我已经将第一个 dataframe 分区为 42500。

How did I get the 42500-> First I have created DF1 with 1 records, DF2 with 25000 and saved, cross join output.我是如何获得 42500-> 首先我创建了 DF1 和 1 条记录，DF2 和 25000 并保存，交叉加入 output。

It was 3.5 MB file, For best performance, Optimum partition should be around 128 MB.它是 3.5 MB 的文件，为了获得最佳性能，最佳分区应该在 128 MB 左右。 Lets assume you want to make one partition size as 150 MB.假设您要将一个分区大小设为 150 MB。

Now output generated from 1 record was 3.5 MB, to make 150 MB partition size we need approx.现在从 1 条记录生成的 output 是 3.5 MB，要制作 150 MB 的分区大小，我们需要大约。 42 records per partitions.每个分区 42 条记录。 We have 1700000 records, which makes it approx.我们有 1700000 条记录，这使得它大约。 40500 partitions. 40500 个分区。

For you, size for 1 record could differ.对您来说，1 条记录的大小可能会有所不同。 Use same approach to calculate partition size.使用相同的方法来计算分区大小。 After the reparation, just use cross join with broadcast.修复后，只需使用带广播的交叉连接即可。

df1.reparition(40500)

df.crossJoin(broadcast(df2))

AWS Glue/Spark 上的交叉联接优化

问题描述

1 个解决方案

解决方案1
2 2021-02-07 11:31:27

AWS Glue/Spark 上的交叉联接优化

问题描述

1 个解决方案

解决方案1 2 2021-02-07 11:31:27

解决方案1
2 2021-02-07 11:31:27