PySpark 优化两个大表的左连接

Question

I'm using the most updated version of PySpark on Databricks.我在 Databricks 上使用 PySpark 的最新版本。 I have two tables each of the size ~25-30GB.我有两个表，每个表的大小约为 25-30GB。 I want to join Table1 and Table2 at the "id" and "id_key" columns respectively.我想分别在“id”和“id_key”列加入 Table1 和 Table2。 I'm able to do that with the command below but when I run my spark job the join is skewed resulting in +95% of my data on one executor making the job take forever.我可以使用下面的命令来做到这一点，但是当我运行我的 spark 作业时，连接是倾斜的，导致 +95% 的数据在一个执行器上，这使得作业需要永远完成。 This is happening when I attempt to load the data after transforming it.当我在转换数据后尝试加载数据时，就会发生这种情况。

Table1 has 13 total columns where the "id" column has a lot of null values with some actual id values.表 1 共有 13 列，其中“id”列有很多 null 值和一些实际的 id 值。

Table2 has 3 total columns where "id_key" has all possible id values each appearing once. Table2 共有 3 列，其中“id_key”具有所有可能的 id 值，每个值出现一次。

I tried broadcasting but because the tables are pretty large, I get OutOfMemory errors when running the job我尝试广播，但由于表非常大，运行作业时出现 OutOfMemory 错误

Table1.join(Table2, Table1.id == Table2.id_key, "left")

I'm thinking of salting but not sure how to go about that or if it is the right solution.我正在考虑加盐，但不确定如何解决这个问题，或者它是否是正确的解决方案。

Answer 1

As I understood your problem, I guess Spark must put all rows with null id in Table1 in the same partition during join's partitioning.据我了解您的问题，我猜 Spark 必须在连接分区期间将Table1中具有null id 的所有行放在同一分区中。

To solve this, you can use the following pattern:要解决此问题，您可以使用以下模式：

split Table1 dataframe between null ids dataframe and not null ids dataframe split Table1 dataframe between null ids dataframe and not null ids dataframe
join your not null ids dataframe with Table2加入你的不是 null ids dataframe与Table2
add columns of Table2 with null value to null ids dataframe将具有 null 值的表Table2的列添加到null ids dataframe
union the obtained not null ids dataframe and null ids dataframe联合获得的不是 null ids dataframe和null ids Z6A8064B5DF479455507DZC553

You can find below the code translation of this pattern:您可以在下面找到此模式的代码翻译：

from pyspark.sql import functions as F

Table1_with_null_ids = Table1.filter(F.col('id').isNull())
Table1_with_not_null_ids = Table1.filter(F.col('id').isNotNull())

Table1_with_not_null_ids_joined = Table1_with_not_null_ids.join(
  Table2, 
  Table1_with_not_null_ids.id == Table2.id_key, 
  'left'
)

Table1_with_null_ids_joined = Table1_with_null_ids \
  .withColumn('id_key', F.lit(None)) \
  .withColumn('table2_column2', F.lit(None)) \
  .withColumn('table2_column3', F.lit(None))

Table1_joined = Table1_with_not_null_ids_joined.unionByName(Table1_with_null_ids_joined)

It avoids doing manual salting, and it may improve performance as you join Table2 with much less rows on Table1 side它避免了手动加盐，当您加入Table2时，它可能会提高性能，而Table1端的行数要少得多

However you need to compute the input Table1 two times as you perform filter two times on the same Table1 .但是，当您在同一个Table1上执行两次filter时，您需要计算输入Table1两次。 If computing input Table1 is an expensive process, you can either cache Table1 before double filtering or proceed as you suggested and add a salting column to Table1 and Table2 and use it in your join expression.如果计算输入Table1是一个昂贵的过程，您可以在双重过滤之前缓存Table1 ，或者按照您的建议继续，向Table1和Table2添加一个加盐列，并在您的连接表达式中使用它。

PySpark 优化两个大表的左连接

问题描述

1 个解决方案

解决方案1
0 2021-05-01 15:26:05

PySpark 优化两个大表的左连接

问题描述

1 个解决方案

解决方案1 0 2021-05-01 15:26:05

解决方案1
0 2021-05-01 15:26:05