[英]PySpark optimize left join of two big tables
I'm using the most updated version of PySpark on Databricks.我在 Databricks 上使用 PySpark 的最新版本。 I have two tables each of the size ~25-30GB.
我有两个表,每个表的大小约为 25-30GB。 I want to join Table1 and Table2 at the "id" and "id_key" columns respectively.
我想分别在“id”和“id_key”列加入 Table1 和 Table2。 I'm able to do that with the command below but when I run my spark job the join is skewed resulting in +95% of my data on one executor making the job take forever.
我可以使用下面的命令来做到这一点,但是当我运行我的 spark 作业时,连接是倾斜的,导致 +95% 的数据在一个执行器上,这使得作业需要永远完成。 This is happening when I attempt to load the data after transforming it.
当我在转换数据后尝试加载数据时,就会发生这种情况。
Table1 has 13 total columns where the "id" column has a lot of null values with some actual id values.表 1 共有 13 列,其中“id”列有很多 null 值和一些实际的 id 值。
Table2 has 3 total columns where "id_key" has all possible id values each appearing once. Table2 共有 3 列,其中“id_key”具有所有可能的 id 值,每个值出现一次。
I tried broadcasting but because the tables are pretty large, I get OutOfMemory errors when running the job我尝试广播,但由于表非常大,运行作业时出现 OutOfMemory 错误
Table1.join(Table2, Table1.id == Table2.id_key, "left")
I'm thinking of salting but not sure how to go about that or if it is the right solution.我正在考虑加盐,但不确定如何解决这个问题,或者它是否是正确的解决方案。
As I understood your problem, I guess Spark must put all rows with null
id in Table1
in the same partition during join's partitioning.据我了解您的问题,我猜 Spark 必须在连接分区期间将
Table1
中具有null
id 的所有行放在同一分区中。
To solve this, you can use the following pattern:要解决此问题,您可以使用以下模式:
Table1
dataframe between null ids dataframe and not null ids dataframe Table1
dataframe between null ids dataframe and not null ids dataframeTable2
Table2
Table2
with null value to null ids dataframeTable2
的列添加到null ids dataframe You can find below the code translation of this pattern:您可以在下面找到此模式的代码翻译:
from pyspark.sql import functions as F
Table1_with_null_ids = Table1.filter(F.col('id').isNull())
Table1_with_not_null_ids = Table1.filter(F.col('id').isNotNull())
Table1_with_not_null_ids_joined = Table1_with_not_null_ids.join(
Table2,
Table1_with_not_null_ids.id == Table2.id_key,
'left'
)
Table1_with_null_ids_joined = Table1_with_null_ids \
.withColumn('id_key', F.lit(None)) \
.withColumn('table2_column2', F.lit(None)) \
.withColumn('table2_column3', F.lit(None))
Table1_joined = Table1_with_not_null_ids_joined.unionByName(Table1_with_null_ids_joined)
It avoids doing manual salting, and it may improve performance as you join Table2
with much less rows on Table1
side它避免了手动加盐,当您加入
Table2
时,它可能会提高性能,而Table1
端的行数要少得多
However you need to compute the input Table1
two times as you perform filter
two times on the same Table1
.但是,当您在同一个
Table1
上执行两次filter
时,您需要计算输入Table1
两次。 If computing input Table1
is an expensive process, you can either cache Table1
before double filtering or proceed as you suggested and add a salting column to Table1
and Table2
and use it in your join expression.如果计算输入
Table1
是一个昂贵的过程,您可以在双重过滤之前缓存Table1
,或者按照您的建议继续,向Table1
和Table2
添加一个加盐列,并在您的连接表达式中使用它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.