pyspark 数据帧重新分区后无法执行连接

Question

We are joining two huge files.我们正在加入两个巨大的文件。

So we are trying to repartition on key column and then we are trying to join on the key column.所以我们试图在键列上重新分区，然后我们试图在键列上加入。

Code snippet代码片段

def repartition_df(df,primary_key,partition_value):
    df = df.repartition(partition_value,primary_key)


df_1 = repartition_df(df1,'pk1', 4 )
df_2 = repartition_df(df2,'pk1', 4 )

df3 = df_1.join(df_2,on =  ['pk1'] , how = 'left')

Error message错误信息

An error was encountered:
'NoneType' object has no attribute 'join'
Traceback (most recent call last):
AttributeError: 'NoneType' object has no attribute 'join'

When it works:当它工作时：

Now, if i dont repartion and go ahead with the join, it works fine.现在，如果我不重新分区并且 go 提前加入，它工作正常。

But from performance perspective, we would like to join after repartition但是从性能的角度来看，我们想在重新分区后加入

Can you please let me know, how do I proceed.你能告诉我，我该如何进行。

Answer 1

Just add a return statement and your solution will work just fine.只需添加一个return语句，您的解决方案就可以正常工作。

def repartition_df(df, primary_key, partition_value):
    df = df.repartition(partition_value, primary_key)
    return df

df_1 = repartition_df(df1, 'pk1', 4)
df_2 = repartition_df(df2, 'pk1', 4)

df3 = df_1.join(df_2, on=['pk1'], how='left')

pyspark 数据帧重新分区后无法执行连接

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-12-02 06:47:08

pyspark 数据帧重新分区后无法执行连接

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-12-02 06:47:08

解决方案1
2 已采纳 2020-12-02 06:47:08