简体   繁体   English

pyspark 数据帧重新分区后无法执行连接

[英]Unable to perform join after repartition of pyspark data frame

We are joining two huge files.我们正在加入两个巨大的文件。

So we are trying to repartition on key column and then we are trying to join on the key column.所以我们试图在键列上重新分区,然后我们试图在键列上加入。

Code snippet代码片段

def repartition_df(df,primary_key,partition_value):
    df = df.repartition(partition_value,primary_key)


df_1 = repartition_df(df1,'pk1', 4 )
df_2 = repartition_df(df2,'pk1', 4 )

df3 = df_1.join(df_2,on =  ['pk1'] , how = 'left')

Error message错误信息

An error was encountered:
'NoneType' object has no attribute 'join'
Traceback (most recent call last):
AttributeError: 'NoneType' object has no attribute 'join'

When it works:当它工作时:

Now, if i dont repartion and go ahead with the join, it works fine.现在,如果我不重新分区并且 go 提前加入,它工作正常。

But from performance perspective, we would like to join after repartition但是从性能的角度来看,我们想在重新分区后加入

Can you please let me know, how do I proceed.你能告诉我,我该如何进行。

Just add a return statement and your solution will work just fine.只需添加一个return语句,您的解决方案就可以正常工作。

def repartition_df(df, primary_key, partition_value):
    df = df.repartition(partition_value, primary_key)
    return df

df_1 = repartition_df(df1, 'pk1', 4)
df_2 = repartition_df(df2, 'pk1', 4)

df3 = df_1.join(df_2, on=['pk1'], how='left')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM