简体   繁体   English

如何编写一个通用的 function 来加入两个 PySpark 数据帧?

[英]How to write a universal function to join two PySpark dataframes?

How to write a universal function to join two PySpark dataframes?如何编写一个通用的 function 来加入两个 PySpark 数据帧?

I want to write a function that performs inner join on two dataframes and also eliminates the repeated common column after joining.我想写一个 function 对两个数据帧执行内连接,并消除连接后重复的公共列。 As far as I'm aware there is no way to do that, as we always need to define common columns manually while joining.据我所知,没有办法做到这一点,因为我们总是需要在加入时手动定义公共列。 Or is there a way?或者有什么办法吗?

If you need to include all the common columns in the join condition then you can extract them into a list and pass to join() .如果您需要在连接条件中包含所有公共列,则可以将它们提取到列表中并传递给join() After the join, just call drop on these same columns to eliminate them from the result.加入后,只需对这些相同的列调用drop即可将它们从结果中删除。

common_cols = list(set(df.columns).intersection(set(df2.columns)))

df3 = df.join(df2, common_cols, how='inner').drop(*common_cols)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM