[英]pyspark join with null conditions
I am trying to join two pyspark dataframes as below based on "Year" and "invoice" columns.我正在尝试根据“年份”和“发票”列加入两个 pyspark 数据框,如下所示。 But if "Year" is missing in df1, then I need to join just based on ""invoice" alone
但是如果df1中缺少“年份”,那么我只需要根据“发票”就可以加入
df1: df1:
Year invoice Status Item
2020 262 YES bag
2019 252 YES ball
2018 240 YES pen
2017 228 YES ink
2016 216 NO headphone
2015 213 NO bicycle
198 NO ribbon
175 YES phone
df2: df2:
Year invoice
2020 262
2016 216
2014 175
2013 198
2019 252
Expected output:预期输出:
Year invoice Status Item
2020 262 YES bag
2016 216 NO headphone
2014 175 YES phone
2013 198 NO ribbon
2019 252 YES ball
I am able to join df1 and df2 as below (only based on Year and invoice" column. If year is missing in df1, I need to add the logic of joining two columns based on invoice alone.我可以按如下方式加入 df1 和 df2(仅基于年份和发票”列。如果 df1 中缺少年份,我需要添加仅基于发票加入两列的逻辑。
df_results = df1.join(df2, on=['Year', 'invoice'], how='left') \
.drop(df2.Year) \
.drop(df2.invoice)
Please let me know how to join if "Year" is not available in the df1, and dataframes should be joined based on "invoice" alone.如果“年份”在 df1 中不可用,请告诉我如何加入,并且应仅根据“发票”加入数据框。 Thanks.
谢谢。
I don't have your code to test this, but I would try to add a condition to the join operation:我没有你的代码来测试这个,但我会尝试向连接操作添加一个条件:
cond = ((df1.Year == df2.Year) | df1.Year.isNull()) & (df1.invoice == df2.invoice)
df_results = df1.join(df2, on=cond, how='left') \
.drop(df2.Year) \
.drop(df2.invoice)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.