简体   繁体   English

pyspark 加入空条件

[英]pyspark join with null conditions

I am trying to join two pyspark dataframes as below based on "Year" and "invoice" columns.我正在尝试根据“年份”和“发票”列加入两个 pyspark 数据框,如下所示。 But if "Year" is missing in df1, then I need to join just based on ""invoice" alone但是如果df1中缺少“年份”,那么我只需要根据“发票”就可以加入

df1: df1:

Year    invoice    Status   Item
2020    262        YES      bag
2019    252        YES      ball
2018    240        YES      pen
2017    228        YES      ink
2016    216        NO       headphone
2015    213        NO       bicycle
        198        NO       ribbon
        175        YES      phone
    

df2: df2:

Year    invoice
2020    262
2016    216
2014    175
2013    198
2019    252

Expected output:预期输出:

Year    invoice    Status   Item
2020    262        YES      bag
2016    216        NO       headphone
2014    175        YES      phone
2013    198        NO       ribbon
2019    252        YES      ball

I am able to join df1 and df2 as below (only based on Year and invoice" column. If year is missing in df1, I need to add the logic of joining two columns based on invoice alone.我可以按如下方式加入 df1 和 df2(仅基于年份和发票”列。如果 df1 中缺少年份,我需要添加仅基于发票加入两列的逻辑。

df_results = df1.join(df2, on=['Year', 'invoice'], how='left') \
                .drop(df2.Year) \
                .drop(df2.invoice)

Please let me know how to join if "Year" is not available in the df1, and dataframes should be joined based on "invoice" alone.如果“年份”在 df1 中不可用,请告诉我如何加入,并且应仅根据“发票”加入数据框。 Thanks.谢谢。

I don't have your code to test this, but I would try to add a condition to the join operation:我没有你的代码来测试这个,但我会尝试向连接操作添加一个条件:

cond = ((df1.Year == df2.Year) | df1.Year.isNull()) & (df1.invoice == df2.invoice)
df_results = df1.join(df2, on=cond, how='left') \
                .drop(df2.Year) \
                .drop(df2.invoice)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM