pyspark 加入空条件

Question

我正在尝试根据“年份”和“发票”列加入两个 pyspark 数据框，如下所示。 但是如果df1中缺少“年份”，那么我只需要根据“发票”就可以加入

df1：

Year    invoice    Status   Item
2020    262        YES      bag
2019    252        YES      ball
2018    240        YES      pen
2017    228        YES      ink
2016    216        NO       headphone
2015    213        NO       bicycle
        198        NO       ribbon
        175        YES      phone

df2：

Year    invoice
2020    262
2016    216
2014    175
2013    198
2019    252

预期输出：

Year    invoice    Status   Item
2020    262        YES      bag
2016    216        NO       headphone
2014    175        YES      phone
2013    198        NO       ribbon
2019    252        YES      ball

我可以按如下方式加入 df1 和 df2（仅基于年份和发票”列。如果 df1 中缺少年份，我需要添加仅基于发票加入两列的逻辑。

df_results = df1.join(df2, on=['Year', 'invoice'], how='left') \
                .drop(df2.Year) \
                .drop(df2.invoice)

如果“年份”在 df1 中不可用，请告诉我如何加入，并且应仅根据“发票”加入数据框。 谢谢。

Answer 1

我没有你的代码来测试这个，但我会尝试向连接操作添加一个条件：

cond = ((df1.Year == df2.Year) | df1.Year.isNull()) & (df1.invoice == df2.invoice)
df_results = df1.join(df2, on=cond, how='left') \
                .drop(df2.Year) \
                .drop(df2.invoice)

pyspark 加入空条件

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-18 16:11:49

pyspark 加入空条件

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-18 16:11:49

解决方案1
1 已采纳 2020-11-18 16:11:49