[英]pyspark join with conditions for empty string
I have three dataframes as below.我有三个数据框,如下所示。
df_prod
Year ID Name brand Point
2020 20903 Ken KKK 2000
2019 12890 Matt MMM 209
2017 346780 Nene NNN 2000
2020 346780 Nene NNN 6000
df_miss
Name brand point
Holy HHH 345
Joshi JJJ 900
df_sale
ID Name Year brand
126789 Holy 2010
346780 Nene 2017 NNN
346780 Nene 2020 NNN
I need to join df_sale depending upon the condition as below.我需要根据以下条件加入 df_sale。 If "brand" is NOTNULL, then I need to join df_sale with df_prod on INNER join ON Year and Name.
如果“品牌”不是空的,那么我需要在 INNER join ON Year 和 Name 上加入 df_sale 和 df_prod。 If "brand" is NULL, then I need to join df_sale with df_miss based on Name.
如果“brand”为 NULL,那么我需要根据 Name 将 df_sale 与 df_miss 连接起来。
Is it possible to have when condition during joins in pyspark?在 pyspark 中加入期间是否可能有 when 条件? I could see some examples on scala but I am looking for pyspark implementation.
我可以在 Scala 上看到一些示例,但我正在寻找 pyspark 实现。
Pseudo code logic伪代码逻辑
if brand != null
df_sale.join(df_prod, on=['Year', 'ID'], how='inner') and df_sale['Name'] = df_prod['Name'] & df_sale['point'] = df_prod['point']
elif brand == null
df_sale.join(df_miss, on=['Name'], how='nner') and
df_sale['point'] = df_prod['point']
Expected output:预期输出:
ID Name Year brand point
126789 Holy 2010 345
346780 Nene 2017 NNN 2000
346780 Nene 2020 NNN 2000
Is it possible to do in pyspark or SQL.是否可以在 pyspark 或 SQL 中进行。 Please give some pointers.
请指点迷津。 Thanks.
谢谢。
When you think about IF ... ELSE ...
conditions in DataFrames (or for that matter, SQL tables), please be aware that these need to apply to the table as if you were traversing it row-by-row.当您考虑 DataFrame 中的
IF ... ELSE ...
条件(或就此而言,SQL 表)时,请注意这些需要应用于表,就像您逐行遍历它一样。
This leaves you with two options (please note that f
denotes pyspark.sql.functions
):这给您留下了两个选项(请注意
f
表示pyspark.sql.functions
):
df_sale
table in two - df_sale_brand_null
and df_sale_brand
based on the f.col("brand").isNull()
condition using something like [input_df.filter(~fail_test), input_df.filter(fail_test)]
.f.col("brand").isNull()
条件,您使用[input_df.filter(~fail_test), input_df.filter(fail_test)]
类的条件将df_sale
表分成两部分 - df_sale_brand_null
和df_sale_brand
。 Then you join with the relevant tables ( df_sales_brand_null
with df_miss
) on desired columns, you handle misaligned columns and finally you unionByName
the two joined tables.df_sales_brand_null
和df_miss
),处理未对齐的列,最后将两个连接的表unionByName
。union
the dataframes df_miss
with df_prod
, handling for the missing columns in df_miss
.union
的dataframes df_miss
与df_prod
,处理在失踪列df_miss
。 Then you join df_sale
with the unioned table (aliased a
and b
respectively) on a conditional statement, such as f.when(f.col("brand").isNotNull(), (f.col("a.Year") == f.col("b.Year")) & (f.col("a.ID") == f.col("b.ID")).otherwise(f.col("a.Name") == f.col("b.Name"))
. The output of f.when(...).otherwise(...)
is a column and therefore your join
statement will recognise it as a valid input to the on=
argument.df_sale
与联合表(分别别名为a
和b
)连接起来,例如f.when(f.col("brand").isNotNull(), (f.col("a.Year") == f.col("b.Year")) & (f.col("a.ID") == f.col("b.ID")).otherwise(f.col("a.Name") == f.col("b.Name"))
. f.when(...).otherwise(...)
是一列,因此您的join
语句会将其识别为有效输入on=
参数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.