简体   繁体   English

pyspark 加入空字符串的条件

[英]pyspark join with conditions for empty string

I have three dataframes as below.我有三个数据框,如下所示。

df_prod

Year  ID      Name   brand  Point 
2020  20903   Ken    KKK    2000
2019  12890   Matt   MMM    209
2017  346780  Nene   NNN    2000
2020  346780  Nene   NNN    6000

df_miss

Name    brand   point
Holy    HHH     345
Joshi   JJJ     900

df_sale

ID      Name  Year    brand   
126789  Holy  2010            
346780  Nene  2017    NNN     
346780  Nene  2020    NNN     

I need to join df_sale depending upon the condition as below.我需要根据以下条件加入 df_sale。 If "brand" is NOTNULL, then I need to join df_sale with df_prod on INNER join ON Year and Name.如果“品牌”不是空的,那么我需要在 INNER join ON Year 和 Name 上加入 df_sale 和 df_prod。 If "brand" is NULL, then I need to join df_sale with df_miss based on Name.如果“brand”为 NULL,那么我需要根据 Name 将 df_sale 与 df_miss 连接起来。

Is it possible to have when condition during joins in pyspark?在 pyspark 中加入期间是否可能有 when 条件? I could see some examples on scala but I am looking for pyspark implementation.我可以在 Scala 上看到一些示例,但我正在寻找 pyspark 实现。

Pseudo code logic伪代码逻辑

if brand != null
   df_sale.join(df_prod, on=['Year', 'ID'], how='inner') and df_sale['Name'] = df_prod['Name'] & df_sale['point'] = df_prod['point']
   
elif brand == null
   df_sale.join(df_miss, on=['Name'], how='nner') and
   df_sale['point'] = df_prod['point']

Expected output:预期输出:

ID      Name  Year    brand   point
126789  Holy  2010            345
346780  Nene  2017    NNN     2000
346780  Nene  2020    NNN     2000

Is it possible to do in pyspark or SQL.是否可以在 pyspark 或 SQL 中进行。 Please give some pointers.请指点迷津。 Thanks.谢谢。

When you think about IF ... ELSE ... conditions in DataFrames (or for that matter, SQL tables), please be aware that these need to apply to the table as if you were traversing it row-by-row.当您考虑 DataFrame 中的IF ... ELSE ...条件(或就此而言,SQL 表)时,请注意这些需要应用于表,就像您逐行遍历它一样。

This leaves you with two options (please note that f denotes pyspark.sql.functions ):这给您留下了两个选项(请注意f表示pyspark.sql.functions ):

  1. You split your df_sale table in two - df_sale_brand_null and df_sale_brand based on the f.col("brand").isNull() condition using something like [input_df.filter(~fail_test), input_df.filter(fail_test)] .根据f.col("brand").isNull()条件,您使用[input_df.filter(~fail_test), input_df.filter(fail_test)]类的条件将df_sale表分成两部分 - df_sale_brand_nulldf_sale_brand Then you join with the relevant tables ( df_sales_brand_null with df_miss ) on desired columns, you handle misaligned columns and finally you unionByName the two joined tables.然后在所需的列上加入相关表( df_sales_brand_nulldf_miss ),处理未对齐的列,最后将两个连接的表unionByName
  2. You union the dataframes df_miss with df_prod , handling for the missing columns in df_miss .union的dataframes df_missdf_prod ,处理在失踪列df_miss Then you join df_sale with the unioned table (aliased a and b respectively) on a conditional statement, such as f.when(f.col("brand").isNotNull(), (f.col("a.Year") == f.col("b.Year")) & (f.col("a.ID") == f.col("b.ID")).otherwise(f.col("a.Name") == f.col("b.Name")) . The output of f.when(...).otherwise(...) is a column and therefore your join statement will recognise it as a valid input to the on= argument.然后在条件语句df_sale与联合表(分别别名为ab )连接起来,例如f.when(f.col("brand").isNotNull(), (f.col("a.Year") == f.col("b.Year")) & (f.col("a.ID") == f.col("b.ID")).otherwise(f.col("a.Name") == f.col("b.Name")) . f.when(...).otherwise(...)是一列,因此您的join语句会将其识别为有效输入on=参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM