简体   繁体   English

想根据Pyspark中的两个列加入dataframe

[英]Want to join dataframe based on two columns in Pyspark

I am using pyspark in databricks.我在数据块中使用 pyspark。 I have two dataframes, df1 and df2.我有两个数据框,df1 和 df2。 I want to left join df1 with df2.我想离开 join df1 with df2。 However based on condition on df2's columns.但是基于 df2 列的条件。

From df1, it checks if df1.ID is present in df2.A, if it is present, it takes the row value, else if it is Null, then it checks df2.B, if it is same as df.ID, it keeps it.从 df1,它检查 df1.ID 是否存在于 df2.A 中,如果它存在,它取行值,否则如果它是 Null,那么它检查 df2.B,如果它与 df.ID 相同,它保留它。

df1 df1

| ID       | 
| -------- | 
| aaa      | 
| bbb      | 
| ccc      | 
| ddd      | 
| eee      | 

df2 DF2

|     A    |     B    |     C    |
| -------- | -------- | -------- |
| aaa      |    aaa   | 23       |
| eee      |    bbb   | 32       |
| Null     |    ccc   | 45       |
| Null     |    ddd   | 76       |

Output Output

| ID       |      A    |     B    |     C    |
| -------- | --------  | -------- | -------- | 
| aaa      | aaa       |    aaa   | 23       |
| bbb      |           |          |          |
| ccc      | Null      |    ccc   | 45       | 
| ddd      | Null      |    ddd   | 76       | 
| eee      |  eee      |    bbb   | 32       |


I tried following but it is not giving me correct results:我试过以下但它没有给我正确的结果:

join_conditions = [
    df1.ID == df2.A,
    (df1.ID  == df2.B) | (df2.A.isNull())
]

df3 = df1.join(df2, join_conditions,"left")

Use conditions df2.A.isNotNull() & (df1.ID == df2.A) and df2.A.isNull() & (df1.ID == df2.B) and separate them with |使用条件df2.A.isNotNull() & (df1.ID == df2.A)df2.A.isNull() & (df1.ID == df2.B)并用|分隔它们: :

import pyspark.sql.functions as F
df1 = spark.createDataFrame(data=[["aaa"],["bbb"],["ccc"],["ddd"],["eee"]], schema=["ID"])
df2 = spark.createDataFrame(data=[["aaa","aaa","23"],["eee","bbb","32"],[None,"ccc","45"],[None,"ddd","76"]], schema=["A","B","C"])

result_df = df1.join(df2,
                     (
                         (df2.A.isNotNull() & (df1.ID == df2.A))
                         |
                         (df2.A.isNull() & (df1.ID == df2.B))
                     ),
                     how="left"
                     )

result_df.show()
+---+----+----+----+
| ID|   A|   B|   C|
+---+----+----+----+
|aaa| aaa| aaa|  23|
|bbb|null|null|null|
|ccc|null| ccc|  45|
|ddd|null| ddd|  76|
|eee| eee| bbb|  32|
+---+----+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM