[英]Want to join dataframe based on two columns in Pyspark
我在數據塊中使用 pyspark。 我有兩個數據框,df1 和 df2。 我想離開 join df1 with df2。 但是基於 df2 列的條件。
從 df1,它檢查 df1.ID 是否存在於 df2.A 中,如果它存在,它取行值,否則如果它是 Null,那么它檢查 df2.B,如果它與 df.ID 相同,它保留它。
df1
| ID |
| -------- |
| aaa |
| bbb |
| ccc |
| ddd |
| eee |
DF2
| A | B | C |
| -------- | -------- | -------- |
| aaa | aaa | 23 |
| eee | bbb | 32 |
| Null | ccc | 45 |
| Null | ddd | 76 |
Output
| ID | A | B | C |
| -------- | -------- | -------- | -------- |
| aaa | aaa | aaa | 23 |
| bbb | | | |
| ccc | Null | ccc | 45 |
| ddd | Null | ddd | 76 |
| eee | eee | bbb | 32 |
我試過以下但它沒有給我正確的結果:
join_conditions = [
df1.ID == df2.A,
(df1.ID == df2.B) | (df2.A.isNull())
]
df3 = df1.join(df2, join_conditions,"left")
使用條件df2.A.isNotNull() & (df1.ID == df2.A)
和df2.A.isNull() & (df1.ID == df2.B)
並用|
分隔它們 :
import pyspark.sql.functions as F
df1 = spark.createDataFrame(data=[["aaa"],["bbb"],["ccc"],["ddd"],["eee"]], schema=["ID"])
df2 = spark.createDataFrame(data=[["aaa","aaa","23"],["eee","bbb","32"],[None,"ccc","45"],[None,"ddd","76"]], schema=["A","B","C"])
result_df = df1.join(df2,
(
(df2.A.isNotNull() & (df1.ID == df2.A))
|
(df2.A.isNull() & (df1.ID == df2.B))
),
how="left"
)
result_df.show()
+---+----+----+----+
| ID| A| B| C|
+---+----+----+----+
|aaa| aaa| aaa| 23|
|bbb|null|null|null|
|ccc|null| ccc| 45|
|ddd|null| ddd| 76|
|eee| eee| bbb| 32|
+---+----+----+----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.