简体   繁体   中英

Conditional Join in Spark DataFrame

I am trying to join two DataFrame with condition.

I have two dataframe A and B.

A contains id,m_cd and c_cd columns B contains m_cd,c_cd and record columns

Conditions are -

  • If m_cd is null then join c_cd of A with B
  • If m_cd is not null then join m_cd of A with B

we can use " when " and " otherwise ()" in withcolumn () method of dataframe, so is there any way to do this for the case of join in dataframe.

I have already done this using Union .But wanted to know if there any other option available.

You can use the "when" / "otherwise" in the join condition:

case class Foo(m_cd: Option[Int], c_cd: Option[Int])
val dfA = spark.createDataset(Array(
    Foo(Some(1), Some(2)),
    Foo(Some(2), Some(3)),
    Foo(None: Option[Int], Some(4))
))


val dfB = spark.createDataset(Array(
    Foo(Some(1), Some(5)),
    Foo(Some(2), Some(6)),
    Foo(Some(10), Some(4))
))

val joinCondition = when($"a.m_cd".isNull, $"a.c_cd"===$"b.c_cd")
    .otherwise($"a.m_cd"===$"b.m_cd")

dfA.as('a).join(dfB.as('b), joinCondition).show

It might still be more readable to use the union, though.

In case someone is trying to do it in Pyspark here's the sintaxe

join_condition = when(df1.azure_resourcegroup.startswith('a_string'),df1.some_field == df2.somefield)\
    .otherwise((df1.servicename == df2.type) &
    (df1.resourcegroup == df2.esource_group) &
    (df1.subscriptionguid == df2.subscription_id))
df1 = df1.join(df2,join_condition,how='left')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM