简体   繁体   中英

Joining two dataframes in spark scala based on OR condition

I have two data frames 1) Accounts and 2) Customers. The schema of accounts is as:

Name  Id    Telehone     Mob       email
 AR     1     123        1234      test1@gmail.com
 BR     2     213        4123      test2@gmail.com
 CR     3     231        3214      test3@gmail.com
 KR     4     132        1324      test4@gmail.com

Second table Customers as:

Id    Phone   Email
  2     2344    testq@gmail.com
  6     132     testf@gmail.com
  7     64562    test1@gmail.com

I need to join these two dataframes such that Id is matching Id OR Phone is matching Telephone OR Mob Or Email is matching email . In Above case in first row of Customers is matching on ID, Second is matching on phone and third on email. The join Should be left containin all records of Accounts.

Check below code.

scala> accountDF.show(false)
+----+---+---------+----+---------------+
|name|id |telephone|mob |email          |
+----+---+---------+----+---------------+
|AR  |1  |123      |1234|test1@gmail.com|
|BR  |2  |213      |4123|test2@gmail.com|
|CR  |3  |231      |3214|test3@gmail.com|
|KR  |4  |132      |1324|test4@gmail.com|
+----+---+---------+----+---------------+
scala> customerDF.show(false)
+---+-----+---------------+
|id |phone|email          |
+---+-----+---------------+
|2  |2344 |testq@gmail.com|
|6  |132  |testf@gmail.com|
|7  |64562|test1@gmail.com|
+---+-----+---------------+
scala> accountDF.printSchema
root
 |-- name: string (nullable = true)
 |-- id: string (nullable = true)
 |-- telephone: string (nullable = true)
 |-- mob: string (nullable = true)
 |-- email: string (nullable = true)
scala> customerDF.printSchema
root
 |-- id: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- email: string (nullable = true)

scala> 

accountDF.join(customerDF,(accountDF("id") === customerDF("id") || (accountDF("telephone") === customerDF("phone") ||accountDF("mob") === customerDF("phone")) || accountDF("email") === customerDF("email")),"left").show(false)

+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email          |id  |phone|email          |
+----+---+---------+----+---------------+----+-----+---------------+
|AR  |1  |123      |1234|test1@gmail.com|7   |64562|test1@gmail.com|
|BR  |2  |213      |4123|test2@gmail.com|2   |2344 |testq@gmail.com|
|CR  |3  |231      |3214|test3@gmail.com|null|null |null           |
|KR  |4  |132      |1324|test4@gmail.com|6   |132  |testf@gmail.com|
+----+---+---------+----+---------------+----+-----+---------------+

You can easily meet this requirement with spark SQL .

Code to refer -

import org.apache.spark.sql.functions._

val accountdf = sc.parallelize(Seq(("AR",1,123,1234,"test1@gmail.com"),("BR", 2, 213, 4123, "test2@gmail.com"),("CR", 3, 231, 3214, "test3@gmail.com"),("KR", 4, 132, 1324, "test4@gmail.com"))).toDF("name","id","telephone","mob","email")


accountdf.createOrReplaceTempView("account")

val customerdf = sc.parallelize(Seq((2,2344,"testq@gmail.com"),(6,132,"testf@gmail.com"),(7,64562,"test1@gmail.com"))).toDF("id","phone","email")
   
customerdf.createOrReplaceTempView("customer")

sql("select * from account a left join customer c on a.id = c.id or (a.telephone = c.phone or a.mob = c.phone) or a.email = c.email").show(false)

+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email          |id  |phone|email          |
+----+---+---------+----+---------------+----+-----+---------------+
|BR  |2  |213      |4123|test2@gmail.com|2   |2344 |testq@gmail.com|
|KR  |4  |132      |1324|test4@gmail.com|6   |132  |testf@gmail.com|
|AR  |1  |123      |1234|test1@gmail.com|7   |64562|test1@gmail.com|
|CR  |3  |231      |3214|test3@gmail.com|null|null |null           |
+----+---+---------+----+---------------+----+-----+---------------+
  val sourceDF = Seq(("AR",1,123,1234,"test1@gmail.com"),
    ("BR",2,213,4123,"test2@gmail.com"),
  ("CR",3,231,3214,"test3@gmail.com"),
  ("KR",4,132,1324,"test4@gmail.com")
  ).toDF("Name","Id","Telehone","Mob","email")


  val sourceDF2 = Seq((2,2344,"testq@gmail.com"),
    (6,132,"testf@gmail.com"),
    (7,64562,"test1@gmail.com")
  ).toDF("Id","Phone","Email")

  val joinDF = sourceDF.join(sourceDF2,
    sourceDF.col("Id") === sourceDF2.col("Id") ||
      (sourceDF.col("Telehone") === sourceDF2.col("Phone") ||
      sourceDF.col("Mob") === sourceDF2.col("Phone")) ||
      sourceDF.col("email") === sourceDF2.col("Email")
    ,
  "inner")
  // use "inner" or "left" or ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM