I have two data frames 1) Accounts and 2) Customers. The schema of accounts is as:
Name Id Telehone Mob email
AR 1 123 1234 test1@gmail.com
BR 2 213 4123 test2@gmail.com
CR 3 231 3214 test3@gmail.com
KR 4 132 1324 test4@gmail.com
Second table Customers as:
Id Phone Email
2 2344 testq@gmail.com
6 132 testf@gmail.com
7 64562 test1@gmail.com
I need to join these two dataframes such that Id
is matching Id
OR
Phone
is matching Telephone
OR Mob Or Email
is matching email
. In Above case in first row of Customers is matching on ID, Second is matching on phone and third on email. The join Should be left containin all records of Accounts.
Check below code.
scala> accountDF.show(false)
+----+---+---------+----+---------------+
|name|id |telephone|mob |email |
+----+---+---------+----+---------------+
|AR |1 |123 |1234|test1@gmail.com|
|BR |2 |213 |4123|test2@gmail.com|
|CR |3 |231 |3214|test3@gmail.com|
|KR |4 |132 |1324|test4@gmail.com|
+----+---+---------+----+---------------+
scala> customerDF.show(false)
+---+-----+---------------+
|id |phone|email |
+---+-----+---------------+
|2 |2344 |testq@gmail.com|
|6 |132 |testf@gmail.com|
|7 |64562|test1@gmail.com|
+---+-----+---------------+
scala> accountDF.printSchema
root
|-- name: string (nullable = true)
|-- id: string (nullable = true)
|-- telephone: string (nullable = true)
|-- mob: string (nullable = true)
|-- email: string (nullable = true)
scala> customerDF.printSchema
root
|-- id: string (nullable = true)
|-- phone: string (nullable = true)
|-- email: string (nullable = true)
scala>
accountDF.join(customerDF,(accountDF("id") === customerDF("id") || (accountDF("telephone") === customerDF("phone") ||accountDF("mob") === customerDF("phone")) || accountDF("email") === customerDF("email")),"left").show(false)
+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email |id |phone|email |
+----+---+---------+----+---------------+----+-----+---------------+
|AR |1 |123 |1234|test1@gmail.com|7 |64562|test1@gmail.com|
|BR |2 |213 |4123|test2@gmail.com|2 |2344 |testq@gmail.com|
|CR |3 |231 |3214|test3@gmail.com|null|null |null |
|KR |4 |132 |1324|test4@gmail.com|6 |132 |testf@gmail.com|
+----+---+---------+----+---------------+----+-----+---------------+
You can easily meet this requirement with spark SQL
.
Code to refer -
import org.apache.spark.sql.functions._
val accountdf = sc.parallelize(Seq(("AR",1,123,1234,"test1@gmail.com"),("BR", 2, 213, 4123, "test2@gmail.com"),("CR", 3, 231, 3214, "test3@gmail.com"),("KR", 4, 132, 1324, "test4@gmail.com"))).toDF("name","id","telephone","mob","email")
accountdf.createOrReplaceTempView("account")
val customerdf = sc.parallelize(Seq((2,2344,"testq@gmail.com"),(6,132,"testf@gmail.com"),(7,64562,"test1@gmail.com"))).toDF("id","phone","email")
customerdf.createOrReplaceTempView("customer")
sql("select * from account a left join customer c on a.id = c.id or (a.telephone = c.phone or a.mob = c.phone) or a.email = c.email").show(false)
+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email |id |phone|email |
+----+---+---------+----+---------------+----+-----+---------------+
|BR |2 |213 |4123|test2@gmail.com|2 |2344 |testq@gmail.com|
|KR |4 |132 |1324|test4@gmail.com|6 |132 |testf@gmail.com|
|AR |1 |123 |1234|test1@gmail.com|7 |64562|test1@gmail.com|
|CR |3 |231 |3214|test3@gmail.com|null|null |null |
+----+---+---------+----+---------------+----+-----+---------------+
val sourceDF = Seq(("AR",1,123,1234,"test1@gmail.com"),
("BR",2,213,4123,"test2@gmail.com"),
("CR",3,231,3214,"test3@gmail.com"),
("KR",4,132,1324,"test4@gmail.com")
).toDF("Name","Id","Telehone","Mob","email")
val sourceDF2 = Seq((2,2344,"testq@gmail.com"),
(6,132,"testf@gmail.com"),
(7,64562,"test1@gmail.com")
).toDF("Id","Phone","Email")
val joinDF = sourceDF.join(sourceDF2,
sourceDF.col("Id") === sourceDF2.col("Id") ||
(sourceDF.col("Telehone") === sourceDF2.col("Phone") ||
sourceDF.col("Mob") === sourceDF2.col("Phone")) ||
sourceDF.col("email") === sourceDF2.col("Email")
,
"inner")
// use "inner" or "left" or ...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.