merge two dataframes in scala spark

Question

I have two dataframes :

dataframe1 :

+-----++-----++-------------+
| id  || name| has_bank_acc |
+-----++-----++-------------+
|    0||  qwe||  true       |
|    1||  asd||  false      |
|    2||  rty||  false      |
|    3||  tyu||  true       |
+-----++-----++-------------+

dataframe2:

+-----++-----++--------------+
| id  || name| has_email_acc |
+-----++-----++--------------+
|    0||  qwe||  true        |
|    5||  hjk||  false       |
|    8||  oiu||  false       |
|    7||  nmb||  true        |
+-----++-----++--------------+

I have to merge these dataframe to get the following :

+-----++-----++-------------++---------------+
| id  || name| has_bank_acc || has_email_acc |
+-----++-----++-------------++---------------+
|    0||  qwe||  true       |    null        |
|    1||  asd||  false      |    null        |
|    2||  rty||  false      |    null        |
|    3||  tyu||  true       |    null        |
|    0||  qwe||  null       |    true        |
|    5||  hjk||  null       |    false       |
|    8||  oiu||  null       |    false       |
|    7||  nmb||  null       |    true        |
+-----++-----++-------------+----------------+

I have tried union and join but wasn't successful

Answer 1

You can not perform Union with different columns. If you add missing columns and pass null then it will give data type error. So the only solution is Join.

scala> df1.show()
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
|  0| qwe|        true|
|  1| asd|       false|
|  2| rty|       false|
|  3| tyu|        true|
+---+----+------------+


scala> df2.show()
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
|  0| qwe|       true  |
|  5| hjk|       false |
|  8| oiu|       false |
|  7| nmb|       true  |
+---+----+-------------+


scala> val df11 = df1.withColumn("fid", lit(1))

scala> val df22 = df1.withColumn("fid", lit(2))

scala> df11.alias("1").join(df22.alias("2"), List("fid", "id", "name"),"full").drop("fid").show()
+---+----+------------+------------+
| id|name|has_bank_acc|has_bank_acc|
+---+----+------------+------------+
|  0| qwe|        true|        null|
|  1| asd|       false|        null|
|  2| rty|       false|        null|
|  3| tyu|        true|        null|
|  0| qwe|        null|        true|
|  1| asd|        null|       false|
|  2| rty|        null|       false|
|  3| tyu|        null|        true|
+---+----+------------+------------+

Answer 2

The solution could be :

scala> df1.show
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
|  0| qwe|        true|
|  1| asd|       false|
|  2| rty|       false|
|  3| tyu|        true|
+---+----+------------+


scala> df2.show
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
|  0| qwe|         true|
|  5| hjk|        false|
|  8| oiu|        false|
|  7| nmb|         true|
+---+----+-------------+


scala> val cols1 = df1.columns.toSet
cols1: scala.collection.immutable.Set[String] = Set(id, name, has_bank_acc)

scala> val cols2 = df2.columns.toSet
cols2: scala.collection.immutable.Set[String] = Set(id, name, has_email_acc)

scala> val total = cols1 ++ cols2
total: scala.collection.immutable.Set[String] = Set(id, name, has_bank_acc, has_email_acc)

scala> def expr(myCols: Set[String], allCols: Set[String]) = {
     | allCols.toList.map(x => x match {
     | case x if myCols.contains(x) => col(x)
     | case _ => lit(null).as(x)
     | })
     | }
expr: (myCols: Set[String], allCols: Set[String])List[org.apache.spark.sql.Column]

scala> df1.select(expr(cols1, total): _*).unionAll(df2.select(expr(cols2,total): _*)).show
warning: there was one deprecation warning; re-run with -deprecation for details
+---+----+------------+-------------+
| id|name|has_bank_acc|has_email_acc|
+---+----+------------+-------------+
|  0| qwe|        true|         null|
|  1| asd|       false|         null|
|  2| rty|       false|         null|
|  3| tyu|        true|         null|
|  0| qwe|        null|         true|
|  5| hjk|        null|        false|
|  8| oiu|        null|        false|
|  7| nmb|        null|         true|
+---+----+------------+-------------+

Let me know if it helps!!

Answer 3

"UnionAll" with missed columns adding can help:

dataframe1
  .withColumn("has_email_acc", lit(null).cast(BooleanType))
    .unionByName(dataframe2.withColumn("has_bank_acc", lit(null).cast(BooleanType)))

Answer 4

val data = Seq((0,"qwe","true"),(1,"asd","false"),(2,"rty","false"),(3,"tyu","true")).toDF("id","name","has_bank_acc")
scala> data.show
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
|  0| qwe|        true|
|  1| asd|       false|
|  2| rty|       false|
|  3| tyu|        true|
+---+----+------------+

val data2 = Seq((0,"qwe","true"),(5,"hjk","false"),(8,"oiu","false"),(7,"nmb","true")).toDF("id","name","has_email_acc")

scala> data2.show
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
|  0| qwe|         true|
|  5| hjk|        false|
|  8| oiu|        false|
|  7| nmb|         true|
+---+----+-------------+

val data_cols = data.columns
val data2_cols = data2.columns

val transformedData = data2_cols.diff(data_cols).foldLeft(data) {
      case (df, (newCols)) =>
        df.withColumn(newCols, lit("null"))
    }

val transformedData2 = data_cols.diff(data2_cols).foldLeft(data2) {
      case (df, (newCols)) =>
        df.withColumn(newCols, lit("null"))
    }

val finalData = transformedData2.unionByName(transformedData)
finalData.show
scala> finalData.show
+---+----+-------------+------------+
| id|name|has_email_acc|has_bank_acc|
+---+----+-------------+------------+
|  0| qwe|         true|        null|
|  5| hjk|        false|        null|
|  8| oiu|        false|        null|
|  7| nmb|         true|        null|
|  0| qwe|         null|        true|
|  1| asd|         null|       false|
|  2| rty|         null|       false|
|  3| tyu|         null|        true|
+---+----+-------------+------------+

merge two dataframes in scala spark

Question

4 answers

solution1
0 2020-03-18 06:51:12

solution2
0 2020-03-18 07:00:53

solution3
0 2020-03-19 16:17:17

solution4
-1 2020-03-18 07:03:17

merge two dataframes in scala spark

Question

4 answers

solution1 0 2020-03-18 06:51:12

solution2 0 2020-03-18 07:00:53

solution3 0 2020-03-19 16:17:17

solution4 -1 2020-03-18 07:03:17

solution1
0 2020-03-18 06:51:12

solution2
0 2020-03-18 07:00:53

solution3
0 2020-03-19 16:17:17

solution4
-1 2020-03-18 07:03:17