简体   繁体   中英

merge two dataframes in scala spark

I have two dataframes :

dataframe1 :

+-----++-----++-------------+
| id  || name| has_bank_acc |
+-----++-----++-------------+
|    0||  qwe||  true       |
|    1||  asd||  false      |
|    2||  rty||  false      |
|    3||  tyu||  true       |
+-----++-----++-------------+

dataframe2:

+-----++-----++--------------+
| id  || name| has_email_acc |
+-----++-----++--------------+
|    0||  qwe||  true        |
|    5||  hjk||  false       |
|    8||  oiu||  false       |
|    7||  nmb||  true        |
+-----++-----++--------------+

I have to merge these dataframe to get the following :

+-----++-----++-------------++---------------+
| id  || name| has_bank_acc || has_email_acc |
+-----++-----++-------------++---------------+
|    0||  qwe||  true       |    null        |
|    1||  asd||  false      |    null        |
|    2||  rty||  false      |    null        |
|    3||  tyu||  true       |    null        |
|    0||  qwe||  null       |    true        |
|    5||  hjk||  null       |    false       |
|    8||  oiu||  null       |    false       |
|    7||  nmb||  null       |    true        |
+-----++-----++-------------+----------------+ 

I have tried union and join but wasn't successful

You can not perform Union with different columns. If you add missing columns and pass null then it will give data type error. So the only solution is Join.

scala> df1.show()
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
|  0| qwe|        true|
|  1| asd|       false|
|  2| rty|       false|
|  3| tyu|        true|
+---+----+------------+


scala> df2.show()
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
|  0| qwe|       true  |
|  5| hjk|       false |
|  8| oiu|       false |
|  7| nmb|       true  |
+---+----+-------------+


scala> val df11 = df1.withColumn("fid", lit(1))

scala> val df22 = df1.withColumn("fid", lit(2))

scala> df11.alias("1").join(df22.alias("2"), List("fid", "id", "name"),"full").drop("fid").show()
+---+----+------------+------------+
| id|name|has_bank_acc|has_bank_acc|
+---+----+------------+------------+
|  0| qwe|        true|        null|
|  1| asd|       false|        null|
|  2| rty|       false|        null|
|  3| tyu|        true|        null|
|  0| qwe|        null|        true|
|  1| asd|        null|       false|
|  2| rty|        null|       false|
|  3| tyu|        null|        true|
+---+----+------------+------------+

The solution could be :

scala> df1.show
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
|  0| qwe|        true|
|  1| asd|       false|
|  2| rty|       false|
|  3| tyu|        true|
+---+----+------------+


scala> df2.show
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
|  0| qwe|         true|
|  5| hjk|        false|
|  8| oiu|        false|
|  7| nmb|         true|
+---+----+-------------+


scala> val cols1 = df1.columns.toSet
cols1: scala.collection.immutable.Set[String] = Set(id, name, has_bank_acc)

scala> val cols2 = df2.columns.toSet
cols2: scala.collection.immutable.Set[String] = Set(id, name, has_email_acc)

scala> val total = cols1 ++ cols2
total: scala.collection.immutable.Set[String] = Set(id, name, has_bank_acc, has_email_acc)

scala> def expr(myCols: Set[String], allCols: Set[String]) = {
     | allCols.toList.map(x => x match {
     | case x if myCols.contains(x) => col(x)
     | case _ => lit(null).as(x)
     | })
     | }
expr: (myCols: Set[String], allCols: Set[String])List[org.apache.spark.sql.Column]

scala> df1.select(expr(cols1, total): _*).unionAll(df2.select(expr(cols2,total): _*)).show
warning: there was one deprecation warning; re-run with -deprecation for details
+---+----+------------+-------------+
| id|name|has_bank_acc|has_email_acc|
+---+----+------------+-------------+
|  0| qwe|        true|         null|
|  1| asd|       false|         null|
|  2| rty|       false|         null|
|  3| tyu|        true|         null|
|  0| qwe|        null|         true|
|  5| hjk|        null|        false|
|  8| oiu|        null|        false|
|  7| nmb|        null|         true|
+---+----+------------+-------------+

Let me know if it helps!!

"UnionAll" with missed columns adding can help:

dataframe1
  .withColumn("has_email_acc", lit(null).cast(BooleanType))
    .unionByName(dataframe2.withColumn("has_bank_acc", lit(null).cast(BooleanType)))
val data = Seq((0,"qwe","true"),(1,"asd","false"),(2,"rty","false"),(3,"tyu","true")).toDF("id","name","has_bank_acc")
scala> data.show
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
|  0| qwe|        true|
|  1| asd|       false|
|  2| rty|       false|
|  3| tyu|        true|
+---+----+------------+

val data2 = Seq((0,"qwe","true"),(5,"hjk","false"),(8,"oiu","false"),(7,"nmb","true")).toDF("id","name","has_email_acc")

scala> data2.show
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
|  0| qwe|         true|
|  5| hjk|        false|
|  8| oiu|        false|
|  7| nmb|         true|
+---+----+-------------+

val data_cols = data.columns
val data2_cols = data2.columns

val transformedData = data2_cols.diff(data_cols).foldLeft(data) {
      case (df, (newCols)) =>
        df.withColumn(newCols, lit("null"))
    }

val transformedData2 = data_cols.diff(data2_cols).foldLeft(data2) {
      case (df, (newCols)) =>
        df.withColumn(newCols, lit("null"))
    }

val finalData = transformedData2.unionByName(transformedData)
finalData.show
scala> finalData.show
+---+----+-------------+------------+
| id|name|has_email_acc|has_bank_acc|
+---+----+-------------+------------+
|  0| qwe|         true|        null|
|  5| hjk|        false|        null|
|  8| oiu|        false|        null|
|  7| nmb|         true|        null|
|  0| qwe|         null|        true|
|  1| asd|         null|       false|
|  2| rty|         null|       false|
|  3| tyu|         null|        true|
+---+----+-------------+------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM