Issue with Spark-scala Join . Looking for a better Approach

Question

I have 2 DF,s like below.

+---+---+---+
|  M| c2| c3|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
+---+---+---+

+---+---+---+
|  M| c2| c3|
+---+---+---+
|  1| 20| 30|
|  2| 30| 40|
+---+---+---+

What should be the best approach to get a new dataframe like below.This means, the new Df has column names c2 and c3 but value is concat( df1("c1"),df1("c2") ) but with same column name.I can do this with df3.withColumn("c2_new",concat( df1("c2"),df2("c2") )) and then renaming the new column to C2. But ssue is that I have 150+ Columns in my DF.What should be the best approach here?

+---+------+-----+
|  M| c2  |   c3 |
+---+-----+------+
|  1| 2_20|  3_30|
|  2| 3_30|  4_40|
+---+------+-----+

Answer 1

If you have a wide columns, you could iterate over columns and apply the same transformations for it. In your case you should merge dataframes and aggregate columns like this:

import org.apache.spark.sql.types.StringType

val commonColumns = (df1.columns.toSet & df2.columns.toSet).filter(_ != "M").toSeq
commonColumns

df1.union(df2)
    .groupBy("M")
    .agg(count(lit(1)) as "cnt", 
        commonColumns.map(c => concat_ws("_", collect_set(col(c).cast(StringType))) as c):_*)
    .select("M", commonColumns:_*)
        .show

Here is the output:

+---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|20_2|3_30|
|  2|3_30|40_4|
+---+----+----+

If you have requirement on ordering (ie value from df1 must be on the left side, value from df2 must by on the right) you could use this trick:

Add dateframe number ( 1 and 2 ) before union as a new column
Create structure from dataframe number and column value
During aggregation take min and max of this structure
Extract the value from the structure
Concat values with an underscore

Code:

df1
    .withColumn("src", lit(1))
    .union(df2.withColumn("src", lit(2)))
    .groupBy("M")
    .agg(count(lit(1)) as "cnt", 
        commonColumns.map(c => concat(
            min(struct($"src", col(c)))(c),
            lit("_"),
            max(struct($"src", col(c)))(c)) as c):_*)
    .select("M", commonColumns:_*)
    .show

The final result is ordered:

+---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|2_20|3_30|
|  2|3_30|4_40|
+---+----+----+

Answer 2

You can do this with a join:

val selectExpr = df1.columns.filterNot(_=="M").map(c => concat_ws("_",df1(c),df2(c)).as(c))

df1.join(df2,"M")
  .select((col("M") +: selectExpr):_*)
  .show()

gives:

---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|2_20|3_30|
|  2|3_30|4_40|
+---+----+----+

Issue with Spark-scala Join . Looking for a better Approach

Question

2 answers

solution1
2 2019-12-11 12:20:55

solution2
2 ACCPTED 2019-12-11 20:19:02

Issue with Spark-scala Join . Looking for a better Approach

Question

2 answers

solution1 2 2019-12-11 12:20:55

solution2 2 ACCPTED 2019-12-11 20:19:02

solution1
2 2019-12-11 12:20:55

solution2
2 ACCPTED 2019-12-11 20:19:02